Abstract
Depth map estimation from images is a crucial task in self-driving applications. Existing methods can be categorized into two groups: multi-view stereo and monocular depth estimation. The former requires cameras to have large overlapping areas and a sufficient baseline between them, while the latter that processes each image independently can hardly guarantee the structure consistency between cameras. In this paper, we propose a novel self-supervised multi-camera collaborative depth prediction method with latent diffusion models, which does not require large overlapping areas while maintaining structure consistency between cameras. Specifically, we introduce MCDP, a new generative foundation model for estimating depth attributes for multi-cameras. We formulate the depth estimation as a weighted combination of depth bases, in which the weights are updated iteratively by the recurrent refinement strategy. During the iterative update, the results of depth estimation are compared across cameras, and the information of overlapping areas is propagated to the whole depth maps with the help of basis formulation in diffusion process. We integrate the GRU-based Weight Net into the diffusion process, allowing the refined hidden state to serve as a conditional input to accurately control the next iterative denoising step. Furthermore, by incorporating the proposed depth consistency loss, we ensure structural consistency across cameras, even in regions with minimal overlap. Experimental results on DDAD, NuScenes, Cityscapes, and Waymo Open Datasets demonstrate the superior performance of our method, and show great help for the downstream task.
| Original language | English |
|---|---|
| Pages (from-to) | 9609-9624 |
| Number of pages | 16 |
| Journal | IEEE Transactions on Intelligent Transportation Systems |
| Volume | 26 |
| Issue number | 7 |
| DOIs | |
| State | Published - 2025 |
| Externally published | Yes |
Keywords
- Multiple cameras
- depth estimation
- diffusion models
Fingerprint
Dive into the research topics of 'Self-Supervised Multi-Camera Collaborative Depth Prediction With Latent Diffusion Models'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver