CLIP-BEVFormer: Enhancing Multi-View Image-Based BEV Detector with Ground Truth Flow

Burhaneddin Yaman; Chenbin Pan; Liu Ren; Senem Velipasalar

arxiv: 2403.08919 · v2 · pith:KCZOIKWMnew · submitted 2024-03-13 · 💻 cs.CV

CLIP-BEVFormer: Enhancing Multi-View Image-Based BEV Detector with Ground Truth Flow

Chenbin Pan , Burhaneddin Yaman , Senem Velipasalar , Liu Ren This is my paper

classification 💻 cs.CV

keywords clip-bevformerflowgroundmulti-viewtruthachievesaddressapproach

0 comments

read the original abstract

Autonomous driving stands as a pivotal domain in computer vision, shaping the future of transportation. Within this paradigm, the backbone of the system plays a crucial role in interpreting the complex environment. However, a notable challenge has been the loss of clear supervision when it comes to Bird's Eye View elements. To address this limitation, we introduce CLIP-BEVFormer, a novel approach that leverages the power of contrastive learning techniques to enhance the multi-view image-derived BEV backbones with ground truth information flow. We conduct extensive experiments on the challenging nuScenes dataset and showcase significant and consistent improvements over the SOTA. Specifically, CLIP-BEVFormer achieves an impressive 8.5\% and 9.2\% enhancement in terms of NDS and mAP, respectively, over the previous best BEV model on the 3D object detection task.

This paper has not been read by Pith yet.

CLIP-BEVFormer: Enhancing Multi-View Image-Based BEV Detector with Ground Truth Flow

discussion (0)