pith. sign in

arxiv: 2606.11507 · v1 · pith:EYZZQOTUnew · submitted 2026-06-09 · 💻 cs.CV

SceneMiner: Identity-Preserving Multi-Task Fine-Tuning for Unified BEV Scene Mining

Pith reviewed 2026-06-27 13:00 UTC · model grok-4.3

classification 💻 cs.CV
keywords scene miningbird's-eye-viewmulti-task fine-tuningcross-task interferenceidentity-preserving tuningdriving logsvision-language backboneBEV pipeline
0
0 comments X

The pith

Zero-initializing new sub-modules and freezing shared parameters eliminates cross-task interference in multi-head BEV models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that adding heads to a shared vision-language backbone for scene mining causes cross-task interference: new parameters shift the activation stream and degrade existing heads even when those heads stay frozen. Their identity-preserving multi-task fine-tuning counters this by zero-initializing every new sub-module and freezing all parameters that feed the shared stream. This keeps every original mining head bit-identical while training only about 102k parameters. A reader would care because it lets one camera-only BEV pipeline produce retrieval embeddings, scene tags, and risk scores together without separate models or performance loss on any signal.

Core claim

Cross-task interference occurs when a new head is added because its parameters alter the shared activation stream, degrading weight-frozen sibling heads. Identity-preserving multi-task fine-tuning removes the interference by zero-initializing every new sub-module and freezing every parameter that feeds the shared stream, so the mining heads remain preserved bit-identically while only ~102k parameters are trained. The resulting model emits a text-prompted retrieval embedding, a 20-tag multi-label distribution, and a physics-based risk score from a single forward pass on camera input.

What carries the argument

Identity-preserving multi-task fine-tuning: zero-initializing every new sub-module and freezing every parameter that feeds the shared activation stream, which preserves existing heads bit-identically.

If this is right

  • A single frozen vision-language backbone can produce retrieval, tagging, and risk signals in one forward pass without LiDAR or radar.
  • Existing mining heads stay bit-identical after new heads are added and trained.
  • Only ~102k parameters require training when a new head is introduced.
  • The tagging head reaches mAP 0.4614 and micro-F1 0.5557 on 20 scene tags by pooling scenes into 32 visual tokens.
  • Text-prompted retrieval becomes possible while the other heads remain untouched.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same zero-init and freeze pattern could be tested on multi-task models outside driving scenes, such as general video understanding.
  • It may support adding more than three heads without interference if the shared stream remains frozen.
  • Exact bit-identity of weights before and after could be verified by direct parameter comparison on public checkpoints.
  • The approach might reduce the need for task-specific fine-tuning runs in any setting where heads share an activation backbone.

Load-bearing premise

Zero-initializing new sub-modules and freezing parameters that feed the shared stream will keep the original mining heads completely unchanged with no measurable activation shift.

What would settle it

After adding and training a new head with zero-initialized modules and frozen shared-stream parameters, observe any non-zero difference in the output logits or weights of an existing head.

Figures

Figures reproduced from arXiv: 2606.11507 by Abdalmalek Aburaddaha, Keval Thaker, Samir A. Rawashdeh, Venkatraman Narayanan.

Figure 1
Figure 1. Figure 1: SceneMiner architecture. A frozen SigLIP2 [ [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative text-prompted retrieval. Each row shows the top-3 scenes for one [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗
read the original abstract

Mining hard, safety-critical scenes from driving logs is bottlenecked by the absence of difficulty labels, and no single proxy, collision risk, trajectory ambiguity, or semantic rarity suffices to find such scenes on its own. We present SceneMiner, a unified, camera-only bird's-eye-view pipeline that emits complementary mining signals from a frozen vision-language backbone in a single forward pass, with no LiDAR or radar: a retrieval embedding for text-prompted scenario search, a multi-label scene-tag distribution, and a continuous physics-based risk score (a motion forecast is a byproduct, not a contribution). Building such a multi-head model exposes our central finding, a failure mode we term cross-task interference: adding or upgrading one head shifts a shared activation stream and degrades weight-frozen sibling heads, so freezing parameters alone is insufficient. Our contribution, identity-preserving multi-task fine-tuning, removes this interference by zero-initializing every new sub-module and freezing every parameter that feeds the shared stream. The mining heads are thereby preserved bit-identically while training only ~102k parameters. The tagging head reaches mAP 0.4614 (micro-F1 0.5557) on 20 scene tags by pooling each scene into 32 visual tokens, and the embedding head supports text-prompted retrieval, validated qualitatively. Code is available at: https://anonymous.4open.science/r/sceneminer_anonymous-64E5

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents SceneMiner, a unified camera-only BEV pipeline that extracts complementary mining signals (text-prompted retrieval embedding, multi-label scene-tag distribution, and physics-based risk score) from a frozen vision-language backbone in a single forward pass. It identifies cross-task interference as a failure mode where adding or upgrading one head degrades frozen sibling heads, and proposes identity-preserving multi-task fine-tuning via zero-initialization of new sub-modules and freezing of all parameters feeding the shared activation stream. This is claimed to preserve original heads bit-identically while training only ~102k parameters. The tagging head achieves mAP 0.4614 (micro-F1 0.5557) on 20 tags using 32 visual tokens per scene; retrieval is validated qualitatively. Code is released.

Significance. If the preservation mechanism holds under empirical scrutiny, the approach offers a lightweight way to extend pre-trained multi-head models without interference, which could be useful for scene mining and multi-task BEV perception in autonomous driving. The availability of code is a positive factor for reproducibility.

major comments (2)
  1. [Abstract] Abstract (paragraph on cross-task interference): The assertion that zero-initializing every new sub-module and freezing every parameter feeding the shared stream preserves the original mining heads bit-identically is load-bearing for the central contribution, yet the manuscript provides no quantitative verification such as output deltas, activation equality checks, or before/after comparisons on the retrieval embedding or risk-score heads.
  2. [Abstract] Abstract and method description: No ablation or control experiment is reported that demonstrates degradation of sibling heads when the identity-preserving steps are omitted, leaving the necessity and effectiveness of the proposed fine-tuning unverified despite the reported tagging mAP of 0.4614.
minor comments (1)
  1. [Abstract] The abstract states that a motion forecast is a byproduct but provides no details on its formulation or evaluation; if this is not central, it should be clarified as out of scope.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments highlighting the need for stronger empirical verification of the identity-preserving mechanism. We address each point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract (paragraph on cross-task interference): The assertion that zero-initializing every new sub-module and freezing every parameter feeding the shared stream preserves the original mining heads bit-identically is load-bearing for the central contribution, yet the manuscript provides no quantitative verification such as output deltas, activation equality checks, or before/after comparisons on the retrieval embedding or risk-score heads.

    Authors: We agree that explicit quantitative verification strengthens the claim. The preservation holds by construction because new sub-modules are zero-initialized (contributing zero to the shared stream) and all parameters feeding the shared activation stream are frozen, leaving the forward pass through the original heads unchanged. To address the concern, we will add before/after comparisons in the revision, including L2 output deltas, activation equality checks, and numerical confirmation of identical outputs on the retrieval embedding and risk-score heads. revision: yes

  2. Referee: [Abstract] Abstract and method description: No ablation or control experiment is reported that demonstrates degradation of sibling heads when the identity-preserving steps are omitted, leaving the necessity and effectiveness of the proposed fine-tuning unverified despite the reported tagging mAP of 0.4614.

    Authors: The manuscript identifies cross-task interference as an observed failure mode, but we acknowledge the absence of a dedicated ablation. In the revision we will include a control experiment that omits zero-initialization of new sub-modules and/or the freezing of shared-stream parameters, reporting the resulting changes to the original heads (e.g., shifts in retrieval embeddings and risk scores) to demonstrate both necessity and effectiveness. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is architectural and empirical

full rationale

The paper presents an engineering solution for multi-task fine-tuning via zero-initialization of new sub-modules and freezing of parameters feeding the shared stream, with the bit-identical preservation stated as a direct architectural consequence rather than a derived prediction. Reported metrics (e.g., tagging mAP 0.4614) are empirical outcomes on held-out data, not quantities forced by fitting the same inputs. No equations, self-citations, uniqueness theorems, or ansatzes appear in the provided text that reduce claims to inputs by construction. The derivation chain is self-contained as a practical design choice with released code.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The pipeline assumes a frozen vision-language backbone already encodes useful BEV features for all three heads; the interference-removal claim rests on the untested premise that zero-init plus selective freezing is sufficient to isolate new heads.

free parameters (1)
  • number of visual tokens = 32
    Scenes are pooled into exactly 32 visual tokens before the tagging head; this is a design choice that directly affects the reported mAP.
axioms (1)
  • domain assumption A frozen vision-language backbone supplies sufficiently rich features for simultaneous retrieval, tagging, and risk scoring in BEV without task-specific adaptation of the backbone itself.
    Invoked when the pipeline is described as emitting all signals from the frozen backbone in one forward pass.

pith-pipeline@v0.9.1-grok · 5811 in / 1396 out tokens · 31203 ms · 2026-06-27T13:00:53.766792+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 14 canonical work pages

  1. [1]

    OptFlow: Fast optimization-based scene flow estimation without supervision

    Rahul Ahuja, Chris Baker, and Wilko Schwarting. OptFlow: Fast optimization-based scene flow estimation without supervision. InProceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision (WACV), 2024. doi: 10.1109/ W ACV57701.2024.00313

  2. [2]

    FishRoPE: Projective rotary position embeddings for omnidirectional visual perception.arXiv preprint arXiv:2604.10391, 2026

    Rahul Ahuja, Mudit Jain, Bala Murali Manoghar Sai Sudhakar, Venkatraman Narayanan, Pratik Likhar, Varun Ravi Kumar, and Senthil Yogamani. FishRoPE: Projective rotary position embeddings for omnidirectional visual perception.arXiv preprint arXiv:2604.10391, 2026

  3. [3]

    Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-VL technical report.ar...

  4. [4]

    2020 , volume =

    Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuScenes: A multimodal dataset for autonomous driving. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 11621–11631, 2020. doi: 10.1109/CVPR42600.2020.01164

  5. [5]

    nuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles.arXiv preprint arXiv:2106.11810, 2021

    Holger Caesar, Juraj Kabzan, Kok Seang Tan, Whye Kit Fong, Eric Wolff, Alex Lang, Luke Fletcher, Oscar Beijbom, and Sammy Omari. nuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles.arXiv preprint arXiv:2106.11810, 2021. CVPR ADP3 Workshop

  6. [6]

    Net2net: Accelerating learning via knowledge transfer

    Tianqi Chen, Ian Goodfellow, and Jonathon Shlens. Net2net: Accelerating learning via knowledge transfer. InInternational Conference on Learning Representations (ICLR), 2016

  7. [7]

    Yokoyama, S

    Tushar Choudhary, Vikrant Dewangan, Shivam Chandhok, Shubham Priyadarshan, Anushka Jain, Arun K. Singh, Siddharth Srivastava, Krishna Murthy Jatavallabhula, and K. Madhava Krishna. Talk2BEV: Language-enhanced bird’s-eye view maps for autonomous driving. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2024. doi: 10.110...

  8. [8]

    RefA V: Towards planning-centric scenario mining.arXiv preprint arXiv:2505.20981, 2025

    Cainan Davidson, Deva Ramanan, and Neehar Peri. RefA V: Towards planning-centric scenario mining.arXiv preprint arXiv:2505.20981, 2025

  9. [9]

    CARScenes: Semantic VLM dataset for safe au- tonomous driving.arXiv preprint arXiv:2511.10701, 2025

    Yuankai He and Weisong Shi. CARScenes: Semantic VLM dataset for safe au- tonomous driving.arXiv preprint arXiv:2511.10701, 2025. 16ABURADDAHA, NARA Y ANAN, THAKER, RAW ASHDEH: SCENEMINER BEV SCENE MINING

  10. [10]

    Parameter- efficient transfer learning for NLP

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter- efficient transfer learning for NLP. InProceedings of the 36th International Conference on Machine Learning (ICML), 2019

  11. [11]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations (ICLR), 2022

  12. [12]

    Planning-oriented autonomous driving

    Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, Lewei Lu, Xiaosong Jia, Qiang Liu, Jifeng Dai, Yu Qiao, and Hongyang Li. Planning-oriented autonomous driving. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  13. [13]

    NuScenes-MQA: In- tegrated evaluation of captions and QA for autonomous driving datasets using markup annotations

    Yuichi Inoue, Yuki Yada, Kotaro Tanahashi, and Yu Yamaguchi. NuScenes-MQA: In- tegrated evaluation of captions and QA for autonomous driving datasets using markup annotations. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops, pages 930–938, 2024

  14. [14]

    In: IEEE/CVF International Conference on Computer Vision

    Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. V AD: Vectorized scene repre- sentation for efficient autonomous driving. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), 2023. doi: 10.1109/ICCV51070.2023. 00766

  15. [15]

    Nidhi Kalra and Susan M. Paddock. Driving to safety: How many miles of driving would it take to demonstrate autonomous vehicle reliability? Technical Report RR- 1478-RC, RAND Corporation, 2016

  16. [16]

    Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Des- jardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):352...

  17. [17]

    Yokoyama, S

    Chi-Hsi Kung, Chieh-Chi Yang, Pang-Yuan Pao, Shu-Wei Lu, Pin-Lun Chen, Hsin- Cheng Lu, and Yi-Ting Chen. RiskBench: A scenario-based benchmark for risk iden- tification. InProceedings of the IEEE International Conference on Robotics and Au- tomation (ICRA), 2024. doi: 10.1109/ICRA57147.2024.10610270

  18. [18]

    BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InProceedings of the 40th International Conference on Machine Learning (ICML), pages 19730–19742, 2023

  19. [19]

    BEVFormer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers

    Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Yu Qiao, and Jifeng Dai. BEVFormer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. InProceedings of the Euro- pean Conference on Computer Vision (ECCV), pages 1–18, 2022. doi: 10.1007/ 978-3-031-20077-9_1. ABURADDAHA, NARA Y ANAN, THAKE...

  20. [20]

    Learning without forgetting.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 40(12):2935–2947, 2018

    Zhizhong Li and Derek Hoiem. Learning without forgetting.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 40(12):2935–2947, 2018. doi: 10.1109/TPAMI.2017.2773081

  21. [21]

    Conflict-averse gradient descent for multi-task learning

    Bo Liu, Xingchao Liu, Xiaojie Jin, Peter Stone, and Qiang Liu. Conflict-averse gradient descent for multi-task learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2021

  22. [22]

    MambaFusion: Adaptive state-space fusion for multimodal 3d object detection.arXiv preprint arXiv:2602.08126, 2026

    Venkatraman Narayanan, Bala Sai, Rahul Ahuja, Pratik Likhar, Varun Ravi Kumar, and Senthil Yogamani. MambaFusion: Adaptive state-space fusion for multimodal 3d object detection.arXiv preprint arXiv:2602.08126, 2026

  23. [23]

    Singh, V

    Nigamaa Nayakanti, Rami Al-Rfou, Aurick Zhou, Kratarth Goel, Khaled S. Refaat, and Benjamin Sapp. Wayformer: Motion forecasting via simple and efficient attention networks. InProceedings of the IEEE International Conference on Robotics and Au- tomation (ICRA), pages 2980–2987, 2023. doi: 10.1109/ICRA48891.2023.10160609

  24. [24]

    Nuscenes- qa: A multi-modal visual question answering benchmark for autonomous driving sce- nario

    Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, and Yu-Gang Jiang. NuScenes- QA: A multi-modal visual question answering benchmark for autonomous driving sce- nario. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2024. doi: 10.1609/aaai.v38i5.28253

  25. [25]

    Learning transferable visual models from natural lan- guage supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand- hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural lan- guage supervision. InProceedings of the 38th International Conference on Machine Learning (ICML), pages 8748–8763, 2021

  26. [26]

    Trajec- tron++: Dynamically-feasible trajectory forecasting with heterogeneous data

    Tim Salzmann, Boris Ivanovic, Punarjay Chakravarty, and Marco Pavone. Trajec- tron++: Dynamically-feasible trajectory forecasting with heterogeneous data. InPro- ceedings of the European Conference on Computer Vision (ECCV), pages 683–700,

  27. [27]

    doi: 10.1007/978-3-030-58523-5_40

  28. [28]

    In: IEEE/CVF International Conference on Computer Vision

    Ari Seff, Brian Cera, Dian Chen, Mason Ng, Aurick Zhou, Nigamaa Nayakanti, Khaled S. Refaat, Rami Al-Rfou, and Benjamin Sapp. MotionLM: Multi-agent mo- tion forecasting as language modeling. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023. doi: 10.1109/ICCV51070.2023.00788

  29. [29]

    Motion transformer with global intention localization and local movement refinement

    Shaoshuai Shi, Li Jiang, Dengxin Dai, and Bernt Schiele. Motion transformer with global intention localization and local movement refinement. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

  30. [30]

    DriveLM: Driving with graph visual question answering

    Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. DriveLM: Driving with graph visual question answering. InProceedings of the European Conference on Computer Vision (ECCV), 2024. doi: 10.1007/978-3-031-72943-0_15

  31. [31]

    Critical reasons for crashes investigated in the national motor vehicle crash causation survey

    Santokh Singh. Critical reasons for crashes investigated in the national motor vehicle crash causation survey. Traffic Safety Facts Crash Stats DOT HS 812 115, National Highway Traffic Safety Administration, 2015. 18ABURADDAHA, NARA Y ANAN, THAKER, RAW ASHDEH: SCENEMINER BEV SCENE MINING

  32. [32]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025

    Ziying Song, Caiyan Jia, Lin Liu, Hongyu Pan, Yongchang Zhang, Junming Wang, Xingyu Zhang, Shaoqing Xu, Lei Yang, and Yadan Luo. Don’t shake the wheel: Momentum-aware planning in end-to-end autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. doi: 10.1109/CVPR52734.2025.02089

  33. [33]

    Scalability in perception for autonomous driving: Waymo Open Dataset

    Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Ettinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang, Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov. Scalability in perception...

  34. [34]

    DriveVLM: The convergence of au- tonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024

    Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. DriveVLM: The convergence of au- tonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024

  35. [35]

    SigLIP 2: Multilingual vision-language encoders with improved semantic understand- ing, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. SigLIP 2: Multilingual vision-language encoders with improved semantic understand- ing, localization, and dense features.a...

  36. [36]

    Representation learning with con- trastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

    Aäron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with con- trastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

  37. [37]

    Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and José M. Álvarez. OmniDrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. doi: 10.1109/CVPR52734.2025.02090

  38. [38]

    Argoverse 2: Next genera- tion datasets for self-driving perception and forecasting

    Benjamin Wilson, William Qi, Tanmay Agarwal, John Lambert, Jagjeet Singh, Sid- dhesh Khandelwal, Bowen Pan, Ratnesh Kumar, Andrew Hartnett, Jhony Kaese- model Pontes, Deva Ramanan, Peter Carr, and James Hays. Argoverse 2: Next genera- tion datasets for self-driving perception and forecasting. InProceedings of the Neural Information Processing Systems Trac...

  39. [39]

    Language prompt for autonomous driving.arXiv preprint arXiv:2309.04379, 2023

    Dongming Wu, Wencheng Han, Yingfei Liu, Tiancai Wang, Cheng-zhong Xu, Xiangyu Zhang, and Jianbing Shen. Language prompt for autonomous driving.arXiv preprint arXiv:2309.04379, 2023

  40. [40]

    Gradient surgery for multi-task learning

    Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning. InAdvances in Neural In- formation Processing Systems (NeurIPS), 2020

  41. [41]

    Zhang, Alexander Sax, Amir Zamir, Leonidas Guibas, and Jitendra Ma- lik

    Jeffrey O. Zhang, Alexander Sax, Amir Zamir, Leonidas Guibas, and Jitendra Ma- lik. Side-tuning: A baseline for network adaptation via additive side networks. In Proceedings of the European Conference on Computer Vision (ECCV), 2020. doi: 10.1007/978-3-030-58580-8_41. ABURADDAHA, NARA Y ANAN, THAKER, RAW ASHDEH: SCENEMINER BEV SCENE MINING19

  42. [42]

    In: IEEE/CVF International Conference on Computer Vision

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text- to-image diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023. doi: 10.1109/ICCV51070.2023.00355

  43. [43]

    BEVerse: Unified perception and prediction in birds-eye-view for vision- centric autonomous driving.arXiv preprint arXiv:2205.09743, 2022

    Yunpeng Zhang, Zheng Zhu, Wenzhao Zheng, Junjie Huang, Guan Huang, Jie Zhou, and Jiwen Lu. BEVerse: Unified perception and prediction in birds-eye-view for vision- centric autonomous driving.arXiv preprint arXiv:2205.09743, 2022