SceneMiner: Identity-Preserving Multi-Task Fine-Tuning for Unified BEV Scene Mining

Abdalmalek Aburaddaha; Keval Thaker; Samir A. Rawashdeh; Venkatraman Narayanan

arxiv: 2606.11507 · v1 · pith:EYZZQOTUnew · submitted 2026-06-09 · 💻 cs.CV

SceneMiner: Identity-Preserving Multi-Task Fine-Tuning for Unified BEV Scene Mining

Abdalmalek Aburaddaha , Venkatraman Narayanan , Keval Thaker , Samir A. Rawashdeh This is my paper

Pith reviewed 2026-06-27 13:00 UTC · model grok-4.3

classification 💻 cs.CV

keywords scene miningbird's-eye-viewmulti-task fine-tuningcross-task interferenceidentity-preserving tuningdriving logsvision-language backboneBEV pipeline

0 comments

The pith

Zero-initializing new sub-modules and freezing shared parameters eliminates cross-task interference in multi-head BEV models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that adding heads to a shared vision-language backbone for scene mining causes cross-task interference: new parameters shift the activation stream and degrade existing heads even when those heads stay frozen. Their identity-preserving multi-task fine-tuning counters this by zero-initializing every new sub-module and freezing all parameters that feed the shared stream. This keeps every original mining head bit-identical while training only about 102k parameters. A reader would care because it lets one camera-only BEV pipeline produce retrieval embeddings, scene tags, and risk scores together without separate models or performance loss on any signal.

Core claim

Cross-task interference occurs when a new head is added because its parameters alter the shared activation stream, degrading weight-frozen sibling heads. Identity-preserving multi-task fine-tuning removes the interference by zero-initializing every new sub-module and freezing every parameter that feeds the shared stream, so the mining heads remain preserved bit-identically while only ~102k parameters are trained. The resulting model emits a text-prompted retrieval embedding, a 20-tag multi-label distribution, and a physics-based risk score from a single forward pass on camera input.

What carries the argument

Identity-preserving multi-task fine-tuning: zero-initializing every new sub-module and freezing every parameter that feeds the shared activation stream, which preserves existing heads bit-identically.

If this is right

A single frozen vision-language backbone can produce retrieval, tagging, and risk signals in one forward pass without LiDAR or radar.
Existing mining heads stay bit-identical after new heads are added and trained.
Only ~102k parameters require training when a new head is introduced.
The tagging head reaches mAP 0.4614 and micro-F1 0.5557 on 20 scene tags by pooling scenes into 32 visual tokens.
Text-prompted retrieval becomes possible while the other heads remain untouched.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same zero-init and freeze pattern could be tested on multi-task models outside driving scenes, such as general video understanding.
It may support adding more than three heads without interference if the shared stream remains frozen.
Exact bit-identity of weights before and after could be verified by direct parameter comparison on public checkpoints.
The approach might reduce the need for task-specific fine-tuning runs in any setting where heads share an activation backbone.

Load-bearing premise

Zero-initializing new sub-modules and freezing parameters that feed the shared stream will keep the original mining heads completely unchanged with no measurable activation shift.

What would settle it

After adding and training a new head with zero-initialized modules and frozen shared-stream parameters, observe any non-zero difference in the output logits or weights of an existing head.

Figures

Figures reproduced from arXiv: 2606.11507 by Abdalmalek Aburaddaha, Keval Thaker, Samir A. Rawashdeh, Venkatraman Narayanan.

**Figure 2.** Figure 2: Qualitative text-prompted retrieval. Each row shows the top-3 scenes for one [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗

read the original abstract

Mining hard, safety-critical scenes from driving logs is bottlenecked by the absence of difficulty labels, and no single proxy, collision risk, trajectory ambiguity, or semantic rarity suffices to find such scenes on its own. We present SceneMiner, a unified, camera-only bird's-eye-view pipeline that emits complementary mining signals from a frozen vision-language backbone in a single forward pass, with no LiDAR or radar: a retrieval embedding for text-prompted scenario search, a multi-label scene-tag distribution, and a continuous physics-based risk score (a motion forecast is a byproduct, not a contribution). Building such a multi-head model exposes our central finding, a failure mode we term cross-task interference: adding or upgrading one head shifts a shared activation stream and degrades weight-frozen sibling heads, so freezing parameters alone is insufficient. Our contribution, identity-preserving multi-task fine-tuning, removes this interference by zero-initializing every new sub-module and freezing every parameter that feeds the shared stream. The mining heads are thereby preserved bit-identically while training only ~102k parameters. The tagging head reaches mAP 0.4614 (micro-F1 0.5557) on 20 scene tags by pooling each scene into 32 visual tokens, and the embedding head supports text-prompted retrieval, validated qualitatively. Code is available at: https://anonymous.4open.science/r/sceneminer_anonymous-64E5

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SceneMiner gives a practical recipe for adding multiple mining heads to a frozen BEV backbone without degrading the originals, but the bit-identical preservation claim has no direct checks.

read the letter

The main takeaway is a targeted fix for cross-task interference when stacking heads on a shared vision-language backbone for camera-only scene mining in driving logs. Zero-initializing new sub-modules and freezing the shared feeders is meant to keep the original heads unchanged while training only ~102k parameters.

The paper does a clean job laying out the real bottleneck: no single proxy finds safety-critical scenes, so they want retrieval embeddings, multi-label tags, and a risk score from one forward pass. The tagging head hits mAP 0.4614 on 20 tags by pooling to 32 visual tokens, and they show qualitative retrieval success. Releasing code is useful for anyone trying to replicate the setup.

The soft spot is the lack of evidence for the central claim. The abstract describes the preservation method but gives no before-after output deltas, activation comparisons, or ablation that shows sibling heads degrade without the zero-init and freeze steps. The reported mAP stands alone without a control run.

This is for people working on AV data curation and multi-task fine-tuning on frozen backbones. A reader who already runs similar models could test the recipe quickly.

It deserves peer review because the engineering problem is concrete and the numbers are given, even though the preservation part needs quantitative backing in a revision.

Referee Report

2 major / 1 minor

Summary. The paper presents SceneMiner, a unified camera-only BEV pipeline that extracts complementary mining signals (text-prompted retrieval embedding, multi-label scene-tag distribution, and physics-based risk score) from a frozen vision-language backbone in a single forward pass. It identifies cross-task interference as a failure mode where adding or upgrading one head degrades frozen sibling heads, and proposes identity-preserving multi-task fine-tuning via zero-initialization of new sub-modules and freezing of all parameters feeding the shared activation stream. This is claimed to preserve original heads bit-identically while training only ~102k parameters. The tagging head achieves mAP 0.4614 (micro-F1 0.5557) on 20 tags using 32 visual tokens per scene; retrieval is validated qualitatively. Code is released.

Significance. If the preservation mechanism holds under empirical scrutiny, the approach offers a lightweight way to extend pre-trained multi-head models without interference, which could be useful for scene mining and multi-task BEV perception in autonomous driving. The availability of code is a positive factor for reproducibility.

major comments (2)

[Abstract] Abstract (paragraph on cross-task interference): The assertion that zero-initializing every new sub-module and freezing every parameter feeding the shared stream preserves the original mining heads bit-identically is load-bearing for the central contribution, yet the manuscript provides no quantitative verification such as output deltas, activation equality checks, or before/after comparisons on the retrieval embedding or risk-score heads.
[Abstract] Abstract and method description: No ablation or control experiment is reported that demonstrates degradation of sibling heads when the identity-preserving steps are omitted, leaving the necessity and effectiveness of the proposed fine-tuning unverified despite the reported tagging mAP of 0.4614.

minor comments (1)

[Abstract] The abstract states that a motion forecast is a byproduct but provides no details on its formulation or evaluation; if this is not central, it should be clarified as out of scope.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments highlighting the need for stronger empirical verification of the identity-preserving mechanism. We address each point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract (paragraph on cross-task interference): The assertion that zero-initializing every new sub-module and freezing every parameter feeding the shared stream preserves the original mining heads bit-identically is load-bearing for the central contribution, yet the manuscript provides no quantitative verification such as output deltas, activation equality checks, or before/after comparisons on the retrieval embedding or risk-score heads.

Authors: We agree that explicit quantitative verification strengthens the claim. The preservation holds by construction because new sub-modules are zero-initialized (contributing zero to the shared stream) and all parameters feeding the shared activation stream are frozen, leaving the forward pass through the original heads unchanged. To address the concern, we will add before/after comparisons in the revision, including L2 output deltas, activation equality checks, and numerical confirmation of identical outputs on the retrieval embedding and risk-score heads. revision: yes
Referee: [Abstract] Abstract and method description: No ablation or control experiment is reported that demonstrates degradation of sibling heads when the identity-preserving steps are omitted, leaving the necessity and effectiveness of the proposed fine-tuning unverified despite the reported tagging mAP of 0.4614.

Authors: The manuscript identifies cross-task interference as an observed failure mode, but we acknowledge the absence of a dedicated ablation. In the revision we will include a control experiment that omits zero-initialization of new sub-modules and/or the freezing of shared-stream parameters, reporting the resulting changes to the original heads (e.g., shifts in retrieval embeddings and risk scores) to demonstrate both necessity and effectiveness. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is architectural and empirical

full rationale

The paper presents an engineering solution for multi-task fine-tuning via zero-initialization of new sub-modules and freezing of parameters feeding the shared stream, with the bit-identical preservation stated as a direct architectural consequence rather than a derived prediction. Reported metrics (e.g., tagging mAP 0.4614) are empirical outcomes on held-out data, not quantities forced by fitting the same inputs. No equations, self-citations, uniqueness theorems, or ansatzes appear in the provided text that reduce claims to inputs by construction. The derivation chain is self-contained as a practical design choice with released code.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The pipeline assumes a frozen vision-language backbone already encodes useful BEV features for all three heads; the interference-removal claim rests on the untested premise that zero-init plus selective freezing is sufficient to isolate new heads.

free parameters (1)

number of visual tokens = 32
Scenes are pooled into exactly 32 visual tokens before the tagging head; this is a design choice that directly affects the reported mAP.

axioms (1)

domain assumption A frozen vision-language backbone supplies sufficiently rich features for simultaneous retrieval, tagging, and risk scoring in BEV without task-specific adaptation of the backbone itself.
Invoked when the pipeline is described as emitting all signals from the frozen backbone in one forward pass.

pith-pipeline@v0.9.1-grok · 5811 in / 1396 out tokens · 31203 ms · 2026-06-27T13:00:53.766792+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 14 canonical work pages

[1]

OptFlow: Fast optimization-based scene flow estimation without supervision

Rahul Ahuja, Chris Baker, and Wilko Schwarting. OptFlow: Fast optimization-based scene flow estimation without supervision. InProceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision (WACV), 2024. doi: 10.1109/ W ACV57701.2024.00313

arXiv 2024
[2]

FishRoPE: Projective rotary position embeddings for omnidirectional visual perception.arXiv preprint arXiv:2604.10391, 2026

Rahul Ahuja, Mudit Jain, Bala Murali Manoghar Sai Sudhakar, Venkatraman Narayanan, Pratik Likhar, Varun Ravi Kumar, and Senthil Yogamani. FishRoPE: Projective rotary position embeddings for omnidirectional visual perception.arXiv preprint arXiv:2604.10391, 2026

Pith/arXiv arXiv 2026
[3]

Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-VL technical report.ar...

Pith/arXiv arXiv 2025
[4]

2020 , volume =

Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuScenes: A multimodal dataset for autonomous driving. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 11621–11631, 2020. doi: 10.1109/CVPR42600.2020.01164

work page doi:10.1109/cvpr42600.2020.01164 2020
[5]

nuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles.arXiv preprint arXiv:2106.11810, 2021

Holger Caesar, Juraj Kabzan, Kok Seang Tan, Whye Kit Fong, Eric Wolff, Alex Lang, Luke Fletcher, Oscar Beijbom, and Sammy Omari. nuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles.arXiv preprint arXiv:2106.11810, 2021. CVPR ADP3 Workshop

Pith/arXiv arXiv 2021
[6]

Net2net: Accelerating learning via knowledge transfer

Tianqi Chen, Ian Goodfellow, and Jonathon Shlens. Net2net: Accelerating learning via knowledge transfer. InInternational Conference on Learning Representations (ICLR), 2016

2016
[7]

Yokoyama, S

Tushar Choudhary, Vikrant Dewangan, Shivam Chandhok, Shubham Priyadarshan, Anushka Jain, Arun K. Singh, Siddharth Srivastava, Krishna Murthy Jatavallabhula, and K. Madhava Krishna. Talk2BEV: Language-enhanced bird’s-eye view maps for autonomous driving. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2024. doi: 10.110...

work page doi:10.1109/icra57147.2024.10611485 2024
[8]

RefA V: Towards planning-centric scenario mining.arXiv preprint arXiv:2505.20981, 2025

Cainan Davidson, Deva Ramanan, and Neehar Peri. RefA V: Towards planning-centric scenario mining.arXiv preprint arXiv:2505.20981, 2025

arXiv 2025
[9]

CARScenes: Semantic VLM dataset for safe au- tonomous driving.arXiv preprint arXiv:2511.10701, 2025

Yuankai He and Weisong Shi. CARScenes: Semantic VLM dataset for safe au- tonomous driving.arXiv preprint arXiv:2511.10701, 2025. 16ABURADDAHA, NARA Y ANAN, THAKER, RAW ASHDEH: SCENEMINER BEV SCENE MINING

arXiv 2025
[10]

Parameter- efficient transfer learning for NLP

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter- efficient transfer learning for NLP. InProceedings of the 36th International Conference on Machine Learning (ICML), 2019

2019
[11]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations (ICLR), 2022

2022
[12]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, Lewei Lu, Xiaosong Jia, Qiang Liu, Jifeng Dai, Yu Qiao, and Hongyang Li. Planning-oriented autonomous driving. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

2023
[13]

NuScenes-MQA: In- tegrated evaluation of captions and QA for autonomous driving datasets using markup annotations

Yuichi Inoue, Yuki Yada, Kotaro Tanahashi, and Yu Yamaguchi. NuScenes-MQA: In- tegrated evaluation of captions and QA for autonomous driving datasets using markup annotations. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops, pages 930–938, 2024

2024
[14]

In: IEEE/CVF International Conference on Computer Vision

Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. V AD: Vectorized scene repre- sentation for efficient autonomous driving. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), 2023. doi: 10.1109/ICCV51070.2023. 00766

work page doi:10.1109/iccv51070.2023 2023
[15]

Nidhi Kalra and Susan M. Paddock. Driving to safety: How many miles of driving would it take to demonstrate autonomous vehicle reliability? Technical Report RR- 1478-RC, RAND Corporation, 2016

2016
[16]

Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Des- jardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):352...

2017
[17]

Yokoyama, S

Chi-Hsi Kung, Chieh-Chi Yang, Pang-Yuan Pao, Shu-Wei Lu, Pin-Lun Chen, Hsin- Cheng Lu, and Yi-Ting Chen. RiskBench: A scenario-based benchmark for risk iden- tification. InProceedings of the IEEE International Conference on Robotics and Au- tomation (ICRA), 2024. doi: 10.1109/ICRA57147.2024.10610270

work page doi:10.1109/icra57147.2024.10610270 2024
[18]

BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InProceedings of the 40th International Conference on Machine Learning (ICML), pages 19730–19742, 2023

2023
[19]

BEVFormer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers

Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Yu Qiao, and Jifeng Dai. BEVFormer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. InProceedings of the Euro- pean Conference on Computer Vision (ECCV), pages 1–18, 2022. doi: 10.1007/ 978-3-031-20077-9_1. ABURADDAHA, NARA Y ANAN, THAKE...

2022
[20]

Learning without forgetting.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 40(12):2935–2947, 2018

Zhizhong Li and Derek Hoiem. Learning without forgetting.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 40(12):2935–2947, 2018. doi: 10.1109/TPAMI.2017.2773081

work page doi:10.1109/tpami.2017.2773081 2018
[21]

Conflict-averse gradient descent for multi-task learning

Bo Liu, Xingchao Liu, Xiaojie Jin, Peter Stone, and Qiang Liu. Conflict-averse gradient descent for multi-task learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2021

2021
[22]

MambaFusion: Adaptive state-space fusion for multimodal 3d object detection.arXiv preprint arXiv:2602.08126, 2026

Venkatraman Narayanan, Bala Sai, Rahul Ahuja, Pratik Likhar, Varun Ravi Kumar, and Senthil Yogamani. MambaFusion: Adaptive state-space fusion for multimodal 3d object detection.arXiv preprint arXiv:2602.08126, 2026

arXiv 2026
[23]

Singh, V

Nigamaa Nayakanti, Rami Al-Rfou, Aurick Zhou, Kratarth Goel, Khaled S. Refaat, and Benjamin Sapp. Wayformer: Motion forecasting via simple and efficient attention networks. InProceedings of the IEEE International Conference on Robotics and Au- tomation (ICRA), pages 2980–2987, 2023. doi: 10.1109/ICRA48891.2023.10160609

work page doi:10.1109/icra48891.2023.10160609 2023
[24]

Nuscenes- qa: A multi-modal visual question answering benchmark for autonomous driving sce- nario

Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, and Yu-Gang Jiang. NuScenes- QA: A multi-modal visual question answering benchmark for autonomous driving sce- nario. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2024. doi: 10.1609/aaai.v38i5.28253

work page doi:10.1609/aaai.v38i5.28253 2024
[25]

Learning transferable visual models from natural lan- guage supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand- hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural lan- guage supervision. InProceedings of the 38th International Conference on Machine Learning (ICML), pages 8748–8763, 2021

2021
[26]

Trajec- tron++: Dynamically-feasible trajectory forecasting with heterogeneous data

Tim Salzmann, Boris Ivanovic, Punarjay Chakravarty, and Marco Pavone. Trajec- tron++: Dynamically-feasible trajectory forecasting with heterogeneous data. InPro- ceedings of the European Conference on Computer Vision (ECCV), pages 683–700,
[27]

doi: 10.1007/978-3-030-58523-5_40

work page doi:10.1007/978-3-030-58523-5_40
[28]

In: IEEE/CVF International Conference on Computer Vision

Ari Seff, Brian Cera, Dian Chen, Mason Ng, Aurick Zhou, Nigamaa Nayakanti, Khaled S. Refaat, Rami Al-Rfou, and Benjamin Sapp. MotionLM: Multi-agent mo- tion forecasting as language modeling. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023. doi: 10.1109/ICCV51070.2023.00788

work page doi:10.1109/iccv51070.2023.00788 2023
[29]

Motion transformer with global intention localization and local movement refinement

Shaoshuai Shi, Li Jiang, Dengxin Dai, and Bernt Schiele. Motion transformer with global intention localization and local movement refinement. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

2022
[30]

DriveLM: Driving with graph visual question answering

Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. DriveLM: Driving with graph visual question answering. InProceedings of the European Conference on Computer Vision (ECCV), 2024. doi: 10.1007/978-3-031-72943-0_15

work page doi:10.1007/978-3-031-72943-0_15 2024
[31]

Critical reasons for crashes investigated in the national motor vehicle crash causation survey

Santokh Singh. Critical reasons for crashes investigated in the national motor vehicle crash causation survey. Traffic Safety Facts Crash Stats DOT HS 812 115, National Highway Traffic Safety Administration, 2015. 18ABURADDAHA, NARA Y ANAN, THAKER, RAW ASHDEH: SCENEMINER BEV SCENE MINING

2015
[32]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025

Ziying Song, Caiyan Jia, Lin Liu, Hongyu Pan, Yongchang Zhang, Junming Wang, Xingyu Zhang, Shaoqing Xu, Lei Yang, and Yadan Luo. Don’t shake the wheel: Momentum-aware planning in end-to-end autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. doi: 10.1109/CVPR52734.2025.02089

work page doi:10.1109/cvpr52734.2025.02089 2025
[33]

Scalability in perception for autonomous driving: Waymo Open Dataset

Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Ettinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang, Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov. Scalability in perception...

2020
[34]

DriveVLM: The convergence of au- tonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024

Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. DriveVLM: The convergence of au- tonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024

Pith/arXiv arXiv 2024
[35]

SigLIP 2: Multilingual vision-language encoders with improved semantic understand- ing, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. SigLIP 2: Multilingual vision-language encoders with improved semantic understand- ing, localization, and dense features.a...

Pith/arXiv arXiv 2025
[36]

Representation learning with con- trastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

Aäron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with con- trastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

Pith/arXiv arXiv 2018
[37]

Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and José M. Álvarez. OmniDrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. doi: 10.1109/CVPR52734.2025.02090

work page doi:10.1109/cvpr52734.2025.02090 2025
[38]

Argoverse 2: Next genera- tion datasets for self-driving perception and forecasting

Benjamin Wilson, William Qi, Tanmay Agarwal, John Lambert, Jagjeet Singh, Sid- dhesh Khandelwal, Bowen Pan, Ratnesh Kumar, Andrew Hartnett, Jhony Kaese- model Pontes, Deva Ramanan, Peter Carr, and James Hays. Argoverse 2: Next genera- tion datasets for self-driving perception and forecasting. InProceedings of the Neural Information Processing Systems Trac...

2021
[39]

Language prompt for autonomous driving.arXiv preprint arXiv:2309.04379, 2023

Dongming Wu, Wencheng Han, Yingfei Liu, Tiancai Wang, Cheng-zhong Xu, Xiangyu Zhang, and Jianbing Shen. Language prompt for autonomous driving.arXiv preprint arXiv:2309.04379, 2023

arXiv 2023
[40]

Gradient surgery for multi-task learning

Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning. InAdvances in Neural In- formation Processing Systems (NeurIPS), 2020

2020
[41]

Zhang, Alexander Sax, Amir Zamir, Leonidas Guibas, and Jitendra Ma- lik

Jeffrey O. Zhang, Alexander Sax, Amir Zamir, Leonidas Guibas, and Jitendra Ma- lik. Side-tuning: A baseline for network adaptation via additive side networks. In Proceedings of the European Conference on Computer Vision (ECCV), 2020. doi: 10.1007/978-3-030-58580-8_41. ABURADDAHA, NARA Y ANAN, THAKER, RAW ASHDEH: SCENEMINER BEV SCENE MINING19

work page doi:10.1007/978-3-030-58580-8_41 2020
[42]

In: IEEE/CVF International Conference on Computer Vision

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text- to-image diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023. doi: 10.1109/ICCV51070.2023.00355

work page doi:10.1109/iccv51070.2023.00355 2023
[43]

BEVerse: Unified perception and prediction in birds-eye-view for vision- centric autonomous driving.arXiv preprint arXiv:2205.09743, 2022

Yunpeng Zhang, Zheng Zhu, Wenzhao Zheng, Junjie Huang, Guan Huang, Jie Zhou, and Jiwen Lu. BEVerse: Unified perception and prediction in birds-eye-view for vision- centric autonomous driving.arXiv preprint arXiv:2205.09743, 2022

arXiv 2022

[1] [1]

OptFlow: Fast optimization-based scene flow estimation without supervision

Rahul Ahuja, Chris Baker, and Wilko Schwarting. OptFlow: Fast optimization-based scene flow estimation without supervision. InProceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision (WACV), 2024. doi: 10.1109/ W ACV57701.2024.00313

arXiv 2024

[2] [2]

FishRoPE: Projective rotary position embeddings for omnidirectional visual perception.arXiv preprint arXiv:2604.10391, 2026

Rahul Ahuja, Mudit Jain, Bala Murali Manoghar Sai Sudhakar, Venkatraman Narayanan, Pratik Likhar, Varun Ravi Kumar, and Senthil Yogamani. FishRoPE: Projective rotary position embeddings for omnidirectional visual perception.arXiv preprint arXiv:2604.10391, 2026

Pith/arXiv arXiv 2026

[3] [3]

Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-VL technical report.ar...

Pith/arXiv arXiv 2025

[4] [4]

2020 , volume =

Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuScenes: A multimodal dataset for autonomous driving. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 11621–11631, 2020. doi: 10.1109/CVPR42600.2020.01164

work page doi:10.1109/cvpr42600.2020.01164 2020

[5] [5]

nuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles.arXiv preprint arXiv:2106.11810, 2021

Holger Caesar, Juraj Kabzan, Kok Seang Tan, Whye Kit Fong, Eric Wolff, Alex Lang, Luke Fletcher, Oscar Beijbom, and Sammy Omari. nuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles.arXiv preprint arXiv:2106.11810, 2021. CVPR ADP3 Workshop

Pith/arXiv arXiv 2021

[6] [6]

Net2net: Accelerating learning via knowledge transfer

Tianqi Chen, Ian Goodfellow, and Jonathon Shlens. Net2net: Accelerating learning via knowledge transfer. InInternational Conference on Learning Representations (ICLR), 2016

2016

[7] [7]

Yokoyama, S

Tushar Choudhary, Vikrant Dewangan, Shivam Chandhok, Shubham Priyadarshan, Anushka Jain, Arun K. Singh, Siddharth Srivastava, Krishna Murthy Jatavallabhula, and K. Madhava Krishna. Talk2BEV: Language-enhanced bird’s-eye view maps for autonomous driving. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2024. doi: 10.110...

work page doi:10.1109/icra57147.2024.10611485 2024

[8] [8]

RefA V: Towards planning-centric scenario mining.arXiv preprint arXiv:2505.20981, 2025

Cainan Davidson, Deva Ramanan, and Neehar Peri. RefA V: Towards planning-centric scenario mining.arXiv preprint arXiv:2505.20981, 2025

arXiv 2025

[9] [9]

CARScenes: Semantic VLM dataset for safe au- tonomous driving.arXiv preprint arXiv:2511.10701, 2025

Yuankai He and Weisong Shi. CARScenes: Semantic VLM dataset for safe au- tonomous driving.arXiv preprint arXiv:2511.10701, 2025. 16ABURADDAHA, NARA Y ANAN, THAKER, RAW ASHDEH: SCENEMINER BEV SCENE MINING

arXiv 2025

[10] [10]

Parameter- efficient transfer learning for NLP

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter- efficient transfer learning for NLP. InProceedings of the 36th International Conference on Machine Learning (ICML), 2019

2019

[11] [11]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations (ICLR), 2022

2022

[12] [12]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, Lewei Lu, Xiaosong Jia, Qiang Liu, Jifeng Dai, Yu Qiao, and Hongyang Li. Planning-oriented autonomous driving. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

2023

[13] [13]

NuScenes-MQA: In- tegrated evaluation of captions and QA for autonomous driving datasets using markup annotations

Yuichi Inoue, Yuki Yada, Kotaro Tanahashi, and Yu Yamaguchi. NuScenes-MQA: In- tegrated evaluation of captions and QA for autonomous driving datasets using markup annotations. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops, pages 930–938, 2024

2024

[14] [14]

In: IEEE/CVF International Conference on Computer Vision

Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. V AD: Vectorized scene repre- sentation for efficient autonomous driving. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), 2023. doi: 10.1109/ICCV51070.2023. 00766

work page doi:10.1109/iccv51070.2023 2023

[15] [15]

Nidhi Kalra and Susan M. Paddock. Driving to safety: How many miles of driving would it take to demonstrate autonomous vehicle reliability? Technical Report RR- 1478-RC, RAND Corporation, 2016

2016

[16] [16]

Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Des- jardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):352...

2017

[17] [17]

Yokoyama, S

Chi-Hsi Kung, Chieh-Chi Yang, Pang-Yuan Pao, Shu-Wei Lu, Pin-Lun Chen, Hsin- Cheng Lu, and Yi-Ting Chen. RiskBench: A scenario-based benchmark for risk iden- tification. InProceedings of the IEEE International Conference on Robotics and Au- tomation (ICRA), 2024. doi: 10.1109/ICRA57147.2024.10610270

work page doi:10.1109/icra57147.2024.10610270 2024

[18] [18]

BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InProceedings of the 40th International Conference on Machine Learning (ICML), pages 19730–19742, 2023

2023

[19] [19]

BEVFormer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers

Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Yu Qiao, and Jifeng Dai. BEVFormer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. InProceedings of the Euro- pean Conference on Computer Vision (ECCV), pages 1–18, 2022. doi: 10.1007/ 978-3-031-20077-9_1. ABURADDAHA, NARA Y ANAN, THAKE...

2022

[20] [20]

Learning without forgetting.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 40(12):2935–2947, 2018

Zhizhong Li and Derek Hoiem. Learning without forgetting.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 40(12):2935–2947, 2018. doi: 10.1109/TPAMI.2017.2773081

work page doi:10.1109/tpami.2017.2773081 2018

[21] [21]

Conflict-averse gradient descent for multi-task learning

Bo Liu, Xingchao Liu, Xiaojie Jin, Peter Stone, and Qiang Liu. Conflict-averse gradient descent for multi-task learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2021

2021

[22] [22]

MambaFusion: Adaptive state-space fusion for multimodal 3d object detection.arXiv preprint arXiv:2602.08126, 2026

Venkatraman Narayanan, Bala Sai, Rahul Ahuja, Pratik Likhar, Varun Ravi Kumar, and Senthil Yogamani. MambaFusion: Adaptive state-space fusion for multimodal 3d object detection.arXiv preprint arXiv:2602.08126, 2026

arXiv 2026

[23] [23]

Singh, V

Nigamaa Nayakanti, Rami Al-Rfou, Aurick Zhou, Kratarth Goel, Khaled S. Refaat, and Benjamin Sapp. Wayformer: Motion forecasting via simple and efficient attention networks. InProceedings of the IEEE International Conference on Robotics and Au- tomation (ICRA), pages 2980–2987, 2023. doi: 10.1109/ICRA48891.2023.10160609

work page doi:10.1109/icra48891.2023.10160609 2023

[24] [24]

Nuscenes- qa: A multi-modal visual question answering benchmark for autonomous driving sce- nario

Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, and Yu-Gang Jiang. NuScenes- QA: A multi-modal visual question answering benchmark for autonomous driving sce- nario. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2024. doi: 10.1609/aaai.v38i5.28253

work page doi:10.1609/aaai.v38i5.28253 2024

[25] [25]

Learning transferable visual models from natural lan- guage supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand- hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural lan- guage supervision. InProceedings of the 38th International Conference on Machine Learning (ICML), pages 8748–8763, 2021

2021

[26] [26]

Trajec- tron++: Dynamically-feasible trajectory forecasting with heterogeneous data

Tim Salzmann, Boris Ivanovic, Punarjay Chakravarty, and Marco Pavone. Trajec- tron++: Dynamically-feasible trajectory forecasting with heterogeneous data. InPro- ceedings of the European Conference on Computer Vision (ECCV), pages 683–700,

[27] [27]

doi: 10.1007/978-3-030-58523-5_40

work page doi:10.1007/978-3-030-58523-5_40

[28] [28]

In: IEEE/CVF International Conference on Computer Vision

Ari Seff, Brian Cera, Dian Chen, Mason Ng, Aurick Zhou, Nigamaa Nayakanti, Khaled S. Refaat, Rami Al-Rfou, and Benjamin Sapp. MotionLM: Multi-agent mo- tion forecasting as language modeling. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023. doi: 10.1109/ICCV51070.2023.00788

work page doi:10.1109/iccv51070.2023.00788 2023

[29] [29]

Motion transformer with global intention localization and local movement refinement

Shaoshuai Shi, Li Jiang, Dengxin Dai, and Bernt Schiele. Motion transformer with global intention localization and local movement refinement. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

2022

[30] [30]

DriveLM: Driving with graph visual question answering

Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. DriveLM: Driving with graph visual question answering. InProceedings of the European Conference on Computer Vision (ECCV), 2024. doi: 10.1007/978-3-031-72943-0_15

work page doi:10.1007/978-3-031-72943-0_15 2024

[31] [31]

Critical reasons for crashes investigated in the national motor vehicle crash causation survey

Santokh Singh. Critical reasons for crashes investigated in the national motor vehicle crash causation survey. Traffic Safety Facts Crash Stats DOT HS 812 115, National Highway Traffic Safety Administration, 2015. 18ABURADDAHA, NARA Y ANAN, THAKER, RAW ASHDEH: SCENEMINER BEV SCENE MINING

2015

[32] [32]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025

Ziying Song, Caiyan Jia, Lin Liu, Hongyu Pan, Yongchang Zhang, Junming Wang, Xingyu Zhang, Shaoqing Xu, Lei Yang, and Yadan Luo. Don’t shake the wheel: Momentum-aware planning in end-to-end autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. doi: 10.1109/CVPR52734.2025.02089

work page doi:10.1109/cvpr52734.2025.02089 2025

[33] [33]

Scalability in perception for autonomous driving: Waymo Open Dataset

Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Ettinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang, Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov. Scalability in perception...

2020

[34] [34]

DriveVLM: The convergence of au- tonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024

Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. DriveVLM: The convergence of au- tonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024

Pith/arXiv arXiv 2024

[35] [35]

SigLIP 2: Multilingual vision-language encoders with improved semantic understand- ing, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. SigLIP 2: Multilingual vision-language encoders with improved semantic understand- ing, localization, and dense features.a...

Pith/arXiv arXiv 2025

[36] [36]

Representation learning with con- trastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

Aäron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with con- trastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

Pith/arXiv arXiv 2018

[37] [37]

Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and José M. Álvarez. OmniDrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. doi: 10.1109/CVPR52734.2025.02090

work page doi:10.1109/cvpr52734.2025.02090 2025

[38] [38]

Argoverse 2: Next genera- tion datasets for self-driving perception and forecasting

Benjamin Wilson, William Qi, Tanmay Agarwal, John Lambert, Jagjeet Singh, Sid- dhesh Khandelwal, Bowen Pan, Ratnesh Kumar, Andrew Hartnett, Jhony Kaese- model Pontes, Deva Ramanan, Peter Carr, and James Hays. Argoverse 2: Next genera- tion datasets for self-driving perception and forecasting. InProceedings of the Neural Information Processing Systems Trac...

2021

[39] [39]

Language prompt for autonomous driving.arXiv preprint arXiv:2309.04379, 2023

Dongming Wu, Wencheng Han, Yingfei Liu, Tiancai Wang, Cheng-zhong Xu, Xiangyu Zhang, and Jianbing Shen. Language prompt for autonomous driving.arXiv preprint arXiv:2309.04379, 2023

arXiv 2023

[40] [40]

Gradient surgery for multi-task learning

Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning. InAdvances in Neural In- formation Processing Systems (NeurIPS), 2020

2020

[41] [41]

Zhang, Alexander Sax, Amir Zamir, Leonidas Guibas, and Jitendra Ma- lik

Jeffrey O. Zhang, Alexander Sax, Amir Zamir, Leonidas Guibas, and Jitendra Ma- lik. Side-tuning: A baseline for network adaptation via additive side networks. In Proceedings of the European Conference on Computer Vision (ECCV), 2020. doi: 10.1007/978-3-030-58580-8_41. ABURADDAHA, NARA Y ANAN, THAKER, RAW ASHDEH: SCENEMINER BEV SCENE MINING19

work page doi:10.1007/978-3-030-58580-8_41 2020

[42] [42]

In: IEEE/CVF International Conference on Computer Vision

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text- to-image diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023. doi: 10.1109/ICCV51070.2023.00355

work page doi:10.1109/iccv51070.2023.00355 2023

[43] [43]

BEVerse: Unified perception and prediction in birds-eye-view for vision- centric autonomous driving.arXiv preprint arXiv:2205.09743, 2022

Yunpeng Zhang, Zheng Zhu, Wenzhao Zheng, Junjie Huang, Guan Huang, Jie Zhou, and Jiwen Lu. BEVerse: Unified perception and prediction in birds-eye-view for vision- centric autonomous driving.arXiv preprint arXiv:2205.09743, 2022

arXiv 2022