SceneMiner: Identity-Preserving Multi-Task Fine-Tuning for Unified BEV Scene Mining
Pith reviewed 2026-06-27 13:00 UTC · model grok-4.3
The pith
Zero-initializing new sub-modules and freezing shared parameters eliminates cross-task interference in multi-head BEV models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Cross-task interference occurs when a new head is added because its parameters alter the shared activation stream, degrading weight-frozen sibling heads. Identity-preserving multi-task fine-tuning removes the interference by zero-initializing every new sub-module and freezing every parameter that feeds the shared stream, so the mining heads remain preserved bit-identically while only ~102k parameters are trained. The resulting model emits a text-prompted retrieval embedding, a 20-tag multi-label distribution, and a physics-based risk score from a single forward pass on camera input.
What carries the argument
Identity-preserving multi-task fine-tuning: zero-initializing every new sub-module and freezing every parameter that feeds the shared activation stream, which preserves existing heads bit-identically.
If this is right
- A single frozen vision-language backbone can produce retrieval, tagging, and risk signals in one forward pass without LiDAR or radar.
- Existing mining heads stay bit-identical after new heads are added and trained.
- Only ~102k parameters require training when a new head is introduced.
- The tagging head reaches mAP 0.4614 and micro-F1 0.5557 on 20 scene tags by pooling scenes into 32 visual tokens.
- Text-prompted retrieval becomes possible while the other heads remain untouched.
Where Pith is reading between the lines
- The same zero-init and freeze pattern could be tested on multi-task models outside driving scenes, such as general video understanding.
- It may support adding more than three heads without interference if the shared stream remains frozen.
- Exact bit-identity of weights before and after could be verified by direct parameter comparison on public checkpoints.
- The approach might reduce the need for task-specific fine-tuning runs in any setting where heads share an activation backbone.
Load-bearing premise
Zero-initializing new sub-modules and freezing parameters that feed the shared stream will keep the original mining heads completely unchanged with no measurable activation shift.
What would settle it
After adding and training a new head with zero-initialized modules and frozen shared-stream parameters, observe any non-zero difference in the output logits or weights of an existing head.
Figures
read the original abstract
Mining hard, safety-critical scenes from driving logs is bottlenecked by the absence of difficulty labels, and no single proxy, collision risk, trajectory ambiguity, or semantic rarity suffices to find such scenes on its own. We present SceneMiner, a unified, camera-only bird's-eye-view pipeline that emits complementary mining signals from a frozen vision-language backbone in a single forward pass, with no LiDAR or radar: a retrieval embedding for text-prompted scenario search, a multi-label scene-tag distribution, and a continuous physics-based risk score (a motion forecast is a byproduct, not a contribution). Building such a multi-head model exposes our central finding, a failure mode we term cross-task interference: adding or upgrading one head shifts a shared activation stream and degrades weight-frozen sibling heads, so freezing parameters alone is insufficient. Our contribution, identity-preserving multi-task fine-tuning, removes this interference by zero-initializing every new sub-module and freezing every parameter that feeds the shared stream. The mining heads are thereby preserved bit-identically while training only ~102k parameters. The tagging head reaches mAP 0.4614 (micro-F1 0.5557) on 20 scene tags by pooling each scene into 32 visual tokens, and the embedding head supports text-prompted retrieval, validated qualitatively. Code is available at: https://anonymous.4open.science/r/sceneminer_anonymous-64E5
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents SceneMiner, a unified camera-only BEV pipeline that extracts complementary mining signals (text-prompted retrieval embedding, multi-label scene-tag distribution, and physics-based risk score) from a frozen vision-language backbone in a single forward pass. It identifies cross-task interference as a failure mode where adding or upgrading one head degrades frozen sibling heads, and proposes identity-preserving multi-task fine-tuning via zero-initialization of new sub-modules and freezing of all parameters feeding the shared activation stream. This is claimed to preserve original heads bit-identically while training only ~102k parameters. The tagging head achieves mAP 0.4614 (micro-F1 0.5557) on 20 tags using 32 visual tokens per scene; retrieval is validated qualitatively. Code is released.
Significance. If the preservation mechanism holds under empirical scrutiny, the approach offers a lightweight way to extend pre-trained multi-head models without interference, which could be useful for scene mining and multi-task BEV perception in autonomous driving. The availability of code is a positive factor for reproducibility.
major comments (2)
- [Abstract] Abstract (paragraph on cross-task interference): The assertion that zero-initializing every new sub-module and freezing every parameter feeding the shared stream preserves the original mining heads bit-identically is load-bearing for the central contribution, yet the manuscript provides no quantitative verification such as output deltas, activation equality checks, or before/after comparisons on the retrieval embedding or risk-score heads.
- [Abstract] Abstract and method description: No ablation or control experiment is reported that demonstrates degradation of sibling heads when the identity-preserving steps are omitted, leaving the necessity and effectiveness of the proposed fine-tuning unverified despite the reported tagging mAP of 0.4614.
minor comments (1)
- [Abstract] The abstract states that a motion forecast is a byproduct but provides no details on its formulation or evaluation; if this is not central, it should be clarified as out of scope.
Simulated Author's Rebuttal
We thank the referee for the constructive comments highlighting the need for stronger empirical verification of the identity-preserving mechanism. We address each point below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract (paragraph on cross-task interference): The assertion that zero-initializing every new sub-module and freezing every parameter feeding the shared stream preserves the original mining heads bit-identically is load-bearing for the central contribution, yet the manuscript provides no quantitative verification such as output deltas, activation equality checks, or before/after comparisons on the retrieval embedding or risk-score heads.
Authors: We agree that explicit quantitative verification strengthens the claim. The preservation holds by construction because new sub-modules are zero-initialized (contributing zero to the shared stream) and all parameters feeding the shared activation stream are frozen, leaving the forward pass through the original heads unchanged. To address the concern, we will add before/after comparisons in the revision, including L2 output deltas, activation equality checks, and numerical confirmation of identical outputs on the retrieval embedding and risk-score heads. revision: yes
-
Referee: [Abstract] Abstract and method description: No ablation or control experiment is reported that demonstrates degradation of sibling heads when the identity-preserving steps are omitted, leaving the necessity and effectiveness of the proposed fine-tuning unverified despite the reported tagging mAP of 0.4614.
Authors: The manuscript identifies cross-task interference as an observed failure mode, but we acknowledge the absence of a dedicated ablation. In the revision we will include a control experiment that omits zero-initialization of new sub-modules and/or the freezing of shared-stream parameters, reporting the resulting changes to the original heads (e.g., shifts in retrieval embeddings and risk scores) to demonstrate both necessity and effectiveness. revision: yes
Circularity Check
No significant circularity; method is architectural and empirical
full rationale
The paper presents an engineering solution for multi-task fine-tuning via zero-initialization of new sub-modules and freezing of parameters feeding the shared stream, with the bit-identical preservation stated as a direct architectural consequence rather than a derived prediction. Reported metrics (e.g., tagging mAP 0.4614) are empirical outcomes on held-out data, not quantities forced by fitting the same inputs. No equations, self-citations, uniqueness theorems, or ansatzes appear in the provided text that reduce claims to inputs by construction. The derivation chain is self-contained as a practical design choice with released code.
Axiom & Free-Parameter Ledger
free parameters (1)
- number of visual tokens =
32
axioms (1)
- domain assumption A frozen vision-language backbone supplies sufficiently rich features for simultaneous retrieval, tagging, and risk scoring in BEV without task-specific adaptation of the backbone itself.
Reference graph
Works this paper leans on
-
[1]
OptFlow: Fast optimization-based scene flow estimation without supervision
Rahul Ahuja, Chris Baker, and Wilko Schwarting. OptFlow: Fast optimization-based scene flow estimation without supervision. InProceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision (WACV), 2024. doi: 10.1109/ W ACV57701.2024.00313
arXiv 2024
-
[2]
Rahul Ahuja, Mudit Jain, Bala Murali Manoghar Sai Sudhakar, Venkatraman Narayanan, Pratik Likhar, Varun Ravi Kumar, and Senthil Yogamani. FishRoPE: Projective rotary position embeddings for omnidirectional visual perception.arXiv preprint arXiv:2604.10391, 2026
Pith/arXiv arXiv 2026
-
[3]
Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-VL technical report.ar...
Pith/arXiv arXiv 2025
-
[4]
Tailornet: Predict- ing clothing in 3d as a function of human pose, shape and garment style
Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuScenes: A multimodal dataset for autonomous driving. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 11621–11631, 2020. doi: 10.1109/CVPR42600.2020.01164
-
[5]
Holger Caesar, Juraj Kabzan, Kok Seang Tan, Whye Kit Fong, Eric Wolff, Alex Lang, Luke Fletcher, Oscar Beijbom, and Sammy Omari. nuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles.arXiv preprint arXiv:2106.11810, 2021. CVPR ADP3 Workshop
Pith/arXiv arXiv 2021
-
[6]
Net2net: Accelerating learning via knowledge transfer
Tianqi Chen, Ian Goodfellow, and Jonathon Shlens. Net2net: Accelerating learning via knowledge transfer. InInternational Conference on Learning Representations (ICLR), 2016
2016
-
[7]
Raman, Ankit Shah, and Stefanie Tellex
Tushar Choudhary, Vikrant Dewangan, Shivam Chandhok, Shubham Priyadarshan, Anushka Jain, Arun K. Singh, Siddharth Srivastava, Krishna Murthy Jatavallabhula, and K. Madhava Krishna. Talk2BEV: Language-enhanced bird’s-eye view maps for autonomous driving. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2024. doi: 10.110...
-
[8]
RefA V: Towards planning-centric scenario mining.arXiv preprint arXiv:2505.20981, 2025
Cainan Davidson, Deva Ramanan, and Neehar Peri. RefA V: Towards planning-centric scenario mining.arXiv preprint arXiv:2505.20981, 2025
arXiv 2025
-
[9]
CARScenes: Semantic VLM dataset for safe au- tonomous driving.arXiv preprint arXiv:2511.10701, 2025
Yuankai He and Weisong Shi. CARScenes: Semantic VLM dataset for safe au- tonomous driving.arXiv preprint arXiv:2511.10701, 2025. 16ABURADDAHA, NARA Y ANAN, THAKER, RAW ASHDEH: SCENEMINER BEV SCENE MINING
arXiv 2025
-
[10]
Parameter- efficient transfer learning for NLP
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter- efficient transfer learning for NLP. InProceedings of the 36th International Conference on Machine Learning (ICML), 2019
2019
-
[11]
Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations (ICLR), 2022
2022
-
[12]
Planning-oriented autonomous driving
Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, Lewei Lu, Xiaosong Jia, Qiang Liu, Jifeng Dai, Yu Qiao, and Hongyang Li. Planning-oriented autonomous driving. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023
2023
-
[13]
NuScenes-MQA: In- tegrated evaluation of captions and QA for autonomous driving datasets using markup annotations
Yuichi Inoue, Yuki Yada, Kotaro Tanahashi, and Yu Yamaguchi. NuScenes-MQA: In- tegrated evaluation of captions and QA for autonomous driving datasets using markup annotations. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops, pages 930–938, 2024
2024
-
[14]
Towards multi-layered 3d garments animation
Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. V AD: Vectorized scene repre- sentation for efficient autonomous driving. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), 2023. doi: 10.1109/ICCV51070.2023. 00766
-
[15]
Nidhi Kalra and Susan M. Paddock. Driving to safety: How many miles of driving would it take to demonstrate autonomous vehicle reliability? Technical Report RR- 1478-RC, RAND Corporation, 2016
2016
-
[16]
Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell
James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Des- jardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):352...
2017
-
[17]
Raman, Ankit Shah, and Stefanie Tellex
Chi-Hsi Kung, Chieh-Chi Yang, Pang-Yuan Pao, Shu-Wei Lu, Pin-Lun Chen, Hsin- Cheng Lu, and Yi-Ting Chen. RiskBench: A scenario-based benchmark for risk iden- tification. InProceedings of the IEEE International Conference on Robotics and Au- tomation (ICRA), 2024. doi: 10.1109/ICRA57147.2024.10610270
-
[18]
BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InProceedings of the 40th International Conference on Machine Learning (ICML), pages 19730–19742, 2023
2023
-
[19]
BEVFormer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers
Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Yu Qiao, and Jifeng Dai. BEVFormer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. InProceedings of the Euro- pean Conference on Computer Vision (ECCV), pages 1–18, 2022. doi: 10.1007/ 978-3-031-20077-9_1. ABURADDAHA, NARA Y ANAN, THAKE...
2022
-
[20]
URL https: //doi.org/10.1109/TPAMI.2017.2773081
Zhizhong Li and Derek Hoiem. Learning without forgetting.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 40(12):2935–2947, 2018. doi: 10.1109/TPAMI.2017.2773081
-
[21]
Conflict-averse gradient descent for multi-task learning
Bo Liu, Xingchao Liu, Xiaojie Jin, Peter Stone, and Qiang Liu. Conflict-averse gradient descent for multi-task learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2021
2021
-
[22]
Venkatraman Narayanan, Bala Sai, Rahul Ahuja, Pratik Likhar, Varun Ravi Kumar, and Senthil Yogamani. MambaFusion: Adaptive state-space fusion for multimodal 3d object detection.arXiv preprint arXiv:2602.08126, 2026
arXiv 2026
-
[23]
V2xp-asg: Generating adversarial scenes for vehicle-to-everything perception
Nigamaa Nayakanti, Rami Al-Rfou, Aurick Zhou, Kratarth Goel, Khaled S. Refaat, and Benjamin Sapp. Wayformer: Motion forecasting via simple and efficient attention networks. InProceedings of the IEEE International Conference on Robotics and Au- tomation (ICRA), pages 2980–2987, 2023. doi: 10.1109/ICRA48891.2023.10160609
-
[24]
Nuscenes- qa: A multi-modal visual question answering benchmark for autonomous driving sce- nario
Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, and Yu-Gang Jiang. NuScenes- QA: A multi-modal visual question answering benchmark for autonomous driving sce- nario. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2024. doi: 10.1609/aaai.v38i5.28253
-
[25]
Learning transferable visual models from natural lan- guage supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand- hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural lan- guage supervision. InProceedings of the 38th International Conference on Machine Learning (ICML), pages 8748–8763, 2021
2021
-
[26]
Trajec- tron++: Dynamically-feasible trajectory forecasting with heterogeneous data
Tim Salzmann, Boris Ivanovic, Punarjay Chakravarty, and Marco Pavone. Trajec- tron++: Dynamically-feasible trajectory forecasting with heterogeneous data. InPro- ceedings of the European Conference on Computer Vision (ECCV), pages 683–700,
-
[27]
doi: 10.1007/978-3-030-58523-5_40
-
[28]
Towards multi-layered 3d garments animation
Ari Seff, Brian Cera, Dian Chen, Mason Ng, Aurick Zhou, Nigamaa Nayakanti, Khaled S. Refaat, Rami Al-Rfou, and Benjamin Sapp. MotionLM: Multi-agent mo- tion forecasting as language modeling. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023. doi: 10.1109/ICCV51070.2023.00788
-
[29]
Motion transformer with global intention localization and local movement refinement
Shaoshuai Shi, Li Jiang, Dengxin Dai, and Bernt Schiele. Motion transformer with global intention localization and local movement refinement. InAdvances in Neural Information Processing Systems (NeurIPS), 2022
2022
-
[30]
DriveLM: Driving with graph visual question answering
Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. DriveLM: Driving with graph visual question answering. InProceedings of the European Conference on Computer Vision (ECCV), 2024. doi: 10.1007/978-3-031-72943-0_15
-
[31]
Critical reasons for crashes investigated in the national motor vehicle crash causation survey
Santokh Singh. Critical reasons for crashes investigated in the national motor vehicle crash causation survey. Traffic Safety Facts Crash Stats DOT HS 812 115, National Highway Traffic Safety Administration, 2015. 18ABURADDAHA, NARA Y ANAN, THAKER, RAW ASHDEH: SCENEMINER BEV SCENE MINING
2015
-
[32]
Video-bench: Human-aligned video generation benchmark
Ziying Song, Caiyan Jia, Lin Liu, Hongyu Pan, Yongchang Zhang, Junming Wang, Xingyu Zhang, Shaoqing Xu, Lei Yang, and Yadan Luo. Don’t shake the wheel: Momentum-aware planning in end-to-end autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. doi: 10.1109/CVPR52734.2025.02089
-
[33]
Scalability in perception for autonomous driving: Waymo Open Dataset
Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Ettinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang, Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov. Scalability in perception...
2020
-
[34]
Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. DriveVLM: The convergence of au- tonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024
Pith/arXiv arXiv 2024
-
[35]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. SigLIP 2: Multilingual vision-language encoders with improved semantic understand- ing, localization, and dense features.a...
Pith/arXiv arXiv 2025
-
[36]
Representation learning with con- trastive predictive coding.arXiv preprint arXiv:1807.03748, 2018
Aäron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with con- trastive predictive coding.arXiv preprint arXiv:1807.03748, 2018
Pith/arXiv arXiv 2018
-
[37]
Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and José M. Álvarez. OmniDrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. doi: 10.1109/CVPR52734.2025.02090
-
[38]
Argoverse 2: Next genera- tion datasets for self-driving perception and forecasting
Benjamin Wilson, William Qi, Tanmay Agarwal, John Lambert, Jagjeet Singh, Sid- dhesh Khandelwal, Bowen Pan, Ratnesh Kumar, Andrew Hartnett, Jhony Kaese- model Pontes, Deva Ramanan, Peter Carr, and James Hays. Argoverse 2: Next genera- tion datasets for self-driving perception and forecasting. InProceedings of the Neural Information Processing Systems Trac...
2021
-
[39]
Language prompt for autonomous driving.arXiv preprint arXiv:2309.04379, 2023
Dongming Wu, Wencheng Han, Yingfei Liu, Tiancai Wang, Cheng-zhong Xu, Xiangyu Zhang, and Jianbing Shen. Language prompt for autonomous driving.arXiv preprint arXiv:2309.04379, 2023
arXiv 2023
-
[40]
Gradient surgery for multi-task learning
Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning. InAdvances in Neural In- formation Processing Systems (NeurIPS), 2020
2020
-
[41]
Zhang, Alexander Sax, Amir Zamir, Leonidas Guibas, and Jitendra Ma- lik
Jeffrey O. Zhang, Alexander Sax, Amir Zamir, Leonidas Guibas, and Jitendra Ma- lik. Side-tuning: A baseline for network adaptation via additive side networks. In Proceedings of the European Conference on Computer Vision (ECCV), 2020. doi: 10.1007/978-3-030-58580-8_41. ABURADDAHA, NARA Y ANAN, THAKER, RAW ASHDEH: SCENEMINER BEV SCENE MINING19
-
[42]
Towards multi-layered 3d garments animation
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text- to-image diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023. doi: 10.1109/ICCV51070.2023.00355
-
[43]
Yunpeng Zhang, Zheng Zhu, Wenzhao Zheng, Junjie Huang, Guan Huang, Jie Zhou, and Jiwen Lu. BEVerse: Unified perception and prediction in birds-eye-view for vision- centric autonomous driving.arXiv preprint arXiv:2205.09743, 2022
arXiv 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.