Closed Loop Dynamic Driving Data Mixture for Real-Synthetic Co-Training

Dan Xu; Hongzhi Ruan; Jun Ma; Kun Zhan; Pei Liu; Weiliang Ma; Xueyang Zhang; Zhengning Li

arxiv: 2605.21372 · v1 · pith:MIT34CK2new · submitted 2026-05-20 · 💻 cs.CV · cs.AI· cs.LG· cs.RO

Closed Loop Dynamic Driving Data Mixture for Real-Synthetic Co-Training

Hongzhi Ruan , Pei Liu , Weiliang Ma , Zhengning Li , Xueyang Zhang , Jun Ma , Dan Xu , Kun Zhan This is my paper

Pith reviewed 2026-05-21 04:42 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGcs.RO

keywords autonomous drivingreal-synthetic co-trainingdata mixture optimizationclosed-loop evaluationsynthetic data selectionscene representationend-to-end learning

0 comments

The pith

Closed-loop simulation feedback dynamically optimizes real-synthetic data mixtures for better autonomous driving performance with fewer samples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that mixing real and synthetic driving data for end-to-end models works best when treated as an iterative optimization guided by how the model performs in closed-loop simulation. Real data is costly and biased toward common scenes, so the method clusters scenes, scores their value from evaluation feedback, and pulls in only the most useful synthetic examples. This matters because simply adding more synthetic data often creates harmful distribution shifts or wastes compute under tight training budgets. A sympathetic reader would care because it offers a concrete way to scale learning without collecting ever-larger real-world datasets.

Core claim

AutoScale formulates data mixture as a dynamic optimization process that iteratively adjusts scene types and quantities to maximize model performance using closed-loop evaluation feedback. It uses Graph Regularized AutoEncoder to represent driving scenes, Cluster-aware Gradient Ascent to estimate cluster importance and reweighting, and cluster-guided vector retrieval to select high-value synthetic samples. Experiments on NavSim show this outperforms vanilla co-training and cross-domain baselines while delivering better results with fewer synthetic samples under constrained budgets.

What carries the argument

The closed-loop data engine that unifies scene representation, cluster-wise importance estimation from evaluation feedback, and retrieval to dynamically adjust the real-synthetic training mixture.

Load-bearing premise

That signals from running the model in simulation reliably identify which scene clusters are most worth supplementing with synthetic data and that those choices improve performance on real data without creating new mismatches.

What would settle it

Training a model with the AutoScale mixture and then measuring its performance on a large set of real-world driving logs; if accuracy or robustness falls below that of a model trained with a fixed or random synthetic mixture, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2605.21372 by Dan Xu, Hongzhi Ruan, Jun Ma, Kun Zhan, Pei Liu, Weiliang Ma, Xueyang Zhang, Zhengning Li.

**Figure 2.** Figure 2: Overview of the proposed AutoScale with three core modules. (a) Scene representation [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative results of clustering. (a) t-SNE of clusters on the training set. (b) Visualization [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative results of retrieval. The green dotted curve, red dotted curve, and dark yellow [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Clustering Visualization of Common and Long-Tail Driving Scenes in the [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Additional Visualization of Underperforming Scenes, Scene Retrieval, and Performance [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

read the original abstract

Data scaling is fundamental to modern deep learning, and grows increasingly critical as autonomous driving shifts to end-to-end learning. Real-world driving data is expensive to annotate and scene-biased, making real-synthetic co-training with near-infinite synthetic data a promising direction. However, naively incorporating all available synthetic data is inefficient and leads to distribution shifts, and optimizing data mixture under practical training budgets remains a critical yet under-explored problem. In this sense, we claim that the mixture of training data requires clear guidance in terms of scene types and quantities. Particularly in this work, we conceptualize the data mixture approximately as a dynamic optimization process that iteratively adjusts the training data mixture to maximize model performance, guided by closed-loop evaluation feedback, and propose AutoScale, a fully automated closed-loop data engine unifying scene representation, data mixture optimization and retrieval, as well as model training and evaluation. Specifically, we propose Graph Regularized AutoEncoder (Graph-RAE) for driving scene representations, introduce Cluster-aware Gradient Ascent (Cluster-GA) for cluster-wise importance estimation and reweighting, and perform cluster-guided vector retrieval to select high-value samples. Experiments on NavSim demonstrate that AutoScale outperforms vanilla co-training and cross-domain baselines, achieving better performance with fewer synthetic samples under constrained budgets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AutoScale closes the loop on real-synthetic driving data mixtures with graph scene encodings and cluster gradient ascent, claiming efficiency gains on NavSim but carrying clear risks of simulator overfitting.

read the letter

The main point is that this paper gives a concrete pipeline for dynamically mixing real and synthetic driving data by feeding simulator performance back into scene selection. They encode scenes with a Graph Regularized AutoEncoder, estimate cluster importance via Cluster-aware Gradient Ascent, and retrieve high-value samples to adjust the mixture iteratively. The result is reported better performance than vanilla co-training while using fewer synthetic samples under budget limits on NavSim tests.

Referee Report

3 major / 2 minor

Summary. The paper proposes AutoScale, a closed-loop data engine for real-synthetic co-training in autonomous driving. It introduces Graph Regularized AutoEncoder (Graph-RAE) for scene representations, Cluster-aware Gradient Ascent (Cluster-GA) for cluster-wise importance estimation and reweighting, and cluster-guided vector retrieval to select high-value samples. The central claim is that this dynamic optimization of data mixtures, guided by closed-loop NavSim feedback, outperforms vanilla co-training and cross-domain baselines while achieving better performance with fewer synthetic samples under constrained budgets.

Significance. If the results hold under rigorous validation, the work could meaningfully advance efficient data scaling for end-to-end driving models by automating mixture optimization. The unification of representation learning, gradient-based reweighting, and retrieval into a closed-loop engine addresses a practical bottleneck in real-synthetic co-training, with potential for broader impact in data-efficient training regimes.

major comments (3)

[Abstract and Experimental Results] Abstract and Experimental Results: the reported gains on NavSim lack details on statistical significance testing, exact baseline implementations, data exclusion rules, error bars, or number of runs. Without these, it is difficult to assess whether the improvements with fewer samples are robust or could be explained by variance in training or evaluation.
[Method (Cluster-GA and retrieval)] Method section on Cluster-GA: cluster importance weights and retrieval criteria appear derived from the same data distribution being optimized, and both gradient ascent steps and final performance measurement occur inside NavSim. This raises a load-bearing concern that reported gains may reflect fitting to simulator-specific artifacts rather than transferable improvements; an external hold-out (real data or alternate simulator) is needed to break potential circularity.
[Experimental Results] Experimental Results: cluster definitions and the number of gradient ascent steps are described without ablation on their sensitivity or post-hoc selection criteria. If these choices were tuned on the evaluation distribution, the cross-domain and budget-constrained claims require re-validation with fixed, pre-specified hyperparameters.

minor comments (2)

[Method] Notation for Graph-RAE loss terms and the precise formulation of the closed-loop objective could be clarified with an explicit equation relating importance weights to the performance feedback signal.
[Figures] Figure captions and axis labels in the NavSim results plots should explicitly state the metric (e.g., success rate, collision rate) and the exact synthetic sample budget used for each curve.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications from the manuscript and indicating planned revisions where appropriate to improve rigor and transparency.

read point-by-point responses

Referee: [Abstract and Experimental Results] Abstract and Experimental Results: the reported gains on NavSim lack details on statistical significance testing, exact baseline implementations, data exclusion rules, error bars, or number of runs. Without these, it is difficult to assess whether the improvements with fewer samples are robust or could be explained by variance in training or evaluation.

Authors: We agree that additional statistical details would strengthen the presentation of results. In the revised manuscript we will report error bars computed over multiple independent training runs (specifying the exact number of runs and random seeds), include statistical significance tests (such as paired t-tests or Wilcoxon signed-rank tests) between AutoScale and the baselines, and expand the experimental section to describe exact baseline implementations, data exclusion criteria, and evaluation protocols. revision: yes
Referee: [Method (Cluster-GA and retrieval)] Method section on Cluster-GA: cluster importance weights and retrieval criteria appear derived from the same data distribution being optimized, and both gradient ascent steps and final performance measurement occur inside NavSim. This raises a load-bearing concern that reported gains may reflect fitting to simulator-specific artifacts rather than transferable improvements; an external hold-out (real data or alternate simulator) is needed to break potential circularity.

Authors: We acknowledge the validity of the circularity concern. The closed-loop design intentionally uses NavSim feedback to guide data mixture optimization for the real-synthetic co-training task, and cross-domain baselines are already included to probe generalization. To further mitigate simulator-specific artifacts, the revision will add an explicit discussion of potential biases together with results on an external hold-out (either a real-world driving dataset or an alternate simulator) to demonstrate that the selected mixtures transfer beyond the optimization loop. revision: partial
Referee: [Experimental Results] Experimental Results: cluster definitions and the number of gradient ascent steps are described without ablation on their sensitivity or post-hoc selection criteria. If these choices were tuned on the evaluation distribution, the cross-domain and budget-constrained claims require re-validation with fixed, pre-specified hyperparameters.

Authors: We clarify that cluster definitions are derived directly from the Graph-RAE latent space on the training distribution and that the number of gradient ascent steps was chosen according to convergence behavior observed in preliminary runs, not tuned on the final evaluation set. In the revision we will add sensitivity ablations on both the number of clusters and the number of ascent steps, performed with fixed, pre-specified hyperparameters, and will re-validate the budget-constrained and cross-domain claims under these fixed settings. revision: yes

Circularity Check

0 steps flagged

No significant circularity; closed-loop optimization uses external simulator feedback as objective

full rationale

The paper's derivation chain conceptualizes data mixture as an iterative optimization guided by closed-loop NavSim evaluation feedback, then introduces Graph-RAE for scene representation, Cluster-GA for importance estimation, and vector retrieval for sample selection. These steps are algorithmic proposals whose outputs (mixture weights, selected samples) are not equivalent to their inputs by construction; the performance signal is generated by running the trained model in simulation, which is an independent measurement step rather than a renaming or self-definition. No self-citations are invoked as load-bearing uniqueness theorems, no ansatz is smuggled, and no fitted parameter is relabeled as a prediction. Experiments compare against vanilla co-training and cross-domain baselines on the NavSim benchmark, providing an external anchor for the reported gains. The derivation remains self-contained; concerns about simulator-specific artifacts belong to generalization risk, not circularity.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The method rests on domain assumptions about scene clustering and the value of gradient-based importance signals; no explicit free parameters or invented entities are named in the abstract, but the optimization implicitly fits cluster weights to performance feedback.

free parameters (1)

cluster importance weights
Derived via Cluster-GA from training feedback and used to reweight scene clusters

axioms (2)

domain assumption Driving scenes admit useful low-dimensional representations via graph-regularized autoencoders that preserve structural relationships
Invoked when introducing Graph-RAE for scene representation
domain assumption Closed-loop simulation feedback provides reliable guidance for selecting training data that improves real-world generalization
Central to the dynamic optimization process described

pith-pipeline@v0.9.0 · 5785 in / 1200 out tokens · 57720 ms · 2026-05-21T04:42:40.025517+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We conceptualize the data mixture approximately as a dynamic optimization process that iteratively adjusts the training data mixture to maximize model performance, guided by closed-loop evaluation feedback

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 5 internal anchors

[1]

arXiv preprint arXiv:2506.08228 , year=

Mustafa Baniodeh, Kratarth Goel, Scott Ettinger, Carlos Fuertes, Ari Seff, Tim Shen, Cole Gulino, Chenjie Yang, Ghassen Jerfel, Dokook Choe, et al. Scaling laws of motion forecasting and planning–technical report.arXiv preprint arXiv:2506.08228, 2025

work page arXiv 2025
[2]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems (NeurIPS), 2020

work page 2020
[4]

Pseudo-simulation for autonomous driving

Wei Cao, Marcel Hallgarten, Tianyu Li, Daniel Dauner, Xunjiang Gu, Caojun Wang, Yakov Miron, Marco Aiello, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, Andreas Geiger, and Kashyap Chitta. Pseudo-simulation for autonomous driving. InProceedings of the Conference on Robot Learning (CoRL), 2025

work page 2025
[5]

End-to-end autonomous driving: Challenges and frontiers.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024

Li Chen, Penghao Wu, Kashyap Chitta, Bernhard Jaeger, Andreas Geiger, and Hongyang Li. End-to-end autonomous driving: Challenges and frontiers.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024

work page 2024
[6]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InProceedings of the International conference on machine learning (ICML), 2020

work page 2020
[7]

Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023

Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023

work page 2023
[8]

Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking

Daniel Dauner, Marcel Hallgarten, Tianyu Li, Xinshuo Weng, Zhiyu Huang, Zetong Yang, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, Andreas Geiger, and Kashyap Chitta. Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[9]

Nemotron-climb: Clustering-based iterative data mixture bootstrapping for language model pre-training

Shizhe Diao, Yu Yang, Yonggan Fu, Xin Dong, Dan Su, Markus Kliegl, Zijia Chen, Peter Belcak, Yoshi Suhara, Hongxu Yin, et al. Nemotron-climb: Clustering-based iterative data mixture bootstrapping for language model pre-training. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[10]

Realgen: Retrieval augmented generation for controllable traffic scenarios

Wenhao Ding, Yulong Cao, Ding Zhao, Chaowei Xiao, and Marco Pavone. Realgen: Retrieval augmented generation for controllable traffic scenarios. InProceedings of the European Conference on Computer Vision (ECCV), 2024

work page 2024
[11]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InProceedings of the International Conference on Learning Representations (ICLR), 2020

work page 2020
[12]

Doge: Domain reweighting with generalization estimation

Simin Fan, Matteo Pagliardini, and Martin Jaggi. Doge: Domain reweighting with generalization estimation. InProceedings of the International conference on machine learning (ICML), 2024

work page 2024
[13]

Magicdrive- v2: High-resolution long video generation for autonomous driving with adaptive control

Ruiyuan Gao, Kai Chen, Bo Xiao, Lanqing Hong, Zhenguo Li, and Qiang Xu. Magicdrive- v2: High-resolution long video generation for autonomous driving with adaptive control. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

work page 2025
[14]

MagicDrive: Street view generation with diverse 3d geometry control

Ruiyuan Gao, Kai Chen, Enze Xie, Lanqing Hong, Zhenguo Li, Dit-Yan Yeung, and Qiang Xu. MagicDrive: Street view generation with diverse 3d geometry control. InProceedings of the International Conference on Learning Representations (ICLR), 2024

work page 2024
[15]

Vista: A generalizable driving world model with high fidelity and versatile controllability

Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yihang Qiu, Andreas Geiger, Jun Zhang, and Hongyang Li. Vista: A generalizable driving world model with high fidelity and versatile controllability. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. 10

work page 2024
[16]

Road: Rollouts as demonstrations for closed-loop supervised fine-tuning of autonomous driving policies.arXiv preprint arXiv:2512.01993, 2025

Guillermo Garcia-Cobo, Maximilian Igl, Peter Karkus, Zhejun Zhang, Michael Watson, Yuxiao Chen, Boris Ivanovic, and Marco Pavone. Road: Rollouts as demonstrations for closed-loop supervised fine-tuning of autonomous driving policies.arXiv preprint arXiv:2512.01993, 2025

work page arXiv 2025
[17]

Unraveling the effects of synthetic data on end-to-end autonomous driving

Junhao Ge, Zuhong Liu, Longteng Fan, Yifan Jiang, Jiaqi Su, Yiming Li, Zhejun Zhang, and Siheng Chen. Unraveling the effects of synthetic data on end-to-end autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

work page 2025
[18]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2016

work page 2016
[19]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, Lewei Lu, Xiaosong Jia, Qiang Liu, Jifeng Dai, Yu Qiao, and Hongyang Li. Planning-oriented autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023
[21]

Étude comparative de la distribution florale dans une portion des alpes et des jura

Paul Jaccard. Étude comparative de la distribution florale dans une portion des alpes et des jura. Bulletin de la Société V audoise des Sciences Naturelles, 1901

work page 1901
[22]

Vad: Vectorized scene representation for efficient autonomous driving

Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

work page 2023
[23]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[24]

3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics (ToG), 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, George Drettakis, et al. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics (ToG), 2023

work page 2023
[25]

Supervised contrastive learning

Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2020

work page 2020
[26]

Adam: a method for stochastic optimization

Diederik P Kingma and Jimmy Ba. Adam: a method for stochastic optimization. InProceedings of the International Conference on Learning Representations (ICLR), 2015

work page 2015
[27]

Mtgs: Multi-traversal gaussian splatting.arXiv preprint arXiv:2503.12552, 2025

Tianyu Li, Yihang Qiu, Zhenhua Wu, Carl Lindström, Peng Su, Matthias Nießner, and Hongyang Li. Mtgs: Multi-traversal gaussian splatting.arXiv preprint arXiv:2503.12552, 2025

work page arXiv 2025
[28]

Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving

Yongkang Li, Kaixin Xiong, Xiangyu Guo, Fang Li, Sixu Yan, Gangwei Xu, Lijun Zhou, Long Chen, Haiyang Sun, Bing Wang, et al. Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving. InProceedings of the International Conference on Learning Representations (ICLR), 2025

work page 2025
[29]

Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving

Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025
[30]

Model-based policy adaptation for closed-loop end-to-end autonomous driving

Haohong Lin, Yunzhi Zhang, Wenhao Ding, Jiajun Wu, and Ding Zhao. Model-based policy adaptation for closed-loop end-to-end autonomous driving. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[31]

Regmix: Data mixture as regression for language model pre-training

Qian Liu, Xiaosen Zheng, Niklas Muennighoff, Guangtao Zeng, Longxu Dou, Tianyu Pang, Jing Jiang, and Min Lin. Regmix: Data mixture as regression for language model pre-training. InProceedings of the International Conference on Learning Representations (ICLR), 2025. 11

work page 2025
[32]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InProceedings of the International Conference on Learning Representations (ICLR), 2019

work page 2019
[33]

Unleashing generalization of end-to-end autonomous driving with controllable long video generation.arXiv preprint arXiv:2406.01349, 2024

Enhui Ma, Lijun Zhou, Tao Tang, Zhan Zhang, Dong Han, Junpeng Jiang, Kun Zhan, Peng Jia, Xianpeng Lang, Haiyang Sun, et al. Unleashing generalization of end-to-end autonomous driving with controllable long video generation.arXiv preprint arXiv:2406.01349, 2024

work page arXiv 2024
[34]

Sim-and-real co-training: A simple recipe for vision-based robotic manipulation

Abhiram Maddukuri, Zhenyu Jiang, Lawrence Yunliang Chen, Soroush Nasiriany, Yuqi Xie, Yu Fang, Wenqi Huang, Zu Wang, Zhenjia Xu, Nikita Chernyadev, Scott Reed, Ken Goldberg, Ajay Mandlekar, Linxi Fan, and Yuke Zhu. Sim-and-real co-training: A simple recipe for vision-based robotic manipulation. InProceedings of Robotics: Science and Systems (RSS), 2025

work page 2025
[35]

Robocasa: Large-scale simulation of everyday tasks for generalist robots

Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots. InProceedings of Robotics: Science and Systems (RSS), 2024

work page 2024
[36]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

NVIDIA, Johan Bjorck, Nikita Cherniadev Fernando Castañeda, Xingye Da, Runyu Ding, Linxi "Jim" Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Zhenyu Jiang, Jan Kautz, Kaushil Kundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, Edith Llontop, Loic Magne, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed, You L...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InProceedings of the International conference on machine learning (ICML), 2021

work page 2021
[38]

Cosmos-drive-dreams: Scalable synthetic driving data generation with world foundation models.arXiv preprint arXiv:2506.09042, 2025

Xuanchi Ren, Yifan Lu, Tianshi Cao, Ruiyuan Gao, Shengyu Huang, Amirmojtaba Sabour, Tianchang Shen, Tobias Pfaff, Jay Zhangjie Wu, Runjian Chen, et al. Cosmos-drive-dreams: Scalable synthetic driving data generation with world foundation models.arXiv preprint arXiv:2506.09042, 2025

work page arXiv 2025
[39]

Sparsedrive: End-to-end autonomous driving via sparse scene representation

Wenchao Sun, Xuewu Lin, Yining Shi, Chuang Zhang, Haoran Wu, and Sifa Zheng. Sparsedrive: End-to-end autonomous driving via sparse scene representation. InProceedings of the Interna- tional Conference on Robotics and Automation (ICRA), 2025

work page 2025
[40]

Gigaworld-0: World models as data engine to empower embodied ai.arXiv preprint arXiv:2511.19861, 2025

GigaWorld Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jiagang Zhu, Kerui Li, Mengyuan Xu, et al. Gigaworld-0: World models as data engine to empower embodied ai.arXiv preprint arXiv:2511.19861, 2025

work page arXiv 2025
[41]

Simscale: Learning to drive via real-world simulation at scale

Haochen Tian, Tianyu Li, Haochen Liu, Jiazhi Yang, Yihang Qiu, Guang Li, Junli Wang, Yinfeng Gao, Zhang Zhang, Liang Wang, et al. Simscale: Learning to drive via real-world simulation at scale. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2026

work page 2026
[42]

Interndata-a1: Pioneering high-fidelity synthetic data for pre-training generalist policy.arXiv preprint arXiv:2511.16651, 2025

Yang Tian, Yuyin Yang, Yiman Xie, Zetao Cai, Xu Shi, Ning Gao, Hangxu Liu, Xuekun Jiang, Zherui Qiu, Feng Yuan, et al. Interndata-a1: Pioneering high-fidelity synthetic data for pre-training generalist policy.arXiv preprint arXiv:2511.16651, 2025

work page arXiv 2025
[43]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in neural information processing systems (NeurIPS), 2017

work page 2017
[45]

Drive- dreamer: Towards real-world-drive world models for autonomous driving

Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, and Jiwen Lu. Drive- dreamer: Towards real-world-drive world models for autonomous driving. InProceedings of the European Conference on Computer Vision (ECCV), 2024. 12

work page 2024
[46]

Panacea: Panoramic and controllable video generation for autonomous driving

Yuqing Wen, Yucheng Zhao, Yingfei Liu, Fan Jia, Yanhui Wang, Chong Luo, Chi Zhang, Tiancai Wang, Xiaoyan Sun, and Xiangyu Zhang. Panacea: Panoramic and controllable video generation for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2024

work page 2024
[47]

Data retrieval with importance weights for few-shot imitation learning

Amber Xie, Rahul Chand, Dorsa Sadigh, and Joey Hejna. Data retrieval with importance weights for few-shot imitation learning. InProceedings of the Conference on Robot Learning (CoRL), 2025

work page 2025
[48]

Doremi: Optimizing data mixtures speeds up language model pretraining

Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy S Liang, Quoc V Le, Tengyu Ma, and Adams Wei Yu. Doremi: Optimizing data mixtures speeds up language model pretraining. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[49]

Chameleon: A flexible data-mixing framework for language model pretraining and finetuning

Wanyun Xie, Francesco Tonin, and V olkan Cevher. Chameleon: A flexible data-mixing framework for language model pretraining and finetuning. InProceedings of the International conference on machine learning (ICML), 2025

work page 2025
[50]

Goalflow: Goal-driven flow matching for multimodal trajectories generation in end-to-end autonomous driving

Zebin Xing, Xingyu Zhang, Yang Hu, Bo Jiang, Tong He, Qian Zhang, Xiaoxiao Long, and Wei Yin. Goalflow: Goal-driven flow matching for multimodal trajectories generation in end-to-end autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2025

work page 2025
[51]

Data mixing laws: Optimizing data mixtures by predicting language modeling performance

Jiasheng Ye, Peiju Liu, Tianxiang Sun, Jun Zhan, Yunhua Zhou, and Xipeng Qiu. Data mixing laws: Optimizing data mixtures by predicting language modeling performance. InProceedings of the International Conference on Learning Representations (ICLR), 2025

work page 2025
[52]

Diffusion-based planning for autonomous driving with flexible guidance

Yinan Zheng, Ruiming Liang, Kexin Zheng, Jinliang Zheng, Liyuan Mao, Jianxiong Li, Weihao Gu, Rui Ai, Shengbo Eben Li, Xianyuan Zhan, and Jingjing Liu. Diffusion-based planning for autonomous driving with flexible guidance. InProceedings of the International Conference on Learning Representations (ICLR), 2025

work page 2025
[53]

Data scaling laws for imitation learning-based end-to-end autonomous driving.arXiv preprint arXiv:2412.02689, 2024

Yupeng Zheng, Pengxuan Yang, Zhongpu Xia, Qichao Zhang, Yuhang Zheng, Songen Gu, Bu Jin, Teng Zhang, Ben Lu, Chao Han, et al. Data scaling laws for imitation learning-based end-to-end autonomous driving.arXiv preprint arXiv:2412.02689, 2024. 13 Technical appendices and supplementary material This supplementary material presents implementation and evaluati...

work page arXiv 2024

[1] [1]

arXiv preprint arXiv:2506.08228 , year=

Mustafa Baniodeh, Kratarth Goel, Scott Ettinger, Carlos Fuertes, Ari Seff, Tim Shen, Cole Gulino, Chenjie Yang, Ghassen Jerfel, Dokook Choe, et al. Scaling laws of motion forecasting and planning–technical report.arXiv preprint arXiv:2506.08228, 2025

work page arXiv 2025

[2] [2]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems (NeurIPS), 2020

work page 2020

[4] [4]

Pseudo-simulation for autonomous driving

Wei Cao, Marcel Hallgarten, Tianyu Li, Daniel Dauner, Xunjiang Gu, Caojun Wang, Yakov Miron, Marco Aiello, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, Andreas Geiger, and Kashyap Chitta. Pseudo-simulation for autonomous driving. InProceedings of the Conference on Robot Learning (CoRL), 2025

work page 2025

[5] [5]

End-to-end autonomous driving: Challenges and frontiers.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024

Li Chen, Penghao Wu, Kashyap Chitta, Bernhard Jaeger, Andreas Geiger, and Hongyang Li. End-to-end autonomous driving: Challenges and frontiers.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024

work page 2024

[6] [6]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InProceedings of the International conference on machine learning (ICML), 2020

work page 2020

[7] [7]

Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023

Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023

work page 2023

[8] [8]

Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking

Daniel Dauner, Marcel Hallgarten, Tianyu Li, Xinshuo Weng, Zhiyu Huang, Zetong Yang, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, Andreas Geiger, and Kashyap Chitta. Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024

[9] [9]

Nemotron-climb: Clustering-based iterative data mixture bootstrapping for language model pre-training

Shizhe Diao, Yu Yang, Yonggan Fu, Xin Dong, Dan Su, Markus Kliegl, Zijia Chen, Peter Belcak, Yoshi Suhara, Hongxu Yin, et al. Nemotron-climb: Clustering-based iterative data mixture bootstrapping for language model pre-training. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

work page 2025

[10] [10]

Realgen: Retrieval augmented generation for controllable traffic scenarios

Wenhao Ding, Yulong Cao, Ding Zhao, Chaowei Xiao, and Marco Pavone. Realgen: Retrieval augmented generation for controllable traffic scenarios. InProceedings of the European Conference on Computer Vision (ECCV), 2024

work page 2024

[11] [11]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InProceedings of the International Conference on Learning Representations (ICLR), 2020

work page 2020

[12] [12]

Doge: Domain reweighting with generalization estimation

Simin Fan, Matteo Pagliardini, and Martin Jaggi. Doge: Domain reweighting with generalization estimation. InProceedings of the International conference on machine learning (ICML), 2024

work page 2024

[13] [13]

Magicdrive- v2: High-resolution long video generation for autonomous driving with adaptive control

Ruiyuan Gao, Kai Chen, Bo Xiao, Lanqing Hong, Zhenguo Li, and Qiang Xu. Magicdrive- v2: High-resolution long video generation for autonomous driving with adaptive control. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

work page 2025

[14] [14]

MagicDrive: Street view generation with diverse 3d geometry control

Ruiyuan Gao, Kai Chen, Enze Xie, Lanqing Hong, Zhenguo Li, Dit-Yan Yeung, and Qiang Xu. MagicDrive: Street view generation with diverse 3d geometry control. InProceedings of the International Conference on Learning Representations (ICLR), 2024

work page 2024

[15] [15]

Vista: A generalizable driving world model with high fidelity and versatile controllability

Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yihang Qiu, Andreas Geiger, Jun Zhang, and Hongyang Li. Vista: A generalizable driving world model with high fidelity and versatile controllability. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. 10

work page 2024

[16] [16]

Road: Rollouts as demonstrations for closed-loop supervised fine-tuning of autonomous driving policies.arXiv preprint arXiv:2512.01993, 2025

Guillermo Garcia-Cobo, Maximilian Igl, Peter Karkus, Zhejun Zhang, Michael Watson, Yuxiao Chen, Boris Ivanovic, and Marco Pavone. Road: Rollouts as demonstrations for closed-loop supervised fine-tuning of autonomous driving policies.arXiv preprint arXiv:2512.01993, 2025

work page arXiv 2025

[17] [17]

Unraveling the effects of synthetic data on end-to-end autonomous driving

Junhao Ge, Zuhong Liu, Longteng Fan, Yifan Jiang, Jiaqi Su, Yiming Li, Zhejun Zhang, and Siheng Chen. Unraveling the effects of synthetic data on end-to-end autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

work page 2025

[18] [18]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2016

work page 2016

[19] [19]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, Lewei Lu, Xiaosong Jia, Qiang Liu, Jifeng Dai, Yu Qiao, and Hongyang Li. Planning-oriented autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023

[21] [21]

Étude comparative de la distribution florale dans une portion des alpes et des jura

Paul Jaccard. Étude comparative de la distribution florale dans une portion des alpes et des jura. Bulletin de la Société V audoise des Sciences Naturelles, 1901

work page 1901

[22] [22]

Vad: Vectorized scene representation for efficient autonomous driving

Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

work page 2023

[23] [23]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[24] [24]

3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics (ToG), 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, George Drettakis, et al. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics (ToG), 2023

work page 2023

[25] [25]

Supervised contrastive learning

Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2020

work page 2020

[26] [26]

Adam: a method for stochastic optimization

Diederik P Kingma and Jimmy Ba. Adam: a method for stochastic optimization. InProceedings of the International Conference on Learning Representations (ICLR), 2015

work page 2015

[27] [27]

Mtgs: Multi-traversal gaussian splatting.arXiv preprint arXiv:2503.12552, 2025

Tianyu Li, Yihang Qiu, Zhenhua Wu, Carl Lindström, Peng Su, Matthias Nießner, and Hongyang Li. Mtgs: Multi-traversal gaussian splatting.arXiv preprint arXiv:2503.12552, 2025

work page arXiv 2025

[28] [28]

Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving

Yongkang Li, Kaixin Xiong, Xiangyu Guo, Fang Li, Sixu Yan, Gangwei Xu, Lijun Zhou, Long Chen, Haiyang Sun, Bing Wang, et al. Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving. InProceedings of the International Conference on Learning Representations (ICLR), 2025

work page 2025

[29] [29]

Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving

Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025

[30] [30]

Model-based policy adaptation for closed-loop end-to-end autonomous driving

Haohong Lin, Yunzhi Zhang, Wenhao Ding, Jiajun Wu, and Ding Zhao. Model-based policy adaptation for closed-loop end-to-end autonomous driving. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

work page 2025

[31] [31]

Regmix: Data mixture as regression for language model pre-training

Qian Liu, Xiaosen Zheng, Niklas Muennighoff, Guangtao Zeng, Longxu Dou, Tianyu Pang, Jing Jiang, and Min Lin. Regmix: Data mixture as regression for language model pre-training. InProceedings of the International Conference on Learning Representations (ICLR), 2025. 11

work page 2025

[32] [32]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InProceedings of the International Conference on Learning Representations (ICLR), 2019

work page 2019

[33] [33]

Unleashing generalization of end-to-end autonomous driving with controllable long video generation.arXiv preprint arXiv:2406.01349, 2024

Enhui Ma, Lijun Zhou, Tao Tang, Zhan Zhang, Dong Han, Junpeng Jiang, Kun Zhan, Peng Jia, Xianpeng Lang, Haiyang Sun, et al. Unleashing generalization of end-to-end autonomous driving with controllable long video generation.arXiv preprint arXiv:2406.01349, 2024

work page arXiv 2024

[34] [34]

Sim-and-real co-training: A simple recipe for vision-based robotic manipulation

Abhiram Maddukuri, Zhenyu Jiang, Lawrence Yunliang Chen, Soroush Nasiriany, Yuqi Xie, Yu Fang, Wenqi Huang, Zu Wang, Zhenjia Xu, Nikita Chernyadev, Scott Reed, Ken Goldberg, Ajay Mandlekar, Linxi Fan, and Yuke Zhu. Sim-and-real co-training: A simple recipe for vision-based robotic manipulation. InProceedings of Robotics: Science and Systems (RSS), 2025

work page 2025

[35] [35]

Robocasa: Large-scale simulation of everyday tasks for generalist robots

Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots. InProceedings of Robotics: Science and Systems (RSS), 2024

work page 2024

[36] [36]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

NVIDIA, Johan Bjorck, Nikita Cherniadev Fernando Castañeda, Xingye Da, Runyu Ding, Linxi "Jim" Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Zhenyu Jiang, Jan Kautz, Kaushil Kundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, Edith Llontop, Loic Magne, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed, You L...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InProceedings of the International conference on machine learning (ICML), 2021

work page 2021

[38] [38]

Cosmos-drive-dreams: Scalable synthetic driving data generation with world foundation models.arXiv preprint arXiv:2506.09042, 2025

Xuanchi Ren, Yifan Lu, Tianshi Cao, Ruiyuan Gao, Shengyu Huang, Amirmojtaba Sabour, Tianchang Shen, Tobias Pfaff, Jay Zhangjie Wu, Runjian Chen, et al. Cosmos-drive-dreams: Scalable synthetic driving data generation with world foundation models.arXiv preprint arXiv:2506.09042, 2025

work page arXiv 2025

[39] [39]

Sparsedrive: End-to-end autonomous driving via sparse scene representation

Wenchao Sun, Xuewu Lin, Yining Shi, Chuang Zhang, Haoran Wu, and Sifa Zheng. Sparsedrive: End-to-end autonomous driving via sparse scene representation. InProceedings of the Interna- tional Conference on Robotics and Automation (ICRA), 2025

work page 2025

[40] [40]

Gigaworld-0: World models as data engine to empower embodied ai.arXiv preprint arXiv:2511.19861, 2025

GigaWorld Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jiagang Zhu, Kerui Li, Mengyuan Xu, et al. Gigaworld-0: World models as data engine to empower embodied ai.arXiv preprint arXiv:2511.19861, 2025

work page arXiv 2025

[41] [41]

Simscale: Learning to drive via real-world simulation at scale

Haochen Tian, Tianyu Li, Haochen Liu, Jiazhi Yang, Yihang Qiu, Guang Li, Junli Wang, Yinfeng Gao, Zhang Zhang, Liang Wang, et al. Simscale: Learning to drive via real-world simulation at scale. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2026

work page 2026

[42] [42]

Interndata-a1: Pioneering high-fidelity synthetic data for pre-training generalist policy.arXiv preprint arXiv:2511.16651, 2025

Yang Tian, Yuyin Yang, Yiman Xie, Zetao Cai, Xu Shi, Ning Gao, Hangxu Liu, Xuekun Jiang, Zherui Qiu, Feng Yuan, et al. Interndata-a1: Pioneering high-fidelity synthetic data for pre-training generalist policy.arXiv preprint arXiv:2511.16651, 2025

work page arXiv 2025

[43] [43]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[44] [44]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in neural information processing systems (NeurIPS), 2017

work page 2017

[45] [45]

Drive- dreamer: Towards real-world-drive world models for autonomous driving

Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, and Jiwen Lu. Drive- dreamer: Towards real-world-drive world models for autonomous driving. InProceedings of the European Conference on Computer Vision (ECCV), 2024. 12

work page 2024

[46] [46]

Panacea: Panoramic and controllable video generation for autonomous driving

Yuqing Wen, Yucheng Zhao, Yingfei Liu, Fan Jia, Yanhui Wang, Chong Luo, Chi Zhang, Tiancai Wang, Xiaoyan Sun, and Xiangyu Zhang. Panacea: Panoramic and controllable video generation for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2024

work page 2024

[47] [47]

Data retrieval with importance weights for few-shot imitation learning

Amber Xie, Rahul Chand, Dorsa Sadigh, and Joey Hejna. Data retrieval with importance weights for few-shot imitation learning. InProceedings of the Conference on Robot Learning (CoRL), 2025

work page 2025

[48] [48]

Doremi: Optimizing data mixtures speeds up language model pretraining

Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy S Liang, Quoc V Le, Tengyu Ma, and Adams Wei Yu. Doremi: Optimizing data mixtures speeds up language model pretraining. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023

[49] [49]

Chameleon: A flexible data-mixing framework for language model pretraining and finetuning

Wanyun Xie, Francesco Tonin, and V olkan Cevher. Chameleon: A flexible data-mixing framework for language model pretraining and finetuning. InProceedings of the International conference on machine learning (ICML), 2025

work page 2025

[50] [50]

Goalflow: Goal-driven flow matching for multimodal trajectories generation in end-to-end autonomous driving

Zebin Xing, Xingyu Zhang, Yang Hu, Bo Jiang, Tong He, Qian Zhang, Xiaoxiao Long, and Wei Yin. Goalflow: Goal-driven flow matching for multimodal trajectories generation in end-to-end autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2025

work page 2025

[51] [51]

Data mixing laws: Optimizing data mixtures by predicting language modeling performance

Jiasheng Ye, Peiju Liu, Tianxiang Sun, Jun Zhan, Yunhua Zhou, and Xipeng Qiu. Data mixing laws: Optimizing data mixtures by predicting language modeling performance. InProceedings of the International Conference on Learning Representations (ICLR), 2025

work page 2025

[52] [52]

Diffusion-based planning for autonomous driving with flexible guidance

Yinan Zheng, Ruiming Liang, Kexin Zheng, Jinliang Zheng, Liyuan Mao, Jianxiong Li, Weihao Gu, Rui Ai, Shengbo Eben Li, Xianyuan Zhan, and Jingjing Liu. Diffusion-based planning for autonomous driving with flexible guidance. InProceedings of the International Conference on Learning Representations (ICLR), 2025

work page 2025

[53] [53]

Data scaling laws for imitation learning-based end-to-end autonomous driving.arXiv preprint arXiv:2412.02689, 2024

Yupeng Zheng, Pengxuan Yang, Zhongpu Xia, Qichao Zhang, Yuhang Zheng, Songen Gu, Bu Jin, Teng Zhang, Ben Lu, Chao Han, et al. Data scaling laws for imitation learning-based end-to-end autonomous driving.arXiv preprint arXiv:2412.02689, 2024. 13 Technical appendices and supplementary material This supplementary material presents implementation and evaluati...

work page arXiv 2024