arxiv: 2406.06978 · v4 · submitted 2024-06-11 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation

Jan Kautz, Jose M. Alvarez, Kailin Li, Shihao Wang, Shiyi Lan, Yishen Ji, Yu-Gang Jiang, Zhenxin Li, Zhiding Yu, Zhiqi Li, Ziyue Zhu, Zuxuan Wu

Pith reviewed 2026-05-13 23:08 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal planningknowledge distillationend-to-end learningtrajectory generationmulti-head decoderteacher-student modelautonomous driving

0 comments

The pith

Hydra-MDP trains an end-to-end planner by distilling knowledge from both human demonstrations and rule-based experts into a multi-head decoder that outputs diverse trajectories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a teacher-student training setup where multiple teachers guide a student model. Human teachers provide realistic driving examples while rule-based teachers supply structured environmental reasoning. The student uses a multi-head decoder to produce several trajectory options at once, each optimized for different evaluation criteria. This setup allows the entire planning process to run end-to-end, learning directly how the scene affects possible paths without separate rule-based cleanup steps afterward. Such an approach matters because it simplifies deployment in varied real-world driving situations.

Core claim

Hydra-MDP employs multiple teachers in a teacher-student model and knowledge distillation from both human and rule-based teachers to train the student model, which features a multi-head decoder to learn diverse trajectory candidates tailored to various evaluation metrics. With the knowledge of rule-based teachers, Hydra-MDP learns how the environment influences the planning in an end-to-end manner instead of resorting to non-differentiable post-processing.

What carries the argument

The multi-head decoder in the student model, which simultaneously absorbs distillation signals from human and rule-based teachers to generate multiple trajectory candidates tailored to different metrics.

If this is right

Planning becomes fully differentiable and end-to-end trainable.
The model generalizes better across different driving environments and conditions.
Diverse trajectory candidates are produced to match multiple evaluation metrics.
Environment influences on planning are learned directly rather than applied through post-processing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar multi-teacher distillation could apply to other sequential decision tasks like robotics navigation.
Reducing reliance on post-processing may speed up inference in real-time systems.
Adding more specialized teachers might further refine trajectory diversity without increasing model size.

Load-bearing premise

The multi-head decoder can absorb conflicting signals from human and rule-based teachers at the same time without collapsing modes or losing performance on any metric.

What would settle it

Training the same architecture with a single-head decoder using the combined teachers and observing no drop in any metric compared to the multi-head version would indicate the multi-head structure is not necessary.

read the original abstract

We propose Hydra-MDP, a novel paradigm employing multiple teachers in a teacher-student model. This approach uses knowledge distillation from both human and rule-based teachers to train the student model, which features a multi-head decoder to learn diverse trajectory candidates tailored to various evaluation metrics. With the knowledge of rule-based teachers, Hydra-MDP learns how the environment influences the planning in an end-to-end manner instead of resorting to non-differentiable post-processing. This method achieves the $1^{st}$ place in the Navsim challenge, demonstrating significant improvements in generalization across diverse driving environments and conditions. More details by visiting \url{https://github.com/NVlabs/Hydra-MDP}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Hydra-MDP's multi-teacher distillation with metric-specific heads claims a Navsim win but lacks direct checks on whether the heads integrate conflicting signals without collapse or trade-offs.

read the letter

The main thing to know is that this paper describes a student model trained via distillation from both human trajectories and rule-based teachers, using a multi-head decoder to output diverse candidates tuned to different driving metrics. It reports first place on the Navsim challenge and frames the approach as fully end-to-end, avoiding separate post-processing steps for environment-aware planning. That combination of multiple teachers plus metric-tailored heads is the concrete addition over standard single-teacher distillation setups. The practical angle is clear: it tries to make planning differentiable while still pulling in both data-driven and rule-based knowledge, which matters for real deployment where pure imitation often fails on edge cases. The challenge result gives some external signal that the method generalizes across conditions better than prior entries. The soft spot is exactly where the stress-test note flags it. The architecture description implies separate heads can absorb signals from human and rule-based sources without one dominating, but there is no reported evidence on loss balancing between teachers, head specialization, or per-metric scores when both teachers are active at once. If the heads collapse or one metric degrades to boost another, the claimed end-to-end advantage over post-processing baselines would not hold. I would want to see those ablations and the raw numbers before treating the win as settled. This is for researchers working on end-to-end multimodal planners in driving or similar sequential domains who already follow distillation work. A reader looking for a concrete architecture that ships a benchmark win would get usable ideas from it. The paper deserves peer review because the core setup is reproducible enough to test and the challenge result is a real external anchor, even if the conflicting-signal validation needs tightening.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Hydra-MDP, a teacher-student distillation framework for end-to-end multimodal planning in autonomous driving. It distills knowledge from both human trajectories and rule-based teachers into a student model equipped with a multi-head decoder that produces diverse trajectory candidates tailored to different evaluation metrics. The approach claims to learn environmental influences directly in a differentiable manner, avoiding non-differentiable post-processing, and reports achieving first place in the Navsim challenge along with improved generalization across diverse driving conditions.

Significance. If the performance claims are substantiated with detailed experiments, the work could advance end-to-end planning by showing how multiple conflicting teachers can be integrated via a multi-head architecture without post-hoc rules. The 1st-place Navsim result on an independent benchmark and the public GitHub repository would strengthen the case for practical impact and reproducibility in multimodal driving policies.

major comments (2)

[Abstract and Results] Abstract and Results section: The claim of 1st place in the Navsim challenge and significant generalization improvements is not accompanied by quantitative tables, ablation studies on the multi-head decoder, or error analysis; without these the central performance claim cannot be verified from the manuscript.
[Method (§3)] Method section (§3): The multi-head decoder is presented as simultaneously absorbing signals from human and rule-based teachers to produce diverse candidates, but no analysis of loss weighting between teachers, head specialization metrics, or per-metric scores when both teachers are active is provided; this directly bears on whether mode collapse or metric trade-offs occur, undermining the end-to-end advantage over post-processing baselines.

minor comments (2)

[Introduction] The related-work discussion should cite additional prior multi-teacher distillation techniques in planning to clarify the specific contribution of Hydra-Distillation.
[Figures] Figure captions and architecture diagrams would benefit from explicit labels indicating which heads correspond to which teacher signals and evaluation metrics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We agree that additional quantitative evidence and analysis would strengthen the presentation of our results and method. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract and Results] Abstract and Results section: The claim of 1st place in the Navsim challenge and significant generalization improvements is not accompanied by quantitative tables, ablation studies on the multi-head decoder, or error analysis; without these the central performance claim cannot be verified from the manuscript.

Authors: We agree that the current manuscript would benefit from more explicit quantitative support for the Navsim ranking and generalization claims. In the revised version we will add (i) a main results table with all competing methods and metrics, (ii) dedicated ablation tables isolating the multi-head decoder, and (iii) an error-analysis section that breaks down failure modes across driving conditions. These additions will make the performance claims directly verifiable from the text. revision: yes
Referee: [Method (§3)] Method section (§3): The multi-head decoder is presented as simultaneously absorbing signals from human and rule-based teachers to produce diverse candidates, but no analysis of loss weighting between teachers, head specialization metrics, or per-metric scores when both teachers are active is provided; this directly bears on whether mode collapse or metric trade-offs occur, undermining the end-to-end advantage over post-processing baselines.

Authors: We acknowledge the absence of explicit loss-weighting and specialization analysis. In the revision we will include (i) an ablation on teacher loss coefficients, (ii) quantitative metrics (e.g., trajectory diversity and per-head metric scores) that demonstrate head specialization, and (iii) a direct comparison of per-metric performance when both teachers are active versus single-teacher baselines. These experiments will show that mode collapse does not occur and that the multi-head design preserves the claimed end-to-end advantage. revision: yes

Circularity Check

0 steps flagged

Low circularity: external benchmark validation and independent teachers prevent reduction to internal fits

full rationale

The paper trains a multi-head decoder student via distillation from external human trajectories and rule-based teachers, then reports 1st place on the independent Navsim challenge benchmark. No equations or steps in the abstract or description reduce the claimed generalization improvements to a fitted parameter or self-defined quantity inside the paper. The multi-head architecture is presented as learning diverse candidates, but performance is validated externally rather than by construction. Minor self-citations may exist in the full text but do not load-bear the central claims, keeping circularity at a low non-significant level.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The approach rests on standard supervised-learning assumptions (availability of teacher labels, differentiability of the student network) and the empirical claim that multi-head outputs remain diverse. No explicit free parameters, axioms, or invented physical entities are introduced beyond the named Hydra-Distillation framework.

invented entities (1)

Hydra-Distillation no independent evidence
purpose: Multi-teacher, multi-target knowledge-distillation framework for multimodal planning
Named as the core novel paradigm in the abstract; no independent falsifiable prediction outside the paper is supplied.

pith-pipeline@v0.9.0 · 5453 in / 1201 out tokens · 94005 ms · 2026-05-13T23:08:25.962349+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel unclear
multi-head decoder to learn diverse trajectory candidates tailored to various evaluation metrics... knowledge distillation from both human and rule-based teachers
Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear
Hydra-MDP learns how the environment influences the planning in an end-to-end manner instead of resorting to non-differentiable post-processing

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VECTOR-Drive: Tightly Coupled Vision-Language and Trajectory Expert Routing for End-to-End Autonomous Driving
cs.CV 2026-05 unverdicted novelty 7.0

VECTOR-DRIVE couples vision-language reasoning and trajectory planning in one Transformer via semantic expert routing and flow-matching, reaching 88.91 driving score on Bench2Drive.
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
cs.RO 2026-05 unverdicted novelty 7.0

ReflectDrive-2 achieves 91.0 PDMS on NAVSIM with camera input by training a discrete diffusion model to self-edit trajectories via RL-aligned AutoEdit.
Driving risk emerges from the required two-dimensional joint evasive acceleration
cs.RO 2026-04 unverdicted novelty 7.0

Evasive acceleration quantifies driving risk as the minimum 2D constant relative acceleration needed to avoid collision and outperforms time-to-collision on warning timing, discrimination, and information retention ac...
Unveiling the Surprising Efficacy of Navigation Understanding in End-to-End Autonomous Driving
cs.RO 2026-04 unverdicted novelty 7.0

The SNG framework and SNG-VLA model enable end-to-end driving systems to better incorporate global navigation for state-of-the-art route following without auxiliary perception losses.
The DAWN of World-Action Interactive Models
cs.CV 2026-05 unverdicted novelty 6.0

DAWN couples a world predictor with a world-conditioned action denoiser in latent space so that each refines the other recursively, yielding strong planning and safety results on autonomous driving benchmarks.
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 6.0

CoWorld-VLA encodes world information into four expert tokens that condition a diffusion-based planner, yielding competitive collision avoidance and trajectory accuracy on the NAVSIM benchmark.
DriveFuture: Future-Aware Latent World Models for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 6.0

DriveFuture achieves SOTA results on NAVSIM by conditioning latent world model states on future predictions to directly inform trajectory planning.
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
cs.RO 2026-05 unverdicted novelty 6.0

ReflectDrive-2 combines masked discrete diffusion with RL-aligned self-editing to generate and refine driving trajectories, reaching 91.0 PDMS on NAVSIM camera-only and 94.8 in best-of-6.
Unified Map Prior Encoder for Mapping and Planning
cs.CV 2026-05 unverdicted novelty 6.0

UMPE fuses any subset of HD/SD vector maps, raster SD maps, and satellite imagery into BEV features via alignment-aware vector and raster branches, raising mapping mAP by 5.3-5.9 points and cutting planning L2 error b...
ProDrive: Proactive Planning for Autonomous Driving via Ego-Environment Co-Evolution
cs.RO 2026-04 unverdicted novelty 6.0

ProDrive couples a query-centric planner with a BEV world model for end-to-end ego-environment co-evolution, enabling future-outcome assessment that improves safety and efficiency over reactive baselines on NAVSIM v1.
FeaXDrive: Feasibility-aware Trajectory-Centric Diffusion Planning for End-to-End Autonomous Driving
cs.RO 2026-04 unverdicted novelty 6.0

FeaXDrive improves end-to-end autonomous driving by shifting diffusion planning to a trajectory-centric formulation with curvature-constrained training, drivable-area guidance, and GRPO post-training, yielding stronge...
Scaling-Aware Data Selection for End-to-End Autonomous Driving Systems
cs.LG 2026-04 unverdicted novelty 6.0

MOSAIC is a scaling-aware data selection framework that outperforms baselines in training end-to-end autonomous driving planners, achieving comparable or better EPDMS scores with up to 80% less data.
Orion-Lite: Distilling LLM Reasoning into Efficient Vision-Only Driving Models
cs.CV 2026-04 unverdicted novelty 6.0

Orion-Lite uses latent feature distillation and trajectory supervision to create a vision-only model that surpasses its LLM-based teacher on closed-loop Bench2Drive evaluation, achieving a new SOTA driving score of 80.6.
ExploreVLA: Dense World Modeling and Exploration for End-to-End Autonomous Driving
cs.CV 2026-04 unverdicted novelty 6.0

ExploreVLA augments VLA driving models with future RGB and depth prediction for dense supervision and uses prediction uncertainty as a safety-gated intrinsic reward for RL-based exploration, reaching SOTA PDMS 93.7 on NAVSIM.
DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale
cs.CV 2026-04 unverdicted novelty 6.0

DVGT-2 is a streaming vision-geometry-action model that jointly reconstructs dense 3D geometry and plans trajectories online, achieving better reconstruction than prior batch methods while transferring directly to pla...
REAP: Reinforcement-Learning End-to-End Autonomous Parking with Gaussian Splatting Simulator for Real2Sim2Real Transfer
cs.RO 2026-05 unverdicted novelty 5.0

REAP trains an end-to-end SAC policy with behavior cloning and collision penalties inside a 3DGS Real2Sim simulator and transfers it to physical vehicles, succeeding in narrow mechanical parking slots.
CRAFT: Counterfactual-to-Interactive Reinforcement Fine-Tuning for Driving Policies
cs.LG 2026-05 unverdicted novelty 5.0

CRAFT is an on-policy RL fine-tuning framework that decomposes closed-loop policy gradients into a group-normalized counterfactual proxy plus residual correction from interaction events, achieving top closed-loop perf...
SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model
cs.CV 2026-04 unverdicted novelty 5.0

SpanVLA reduces action generation latency via flow-matching conditioned on history and improves robustness by training on negative-recovery samples with GRPO and a dedicated reasoning dataset.
RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework
cs.CV 2026-04 unverdicted novelty 5.0

RAD-2 uses a diffusion generator and RL discriminator to cut collision rates by 56% in closed-loop autonomous driving planning.
CrowdVLA: Embodied Vision-Language-Action Agents for Context-Aware Crowd Simulation
cs.GR 2026-04 unverdicted novelty 5.0

CrowdVLA introduces vision-language-action agents for crowd simulation that reason about scene semantics, social norms, and action consequences using fine-tuned models and simulation rollouts.
DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving
cs.CV 2026-05 unverdicted novelty 4.0

DeepSight uses parallel latent feature prediction in BEV for long-horizon world modeling and adaptive text reasoning to reach state-of-the-art closed-loop performance on the Bench2drive benchmark.
Do Open-Loop Metrics Predict Closed-Loop Driving? A Cross-Benchmark Correlation Study of NAVSIM and Bench2Drive
cs.RO 2026-04 conditional novelty 4.0

Cross-benchmark analysis of 8 methods shows NAVSIM PDM Score correlates with Bench2Drive Driving Score at Spearman ρ=0.90, with Ego Progress as the strongest single predictor and a simpler 3-metric formula matching th...

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 21 Pith papers · 1 internal anchor

[1]

Quad: Query-based in- terpretable neural motion planning for autonomous driving

Sourav Biswas, Sergio Casas, Quinlan Sykora, Ben Agro, Abbas Sadat, and Raquel Urtasun. Quad: Query-based in- terpretable neural motion planning for autonomous driving. arXiv preprint arXiv:2404.01486, 2024. 2

work page arXiv 2024
[3]

nuplan: A closed-loop ml-based planning bench- mark for autonomous vehicles.arXiv preprint arXiv:2106.11810, 2021

Holger Caesar, Juraj Kabzan, Kok Seang Tan, Whye Kit Fong, Eric Wolff, Alex Lang, Luke Fletcher, Oscar Beijbom, and Sammy Omari. nuplan: A closed-loop ml-based plan- ning benchmark for autonomous vehicles. arXiv preprint arXiv:2106.11810, 2021. 3

work page arXiv 2021
[4]

VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning

Shaoyu Chen, Bo Jiang, Hao Gao, Bencheng Liao, Qing Xu, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Vadv2: End-to-end vectorized autonomous driving via probabilistic planning. arXiv preprint arXiv:2402.13243,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Transfuser: Imita- tion with transformer-based sensor fusion for autonomous driving

Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger. Transfuser: Imita- tion with transformer-based sensor fusion for autonomous driving. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022. 3, 4

work page 2022
[6]

Navsim: Data-driven non-reactive autonomous vehicle simulation

NA VSIM Contributors. Navsim: Data-driven non-reactive autonomous vehicle simulation. https://github.com/ autonomousvision/navsim, 2024. 3

work page 2024
[7]

Openscene: The largest up-to- date 3d occupancy prediction benchmark in autonomous driving

OpenScene Contributors. Openscene: The largest up-to- date 3d occupancy prediction benchmark in autonomous driving. https://github.com/OpenDriveLab/ OpenScene, 2023. 3

work page 2023
[8]

Parting with misconceptions about learning- based vehicle motion planning

Daniel Dauner, Marcel Hallgarten, Andreas Geiger, and Kashyap Chitta. Parting with misconceptions about learning- based vehicle motion planning. In Conference on Robot Learning, pages 1268–1281. PMLR, 2023. 1, 3, 4

work page 2023
[9]

Eva-02: A visual representation for neon genesis

Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xin- long Wang, and Yue Cao. Eva-02: A visual representation for neon genesis. arXiv preprint arXiv:2303.11331, 2023. 4

work page arXiv 2023
[10]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 4

work page 2016
[11]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17853–17862, 2023. 1, 2, 4

work page 2023
[12]

Vad: Vectorized scene representation for ef- ficient autonomous driving

Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for ef- ficient autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340– 8350, 2023. 1, 2

work page 2023
[13]

An energy and gpu-computation efficient backbone network for real-time object detection

Youngwan Lee, Joong-won Hwang, Sangrok Lee, Yuseok Bae, and Jongyoul Park. An energy and gpu-computation efficient backbone network for real-time object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 0–0, 2019. 4

work page 2019
[14]

Is ego status all you need for open-loop end-to-end autonomous driving? arXiv preprint arXiv:2312.03031, 2023

Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, and Jose M Alvarez. Is ego status all you need for open-loop end-to-end autonomous driving? arXiv preprint arXiv:2312.03031, 2023. 1, 2, 3

work page arXiv 2023
[15]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014. 4

work page 2014
[16]

Is pseudo-lidar needed for monocular 3d object detection? In Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision , pages 3142–3152,

Dennis Park, Rares Ambrus, Vitor Guizilini, Jie Li, and Adrien Gaidon. Is pseudo-lidar needed for monocular 3d object detection? In Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision , pages 3142–3152,

work page
[17]

Objects365: A large-scale, high-quality dataset for object detection

Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: A large-scale, high-quality dataset for object detection. In Pro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 8430–8439, 2019. 4

work page 2019
[18]

Con- gested traffic states in empirical observations and microscopic simulations

Martin Treiber, Ansgar Hennecke, and Dirk Helbing. Con- gested traffic states in empirical observations and microscopic simulations. Physical review E, 62(2):1805, 2000. 1

work page 2000
[19]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. 3

work page 2017
[20]

Depth anything: Unleash- ing the power of large-scale unlabeled data

Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Ji- ashi Feng, and Hengshuang Zhao. Depth anything: Unleash- ing the power of large-scale unlabeled data. arXiv preprint arXiv:2401.10891, 2024. 4 5

work page arXiv 2024