FleetAgent: Teleoperation Assistant for Autonomous Fleets via Vectorized V2N Messages

Deyuan Qu; Juntong Peng; Qi Chen; Takayuki Shimizu; Yaobin Chen; Ziran Wang

arxiv: 2606.21222 · v1 · pith:3RHEGUDEnew · submitted 2026-06-19 · 💻 cs.RO · cs.AI

FleetAgent: Teleoperation Assistant for Autonomous Fleets via Vectorized V2N Messages

Juntong Peng , Qi Chen , Deyuan Qu , Takayuki Shimizu , Yaobin Chen , Ziran Wang This is my paper

Pith reviewed 2026-06-26 14:12 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords autonomous fleetsteleoperationvectorized V2N messagesmultimodal LLMintervention prioritizationVecEvalVecFormer

0 comments

The pith

FleetAgent lets a cloud MLLM evaluate autonomous fleet plans from compact vectorized V2N messages instead of raw images or text.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FleetAgent, a cloud-hosted multimodal LLM that takes vectorized messages carrying map elements, detected objects, and ego planned paths. It outputs structured natural-language narration, plan explanations, evaluations, and an intervention urgency score to help remote operators prioritize. The approach targets the bandwidth and operator limits of large autonomous fleets by avoiding raw sensor streams. VecFormer converts the vectors into embeddings with top-K selection to keep context manageable for batch processing.

Core claim

FleetAgent consumes compact vectorized V2N messages such as map elements, detected objects, and the ego planned path to generate structured natural-language responses including narration, explanation, and evaluation of the plan and scene, along with an intervention urgency score. VecFormer provides the vector-to-embedding interface with differentiable top-K context selection. On the VecEval dataset the system reduces uplink payload by up to 625 times versus raw images, reduces KV-cache memory by 16.54 times versus text descriptions, improves Lingo-Judge score by 16.8 percent, and lowers intervention failure rate by 19.9 percent versus Qwen2.5-VL-7B using language descriptions.

What carries the argument

VecFormer, a vector-to-embedding interface with differentiable top-K context selection that converts structured vector messages into token embeddings for MLLMs while bounding context length and KV-cache growth.

If this is right

Uplink payload is reduced by up to 625 times compared with raw images.
KV-cache memory is reduced by 16.54 times compared with original text descriptions.
Lingo-Judge score improves by 16.8 percent on VecEval.
Intervention failure rate drops by 19.9 percent compared with the language-description baseline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Urgency scores could let one operator oversee hundreds of vehicles by routing attention only to high-score cases.
The same vector messages might support post-hoc auditing or safety-case generation for fleet deployments.
VecFormer-style embedding could be reused for other V2X or infrastructure-to-vehicle AI tasks that need compact scene representations.

Load-bearing premise

The vectorized messages contain enough information for the MLLM to produce scene narrations, plan evaluations, and urgency scores that match human judgment.

What would settle it

A side-by-side test on held-out fleet scenarios in which human operators compare the system's urgency scores and explanations against their own assessments and measure resulting intervention success rates.

Figures

Figures reproduced from arXiv: 2606.21222 by Deyuan Qu, Juntong Peng, Qi Chen, Takayuki Shimizu, Yaobin Chen, Ziran Wang.

**Figure 1.** Figure 1: FleetAgent: A teleoperation assistant framework for large-scale driverless fleet, providing intervention urgency rating and natural language explanations to support operator situational awareness. plan variants with human-verified structured language labels and intervention urgency scores. • System- and model-level evaluation: We quantify bandwidth, latency, and memory footprint improvements and demonstra… view at source ↗

**Figure 2.** Figure 2: VecEval annotation pipeline and examples. We pair a human-driven plan from the original dataset with generated [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: VecFormer design. Multi-modal encoders (a) encode vectorized map, object, and ego-plan inputs into LLMcompatible continuous embeddings (b) and apply differentiable top-K context selection (c) to prioritize planningrelevant context. Base MLLM (d) consumes the multi-modal tokens and provides natural language responses. reduces throughput, especially in complex scenes that are more likely to require teleop… view at source ↗

**Figure 4.** Figure 4: Qualitative example. FleetAgent correctly interprets a safe yielding maneuver and assigns low urgency, while some other baselines, including Gemini 2.5-Flash, misinterpret the stop as unsafe. costly than conservative overestimation, which mainly increases operator workload; and (3) language metrics including BLEU (B),METEOR (M), and ROUGE (R), to measure presentation-level similarity and fluency, which a… view at source ↗

read the original abstract

Large-scale autonomous fleets rely on teleoperation to resolve rare failures, yet streaming raw sensor data from many vehicles is costly, and remote operators can only monitor a limited number of vehicles at a time. We introduce FleetAgent, a cloud-hosted multimodal large language model (MLLM) assistant that consumes compact vectorized vehicle-to-network (V2N) messages, such as map elements, detected objects, and the ego planned path. It provides a structured natural-language response (including narration, explanation, and evaluation of the plan and scene), along with an intervention urgency score for operator prioritization. To make structured messages compatible with token-based MLLMs, we propose VecFormer, a vector-to-embedding interface with differentiable top-K context selection that bounds context length and GPU KV-cache growth, enabling more efficient batch processing, which is important under the context of cloud-hosted large-scale fleet management. We also construct VecEval, a nuScenes-derived dataset with paired human and synthetic imperfect plans and human-verified language labels, to facilitate the training and evaluation of our proposed system. Our proposed system can reduce uplink payload by up to 625 times compared with raw images and reduce KV-cache memory by 16.54 times compared with original text descriptions. On VecEval, FleetAgent improves Lingo-Judge score by 16.8% and reduces intervention failure rate by 19.9%, compared with Qwen2.5-VL-7B using language descriptions. These results demonstrate that FleetAgent can utilize compact structured V2N messaging to enable efficient, explainable teleoperation monitoring for autonomous fleets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FleetAgent's vectorized V2N setup and VecFormer look like a practical step for fleet teleop scaling, but the performance numbers rest on thin methods detail and a new dataset whose labels need closer inspection.

read the letter

The core idea here is using compact vector messages (maps, objects, planned paths) fed through a new VecFormer interface into an MLLM for narration, plan checks, and urgency scoring. That directly targets the bandwidth and operator limits in large fleets, and the reported 625x payload cut plus KV-cache savings are the kind of numbers that matter for deployment.

What stands out as new is the differentiable top-K selection in VecFormer to keep context bounded, plus the VecEval benchmark built from nuScenes with human-verified labels on paired plans. The system claims a 16.8% Lingo-Judge lift and 19.9% lower failure rate versus Qwen2.5-VL-7B on language inputs. Those are concrete comparisons to an external baseline.

The soft spots are in the evaluation. The abstract gives the headline gains but skips error bars, ablations on the urgency scoring, and any breakdown of how the human-verified labels were collected or how synthetic imperfect plans were generated. Post-hoc dataset work often hides selection effects, and without seeing those steps it's hard to judge whether the vector inputs really preserve the relational and dynamic details needed for reliable urgency calls.

The stress-test concern holds on the abstract alone: if top-K pruning drops critical context, the efficiency wins could trade off against accuracy in intervention decisions. The paper would need to show that the structured vectors match or beat language descriptions on the same tasks.

This is aimed at people working on autonomous fleet operations and MLLM interfaces for robotics. It deserves a serious referee because the problem is real and the proposed interface is a clear technical move, even if the current evidence is preliminary and the claims will need tightening in revision.

Referee Report

3 major / 2 minor

Summary. The paper introduces FleetAgent, a cloud-hosted MLLM assistant for teleoperation of autonomous vehicle fleets. It consumes compact vectorized V2N messages (map elements, detected objects, ego planned path) via a proposed VecFormer vector-to-embedding interface with differentiable top-K selection, producing structured natural-language outputs (narration, plan evaluation, explanation) plus an intervention urgency score. A new VecEval dataset (nuScenes-derived, with paired human/synthetic plans and human-verified labels) is constructed for training and evaluation. The system is reported to achieve up to 625× uplink payload reduction versus raw images, 16.54× KV-cache reduction versus text descriptions, 16.8% higher Lingo-Judge score, and 19.9% lower intervention failure rate versus Qwen2.5-VL-7B on language descriptions.

Significance. If the central premise holds—that vectorized V2N messages plus VecFormer preserve sufficient relational and dynamic scene information for MLLM outputs to align with human judgment on narration, plan evaluation, and urgency—the work could meaningfully advance scalable fleet teleoperation by cutting bandwidth and compute costs while adding explainability. The efficiency metrics and new VecEval benchmark are concrete contributions; the focus on cloud-scale batching via bounded context is practically relevant. However, the absence of ablations on information loss and label reliability weakens the ability to assess whether the reported gains come at the expense of reliability.

major comments (3)

[Abstract, §3] Abstract and §3 (VecFormer): the central claim that vectorized messages (after top-K pruning) suffice for MLLM outputs to match human judgment on urgency scores and plan evaluation is load-bearing for all quantitative results, yet the manuscript provides no quantitative analysis of information loss (e.g., relational or dynamic details dropped by vectorization or top-K) relative to raw sensor streams or full language descriptions.
[§4] §4 (VecEval construction and metrics): the 19.9% failure-rate reduction and 16.8% Lingo-Judge lift rest on human-verified labels and the failure-rate definition, but the paper supplies no annotation protocol, number of annotators, inter-annotator agreement, or error bars on these metrics, making it impossible to judge whether the improvements are robust or artifactual.
[§4] §4 (baseline comparison): the gains are reported versus Qwen2.5-VL-7B using language descriptions, but the manuscript does not state whether FleetAgent employs the identical base MLLM or a different one; without this, it is unclear whether improvements are attributable to the structured V2N input or to other modeling differences.

minor comments (2)

[Abstract, §1] The abstract and introduction introduce VecFormer and VecEval without an explicit statement of their novelty relative to prior vector-to-sequence or V2X work; a short related-work paragraph would clarify contribution.
[Figure 2, §3] Figure 2 (system overview) and any accompanying equations for differentiable top-K would benefit from explicit notation for the selection operator and its gradient path to aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on FleetAgent. The comments identify areas where additional clarity and analysis would strengthen the manuscript. We respond point-by-point below and will revise accordingly.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3 (VecFormer): the central claim that vectorized messages (after top-K pruning) suffice for MLLM outputs to match human judgment on urgency scores and plan evaluation is load-bearing for all quantitative results, yet the manuscript provides no quantitative analysis of information loss (e.g., relational or dynamic details dropped by vectorization or top-K) relative to raw sensor streams or full language descriptions.

Authors: We agree that explicit quantification of information loss would provide stronger support. The reported gains versus language descriptions offer indirect evidence of preserved utility, but a direct ablation is absent. In revision we will add a dedicated paragraph in §3 discussing the design choices for top-K selection and include qualitative examples of retained relational elements (e.g., object interactions and path constraints) to address this concern. revision: yes
Referee: [§4] §4 (VecEval construction and metrics): the 19.9% failure-rate reduction and 16.8% Lingo-Judge lift rest on human-verified labels and the failure-rate definition, but the paper supplies no annotation protocol, number of annotators, inter-annotator agreement, or error bars on these metrics, making it impossible to judge whether the improvements are robust or artifactual.

Authors: The omission of these details is a valid criticism. VecEval labels were produced via human verification, yet the protocol, annotator count, agreement statistics, and error bars were not reported. We will expand §4 with the annotation guidelines, number of annotators, inter-annotator agreement, and error bars on all metrics in the revised manuscript. revision: yes
Referee: [§4] §4 (baseline comparison): the gains are reported versus Qwen2.5-VL-7B using language descriptions, but the manuscript does not state whether FleetAgent employs the identical base MLLM or a different one; without this, it is unclear whether improvements are attributable to the structured V2N input or to other modeling differences.

Authors: FleetAgent employs the identical base model Qwen2.5-VL-7B; VecFormer replaces only the visual tokenization pathway while keeping the language model weights unchanged. The comparison is therefore to the same MLLM receiving language descriptions. We will add an explicit statement to this effect in §4 of the revision. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains rest on external baseline and human-labeled dataset

full rationale

The paper reports payload reduction, KV-cache savings, Lingo-Judge improvement, and failure-rate reduction via direct comparison to Qwen2.5-VL-7B on language inputs, evaluated on VecEval (a newly constructed dataset with human-verified labels). No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text that would reduce these outcomes to the inputs by construction. VecFormer and the vectorized V2N pipeline are presented as engineering contributions whose value is measured externally rather than defined into the metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities beyond the two named components are stated. VecFormer and VecEval are engineering artifacts rather than new physical entities.

invented entities (2)

VecFormer no independent evidence
purpose: vector-to-embedding interface with differentiable top-K context selection
Introduced to bound context length for cloud MLLM batch processing
VecEval no independent evidence
purpose: nuScenes-derived dataset with paired plans and human-verified language labels
Constructed to train and evaluate the proposed system

pith-pipeline@v0.9.1-grok · 5838 in / 1437 out tokens · 17812 ms · 2026-06-26T14:12:20.583917+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 7 linked inside Pith

[1]

Waymo, “Waymo,” https://waymo.com/, 2025, accessed: 2025-09-24

2025
[2]

Zoox: It’s not a car,

Zoox, “Zoox: It’s not a car,” https://zoox.com/, 2025, accessed: 2025- 09-24

2025
[3]

Tesla robotaxi,

Tesla, “Tesla robotaxi,” https://www.tesla.com/robotaxi, 2025, ac- cessed: 2025-09-23

2025
[4]

nuScenes: A Multimodal Dataset for Autonomous Driving,

H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom, “nuScenes: A Multimodal Dataset for Autonomous Driving,” 2020, pp. 11 621– 11 631

2020
[5]

Sensor and Actuator Latency during Teleoperation of Automated Vehicles,

J.-M. Georg, J. Feiler, S. Hoffmann, and F. Diermeyer, “Sensor and Actuator Latency during Teleoperation of Automated Vehicles,” in 2020 IEEE Intelligent Vehicles Symposium (IV), Oct. 2020, pp. 760– 766, iSSN: 2642-7214

2020
[6]

Data Rate Reduction for Video Streams in Teleoperated Driving,

S. Neumeier, V . Bajpai, M. Neumeier, C. Facchi, and J. Ott, “Data Rate Reduction for Video Streams in Teleoperated Driving,”IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 10, pp. 19 145–19 160, Oct. 2022

2022
[7]

Shared Autonomy for Teleoperated Driving: A Real-Time Interactive Path Planning Approach,

D. Schitz, S. Bao, D. Rieth, and H. Aschemann, “Shared Autonomy for Teleoperated Driving: A Real-Time Interactive Path Planning Approach,” in2021 IEEE International Conference on Robotics and Automation (ICRA), May 2021, pp. 999–1004, iSSN: 2577-087X

2021
[8]

A Generative Model-Based Predictive Display for Robotic Teleoperation,

B. Xie, M. Han, J. Jin, M. Barczyk, and M. J ¨agersand, “A Generative Model-Based Predictive Display for Robotic Teleoperation,” in2021 IEEE International Conference on Robotics and Automation (ICRA), May 2021, pp. 2407–2413, iSSN: 2577-087X

2021
[9]

A cost efficient multi remote driver selection for remote operated vehicles,

A. Gohar and S. Lee, “A cost efficient multi remote driver selection for remote operated vehicles,”Computer Networks, vol. 168, p. 107029, Feb. 2020

2020
[10]

Should Teleoperation Be like Driving in a Car? Comparison of Teleoperation HMIs,

M.-M. Wolf, R. Taupitz, and F. Diermeyer, “Should Teleoperation Be like Driving in a Car? Comparison of Teleoperation HMIs,” Apr. 2024, arXiv:2404.13697 [cs]. [Online]. Available: http://arxiv.org/abs/ 2404.13697

arXiv 2024
[11]

A Supervised Machine Learning Approach to Operator Intent Recognition for Teleoperated Mobile Robot Navigation,

E. Tsagkournis, D. Panagopoulos, G. Petousakis, G. Nikolaou, R. Stolkin, and M. Chiou, “A Supervised Machine Learning Approach to Operator Intent Recognition for Teleoperated Mobile Robot Navigation,” Apr. 2023, arXiv:2304.14003 [cs]. [Online]. Available: http://arxiv.org/abs/2304.14003

arXiv 2023
[12]

Assisted Mobile Robot Teleoperation with Intent-aligned Trajectories via Biased Incremental Action Sampling,

X. Yang and N. Michael, “Assisted Mobile Robot Teleoperation with Intent-aligned Trajectories via Biased Incremental Action Sampling,” in2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Oct. 2020, pp. 10 998–11 003, iSSN: 2153-0866

2020
[13]

GoonDAE: Denoising- Based Driver Assistance for Off-Road Teleoperation,

Y . Cho, H. Yun, J. Lee, A. Ha, and J. Yun, “GoonDAE: Denoising- Based Driver Assistance for Off-Road Teleoperation,”IEEE Robotics and Automation Letters, vol. 8, no. 4, pp. 2405–2412, Apr. 2023, arXiv:2209.03568 [cs]

arXiv 2023
[14]

A Systematic Review of Edge Case Detection in Automated Driving: Methods, Challenges and Future Directions,

S. Rahmani, S. Rieder, E. d. Gelder, M. Sonntag, J. L. Mallada, S. Kalisvaart, V . Hashemi, and S. C. Calvert, “A Systematic Review of Edge Case Detection in Automated Driving: Methods, Challenges and Future Directions,” Oct. 2024, arXiv:2410.08491 [cs]. [Online]. Available: http://arxiv.org/abs/2410.08491

Pith/arXiv arXiv 2024
[15]

Generative AI for Autonomous Driving: Frontiers and Opportunities,

Y . Wang, S. Xing, C. Can, R. Li, H. Hua, K. Tian, Z. Mo, X. Gao, K. Wu, S. Zhou, H. You, J. Peng, J. Zhang, Z. Wang, R. Song, M. Yan, W. Zimmer, X. Zhou, P. Li, Z. Lu, C.-J. Chen, Y . Huang, R. A. Rossi, L. Sun, H. Yu, Z. Fan, F. H. Yang, Y . Kang, R. Greer, C. Liu, E. H. Lee, X. Di, X. Ye, L. Ren, A. Knoll, X. Li, S. Ji, M. Tomizuka, M. Pavone, T. Yang,...

arXiv 2025
[16]

DriveLM: Driving with Graph Visual Question Answering,

C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, J. Beißwenger, P. Luo, A. Geiger, and H. Li, “DriveLM: Driving with Graph Visual Question Answering,” inComputer Vision – ECCV 2024, A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, and G. Varol, Eds. Cham: Springer Nature Switzerland, 2025, pp. 256–274

2024
[17]

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models,

X. Tian, J. Gu, B. Li, Y . Liu, Y . Wang, Z. Zhao, K. Zhan, P. Jia, X. Lang, and H. Zhao, “DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models,” Jun. 2024, arXiv:2402.12289 [cs]. [Online]. Available: http://arxiv.org/abs/2402. 12289

Pith/arXiv arXiv 2024
[18]

NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models,

S.-Y . Park, C. Cui, Y . Ma, A. Moradipari, R. Gupta, K. Han, and Z. Wang, “NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models,” Aug. 2025, arXiv:2503.12772 [cs]. [Online]. Available: http://arxiv.org/abs/2503.12772

arXiv 2025
[19]

ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving,

Y . Li, K. Xiong, X. Guo, F. Li, S. Yan, G. Xu, L. Zhou, L. Chen, H. Sun, B. Wang, G. Chen, H. Ye, W. Liu, and X. Wang, “ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving,” Jun. 2025, arXiv:2506.08052 [cs]. [Online]. Available: http://arxiv.org/abs/2506.08052

Pith/arXiv arXiv 2025
[20]

EMMA: End-to-End Multimodal Model for Autonomous Driving,

J.-J. Hwang, R. Xu, H. Lin, W.-C. Hung, J. Ji, K. Choi, D. Huang, T. He, P. Covington, B. Sapp, J. Guo, D. Anguelov, and M. Tan, “EMMA: End-to-End Multimodal Model for Autonomous Driving,” Oct. 2024, arXiv:2410.23262 [cs] version: 1. [Online]. Available: http://arxiv.org/abs/2410.23262

Pith/arXiv arXiv 2024
[21]

OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model,

X. Zhou, X. Han, F. Yang, Y . Ma, and A. C. Knoll, “OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model,” Mar. 2025, arXiv:2503.23463 [cs]. [Online]. Available: http://arxiv.org/abs/2503.23463

arXiv 2025
[22]

DriveGPT4: Interpretable End-to-End Autonomous Driving Via Large Language Model,

Z. Xu, Y . Zhang, E. Xie, Z. Zhao, Y . Guo, K.-Y . K. Wong, Z. Li, and H. Zhao, “DriveGPT4: Interpretable End-to-End Autonomous Driving Via Large Language Model,”IEEE Robotics and Automation Letters, vol. 9, no. 10, pp. 8186–8193, Oct. 2024, conference Name: IEEE Robotics and Automation Letters. [Online]. Available: https://ieeexplore.ieee.org/document/10...

arXiv 2024
[23]

A Language Agent for Autonomous Driving,

J. Mao, J. Ye, Y . Qian, M. Pavone, and Y . Wang, “A Language Agent for Autonomous Driving,” Jul. 2024, arXiv:2311.10813 [cs]. [Online]. Available: http://arxiv.org/abs/2311.10813

arXiv 2024
[24]

Hint-AD: Holistically Aligned Interpretability in End-to-End Autonomous Driving,

K. Ding, B. Chen, Y . Su, H.-a. Gao, B. Jin, C. Sima, W. Zhang, X. Li, P. Barsch, H. Li, and H. Zhao, “Hint-AD: Holistically Aligned Interpretability in End-to-End Autonomous Driving,” Sep. 2024, arXiv:2409.06702 [cs]. [Online]. Available: http://arxiv.org/abs/2409.06702

arXiv 2024
[25]

ALN-P3: Unified Language Alignment for Perception, Prediction, and Planning in Autonomous Driving,

Y . Ma, B. Yaman, X. Ye, M. Yurt, J. Luo, A. Mallik, Z. Wang, and L. Ren, “ALN-P3: Unified Language Alignment for Perception, Prediction, and Planning in Autonomous Driving,” May 2025, arXiv:2505.15158 [cs]. [Online]. Available: http://arxiv.org/abs/2505. 15158

arXiv 2025
[26]

Interworking of dsrc and cellular network technologies for v2x communications: A survey,

K. Abboud, H. A. Omar, and W. Zhuang, “Interworking of dsrc and cellular network technologies for v2x communications: A survey,” IEEE transactions on vehicular technology, vol. 65, no. 12, pp. 9457– 9470, 2016

2016
[27]

5G-Enabled Teleoperated Driving: An Experimental Evaluation,

M. Testouri, G. Elghazaly, F. Hawlader, and R. Frank, “5G-Enabled Teleoperated Driving: An Experimental Evaluation,” Mar. 2025, arXiv:2503.14186 [cs]. [Online]. Available: http://arxiv.org/abs/2503. 14186

arXiv 2025
[28]

QuantV2X: A Fully Quantized Multi-Agent System for Cooperative Perception,

S. Z. Zhao, H. Zhang, Z. Li, J. Peng, A. Chui, Z. Zhou, Z. Meng, H. Xiang, Z. Huang, F. Wang, R. Tian, C. Xu, B. Zhou, and J. Ma, “QuantV2X: A Fully Quantized Multi-Agent System for Cooperative Perception,” Sep. 2025, arXiv:2509.03704 [cs]. [Online]. Available: http://arxiv.org/abs/2509.03704

arXiv 2025
[29]

Scalability in Perception for Autonomous Driving: Waymo Open Dataset,

P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V . Patnaik, P. Tsui, J. Guo, Y . Zhou, Y . Chai, B. Caine, V . Vasudevan, W. Han, J. Ngiam, H. Zhao, A. Timofeev, S. Ettinger, M. Krivokon, A. Gao, A. Joshi, Y . Zhang, J. Shlens, Z. Chen, and D. Anguelov, “Scalability in Perception for Autonomous Driving: Waymo Open Dataset,” 2020, pp. 2446–2454

2020
[30]

Vision meets robotics: The KITTI dataset,

A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The KITTI dataset,”The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, Sep. 2013

2013
[31]

NuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles,

H. Caesar, J. Kabzan, K. S. Tan, W. K. Fong, E. Wolff, A. Lang, L. Fletcher, O. Beijbom, and S. Omari, “NuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles,” Feb. 2022, arXiv:2106.11810 [cs]. [Online]. Available: http://arxiv.org/abs/2106.11810

Pith/arXiv arXiv 2022
[32]

NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario,

T. Qian, J. Chen, L. Zhuo, Y . Jiao, and Y .-G. Jiang, “NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario,” Feb. 2024, arXiv:2305.14836 [cs]. [Online]. Available: http://arxiv.org/abs/2305.14836

arXiv 2024
[33]

Holistic autonomous driving understanding by bird’s-eye-view injected multi- modal large models,

X. Ding, J. Han, H. Xu, X. Liang, W. Zhang, and X. Li, “Holistic autonomous driving understanding by bird’s-eye-view injected multi- modal large models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 13 668–13 677

2024
[34]

Language Prompt for Autonomous Driving,

D. Wu, W. Han, Y . Liu, T. Wang, C.-Z. Xu, X. Zhang, and J. Shen, “Language Prompt for Autonomous Driving,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 8, pp. 8359–8367, Apr. 2025

2025
[35]

Textual Explanations for Self-Driving Vehicles,

J. Kim, A. Rohrbach, T. Darrell, J. Canny, and Z. Akata, “Textual Explanations for Self-Driving Vehicles,” inComputer Vision – ECCV 2018, V . Ferrari, M. Hebert, C. Sminchisescu, and Y . Weiss, Eds. Cham: Springer International Publishing, 2018, vol. 11206, pp. 577– 593, series Title: Lecture Notes in Computer Science

2018
[36]

Qwen2.5-VL Technical Report,

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin, “Qwen2.5-VL Technical Report,” Feb. 2025, arXiv:2502.13923 [cs]. [Online]. Available: http://arxiv.org/abs/2502.13923

Pith/arXiv arXiv 2025
[37]

Categorical Reparameterization with Gumbel-Softmax,

E. Jang, S. Gu, and B. Poole, “Categorical Reparameterization with Gumbel-Softmax,” Aug. 2017, arXiv:1611.01144 [stat]. [Online]. Available: http://arxiv.org/abs/1611.01144

Pith/arXiv arXiv 2017
[38]

Lingoqa: Visual question answering for autonomous driving,

A.-M. Marcu, L. Chen, J. H ¨unermann, A. Karnsund, B. Hanotte, P. Chidananda, S. Nair, V . Badrinarayanan, A. Kendall, J. Shotton et al., “Lingoqa: Visual question answering for autonomous driving,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 252–269

2024

[1] [1]

Waymo, “Waymo,” https://waymo.com/, 2025, accessed: 2025-09-24

2025

[2] [2]

Zoox: It’s not a car,

Zoox, “Zoox: It’s not a car,” https://zoox.com/, 2025, accessed: 2025- 09-24

2025

[3] [3]

Tesla robotaxi,

Tesla, “Tesla robotaxi,” https://www.tesla.com/robotaxi, 2025, ac- cessed: 2025-09-23

2025

[4] [4]

nuScenes: A Multimodal Dataset for Autonomous Driving,

H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom, “nuScenes: A Multimodal Dataset for Autonomous Driving,” 2020, pp. 11 621– 11 631

2020

[5] [5]

Sensor and Actuator Latency during Teleoperation of Automated Vehicles,

J.-M. Georg, J. Feiler, S. Hoffmann, and F. Diermeyer, “Sensor and Actuator Latency during Teleoperation of Automated Vehicles,” in 2020 IEEE Intelligent Vehicles Symposium (IV), Oct. 2020, pp. 760– 766, iSSN: 2642-7214

2020

[6] [6]

Data Rate Reduction for Video Streams in Teleoperated Driving,

S. Neumeier, V . Bajpai, M. Neumeier, C. Facchi, and J. Ott, “Data Rate Reduction for Video Streams in Teleoperated Driving,”IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 10, pp. 19 145–19 160, Oct. 2022

2022

[7] [7]

Shared Autonomy for Teleoperated Driving: A Real-Time Interactive Path Planning Approach,

D. Schitz, S. Bao, D. Rieth, and H. Aschemann, “Shared Autonomy for Teleoperated Driving: A Real-Time Interactive Path Planning Approach,” in2021 IEEE International Conference on Robotics and Automation (ICRA), May 2021, pp. 999–1004, iSSN: 2577-087X

2021

[8] [8]

A Generative Model-Based Predictive Display for Robotic Teleoperation,

B. Xie, M. Han, J. Jin, M. Barczyk, and M. J ¨agersand, “A Generative Model-Based Predictive Display for Robotic Teleoperation,” in2021 IEEE International Conference on Robotics and Automation (ICRA), May 2021, pp. 2407–2413, iSSN: 2577-087X

2021

[9] [9]

A cost efficient multi remote driver selection for remote operated vehicles,

A. Gohar and S. Lee, “A cost efficient multi remote driver selection for remote operated vehicles,”Computer Networks, vol. 168, p. 107029, Feb. 2020

2020

[10] [10]

Should Teleoperation Be like Driving in a Car? Comparison of Teleoperation HMIs,

M.-M. Wolf, R. Taupitz, and F. Diermeyer, “Should Teleoperation Be like Driving in a Car? Comparison of Teleoperation HMIs,” Apr. 2024, arXiv:2404.13697 [cs]. [Online]. Available: http://arxiv.org/abs/ 2404.13697

arXiv 2024

[11] [11]

A Supervised Machine Learning Approach to Operator Intent Recognition for Teleoperated Mobile Robot Navigation,

E. Tsagkournis, D. Panagopoulos, G. Petousakis, G. Nikolaou, R. Stolkin, and M. Chiou, “A Supervised Machine Learning Approach to Operator Intent Recognition for Teleoperated Mobile Robot Navigation,” Apr. 2023, arXiv:2304.14003 [cs]. [Online]. Available: http://arxiv.org/abs/2304.14003

arXiv 2023

[12] [12]

Assisted Mobile Robot Teleoperation with Intent-aligned Trajectories via Biased Incremental Action Sampling,

X. Yang and N. Michael, “Assisted Mobile Robot Teleoperation with Intent-aligned Trajectories via Biased Incremental Action Sampling,” in2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Oct. 2020, pp. 10 998–11 003, iSSN: 2153-0866

2020

[13] [13]

GoonDAE: Denoising- Based Driver Assistance for Off-Road Teleoperation,

Y . Cho, H. Yun, J. Lee, A. Ha, and J. Yun, “GoonDAE: Denoising- Based Driver Assistance for Off-Road Teleoperation,”IEEE Robotics and Automation Letters, vol. 8, no. 4, pp. 2405–2412, Apr. 2023, arXiv:2209.03568 [cs]

arXiv 2023

[14] [14]

A Systematic Review of Edge Case Detection in Automated Driving: Methods, Challenges and Future Directions,

S. Rahmani, S. Rieder, E. d. Gelder, M. Sonntag, J. L. Mallada, S. Kalisvaart, V . Hashemi, and S. C. Calvert, “A Systematic Review of Edge Case Detection in Automated Driving: Methods, Challenges and Future Directions,” Oct. 2024, arXiv:2410.08491 [cs]. [Online]. Available: http://arxiv.org/abs/2410.08491

Pith/arXiv arXiv 2024

[15] [15]

Generative AI for Autonomous Driving: Frontiers and Opportunities,

Y . Wang, S. Xing, C. Can, R. Li, H. Hua, K. Tian, Z. Mo, X. Gao, K. Wu, S. Zhou, H. You, J. Peng, J. Zhang, Z. Wang, R. Song, M. Yan, W. Zimmer, X. Zhou, P. Li, Z. Lu, C.-J. Chen, Y . Huang, R. A. Rossi, L. Sun, H. Yu, Z. Fan, F. H. Yang, Y . Kang, R. Greer, C. Liu, E. H. Lee, X. Di, X. Ye, L. Ren, A. Knoll, X. Li, S. Ji, M. Tomizuka, M. Pavone, T. Yang,...

arXiv 2025

[16] [16]

DriveLM: Driving with Graph Visual Question Answering,

C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, J. Beißwenger, P. Luo, A. Geiger, and H. Li, “DriveLM: Driving with Graph Visual Question Answering,” inComputer Vision – ECCV 2024, A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, and G. Varol, Eds. Cham: Springer Nature Switzerland, 2025, pp. 256–274

2024

[17] [17]

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models,

X. Tian, J. Gu, B. Li, Y . Liu, Y . Wang, Z. Zhao, K. Zhan, P. Jia, X. Lang, and H. Zhao, “DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models,” Jun. 2024, arXiv:2402.12289 [cs]. [Online]. Available: http://arxiv.org/abs/2402. 12289

Pith/arXiv arXiv 2024

[18] [18]

NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models,

S.-Y . Park, C. Cui, Y . Ma, A. Moradipari, R. Gupta, K. Han, and Z. Wang, “NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models,” Aug. 2025, arXiv:2503.12772 [cs]. [Online]. Available: http://arxiv.org/abs/2503.12772

arXiv 2025

[19] [19]

ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving,

Y . Li, K. Xiong, X. Guo, F. Li, S. Yan, G. Xu, L. Zhou, L. Chen, H. Sun, B. Wang, G. Chen, H. Ye, W. Liu, and X. Wang, “ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving,” Jun. 2025, arXiv:2506.08052 [cs]. [Online]. Available: http://arxiv.org/abs/2506.08052

Pith/arXiv arXiv 2025

[20] [20]

EMMA: End-to-End Multimodal Model for Autonomous Driving,

J.-J. Hwang, R. Xu, H. Lin, W.-C. Hung, J. Ji, K. Choi, D. Huang, T. He, P. Covington, B. Sapp, J. Guo, D. Anguelov, and M. Tan, “EMMA: End-to-End Multimodal Model for Autonomous Driving,” Oct. 2024, arXiv:2410.23262 [cs] version: 1. [Online]. Available: http://arxiv.org/abs/2410.23262

Pith/arXiv arXiv 2024

[21] [21]

OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model,

X. Zhou, X. Han, F. Yang, Y . Ma, and A. C. Knoll, “OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model,” Mar. 2025, arXiv:2503.23463 [cs]. [Online]. Available: http://arxiv.org/abs/2503.23463

arXiv 2025

[22] [22]

DriveGPT4: Interpretable End-to-End Autonomous Driving Via Large Language Model,

Z. Xu, Y . Zhang, E. Xie, Z. Zhao, Y . Guo, K.-Y . K. Wong, Z. Li, and H. Zhao, “DriveGPT4: Interpretable End-to-End Autonomous Driving Via Large Language Model,”IEEE Robotics and Automation Letters, vol. 9, no. 10, pp. 8186–8193, Oct. 2024, conference Name: IEEE Robotics and Automation Letters. [Online]. Available: https://ieeexplore.ieee.org/document/10...

arXiv 2024

[23] [23]

A Language Agent for Autonomous Driving,

J. Mao, J. Ye, Y . Qian, M. Pavone, and Y . Wang, “A Language Agent for Autonomous Driving,” Jul. 2024, arXiv:2311.10813 [cs]. [Online]. Available: http://arxiv.org/abs/2311.10813

arXiv 2024

[24] [24]

Hint-AD: Holistically Aligned Interpretability in End-to-End Autonomous Driving,

K. Ding, B. Chen, Y . Su, H.-a. Gao, B. Jin, C. Sima, W. Zhang, X. Li, P. Barsch, H. Li, and H. Zhao, “Hint-AD: Holistically Aligned Interpretability in End-to-End Autonomous Driving,” Sep. 2024, arXiv:2409.06702 [cs]. [Online]. Available: http://arxiv.org/abs/2409.06702

arXiv 2024

[25] [25]

ALN-P3: Unified Language Alignment for Perception, Prediction, and Planning in Autonomous Driving,

Y . Ma, B. Yaman, X. Ye, M. Yurt, J. Luo, A. Mallik, Z. Wang, and L. Ren, “ALN-P3: Unified Language Alignment for Perception, Prediction, and Planning in Autonomous Driving,” May 2025, arXiv:2505.15158 [cs]. [Online]. Available: http://arxiv.org/abs/2505. 15158

arXiv 2025

[26] [26]

Interworking of dsrc and cellular network technologies for v2x communications: A survey,

K. Abboud, H. A. Omar, and W. Zhuang, “Interworking of dsrc and cellular network technologies for v2x communications: A survey,” IEEE transactions on vehicular technology, vol. 65, no. 12, pp. 9457– 9470, 2016

2016

[27] [27]

5G-Enabled Teleoperated Driving: An Experimental Evaluation,

M. Testouri, G. Elghazaly, F. Hawlader, and R. Frank, “5G-Enabled Teleoperated Driving: An Experimental Evaluation,” Mar. 2025, arXiv:2503.14186 [cs]. [Online]. Available: http://arxiv.org/abs/2503. 14186

arXiv 2025

[28] [28]

QuantV2X: A Fully Quantized Multi-Agent System for Cooperative Perception,

S. Z. Zhao, H. Zhang, Z. Li, J. Peng, A. Chui, Z. Zhou, Z. Meng, H. Xiang, Z. Huang, F. Wang, R. Tian, C. Xu, B. Zhou, and J. Ma, “QuantV2X: A Fully Quantized Multi-Agent System for Cooperative Perception,” Sep. 2025, arXiv:2509.03704 [cs]. [Online]. Available: http://arxiv.org/abs/2509.03704

arXiv 2025

[29] [29]

Scalability in Perception for Autonomous Driving: Waymo Open Dataset,

P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V . Patnaik, P. Tsui, J. Guo, Y . Zhou, Y . Chai, B. Caine, V . Vasudevan, W. Han, J. Ngiam, H. Zhao, A. Timofeev, S. Ettinger, M. Krivokon, A. Gao, A. Joshi, Y . Zhang, J. Shlens, Z. Chen, and D. Anguelov, “Scalability in Perception for Autonomous Driving: Waymo Open Dataset,” 2020, pp. 2446–2454

2020

[30] [30]

Vision meets robotics: The KITTI dataset,

A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The KITTI dataset,”The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, Sep. 2013

2013

[31] [31]

NuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles,

H. Caesar, J. Kabzan, K. S. Tan, W. K. Fong, E. Wolff, A. Lang, L. Fletcher, O. Beijbom, and S. Omari, “NuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles,” Feb. 2022, arXiv:2106.11810 [cs]. [Online]. Available: http://arxiv.org/abs/2106.11810

Pith/arXiv arXiv 2022

[32] [32]

NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario,

T. Qian, J. Chen, L. Zhuo, Y . Jiao, and Y .-G. Jiang, “NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario,” Feb. 2024, arXiv:2305.14836 [cs]. [Online]. Available: http://arxiv.org/abs/2305.14836

arXiv 2024

[33] [33]

Holistic autonomous driving understanding by bird’s-eye-view injected multi- modal large models,

X. Ding, J. Han, H. Xu, X. Liang, W. Zhang, and X. Li, “Holistic autonomous driving understanding by bird’s-eye-view injected multi- modal large models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 13 668–13 677

2024

[34] [34]

Language Prompt for Autonomous Driving,

D. Wu, W. Han, Y . Liu, T. Wang, C.-Z. Xu, X. Zhang, and J. Shen, “Language Prompt for Autonomous Driving,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 8, pp. 8359–8367, Apr. 2025

2025

[35] [35]

Textual Explanations for Self-Driving Vehicles,

J. Kim, A. Rohrbach, T. Darrell, J. Canny, and Z. Akata, “Textual Explanations for Self-Driving Vehicles,” inComputer Vision – ECCV 2018, V . Ferrari, M. Hebert, C. Sminchisescu, and Y . Weiss, Eds. Cham: Springer International Publishing, 2018, vol. 11206, pp. 577– 593, series Title: Lecture Notes in Computer Science

2018

[36] [36]

Qwen2.5-VL Technical Report,

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin, “Qwen2.5-VL Technical Report,” Feb. 2025, arXiv:2502.13923 [cs]. [Online]. Available: http://arxiv.org/abs/2502.13923

Pith/arXiv arXiv 2025

[37] [37]

Categorical Reparameterization with Gumbel-Softmax,

E. Jang, S. Gu, and B. Poole, “Categorical Reparameterization with Gumbel-Softmax,” Aug. 2017, arXiv:1611.01144 [stat]. [Online]. Available: http://arxiv.org/abs/1611.01144

Pith/arXiv arXiv 2017

[38] [38]

Lingoqa: Visual question answering for autonomous driving,

A.-M. Marcu, L. Chen, J. H ¨unermann, A. Karnsund, B. Hanotte, P. Chidananda, S. Nair, V . Badrinarayanan, A. Kendall, J. Shotton et al., “Lingoqa: Visual question answering for autonomous driving,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 252–269

2024