Can Aerial VLA Models Cooperate? Evaluating Closed-Loop Air-Ground Coordination with CARLA-Air

Hong Zhang; Tianle Zeng; Xueang Yu; Yanci Wen

arxiv: 2605.31066 · v1 · pith:AJEW2PKFnew · submitted 2026-05-29 · 💻 cs.RO

Can Aerial VLA Models Cooperate? Evaluating Closed-Loop Air-Ground Coordination with CARLA-Air

Tianle Zeng , Yanci Wen , Xueang Yu , Hong Zhang This is my paper

Pith reviewed 2026-06-28 22:20 UTC · model grok-4.3

classification 💻 cs.RO

keywords aerial VLAair-ground coordinationCARLA-AirUAV-UGV cooperationclosed-loop evaluationvision-language-action modelsmulti-agent simulation

0 comments

The pith

Aerial VLA models can track ground partners but fail to turn that into stable cooperative behavior in closed-loop tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CARLA-Air, a unified simulation that combines CARLA and AirSim in one runtime so a UAV and UGV share the same world state, physics, and sensors. It runs two diagnostic tasks—moving-platform landing and occlusion-recovery escort—to check whether single-agent aerial VLA skills transfer to joint air-ground action. Results indicate that models often follow or track the ground partner yet cannot sustain coordinated team performance. Simple state prompts add little help, and basic bidirectional text exchange frequently worsens outcomes by spreading errors. The work concludes that zero-shot cooperation under current text interfaces needs explicit partner-state grounding, low-latency action links, and shared team objectives.

Core claim

Current aerial VLA models can often track or follow a ground partner, but struggle to convert this single-agent competence into stable cooperative behavior. State prompting provides limited benefit, and naive bidirectional interaction fails to consistently improve performance and can amplify errors for most baselines. These findings suggest that, under the tested text-based cue interfaces, zero-shot cooperative air-ground VLA requires three components beyond the current paradigm: explicit partner-state grounding, low-latency action coordination, and team-level objective alignment.

What carries the argument

CARLA-Air, a single-process environment that unifies CARLA and AirSim to share world state, physics tick, and sensing pipeline for consistent UAV-UGV interaction and precise latency measurement.

If this is right

Single-agent tracking skills do not automatically produce stable joint air-ground behavior.
Text-based state prompting yields only marginal gains in coordination tasks.
Naive bidirectional text interaction can increase error rates rather than reduce them.
Zero-shot cooperation needs explicit partner-state grounding, low-latency coordination, and team objective alignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same interface limitations may appear in other multi-robot settings that rely on language for coordination.
Adding direct state sharing or visual grounding between agents could be tested as a direct extension of the current diagnostic tasks.
Hardware experiments on physical UAV-UGV pairs would be needed to check whether the observed cooperation gaps persist outside simulation.

Load-bearing premise

The two diagnostic tasks and the text-based cue interfaces used in CARLA-Air represent the coordination challenges that would appear in real air-ground deployments or with richer interaction modalities.

What would settle it

An experiment in which the same VLA baselines, given direct partner-state access and low-latency channels instead of text cues, achieve stable success on both the landing and escort tasks would falsify the claim that the current paradigm is insufficient.

Figures

Figures reproduced from arXiv: 2605.31066 by Hong Zhang, Tianle Zeng, Xueang Yu, Yanci Wen.

**Figure 1.** Figure 1: CARLA-AIR runtime architecture. CARLA and AirSim are embedded in one enginelevel runtime with a shared world state, physics tick, and sensor/rendering pipeline. Native CARLA and AirSim APIs are preserved, while both command streams are resolved inside the same runtime [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: Landing process trajectory diagnosis. Top: representative visual sequence during moving-platform landing. Bottom: 3D UGV and UAV trajectories during moving-platform landing under (a) C0, (b) C1, and (c) C2. Colored curves denote aerial baselines; gray line denotes the UGV trajectory; black dashed line denotes Rule-Coop-State. 4 Diagnostic Evaluation Building on the physically consistent and closed-loop ev… view at source ↗

**Figure 3.** Figure 3: Cooperative Occlusion-Recovery Escort. The UAV escorts the UGV, loses visual contact under temporary occlusion, searches using partner-state cues, and re-acquires the UGV before resuming escort. For landing, we report Tracking Success Rate (TSR) as the single-UAV primitive score and Landing Success Rate (LSR) as the final cooperative task score. We define Cooperative Conversion Rate as CCR = LSR/ max(TSR,… view at source ↗

**Figure 4.** Figure 4: illustrates the core integration mechanism: CARLA-AIR resolves the single-GameMode constraint by keeping CARLA as the authoritative world manager and composing the AirSim aerial subsystem as an actor-level component. A.1 Coordinate Frame Unification and Single-Tick Execution All cross-agent states, observations, and cooperation metrics are expressed in a unified metric frame. CARLA uses a left-handed Unrea… view at source ↗

**Figure 5.** Figure 5: Coordinate-frame alignment. CARLA uses the Unreal Engine frame in centimetres with Z-up, while AirSim uses a metre-scale NED frame with Z-down. CARLA-AIR applies a deterministic scale conversion and Z-axis sign flip to express UAV and UGV states in one metric frame [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Example of synchronized aerial-ground sensing. Vehicle-side and UAV-side sensor streams are sampled from the same simulation tick in CARLA-AIR. The top row shows vehicle-side modalities and the bottom row shows UAV-side modalities, including RGB, semantic segmentation, depth, LiDAR BEV, surface normals, and instance segmentation. This provides paired multi-modal observations without cross-process timestamp… view at source ↗

read the original abstract

Recent aerial vision-language-action (VLA) models show promising single-UAV capabilities, such as tracking moving objects and navigating to language-specified landmarks. However, it remains unclear whether these capabilities can transfer to air-ground cooperation, where a UAV and a UGV must act jointly in a shared, closed-loop physical world. We study this question with CARLA-Air, a single-process air-ground evaluation environment that unifies CARLA and AirSim inside one Unreal Engine runtime. By sharing the same world state, physics tick, and sensing pipeline, CARLA-Air enables physically consistent UAV--UGV interaction and precise measurement of simulation-timestamp alignment and effective coordination latency. Using CARLA-Air, we evaluate representative aerial VLA and planning baselines on two complementary diagnostic tasks: moving-platform landing and occlusion-recovery escort. The results show that current aerial VLA models can often track or follow a ground partner, but struggle to convert this single-agent competence into stable cooperative behavior. State prompting provides limited benefit, and naive bidirectional interaction fails to consistently improve performance and can amplify errors for most baselines. These findings suggest that, under the tested text-based cue interfaces, zero-shot cooperative air-ground VLA requires three components beyond the current paradigm: explicit partner-state grounding, low-latency action coordination, and team-level objective alignment. Our code is available at https://github.com/louiszengCN/CarlaAir.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CARLA-Air is a practical new simulator for air-ground testing, and the results show single-agent VLA tracking does not reliably become cooperative behavior under text cues.

read the letter

The paper's main contribution is CARLA-Air, which runs CARLA and AirSim inside one Unreal process so the UAV and UGV share the same world state, physics ticks, and sensing. That setup lets them measure simulation alignment and coordination latency directly, which is useful for closed-loop experiments.

They evaluate several aerial VLA and planning baselines on two tasks: landing on a moving platform and escorting through occlusion. The pattern they report is that models often succeed at following or tracking the ground partner but fail to produce stable joint behavior. Adding state prompts gives little gain, and naive bidirectional text exchange sometimes increases errors. They list three missing pieces—partner-state grounding, low-latency coordination, and team objective alignment—while keeping the claim scoped to the text-based interfaces they tested.

This is new because earlier VLA work stayed with single agents; the unified environment and the two diagnostic tasks do not appear in the cited prior literature. The public code is a plus for anyone who wants to reproduce or extend the setup.

The soft spot is that the abstract gives no trial counts, variance numbers, or exact prompting templates, so it is hard to judge how sensitive the failures are to implementation details. The stress-test note is right that the scoping prevents overgeneralization, but a reader still needs the full quantitative tables to assess robustness.

This paper is for people working on multi-agent embodied AI or VLA coordination. It supplies a concrete benchmark and a clear negative result rather than another positive demo. I would send it to peer review because the simulator is a usable addition and the evaluation stays within its stated limits.

Referee Report

0 major / 3 minor

Summary. The paper introduces CARLA-Air, a single-process simulation environment unifying CARLA and AirSim to enable physically consistent closed-loop evaluation of air-ground coordination. It evaluates representative aerial VLA and planning baselines on two diagnostic tasks—moving-platform landing and occlusion-recovery escort—finding that single-agent tracking succeeds but stable cooperative behavior does not. State prompting yields limited benefit, while naive bidirectional interaction often fails to improve performance and can amplify errors. The authors conclude that, under the tested text-based cue interfaces, zero-shot cooperative air-ground VLA requires explicit partner-state grounding, low-latency action coordination, and team-level objective alignment. The code is released at https://github.com/louiszengCN/CarlaAir.

Significance. If the empirical results hold under the stated scoping, the work is significant for providing the first unified closed-loop benchmark for aerial VLA cooperation with ground agents. The shared world state, physics tick, and sensing pipeline enable precise latency and alignment measurements, which is a clear methodological strength. The open-source release supports reproducibility. The findings identify concrete gaps in current VLA paradigms for multi-agent settings and suggest targeted requirements for future work.

minor comments (3)

Abstract: The term 'naive bidirectional interaction' is used without a brief definition or example of the prompting format, which could reduce clarity for readers unfamiliar with the exact interface implementation.
The manuscript should include a short paragraph in the methods or appendix summarizing the number of trials, random seeds, and variance reporting for the performance metrics to aid verification of the reported differences.
Figure captions (e.g., those showing task trajectories or performance bars) would benefit from explicit mention of the metrics plotted and any statistical annotations used.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. No major comments appear in the provided report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

This paper is an empirical evaluation study that introduces the CARLA-Air simulator and reports performance measurements of existing VLA models on two diagnostic tasks. The abstract and provided text contain no derivation chain, equations, fitted parameters presented as predictions, or load-bearing self-citations. Central claims rest on experimental results obtained in the new environment rather than any reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical evaluation paper; no mathematical derivations, fitted parameters, or new physical entities are introduced. The central claims rest on the representativeness of the chosen tasks and interfaces rather than on axioms or free parameters.

pith-pipeline@v0.9.1-grok · 5792 in / 1119 out tokens · 16552 ms · 2026-06-28T22:20:43.575859+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 15 canonical work pages · 4 internal anchors

[1]

Cooperative motion planning and control for aerial-ground autonomous systems: Methods and applications.Progress in Aerospace Sciences, 146:101005, 2024

Runqi Chai, Yunlong Guo, Zongyu Zuo, Kaiyuan Chen, Hyo-Sang Shin, and Antonios Tsour- dos. Cooperative motion planning and control for aerial-ground autonomous systems: Methods and applications.Progress in Aerospace Sciences, 146:101005, 2024

2024
[2]

Ezreal: Enhancing zero-shot outdoor robot naviga- tion toward distant targets under varying visibility,

Tianle Zeng, Jianwei Peng, Hanjing Ye, Guangcheng Chen, Senzi Luo, and Hong Zhang. Ezreal: Enhancing zero-shot outdoor robot navigation toward distant targets under varying visibility.arXiv preprint arXiv:2509.13720, 2025. 9

work page arXiv 2025
[3]

Vision-and-Language Navigation for UAVs: Progress, Challenges, and a Research Roadmap

Hanxuan Chen, Jie Zheng, Siqi Yang, Tianle Zeng, Siwei Feng, Songsheng Cheng, Ruilong Ren, Hanzhong Guo, Shuai Yuan, Xiangyue Wang, et al. Vision-and-language navigation for uavs: Progress, challenges, and a research roadmap.arXiv preprint arXiv:2604.13654, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

UAV-Track VLA: Embodied Aerial Tracking via Vision-Language-Action Models

Qiyao Zhang, Shuhua Zheng, Jianli Sun, Chengxiang Li, Xianke Wu, Zihan Song, Zhiyong Cui, Yisheng Lv, and Yonglin Tian. Uav-track vla: Embodied aerial tracking via vision- language-action models.arXiv preprint arXiv:2604.02241, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[5]

Openfly: A comprehensive platform for aerial vision-language navigation.arXiv preprint arXiv:2502.18041, 2025

Yunpeng Gao, Chenhui Li, Zhongrui You, Junli Liu, Zhen Li, Pengan Chen, Qizhi Chen, Zhonghan Tang, Liansheng Wang, Penghui Yang, et al. Openfly: A comprehensive platform for aerial vision-language navigation.arXiv preprint arXiv:2502.18041, 2025

work page arXiv 2025
[6]

Uav-vla: Vision-language-action system for large scale aerial mission generation

Oleg Sautenkov, Yasheerah Yaqoot, Artem Lykov, Muhammad Ahsan Mustafa, Grik Tade- vosyan, Aibek Akhmetkazy, Miguel Altamirano Cabrera, Mikhail Martynov, Sausar Karaf, and Dzmitry Tsetserukou. Uav-vla: Vision-language-action system for large scale aerial mission generation. In2025 20th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pag...

2025
[7]

Aeri- alvln: Vision-and-language navigation for uavs

Shubo Liu, Hongsheng Zhang, Yuankai Qi, Peng Wang, Yanning Zhang, and Qi Wu. Aeri- alvln: Vision-and-language navigation for uavs. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15384–15394, 2023

2023
[8]

Uav3d: A large-scale 3d perception bench- mark for unmanned aerial vehicles.Advances in Neural Information Processing Systems, 37: 55425–55442, 2024

Hui Ye, Rajshekhar Sunderraman, and Shihao Ji. Uav3d: A large-scale 3d perception bench- mark for unmanned aerial vehicles.Advances in Neural Information Processing Systems, 37: 55425–55442, 2024

2024
[9]

Transimhub: A unified air-ground simulation platform for multi-modal perception and decision-making,

Maonan Wang, Yirong Chen, Yuxin Cai, Aoyu Pang, Yuejiao Xie, Zian Ma, Chengcheng Xu, Kemou Jiang, Ding Wang, Laurent Roullet, et al. Transimhub: A unified air-ground simulation platform for multi-modal perception and decision-making.arXiv preprint arXiv:2510.15365, 2025

work page arXiv 2025
[10]

Airsimag: A high-fidelity simulation platform for air-ground collaborative robotics.arXiv preprint arXiv:2603.23079, 2026

Yangjie Cui, Xin Dong, Boyang Gao, Jinwu Xiang, Daochun Li, and Zhan Tu. Airsimag: A high-fidelity simulation platform for air-ground collaborative robotics.arXiv preprint arXiv:2603.23079, 2026

work page arXiv 2026
[11]

CARLA-Air: Fly Drones Inside a CARLA World -- A Unified Infrastructure for Air-Ground Embodied Intelligence

Tianle Zeng, Yanci Wen, and Hong Zhang. Carla-air: Fly drones inside a carla world–a unified infrastructure for air-ground embodied intelligence.arXiv preprint arXiv:2603.28032, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

Simworld-robotics: Synthesizing photo- realistic and dynamic urban environments for multimodal robot navigation and collaboration

Yan Zhuang, Jiawei Ren, Xiaokang Ye, Jianzhi Shen, Ruixuan Zhang, Tianai Yue, Muhammad Faayez, Xuhong He, Ziqiao Ma, Lianhui Qin, et al. Simworld-robotics: Synthesizing photo- realistic and dynamic urban environments for multimodal robot navigation and collaboration. arXiv preprint arXiv:2512.10046, 2025

work page arXiv 2025
[13]

Unrealzoo: Enriching photo-realistic virtual worlds for embodied ai

Fangwei Zhong, Kui Wu, Churan Wang, Hao Chen, Hai Ci, Zhoujun Li, and Yizhou Wang. Unrealzoo: Enriching photo-realistic virtual worlds for embodied ai. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5769–5779, 2025

2025
[14]

Omnidrones: An ef- ficient and flexible platform for reinforcement learning in drone control.IEEE Robotics and Automation Letters, 9(3):2838–2844, 2024

Botian Xu, Feng Gao, Chao Yu, Ruize Zhang, Yi Wu, and Yu Wang. Omnidrones: An ef- ficient and flexible platform for reinforcement learning in drone control.IEEE Robotics and Automation Letters, 9(3):2838–2844, 2024

2024
[15]

Learning to fly—a gym environment with pybullet physics for reinforcement learning of multi- agent quadcopter control

Jacopo Panerati, Hehui Zheng, SiQi Zhou, James Xu, Amanda Prorok, and Angela P Schoellig. Learning to fly—a gym environment with pybullet physics for reinforcement learning of multi- agent quadcopter control. In2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7512–7519. IEEE, 2021

2021
[16]

Rotors—a modular gazebo mav simulator framework

Fadri Furrer, Michael Burri, Markus Achtelik, and Roland Siegwart. Rotors—a modular gazebo mav simulator framework. InRobot Operating System (ROS) The Complete Refer- ence (Volume 1), pages 595–625. Springer, 2016

2016
[17]

Griffin: Aerial-ground cooperative detection and tracking dataset and benchmark

Jiahao Wang, Xiangyu Cao, Jiaru Zhong, Yuner Zhang, Zeyu Han, Haibao Yu, Chuang Zhang, Lei He, Shaobing Xu, and Jianqiang Wang. Griffin: Aerial-ground cooperative detection and tracking dataset and benchmark. InProceedings of the AAAI Conference on Artificial Intelli- gence, volume 40, pages 9867–9875, 2026. 10

2026
[18]

Uav-flow colosseo: A real-world benchmark for flying-on-a-word uav imitation learning.arXiv preprint arXiv:2505.15725, 2025

Xiangyu Wang, Donglin Yang, Yue Liao, Wenhao Zheng, Wenjun Wu, Bin Dai, Hongsheng Li, and Si Liu. Uav-flow colosseo: A real-world benchmark for flying-on-a-word uav imitation learning.arXiv preprint arXiv:2505.15725, 2025

work page arXiv 2025
[19]

Multi-robot scene comple- tion: Towards task-agnostic collaborative perception

Yiming Li, Juexiao Zhang, Dekun Ma, Yue Wang, and Chen Feng. Multi-robot scene comple- tion: Towards task-agnostic collaborative perception. InConference on Robot Learning, pages 2062–2072. PMLR, 2023

2062
[20]

Airv2x: Unified air-ground vehicle-to- everything collaboration.arXiv preprint arXiv:2506.19283, 2025

Xiangbo Gao, Yuheng Wu, Fengze Yang, Xuewen Luo, Keshu Wu, Xinghao Chen, Yuping Wang, Chenxi Liu, Yang Zhou, and Zhengzhong Tu. Airv2x: Unified air-ground vehicle-to- everything collaboration.arXiv preprint arXiv:2506.19283, 2025

work page arXiv 2025
[21]

CARLA: An open urban driving simulator

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An open urban driving simulator. InProceedings of the 1st Conference on Robot Learning (CoRL), pages 1–16. PMLR, 2017

2017
[22]

AirSim: High-fidelity visual and physical simulation for autonomous vehicles

Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish Kapoor. AirSim: High-fidelity visual and physical simulation for autonomous vehicles. InField and Service Robotics (FSR), pages 621–635. Springer, 2018

2018
[23]

Air-ground collabora- tion for language-specified missions in unknown environments.IEEE Transactions on Field Robotics, 2025

Fernando Cladera, Zachary Ravichandran, Jason Hughes, Varun Murali, Carlos Nieto-Granda, M Ani Hsieh, George J Pappas, Camillo J Taylor, and Vijay Kumar. Air-ground collabora- tion for language-specified missions in unknown environments.IEEE Transactions on Field Robotics, 2025

2025
[24]

Where2comm: Communication-efficient collaborative perception via spatial confidence maps.Advances in neural information processing systems, 35:4874–4886, 2022

Yue Hu, Shaoheng Fang, Zixing Lei, Yiqi Zhong, and Siheng Chen. Where2comm: Communication-efficient collaborative perception via spatial confidence maps.Advances in neural information processing systems, 35:4874–4886, 2022

2022
[25]

A bi-directional adaptive framework for agile uav landing.arXiv preprint arXiv:2601.03037, 2026

Chunhui Zhao, Xirui Kao, Yilin Lu, and Yang Lyu. A bi-directional adaptive framework for agile uav landing.arXiv preprint arXiv:2601.03037, 2026

work page arXiv 2026
[26]

GLIDE: A Coordinated Aerial-Ground Framework for Search and Rescue in Unknown Environments

Seth Farrell, Chenghao Li, Hongzhan Yu, Hesam Mojtahedi, Sicun Gao, and Henrik I Chris- tensen. Glide: A coordinated aerial-ground framework for search and rescue in unknown environments.arXiv preprint arXiv:2509.14210, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Communication- aware multi-agent reinforcement learning for decentralized cooperative uav deployment.arXiv preprint arXiv:2603.16141, 2026

Enguang Fan, Yifan Chen, Zihan Shan, Matthew Caesar, and Jae Kim. Communication- aware multi-agent reinforcement learning for decentralized cooperative uav deployment.arXiv preprint arXiv:2603.16141, 2026

work page arXiv 2026
[28]

AerialVLA: A vision- language-action model for UA V navigation via minimalist end-to-end control,

Peng Xu, Zhengnan Deng, Jiayan Deng, Zonghua Gu, and Shaohua Wan. Aerialvla: A vision- language-action model for uav navigation via minimalist end-to-end control.arXiv preprint arXiv:2603.14363, 2026

work page arXiv 2026
[29]

Pushing the boundaries of immersion and storytelling: A technical review of unreal engine.Displays, page 103268, 2025

Oleksandra Sobchyshak, Santiago Berrezueta-Guzman, and Stefan Wagner. Pushing the boundaries of immersion and storytelling: A technical review of unreal engine.Displays, page 103268, 2025

2025
[30]

Towards realistic UA V vision- language navigation: Platform, benchmark, and methodology,

Xiangyu Wang, Donglin Yang, Ziqin Wang, Hohin Kwan, Jinyu Chen, Wenjun Wu, Hong- sheng Li, Yue Liao, and Si Liu. Towards realistic uav vision-language navigation: Platform, benchmark, and methodology.arXiv preprint arXiv:2410.07087, 2024

work page arXiv 2024
[31]

See, point, fly: A learning-free vlm framework for universal unmanned aerial navigation

Chih Yao Hu, Yang-Sen Lin, Yuna Lee, Chih-Hai Su, Jie-Ying Lee, Shr-Ruei Tsai, Chin-Yang Lin, Kuan-Wen Chen, Tsung-Wei Ke, and Yu-Lun Liu. See, point, fly: A learning-free vlm framework for universal unmanned aerial navigation. InConference on Robot Learning, pages 4697–4708. PMLR, 2025

2025
[32]

Colosseum: An open-source simulator for autonomous robotics research

CodexLabsLLC. Colosseum: An open-source simulator for autonomous robotics research. https://github.com/CodexLabsLLC/Colosseum, 2024. Community fork of AirSim. 11 Appendix This appendix provides additional platform and evaluation details that support the main paper. Ap- pendix A describes the CARLA-AIRruntime design, coordinate-frame unification, sensing s...

2024

[1] [1]

Cooperative motion planning and control for aerial-ground autonomous systems: Methods and applications.Progress in Aerospace Sciences, 146:101005, 2024

Runqi Chai, Yunlong Guo, Zongyu Zuo, Kaiyuan Chen, Hyo-Sang Shin, and Antonios Tsour- dos. Cooperative motion planning and control for aerial-ground autonomous systems: Methods and applications.Progress in Aerospace Sciences, 146:101005, 2024

2024

[2] [2]

Ezreal: Enhancing zero-shot outdoor robot naviga- tion toward distant targets under varying visibility,

Tianle Zeng, Jianwei Peng, Hanjing Ye, Guangcheng Chen, Senzi Luo, and Hong Zhang. Ezreal: Enhancing zero-shot outdoor robot navigation toward distant targets under varying visibility.arXiv preprint arXiv:2509.13720, 2025. 9

work page arXiv 2025

[3] [3]

Vision-and-Language Navigation for UAVs: Progress, Challenges, and a Research Roadmap

Hanxuan Chen, Jie Zheng, Siqi Yang, Tianle Zeng, Siwei Feng, Songsheng Cheng, Ruilong Ren, Hanzhong Guo, Shuai Yuan, Xiangyue Wang, et al. Vision-and-language navigation for uavs: Progress, challenges, and a research roadmap.arXiv preprint arXiv:2604.13654, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[4] [4]

UAV-Track VLA: Embodied Aerial Tracking via Vision-Language-Action Models

Qiyao Zhang, Shuhua Zheng, Jianli Sun, Chengxiang Li, Xianke Wu, Zihan Song, Zhiyong Cui, Yisheng Lv, and Yonglin Tian. Uav-track vla: Embodied aerial tracking via vision- language-action models.arXiv preprint arXiv:2604.02241, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[5] [5]

Openfly: A comprehensive platform for aerial vision-language navigation.arXiv preprint arXiv:2502.18041, 2025

Yunpeng Gao, Chenhui Li, Zhongrui You, Junli Liu, Zhen Li, Pengan Chen, Qizhi Chen, Zhonghan Tang, Liansheng Wang, Penghui Yang, et al. Openfly: A comprehensive platform for aerial vision-language navigation.arXiv preprint arXiv:2502.18041, 2025

work page arXiv 2025

[6] [6]

Uav-vla: Vision-language-action system for large scale aerial mission generation

Oleg Sautenkov, Yasheerah Yaqoot, Artem Lykov, Muhammad Ahsan Mustafa, Grik Tade- vosyan, Aibek Akhmetkazy, Miguel Altamirano Cabrera, Mikhail Martynov, Sausar Karaf, and Dzmitry Tsetserukou. Uav-vla: Vision-language-action system for large scale aerial mission generation. In2025 20th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pag...

2025

[7] [7]

Aeri- alvln: Vision-and-language navigation for uavs

Shubo Liu, Hongsheng Zhang, Yuankai Qi, Peng Wang, Yanning Zhang, and Qi Wu. Aeri- alvln: Vision-and-language navigation for uavs. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15384–15394, 2023

2023

[8] [8]

Uav3d: A large-scale 3d perception bench- mark for unmanned aerial vehicles.Advances in Neural Information Processing Systems, 37: 55425–55442, 2024

Hui Ye, Rajshekhar Sunderraman, and Shihao Ji. Uav3d: A large-scale 3d perception bench- mark for unmanned aerial vehicles.Advances in Neural Information Processing Systems, 37: 55425–55442, 2024

2024

[9] [9]

Transimhub: A unified air-ground simulation platform for multi-modal perception and decision-making,

Maonan Wang, Yirong Chen, Yuxin Cai, Aoyu Pang, Yuejiao Xie, Zian Ma, Chengcheng Xu, Kemou Jiang, Ding Wang, Laurent Roullet, et al. Transimhub: A unified air-ground simulation platform for multi-modal perception and decision-making.arXiv preprint arXiv:2510.15365, 2025

work page arXiv 2025

[10] [10]

Airsimag: A high-fidelity simulation platform for air-ground collaborative robotics.arXiv preprint arXiv:2603.23079, 2026

Yangjie Cui, Xin Dong, Boyang Gao, Jinwu Xiang, Daochun Li, and Zhan Tu. Airsimag: A high-fidelity simulation platform for air-ground collaborative robotics.arXiv preprint arXiv:2603.23079, 2026

work page arXiv 2026

[11] [11]

CARLA-Air: Fly Drones Inside a CARLA World -- A Unified Infrastructure for Air-Ground Embodied Intelligence

Tianle Zeng, Yanci Wen, and Hong Zhang. Carla-air: Fly drones inside a carla world–a unified infrastructure for air-ground embodied intelligence.arXiv preprint arXiv:2603.28032, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[12] [12]

Simworld-robotics: Synthesizing photo- realistic and dynamic urban environments for multimodal robot navigation and collaboration

Yan Zhuang, Jiawei Ren, Xiaokang Ye, Jianzhi Shen, Ruixuan Zhang, Tianai Yue, Muhammad Faayez, Xuhong He, Ziqiao Ma, Lianhui Qin, et al. Simworld-robotics: Synthesizing photo- realistic and dynamic urban environments for multimodal robot navigation and collaboration. arXiv preprint arXiv:2512.10046, 2025

work page arXiv 2025

[13] [13]

Unrealzoo: Enriching photo-realistic virtual worlds for embodied ai

Fangwei Zhong, Kui Wu, Churan Wang, Hao Chen, Hai Ci, Zhoujun Li, and Yizhou Wang. Unrealzoo: Enriching photo-realistic virtual worlds for embodied ai. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5769–5779, 2025

2025

[14] [14]

Omnidrones: An ef- ficient and flexible platform for reinforcement learning in drone control.IEEE Robotics and Automation Letters, 9(3):2838–2844, 2024

Botian Xu, Feng Gao, Chao Yu, Ruize Zhang, Yi Wu, and Yu Wang. Omnidrones: An ef- ficient and flexible platform for reinforcement learning in drone control.IEEE Robotics and Automation Letters, 9(3):2838–2844, 2024

2024

[15] [15]

Learning to fly—a gym environment with pybullet physics for reinforcement learning of multi- agent quadcopter control

Jacopo Panerati, Hehui Zheng, SiQi Zhou, James Xu, Amanda Prorok, and Angela P Schoellig. Learning to fly—a gym environment with pybullet physics for reinforcement learning of multi- agent quadcopter control. In2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7512–7519. IEEE, 2021

2021

[16] [16]

Rotors—a modular gazebo mav simulator framework

Fadri Furrer, Michael Burri, Markus Achtelik, and Roland Siegwart. Rotors—a modular gazebo mav simulator framework. InRobot Operating System (ROS) The Complete Refer- ence (Volume 1), pages 595–625. Springer, 2016

2016

[17] [17]

Griffin: Aerial-ground cooperative detection and tracking dataset and benchmark

Jiahao Wang, Xiangyu Cao, Jiaru Zhong, Yuner Zhang, Zeyu Han, Haibao Yu, Chuang Zhang, Lei He, Shaobing Xu, and Jianqiang Wang. Griffin: Aerial-ground cooperative detection and tracking dataset and benchmark. InProceedings of the AAAI Conference on Artificial Intelli- gence, volume 40, pages 9867–9875, 2026. 10

2026

[18] [18]

Uav-flow colosseo: A real-world benchmark for flying-on-a-word uav imitation learning.arXiv preprint arXiv:2505.15725, 2025

Xiangyu Wang, Donglin Yang, Yue Liao, Wenhao Zheng, Wenjun Wu, Bin Dai, Hongsheng Li, and Si Liu. Uav-flow colosseo: A real-world benchmark for flying-on-a-word uav imitation learning.arXiv preprint arXiv:2505.15725, 2025

work page arXiv 2025

[19] [19]

Multi-robot scene comple- tion: Towards task-agnostic collaborative perception

Yiming Li, Juexiao Zhang, Dekun Ma, Yue Wang, and Chen Feng. Multi-robot scene comple- tion: Towards task-agnostic collaborative perception. InConference on Robot Learning, pages 2062–2072. PMLR, 2023

2062

[20] [20]

Airv2x: Unified air-ground vehicle-to- everything collaboration.arXiv preprint arXiv:2506.19283, 2025

Xiangbo Gao, Yuheng Wu, Fengze Yang, Xuewen Luo, Keshu Wu, Xinghao Chen, Yuping Wang, Chenxi Liu, Yang Zhou, and Zhengzhong Tu. Airv2x: Unified air-ground vehicle-to- everything collaboration.arXiv preprint arXiv:2506.19283, 2025

work page arXiv 2025

[21] [21]

CARLA: An open urban driving simulator

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An open urban driving simulator. InProceedings of the 1st Conference on Robot Learning (CoRL), pages 1–16. PMLR, 2017

2017

[22] [22]

AirSim: High-fidelity visual and physical simulation for autonomous vehicles

Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish Kapoor. AirSim: High-fidelity visual and physical simulation for autonomous vehicles. InField and Service Robotics (FSR), pages 621–635. Springer, 2018

2018

[23] [23]

Air-ground collabora- tion for language-specified missions in unknown environments.IEEE Transactions on Field Robotics, 2025

Fernando Cladera, Zachary Ravichandran, Jason Hughes, Varun Murali, Carlos Nieto-Granda, M Ani Hsieh, George J Pappas, Camillo J Taylor, and Vijay Kumar. Air-ground collabora- tion for language-specified missions in unknown environments.IEEE Transactions on Field Robotics, 2025

2025

[24] [24]

Where2comm: Communication-efficient collaborative perception via spatial confidence maps.Advances in neural information processing systems, 35:4874–4886, 2022

Yue Hu, Shaoheng Fang, Zixing Lei, Yiqi Zhong, and Siheng Chen. Where2comm: Communication-efficient collaborative perception via spatial confidence maps.Advances in neural information processing systems, 35:4874–4886, 2022

2022

[25] [25]

A bi-directional adaptive framework for agile uav landing.arXiv preprint arXiv:2601.03037, 2026

Chunhui Zhao, Xirui Kao, Yilin Lu, and Yang Lyu. A bi-directional adaptive framework for agile uav landing.arXiv preprint arXiv:2601.03037, 2026

work page arXiv 2026

[26] [26]

GLIDE: A Coordinated Aerial-Ground Framework for Search and Rescue in Unknown Environments

Seth Farrell, Chenghao Li, Hongzhan Yu, Hesam Mojtahedi, Sicun Gao, and Henrik I Chris- tensen. Glide: A coordinated aerial-ground framework for search and rescue in unknown environments.arXiv preprint arXiv:2509.14210, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

Communication- aware multi-agent reinforcement learning for decentralized cooperative uav deployment.arXiv preprint arXiv:2603.16141, 2026

Enguang Fan, Yifan Chen, Zihan Shan, Matthew Caesar, and Jae Kim. Communication- aware multi-agent reinforcement learning for decentralized cooperative uav deployment.arXiv preprint arXiv:2603.16141, 2026

work page arXiv 2026

[28] [28]

AerialVLA: A vision- language-action model for UA V navigation via minimalist end-to-end control,

Peng Xu, Zhengnan Deng, Jiayan Deng, Zonghua Gu, and Shaohua Wan. Aerialvla: A vision- language-action model for uav navigation via minimalist end-to-end control.arXiv preprint arXiv:2603.14363, 2026

work page arXiv 2026

[29] [29]

Pushing the boundaries of immersion and storytelling: A technical review of unreal engine.Displays, page 103268, 2025

Oleksandra Sobchyshak, Santiago Berrezueta-Guzman, and Stefan Wagner. Pushing the boundaries of immersion and storytelling: A technical review of unreal engine.Displays, page 103268, 2025

2025

[30] [30]

Towards realistic UA V vision- language navigation: Platform, benchmark, and methodology,

Xiangyu Wang, Donglin Yang, Ziqin Wang, Hohin Kwan, Jinyu Chen, Wenjun Wu, Hong- sheng Li, Yue Liao, and Si Liu. Towards realistic uav vision-language navigation: Platform, benchmark, and methodology.arXiv preprint arXiv:2410.07087, 2024

work page arXiv 2024

[31] [31]

See, point, fly: A learning-free vlm framework for universal unmanned aerial navigation

Chih Yao Hu, Yang-Sen Lin, Yuna Lee, Chih-Hai Su, Jie-Ying Lee, Shr-Ruei Tsai, Chin-Yang Lin, Kuan-Wen Chen, Tsung-Wei Ke, and Yu-Lun Liu. See, point, fly: A learning-free vlm framework for universal unmanned aerial navigation. InConference on Robot Learning, pages 4697–4708. PMLR, 2025

2025

[32] [32]

Colosseum: An open-source simulator for autonomous robotics research

CodexLabsLLC. Colosseum: An open-source simulator for autonomous robotics research. https://github.com/CodexLabsLLC/Colosseum, 2024. Community fork of AirSim. 11 Appendix This appendix provides additional platform and evaluation details that support the main paper. Ap- pendix A describes the CARLA-AIRruntime design, coordinate-frame unification, sensing s...

2024