pith. sign in

arxiv: 2605.31066 · v1 · pith:AJEW2PKFnew · submitted 2026-05-29 · 💻 cs.RO

Can Aerial VLA Models Cooperate? Evaluating Closed-Loop Air-Ground Coordination with CARLA-Air

Pith reviewed 2026-06-28 22:20 UTC · model grok-4.3

classification 💻 cs.RO
keywords aerial VLAair-ground coordinationCARLA-AirUAV-UGV cooperationclosed-loop evaluationvision-language-action modelsmulti-agent simulation
0
0 comments X

The pith

Aerial VLA models can track ground partners but fail to turn that into stable cooperative behavior in closed-loop tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CARLA-Air, a unified simulation that combines CARLA and AirSim in one runtime so a UAV and UGV share the same world state, physics, and sensors. It runs two diagnostic tasks—moving-platform landing and occlusion-recovery escort—to check whether single-agent aerial VLA skills transfer to joint air-ground action. Results indicate that models often follow or track the ground partner yet cannot sustain coordinated team performance. Simple state prompts add little help, and basic bidirectional text exchange frequently worsens outcomes by spreading errors. The work concludes that zero-shot cooperation under current text interfaces needs explicit partner-state grounding, low-latency action links, and shared team objectives.

Core claim

Current aerial VLA models can often track or follow a ground partner, but struggle to convert this single-agent competence into stable cooperative behavior. State prompting provides limited benefit, and naive bidirectional interaction fails to consistently improve performance and can amplify errors for most baselines. These findings suggest that, under the tested text-based cue interfaces, zero-shot cooperative air-ground VLA requires three components beyond the current paradigm: explicit partner-state grounding, low-latency action coordination, and team-level objective alignment.

What carries the argument

CARLA-Air, a single-process environment that unifies CARLA and AirSim to share world state, physics tick, and sensing pipeline for consistent UAV-UGV interaction and precise latency measurement.

If this is right

  • Single-agent tracking skills do not automatically produce stable joint air-ground behavior.
  • Text-based state prompting yields only marginal gains in coordination tasks.
  • Naive bidirectional text interaction can increase error rates rather than reduce them.
  • Zero-shot cooperation needs explicit partner-state grounding, low-latency coordination, and team objective alignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same interface limitations may appear in other multi-robot settings that rely on language for coordination.
  • Adding direct state sharing or visual grounding between agents could be tested as a direct extension of the current diagnostic tasks.
  • Hardware experiments on physical UAV-UGV pairs would be needed to check whether the observed cooperation gaps persist outside simulation.

Load-bearing premise

The two diagnostic tasks and the text-based cue interfaces used in CARLA-Air represent the coordination challenges that would appear in real air-ground deployments or with richer interaction modalities.

What would settle it

An experiment in which the same VLA baselines, given direct partner-state access and low-latency channels instead of text cues, achieve stable success on both the landing and escort tasks would falsify the claim that the current paradigm is insufficient.

Figures

Figures reproduced from arXiv: 2605.31066 by Hong Zhang, Tianle Zeng, Xueang Yu, Yanci Wen.

Figure 1
Figure 1. Figure 1: CARLA-AIR runtime architecture. CARLA and AirSim are embedded in one engine￾level runtime with a shared world state, physics tick, and sensor/rendering pipeline. Native CARLA and AirSim APIs are preserved, while both command streams are resolved inside the same runtime [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Landing process trajectory diagnosis. Top: representative visual sequence during moving-platform landing. Bottom: 3D UGV and UAV trajectories during moving-platform land￾ing under (a) C0, (b) C1, and (c) C2. Colored curves denote aerial baselines; gray line denotes the UGV trajectory; black dashed line denotes Rule-Coop-State. 4 Diagnostic Evaluation Building on the physically consistent and closed-loop ev… view at source ↗
Figure 3
Figure 3. Figure 3: Cooperative Occlusion-Recovery Escort. The UAV escorts the UGV, loses visual con￾tact under temporary occlusion, searches using partner-state cues, and re-acquires the UGV before resuming escort. For landing, we report Tracking Success Rate (TSR) as the single-UAV primitive score and Landing Success Rate (LSR) as the final cooperative task score. We define Cooperative Conversion Rate as CCR = LSR/ max(TSR,… view at source ↗
Figure 4
Figure 4. Figure 4: illustrates the core integration mechanism: CARLA-AIR resolves the single-GameMode constraint by keeping CARLA as the authoritative world manager and composing the AirSim aerial subsystem as an actor-level component. A.1 Coordinate Frame Unification and Single-Tick Execution All cross-agent states, observations, and cooperation metrics are expressed in a unified metric frame. CARLA uses a left-handed Unrea… view at source ↗
Figure 5
Figure 5. Figure 5: Coordinate-frame alignment. CARLA uses the Unreal Engine frame in centimetres with Z-up, while AirSim uses a metre-scale NED frame with Z-down. CARLA-AIR applies a deterministic scale conversion and Z-axis sign flip to express UAV and UGV states in one metric frame [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Example of synchronized aerial-ground sensing. Vehicle-side and UAV-side sensor streams are sampled from the same simulation tick in CARLA-AIR. The top row shows vehicle-side modalities and the bottom row shows UAV-side modalities, including RGB, semantic segmentation, depth, LiDAR BEV, surface normals, and instance segmentation. This provides paired multi-modal observations without cross-process timestamp… view at source ↗
read the original abstract

Recent aerial vision-language-action (VLA) models show promising single-UAV capabilities, such as tracking moving objects and navigating to language-specified landmarks. However, it remains unclear whether these capabilities can transfer to air-ground cooperation, where a UAV and a UGV must act jointly in a shared, closed-loop physical world. We study this question with CARLA-Air, a single-process air-ground evaluation environment that unifies CARLA and AirSim inside one Unreal Engine runtime. By sharing the same world state, physics tick, and sensing pipeline, CARLA-Air enables physically consistent UAV--UGV interaction and precise measurement of simulation-timestamp alignment and effective coordination latency. Using CARLA-Air, we evaluate representative aerial VLA and planning baselines on two complementary diagnostic tasks: moving-platform landing and occlusion-recovery escort. The results show that current aerial VLA models can often track or follow a ground partner, but struggle to convert this single-agent competence into stable cooperative behavior. State prompting provides limited benefit, and naive bidirectional interaction fails to consistently improve performance and can amplify errors for most baselines. These findings suggest that, under the tested text-based cue interfaces, zero-shot cooperative air-ground VLA requires three components beyond the current paradigm: explicit partner-state grounding, low-latency action coordination, and team-level objective alignment. Our code is available at https://github.com/louiszengCN/CarlaAir.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper introduces CARLA-Air, a single-process simulation environment unifying CARLA and AirSim to enable physically consistent closed-loop evaluation of air-ground coordination. It evaluates representative aerial VLA and planning baselines on two diagnostic tasks—moving-platform landing and occlusion-recovery escort—finding that single-agent tracking succeeds but stable cooperative behavior does not. State prompting yields limited benefit, while naive bidirectional interaction often fails to improve performance and can amplify errors. The authors conclude that, under the tested text-based cue interfaces, zero-shot cooperative air-ground VLA requires explicit partner-state grounding, low-latency action coordination, and team-level objective alignment. The code is released at https://github.com/louiszengCN/CarlaAir.

Significance. If the empirical results hold under the stated scoping, the work is significant for providing the first unified closed-loop benchmark for aerial VLA cooperation with ground agents. The shared world state, physics tick, and sensing pipeline enable precise latency and alignment measurements, which is a clear methodological strength. The open-source release supports reproducibility. The findings identify concrete gaps in current VLA paradigms for multi-agent settings and suggest targeted requirements for future work.

minor comments (3)
  1. Abstract: The term 'naive bidirectional interaction' is used without a brief definition or example of the prompting format, which could reduce clarity for readers unfamiliar with the exact interface implementation.
  2. The manuscript should include a short paragraph in the methods or appendix summarizing the number of trials, random seeds, and variance reporting for the performance metrics to aid verification of the reported differences.
  3. Figure captions (e.g., those showing task trajectories or performance bars) would benefit from explicit mention of the metrics plotted and any statistical annotations used.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. No major comments appear in the provided report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

This paper is an empirical evaluation study that introduces the CARLA-Air simulator and reports performance measurements of existing VLA models on two diagnostic tasks. The abstract and provided text contain no derivation chain, equations, fitted parameters presented as predictions, or load-bearing self-citations. Central claims rest on experimental results obtained in the new environment rather than any reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical evaluation paper; no mathematical derivations, fitted parameters, or new physical entities are introduced. The central claims rest on the representativeness of the chosen tasks and interfaces rather than on axioms or free parameters.

pith-pipeline@v0.9.1-grok · 5792 in / 1119 out tokens · 16552 ms · 2026-06-28T22:20:43.575859+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 15 canonical work pages · 4 internal anchors

  1. [1]

    Cooperative motion planning and control for aerial-ground autonomous systems: Methods and applications.Progress in Aerospace Sciences, 146:101005, 2024

    Runqi Chai, Yunlong Guo, Zongyu Zuo, Kaiyuan Chen, Hyo-Sang Shin, and Antonios Tsour- dos. Cooperative motion planning and control for aerial-ground autonomous systems: Methods and applications.Progress in Aerospace Sciences, 146:101005, 2024

  2. [2]

    Ezreal: Enhancing zero-shot outdoor robot naviga- tion toward distant targets under varying visibility,

    Tianle Zeng, Jianwei Peng, Hanjing Ye, Guangcheng Chen, Senzi Luo, and Hong Zhang. Ezreal: Enhancing zero-shot outdoor robot navigation toward distant targets under varying visibility.arXiv preprint arXiv:2509.13720, 2025. 9

  3. [3]

    Vision-and-Language Navigation for UAVs: Progress, Challenges, and a Research Roadmap

    Hanxuan Chen, Jie Zheng, Siqi Yang, Tianle Zeng, Siwei Feng, Songsheng Cheng, Ruilong Ren, Hanzhong Guo, Shuai Yuan, Xiangyue Wang, et al. Vision-and-language navigation for uavs: Progress, challenges, and a research roadmap.arXiv preprint arXiv:2604.13654, 2026

  4. [4]

    UAV-Track VLA: Embodied Aerial Tracking via Vision-Language-Action Models

    Qiyao Zhang, Shuhua Zheng, Jianli Sun, Chengxiang Li, Xianke Wu, Zihan Song, Zhiyong Cui, Yisheng Lv, and Yonglin Tian. Uav-track vla: Embodied aerial tracking via vision- language-action models.arXiv preprint arXiv:2604.02241, 2026

  5. [5]

    Openfly: A comprehensive platform for aerial vision-language navigation.arXiv preprint arXiv:2502.18041, 2025

    Yunpeng Gao, Chenhui Li, Zhongrui You, Junli Liu, Zhen Li, Pengan Chen, Qizhi Chen, Zhonghan Tang, Liansheng Wang, Penghui Yang, et al. Openfly: A comprehensive platform for aerial vision-language navigation.arXiv preprint arXiv:2502.18041, 2025

  6. [6]

    Uav-vla: Vision-language-action system for large scale aerial mission generation

    Oleg Sautenkov, Yasheerah Yaqoot, Artem Lykov, Muhammad Ahsan Mustafa, Grik Tade- vosyan, Aibek Akhmetkazy, Miguel Altamirano Cabrera, Mikhail Martynov, Sausar Karaf, and Dzmitry Tsetserukou. Uav-vla: Vision-language-action system for large scale aerial mission generation. In2025 20th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pag...

  7. [7]

    Aeri- alvln: Vision-and-language navigation for uavs

    Shubo Liu, Hongsheng Zhang, Yuankai Qi, Peng Wang, Yanning Zhang, and Qi Wu. Aeri- alvln: Vision-and-language navigation for uavs. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15384–15394, 2023

  8. [8]

    Uav3d: A large-scale 3d perception bench- mark for unmanned aerial vehicles.Advances in Neural Information Processing Systems, 37: 55425–55442, 2024

    Hui Ye, Rajshekhar Sunderraman, and Shihao Ji. Uav3d: A large-scale 3d perception bench- mark for unmanned aerial vehicles.Advances in Neural Information Processing Systems, 37: 55425–55442, 2024

  9. [9]

    Transimhub: A unified air-ground simulation platform for multi-modal perception and decision-making,

    Maonan Wang, Yirong Chen, Yuxin Cai, Aoyu Pang, Yuejiao Xie, Zian Ma, Chengcheng Xu, Kemou Jiang, Ding Wang, Laurent Roullet, et al. Transimhub: A unified air-ground simulation platform for multi-modal perception and decision-making.arXiv preprint arXiv:2510.15365, 2025

  10. [10]

    Airsimag: A high-fidelity simulation platform for air-ground collaborative robotics.arXiv preprint arXiv:2603.23079, 2026

    Yangjie Cui, Xin Dong, Boyang Gao, Jinwu Xiang, Daochun Li, and Zhan Tu. Airsimag: A high-fidelity simulation platform for air-ground collaborative robotics.arXiv preprint arXiv:2603.23079, 2026

  11. [11]

    CARLA-Air: Fly Drones Inside a CARLA World -- A Unified Infrastructure for Air-Ground Embodied Intelligence

    Tianle Zeng, Yanci Wen, and Hong Zhang. Carla-air: Fly drones inside a carla world–a unified infrastructure for air-ground embodied intelligence.arXiv preprint arXiv:2603.28032, 2026

  12. [12]

    Simworld-robotics: Synthesizing photo- realistic and dynamic urban environments for multimodal robot navigation and collaboration

    Yan Zhuang, Jiawei Ren, Xiaokang Ye, Jianzhi Shen, Ruixuan Zhang, Tianai Yue, Muhammad Faayez, Xuhong He, Ziqiao Ma, Lianhui Qin, et al. Simworld-robotics: Synthesizing photo- realistic and dynamic urban environments for multimodal robot navigation and collaboration. arXiv preprint arXiv:2512.10046, 2025

  13. [13]

    Unrealzoo: Enriching photo-realistic virtual worlds for embodied ai

    Fangwei Zhong, Kui Wu, Churan Wang, Hao Chen, Hai Ci, Zhoujun Li, and Yizhou Wang. Unrealzoo: Enriching photo-realistic virtual worlds for embodied ai. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5769–5779, 2025

  14. [14]

    Omnidrones: An ef- ficient and flexible platform for reinforcement learning in drone control.IEEE Robotics and Automation Letters, 9(3):2838–2844, 2024

    Botian Xu, Feng Gao, Chao Yu, Ruize Zhang, Yi Wu, and Yu Wang. Omnidrones: An ef- ficient and flexible platform for reinforcement learning in drone control.IEEE Robotics and Automation Letters, 9(3):2838–2844, 2024

  15. [15]

    Learning to fly—a gym environment with pybullet physics for reinforcement learning of multi- agent quadcopter control

    Jacopo Panerati, Hehui Zheng, SiQi Zhou, James Xu, Amanda Prorok, and Angela P Schoellig. Learning to fly—a gym environment with pybullet physics for reinforcement learning of multi- agent quadcopter control. In2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7512–7519. IEEE, 2021

  16. [16]

    Rotors—a modular gazebo mav simulator framework

    Fadri Furrer, Michael Burri, Markus Achtelik, and Roland Siegwart. Rotors—a modular gazebo mav simulator framework. InRobot Operating System (ROS) The Complete Refer- ence (Volume 1), pages 595–625. Springer, 2016

  17. [17]

    Griffin: Aerial-ground cooperative detection and tracking dataset and benchmark

    Jiahao Wang, Xiangyu Cao, Jiaru Zhong, Yuner Zhang, Zeyu Han, Haibao Yu, Chuang Zhang, Lei He, Shaobing Xu, and Jianqiang Wang. Griffin: Aerial-ground cooperative detection and tracking dataset and benchmark. InProceedings of the AAAI Conference on Artificial Intelli- gence, volume 40, pages 9867–9875, 2026. 10

  18. [18]

    Uav-flow colosseo: A real-world benchmark for flying-on-a-word uav imitation learning.arXiv preprint arXiv:2505.15725, 2025

    Xiangyu Wang, Donglin Yang, Yue Liao, Wenhao Zheng, Wenjun Wu, Bin Dai, Hongsheng Li, and Si Liu. Uav-flow colosseo: A real-world benchmark for flying-on-a-word uav imitation learning.arXiv preprint arXiv:2505.15725, 2025

  19. [19]

    Multi-robot scene comple- tion: Towards task-agnostic collaborative perception

    Yiming Li, Juexiao Zhang, Dekun Ma, Yue Wang, and Chen Feng. Multi-robot scene comple- tion: Towards task-agnostic collaborative perception. InConference on Robot Learning, pages 2062–2072. PMLR, 2023

  20. [20]

    Airv2x: Unified air-ground vehicle-to- everything collaboration.arXiv preprint arXiv:2506.19283, 2025

    Xiangbo Gao, Yuheng Wu, Fengze Yang, Xuewen Luo, Keshu Wu, Xinghao Chen, Yuping Wang, Chenxi Liu, Yang Zhou, and Zhengzhong Tu. Airv2x: Unified air-ground vehicle-to- everything collaboration.arXiv preprint arXiv:2506.19283, 2025

  21. [21]

    CARLA: An open urban driving simulator

    Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An open urban driving simulator. InProceedings of the 1st Conference on Robot Learning (CoRL), pages 1–16. PMLR, 2017

  22. [22]

    AirSim: High-fidelity visual and physical simulation for autonomous vehicles

    Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish Kapoor. AirSim: High-fidelity visual and physical simulation for autonomous vehicles. InField and Service Robotics (FSR), pages 621–635. Springer, 2018

  23. [23]

    Air-ground collabora- tion for language-specified missions in unknown environments.IEEE Transactions on Field Robotics, 2025

    Fernando Cladera, Zachary Ravichandran, Jason Hughes, Varun Murali, Carlos Nieto-Granda, M Ani Hsieh, George J Pappas, Camillo J Taylor, and Vijay Kumar. Air-ground collabora- tion for language-specified missions in unknown environments.IEEE Transactions on Field Robotics, 2025

  24. [24]

    Where2comm: Communication-efficient collaborative perception via spatial confidence maps.Advances in neural information processing systems, 35:4874–4886, 2022

    Yue Hu, Shaoheng Fang, Zixing Lei, Yiqi Zhong, and Siheng Chen. Where2comm: Communication-efficient collaborative perception via spatial confidence maps.Advances in neural information processing systems, 35:4874–4886, 2022

  25. [25]

    A bi-directional adaptive framework for agile uav landing.arXiv preprint arXiv:2601.03037, 2026

    Chunhui Zhao, Xirui Kao, Yilin Lu, and Yang Lyu. A bi-directional adaptive framework for agile uav landing.arXiv preprint arXiv:2601.03037, 2026

  26. [26]

    GLIDE: A Coordinated Aerial-Ground Framework for Search and Rescue in Unknown Environments

    Seth Farrell, Chenghao Li, Hongzhan Yu, Hesam Mojtahedi, Sicun Gao, and Henrik I Chris- tensen. Glide: A coordinated aerial-ground framework for search and rescue in unknown environments.arXiv preprint arXiv:2509.14210, 2025

  27. [27]

    Communication- aware multi-agent reinforcement learning for decentralized cooperative uav deployment.arXiv preprint arXiv:2603.16141, 2026

    Enguang Fan, Yifan Chen, Zihan Shan, Matthew Caesar, and Jae Kim. Communication- aware multi-agent reinforcement learning for decentralized cooperative uav deployment.arXiv preprint arXiv:2603.16141, 2026

  28. [28]

    AerialVLA: A vision- language-action model for UA V navigation via minimalist end-to-end control,

    Peng Xu, Zhengnan Deng, Jiayan Deng, Zonghua Gu, and Shaohua Wan. Aerialvla: A vision- language-action model for uav navigation via minimalist end-to-end control.arXiv preprint arXiv:2603.14363, 2026

  29. [29]

    Pushing the boundaries of immersion and storytelling: A technical review of unreal engine.Displays, page 103268, 2025

    Oleksandra Sobchyshak, Santiago Berrezueta-Guzman, and Stefan Wagner. Pushing the boundaries of immersion and storytelling: A technical review of unreal engine.Displays, page 103268, 2025

  30. [30]

    Towards realistic UA V vision- language navigation: Platform, benchmark, and methodology,

    Xiangyu Wang, Donglin Yang, Ziqin Wang, Hohin Kwan, Jinyu Chen, Wenjun Wu, Hong- sheng Li, Yue Liao, and Si Liu. Towards realistic uav vision-language navigation: Platform, benchmark, and methodology.arXiv preprint arXiv:2410.07087, 2024

  31. [31]

    See, point, fly: A learning-free vlm framework for universal unmanned aerial navigation

    Chih Yao Hu, Yang-Sen Lin, Yuna Lee, Chih-Hai Su, Jie-Ying Lee, Shr-Ruei Tsai, Chin-Yang Lin, Kuan-Wen Chen, Tsung-Wei Ke, and Yu-Lun Liu. See, point, fly: A learning-free vlm framework for universal unmanned aerial navigation. InConference on Robot Learning, pages 4697–4708. PMLR, 2025

  32. [32]

    Colosseum: An open-source simulator for autonomous robotics research

    CodexLabsLLC. Colosseum: An open-source simulator for autonomous robotics research. https://github.com/CodexLabsLLC/Colosseum, 2024. Community fork of AirSim. 11 Appendix This appendix provides additional platform and evaluation details that support the main paper. Ap- pendix A describes the CARLA-AIRruntime design, coordinate-frame unification, sensing s...