PEACE: A Planner-Executor Agent with Constraint Enforcement for UAVs

Erdem Uysal; Sebastiano Panichella; Timo Kehrer

arxiv: 2606.00104 · v1 · pith:G6LHEWRPnew · submitted 2026-05-26 · 💻 cs.RO · cs.AI

PEACE: A Planner-Executor Agent with Constraint Enforcement for UAVs

Erdem Uysal , Timo Kehrer , Sebastiano Panichella This is my paper

Pith reviewed 2026-06-29 16:50 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords planner-executor agentUAV controlLLM planningconstraint enforcementPX4ROS 2geofencingexplainable robotics

0 comments

The pith

A decoupled planner-executor agent lets an LLM handle single-pass UAV mission planning while a separate layer enforces altitude and geofence constraints during execution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PEACE, an architecture that keeps the large language model out of the tight control loop for PX4 drones. Instead, the LLM produces a mission plan once, which a structured ROS 2 interface then executes through MAVLink while continuously checking safety bounds. A world model built from off-the-shelf 2D detectors plus pinhole depth projection supplies the 3D positions needed for those checks, and bounded replanning handles occasional execution failures. The result is presented as more explainable and constraint-compliant than end-to-end LLM control, with fewer model calls required.

Core claim

By separating high-level task planning (single LLM pass) from low-level execution (ROS 2 tool calls bridged to MAVLink), and inserting an explicit constraint-enforcement layer that uses modular 2D detection plus pinhole projection for 3D localization, the system achieves reliable geofencing and altitude limits plus recovery from action failures without repeated LLM invocations or domain-specific fine-tuning.

What carries the argument

The planner-executor split with a constraint-enforcement layer on a 3D world model constructed from 2D detectors and pinhole depth projection, executed via ROS 2 tool-calling to MAVLink.

If this is right

Planning remains human-readable because it is produced in a single explicit step rather than generated continuously inside the control loop.
Safety constraints are enforced at execution time independently of the language model output.
The number of LLM calls drops because replanning is bounded and most execution occurs without further model involvement.
Perception modules can be swapped (YOLO or VLMs) without retraining the planner or changing the constraint layer.
The same separation pattern can be applied to other MAVLink-compatible autopilots without new end-to-end training data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The architecture may reduce latency and cost in field deployments where each LLM call carries network or compute overhead.
Because the world model is built from modular detectors, the system could incorporate additional sensor modalities without redesigning the planning interface.
Explicit constraint enforcement creates an auditable log of safety decisions that could support regulatory review of autonomous UAV operations.

Load-bearing premise

The 3D positions obtained from 2D object detectors and pinhole projection are accurate enough for reliable geofence and altitude enforcement, and single-pass planning plus limited replanning is sufficient to recover from failures.

What would settle it

In PX4-Gazebo simulations, repeated geofence or altitude violations occur when the same missions are flown with the reported perception stack, or recovery from injected execution failures requires more than one additional LLM call in a majority of trials.

Figures

Figures reproduced from arXiv: 2606.00104 by Erdem Uysal, Sebastiano Panichella, Timo Kehrer.

**Figure 1.** Figure 1: System architecture overview. The system comprises three modules. (Left) The simulation environment runs a drone model with IMU, GPS, depth [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

read the original abstract

Foundation models are increasingly used to drive autonomous systems, yet existing approaches either keep the model in a tight control loop, raising latency and hallucination risk, or compile natural language into opaque end-to-end policies that are hard to explain, constraint and require domain-specific datasets and fine-tuning. We propose a planner-executor agent for PX4-based drones that decouples high-level mission planning from low-level control. A large language model performs single-pass task planning, while execution is handled through a structured ROS 2 tool-calling interface bridged to MAVLink. The system constructs a world model by combining modular 2D detectors (e.g., YOLO or vision-language models) with a pinhole depth projection module for 3D object localization. A constraint enforcement layer enforces altitude limits and horizontal geofencing, and bounded replanning enables recovery from execution-time action failures. We position our approach within three common design patterns for foundation-model-based robotics systems and demonstrate its feasibility in PX4 software-in-the-loop simulations in Gazebo. Results highlight improved explainability, constraint enforcement, and reduced LLM calls compared to tightly coupled LLM control. The code, dataset, videos, and other material can be found at the following link: https://github.com/erdemuysalx/PEACE

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper describes a planner-executor split for LLM-driven PX4 drones with an added constraint layer, but supplies no numbers to support its claims about better explainability or fewer LLM calls.

read the letter

The main thing to know is that PEACE splits high-level LLM planning from low-level execution on PX4 drones, using ROS 2 tool calls to MAVLink, a world model from 2D detectors plus pinhole projection, and a constraint layer for altitude and geofencing with bounded replanning on failures. They position it against three existing patterns and say it improves explainability while cutting LLM calls.

What is new is the concrete combination of single-pass planning, explicit constraint enforcement, and the ROS 2/MAVLink bridge for this hardware stack. The code and materials are public, which makes the implementation details usable for someone trying to replicate or extend it.

The description is clear on the architecture. The soft spot is the complete absence of quantitative results. The abstract mentions PX4 SITL simulations in Gazebo and highlights improved metrics, yet no error rates, baselines, LLM call counts, or success rates appear. The 3D localization from 2D detections and pinhole projection is central to the constraint enforcement, but there are no accuracy numbers or sensitivity checks against ground truth. Without those, it is impossible to tell whether projection noise stays inside the enforcement margins or causes false triggers.

This is for robotics groups already working on LLM integration with UAVs who need a starting template rather than a new theoretical result. The work shows honest engagement with the patterns it cites and avoids overclaiming in the text itself.

I would send it for peer review so the authors can add the missing evaluations; the idea is straightforward enough that referees could give useful feedback on the implementation choices.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes PEACE, a decoupled planner-executor architecture for PX4-based UAVs. An LLM performs single-pass high-level mission planning while a structured ROS 2/MAVLink executor handles low-level control; a world model is built from modular 2D detectors (YOLO or VLMs) plus pinhole depth projection for 3D localization, a constraint-enforcement layer handles altitude and geofence limits, and bounded replanning recovers from failures. Feasibility is demonstrated via PX4 SITL simulations in Gazebo, with claims of improved explainability, constraint enforcement, and fewer LLM calls relative to tightly coupled LLM control. Code, dataset, and videos are provided.

Significance. If the quantitative claims hold, the architecture could provide a reproducible template for safer, more interpretable foundation-model integration in robotics by separating planning from execution and making constraints explicit. The open release of implementation artifacts is a clear strength that supports verification and extension.

major comments (3)

[System Architecture / World Model] World-model construction (pinhole depth projection from 2D detections): the central feasibility claim for reliable geofencing and altitude enforcement rests on the accuracy of the constructed 3D positions, yet no error statistics, ground-truth comparisons, camera-intrinsic sensitivity analysis, or bounding-box jitter quantification appear in the simulation results.
[Evaluation / Results] Results section: the abstract asserts 'improved explainability, constraint enforcement, and reduced LLM calls' relative to tightly coupled baselines, but supplies no numerical metrics, baseline comparisons, success rates, latency figures, or statistical tests, so the performance claims cannot be evaluated.
[Constraint Enforcement Layer] Constraint enforcement layer: the description states that altitude limits and horizontal geofences are enforced, but does not specify tolerance margins, how localization noise is propagated into violation decisions, or recovery behavior under realistic projection error, which is load-bearing for the safety argument.

minor comments (1)

[Abstract] Abstract contains the ungrammatical phrase 'constraint and require domain-specific datasets'; rephrase for clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, proposing targeted revisions to improve clarity and rigor while preserving the manuscript's focus on architectural feasibility.

read point-by-point responses

Referee: [System Architecture / World Model] World-model construction (pinhole depth projection from 2D detections): the central feasibility claim for reliable geofencing and altitude enforcement rests on the accuracy of the constructed 3D positions, yet no error statistics, ground-truth comparisons, camera-intrinsic sensitivity analysis, or bounding-box jitter quantification appear in the simulation results.

Authors: We agree that the absence of quantitative error analysis for the 3D world model limits evaluation of the safety claims. The original manuscript prioritizes end-to-end architectural demonstration over component-level metrology. In revision we will add a dedicated subsection reporting localization error statistics computed against Gazebo ground truth, including mean Euclidean position error and sensitivity to bounding-box perturbations. revision: yes
Referee: [Evaluation / Results] Results section: the abstract asserts 'improved explainability, constraint enforcement, and reduced LLM calls' relative to tightly coupled baselines, but supplies no numerical metrics, baseline comparisons, success rates, latency figures, or statistical tests, so the performance claims cannot be evaluated.

Authors: The work is framed as a feasibility study rather than a comparative benchmark. The abstract's phrasing does overstate the comparative aspect. We will revise the abstract and results to remove unsubstantiated comparative language, explicitly state that no baseline implementations were executed, and report only the concrete simulation observables that are available (LLM call counts per mission, constraint-violation events, and replanning triggers). revision: yes
Referee: [Constraint Enforcement Layer] Constraint enforcement layer: the description states that altitude limits and horizontal geofences are enforced, but does not specify tolerance margins, how localization noise is propagated into violation decisions, or recovery behavior under realistic projection error, which is load-bearing for the safety argument.

Authors: We accept that the enforcement logic requires additional specification. The revised manuscript will document the exact tolerance margins employed, the threshold logic applied to projected 3D positions, and the bounded-replanning recovery policy, with references to the corresponding open-source implementation. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural proposal with external simulation support

full rationale

The paper presents a system architecture (planner-executor decoupling, modular 2D detectors + pinhole projection for world model, constraint layer, bounded replanning) whose claims rest on Gazebo/PX4 SITL demonstrations rather than any derivation, equation, or fitted parameter. No self-definitional steps, no fitted-input predictions, and no load-bearing self-citations appear in the description. The contribution is self-contained as an engineering design whose validity is assessed externally via simulation results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard robotics assumptions (accurate simulation-to-reality transfer for perception and control, reliable LLM single-pass planning) with no new free parameters, axioms beyond domain norms, or invented physical entities.

axioms (1)

domain assumption The PX4 SITL Gazebo environment and modular detectors produce representative behavior for the claimed constraint enforcement and replanning.
Feasibility demonstration in simulation is taken to support the overall approach without additional validation steps stated in the abstract.

pith-pipeline@v0.9.1-grok · 5765 in / 1396 out tokens · 53730 ms · 2026-06-29T16:50:30.618854+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 16 canonical work pages · 5 internal anchors

[1]

Uavs meet llms: Overviews and perspectives towards agentic low-altitude mobility,

Y . Tian, F. Lin, Y . Li, T. Zhang, Q. Zhang, X. Fu, J. Huang, X. Dai, Y . Wang, C. Tianet al., “Uavs meet llms: Overviews and perspectives towards agentic low-altitude mobility,”Information Fusion, vol. 122, p. 103158, 2025

2025
[2]

Uavs meet agentic ai: A multidomain survey of autonomous aerial intelligence and agentic uavs,

R. Sapkota, K. I. Roumeliotis, and M. Karkee, “Uavs meet agentic ai: A multidomain survey of autonomous aerial intelligence and agentic uavs,”arXiv preprint arXiv:2506.08045, 2025

work page arXiv 2025
[3]

React: Synergizing reasoning and acting in language models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” inThe eleventh international conference on learning representations, 2022

2022
[4]

Defining and evaluating physical safety for large language models.arXiv preprint arXiv:2411.02317, 2024

Y .-C. Tang, P.-Y . Chen, and T.-Y . Ho, “Defining and evaluat- ing physical safety for large language models,”arXiv preprint arXiv:2411.02317, 2024

work page arXiv 2024
[5]

Towards realistic UA V vision-language nav- igation: Platform, benchmark, and methodology,

X. Wang, D. Yang, Z. Wang, H. Kwan, J. Chen, W. Wu, H. Li, Y . Liao, and S. Liu, “Towards realistic uav vision-language navigation: Platform, benchmark, and methodology. arxiv 2024,”arXiv preprint arXiv:2410.07087, 2024

work page arXiv 2024
[6]

Navila: Legged robot vision-language- action model for navigation,

A.-C. Cheng, Y . Ji, Z. Yang, Z. Gongye, X. Zou, J. Kautz, E. Bıyık, H. Yin, S. Liu, and X. Wang, “Navila: Legged robot vision-language- action model for navigation,”arXiv preprint arXiv:2412.04453, 2024

work page arXiv 2024
[7]

Singer: An onboard generalist vision-language navigation policy for drones,

M. Adang, J. Low, O. Shorinwa, and M. Schwager, “Singer: An onboard generalist vision-language navigation policy for drones,” arXiv preprint arXiv:2509.18610, 2025

work page arXiv 2025
[8]

Cloi-nav: Open-world vision- and-language navigation via complex, long-horizon ordered instruc- tions,

M. Lee, J. Park, J. Jeong, and Y . Cho, “Cloi-nav: Open-world vision- and-language navigation via complex, long-horizon ordered instruc- tions,” inIROS 2025 Workshop: Open World Navigation in Human- centric Environments, 2025

2025
[9]

RT-1: Robotics Transformer for Real-World Control at Scale

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsuet al., “Rt-1: Robotics transformer for real-world control at scale,”arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[10]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choro- manski, T. Ding, D. Driess, A. Dubey, C. Finnet al., “Rt-2: Vision- language-action models transfer web knowledge to robotic control, 2023,”URL https://arxiv.org/abs/2307.15818, vol. 1, p. 2, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

PaLM-E: An Embodied Multimodal Language Model

D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yuet al., “Palm-e: An embodied multimodal language model,”arXiv preprint arXiv:2303.03378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Cognitivedrone: A vla model and evaluation benchmark for real-time cognitive task solving and reasoning in uavs,

A. Lykov, V . Serpiva, M. H. Khan, O. Sautenkov, A. Myshlyaev, G. Tadevosyan, Y . Yaqoot, and D. Tsetserukou, “Cognitivedrone: A vla model and evaluation benchmark for real-time cognitive task solving and reasoning in uavs,”arXiv preprint arXiv:2503.01378, 2025

work page arXiv 2025
[13]

Racevla: Vla-based racing drone navigation with human-like behaviour,

V . Serpiva, A. Lykov, A. Myshlyaev, M. H. Khan, A. A. Ab- dulkarim, O. Sautenkov, and D. Tsetserukou, “Racevla: Vla-based racing drone navigation with human-like behaviour,”arXiv preprint arXiv:2503.02572, 2025

work page arXiv 2025
[14]

Typefly: Flying drones with large language model,

G. Chen, X. Yu, N. Ling, and L. Zhong, “Typefly: Flying drones with large language model,”arXiv preprint arXiv:2312.14950, 2023

work page arXiv 2023
[15]

Flockgpt: Guiding uav flocking with linguistic orchestration,

A. Lykov, S. Karaf, M. Martynov, V . Serpiva, A. Fedoseev, M. Ko- nenkov, and D. Tsetserukou, “Flockgpt: Guiding uav flocking with linguistic orchestration,” in2024 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct). IEEE, 2024, pp. 485–488

2024
[16]

Co-navgpt: Multirobot cooperative visual semantic navigation using vision language models,

B. Yu, Q. Yuan, K. Li, H. Kasaei, and M. Cao, “Co-navgpt: Multirobot cooperative visual semantic navigation using vision language models,” IEEE Robotics and Automation Letters, vol. 11, no. 2, pp. 2122–2129, 2025

2025
[17]

Uav-codeagents: Scalable uav mis- sion planning via multi-agent react and vision-language reasoning,

O. Sautenkov, Y . Yaqoot, M. A. Mustafa, F. Batool, J. Sam, A. Lykov, C.-Y . Wen, and D. Tsetserukou, “Uav-codeagents: Scalable uav mis- sion planning via multi-agent react and vision-language reasoning,” arXiv preprint arXiv:2505.07236, 2025

work page arXiv 2025
[18]

Navagent: Multi- scale urban street view fusion for uav embodied vision-and-language navigation,

Y . Liu, F. Yao, Y . Yue, G. Xu, X. Sun, and K. Fu, “Navagent: Multi- scale urban street view fusion for uav embodied vision-and-language navigation,”arXiv preprint arXiv:2411.08579, 2024

work page arXiv 2024
[19]

Remembr: Building and reasoning over long-horizon spatio-temporal memory for robot navigation,

A. Anwar, J. Welsh, J. Biswas, S. Pouya, and Y . Chang, “Remembr: Building and reasoning over long-horizon spatio-temporal memory for robot navigation,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 2838–2845

2025
[20]

Where are we in the search for an artificial visual cortex for embodied intelligence?

A. Majumdar, K. Yadav, S. Arnaud, J. Ma, C. Chen, S. Silwal, A. Jain, V .-P. Berges, T. Wu, J. Vakilet al., “Where are we in the search for an artificial visual cortex for embodied intelligence?”Advances in Neural Information Processing Systems, vol. 36, pp. 655–677, 2023

2023
[21]

Multi-agent behavior retrieval: Retrieval-augmented policy training for cooperative push manipulation by mobile robots,

S. Kuroki, M. Nishimura, and T. Kozuno, “Multi-agent behavior retrieval: Retrieval-augmented policy training for cooperative push manipulation by mobile robots,” in2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 12 671–12 678

2024
[22]

EMMA: End-to-End Multimodal Model for Autonomous Driving

J.-J. Hwang, R. Xu, H. Lin, W.-C. Hung, J. Ji, K. Choi, D. Huang, T. He, P. Covington, B. Sappet al., “Emma: End-to-end multimodal model for autonomous driving,”arXiv preprint arXiv:2410.23262, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

Y . Wang, W. Luo, J. Bai, Y . Cao, T. Che, K. Chen, Y . Chen, J. Di- amond, Y . Ding, W. Dinget al., “Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail,”arXiv preprint arXiv:2511.00088, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Llm2swarm: robot swarms that responsively reason, plan, and collaborate through llms,

V . Strobel, M. Dorigo, and M. Fritz, “Llm2swarm: robot swarms that responsively reason, plan, and collaborate through llms,”arXiv preprint arXiv:2410.11387, 2024

work page arXiv 2024
[25]

Language- guided pattern formation for swarm robotics with multi-agent rein- forcement learning,

H.-S. Liu, S. Kuroki, T. Kozuno, W.-F. Sun, and C.-Y . Lee, “Language- guided pattern formation for swarm robotics with multi-agent rein- forcement learning,” in2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 8998–9005

2024

[1] [1]

Uavs meet llms: Overviews and perspectives towards agentic low-altitude mobility,

Y . Tian, F. Lin, Y . Li, T. Zhang, Q. Zhang, X. Fu, J. Huang, X. Dai, Y . Wang, C. Tianet al., “Uavs meet llms: Overviews and perspectives towards agentic low-altitude mobility,”Information Fusion, vol. 122, p. 103158, 2025

2025

[2] [2]

Uavs meet agentic ai: A multidomain survey of autonomous aerial intelligence and agentic uavs,

R. Sapkota, K. I. Roumeliotis, and M. Karkee, “Uavs meet agentic ai: A multidomain survey of autonomous aerial intelligence and agentic uavs,”arXiv preprint arXiv:2506.08045, 2025

work page arXiv 2025

[3] [3]

React: Synergizing reasoning and acting in language models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” inThe eleventh international conference on learning representations, 2022

2022

[4] [4]

Defining and evaluating physical safety for large language models.arXiv preprint arXiv:2411.02317, 2024

Y .-C. Tang, P.-Y . Chen, and T.-Y . Ho, “Defining and evaluat- ing physical safety for large language models,”arXiv preprint arXiv:2411.02317, 2024

work page arXiv 2024

[5] [5]

Towards realistic UA V vision-language nav- igation: Platform, benchmark, and methodology,

X. Wang, D. Yang, Z. Wang, H. Kwan, J. Chen, W. Wu, H. Li, Y . Liao, and S. Liu, “Towards realistic uav vision-language navigation: Platform, benchmark, and methodology. arxiv 2024,”arXiv preprint arXiv:2410.07087, 2024

work page arXiv 2024

[6] [6]

Navila: Legged robot vision-language- action model for navigation,

A.-C. Cheng, Y . Ji, Z. Yang, Z. Gongye, X. Zou, J. Kautz, E. Bıyık, H. Yin, S. Liu, and X. Wang, “Navila: Legged robot vision-language- action model for navigation,”arXiv preprint arXiv:2412.04453, 2024

work page arXiv 2024

[7] [7]

Singer: An onboard generalist vision-language navigation policy for drones,

M. Adang, J. Low, O. Shorinwa, and M. Schwager, “Singer: An onboard generalist vision-language navigation policy for drones,” arXiv preprint arXiv:2509.18610, 2025

work page arXiv 2025

[8] [8]

Cloi-nav: Open-world vision- and-language navigation via complex, long-horizon ordered instruc- tions,

M. Lee, J. Park, J. Jeong, and Y . Cho, “Cloi-nav: Open-world vision- and-language navigation via complex, long-horizon ordered instruc- tions,” inIROS 2025 Workshop: Open World Navigation in Human- centric Environments, 2025

2025

[9] [9]

RT-1: Robotics Transformer for Real-World Control at Scale

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsuet al., “Rt-1: Robotics transformer for real-world control at scale,”arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[10] [10]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choro- manski, T. Ding, D. Driess, A. Dubey, C. Finnet al., “Rt-2: Vision- language-action models transfer web knowledge to robotic control, 2023,”URL https://arxiv.org/abs/2307.15818, vol. 1, p. 2, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

PaLM-E: An Embodied Multimodal Language Model

D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yuet al., “Palm-e: An embodied multimodal language model,”arXiv preprint arXiv:2303.03378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

Cognitivedrone: A vla model and evaluation benchmark for real-time cognitive task solving and reasoning in uavs,

A. Lykov, V . Serpiva, M. H. Khan, O. Sautenkov, A. Myshlyaev, G. Tadevosyan, Y . Yaqoot, and D. Tsetserukou, “Cognitivedrone: A vla model and evaluation benchmark for real-time cognitive task solving and reasoning in uavs,”arXiv preprint arXiv:2503.01378, 2025

work page arXiv 2025

[13] [13]

Racevla: Vla-based racing drone navigation with human-like behaviour,

V . Serpiva, A. Lykov, A. Myshlyaev, M. H. Khan, A. A. Ab- dulkarim, O. Sautenkov, and D. Tsetserukou, “Racevla: Vla-based racing drone navigation with human-like behaviour,”arXiv preprint arXiv:2503.02572, 2025

work page arXiv 2025

[14] [14]

Typefly: Flying drones with large language model,

G. Chen, X. Yu, N. Ling, and L. Zhong, “Typefly: Flying drones with large language model,”arXiv preprint arXiv:2312.14950, 2023

work page arXiv 2023

[15] [15]

Flockgpt: Guiding uav flocking with linguistic orchestration,

A. Lykov, S. Karaf, M. Martynov, V . Serpiva, A. Fedoseev, M. Ko- nenkov, and D. Tsetserukou, “Flockgpt: Guiding uav flocking with linguistic orchestration,” in2024 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct). IEEE, 2024, pp. 485–488

2024

[16] [16]

Co-navgpt: Multirobot cooperative visual semantic navigation using vision language models,

B. Yu, Q. Yuan, K. Li, H. Kasaei, and M. Cao, “Co-navgpt: Multirobot cooperative visual semantic navigation using vision language models,” IEEE Robotics and Automation Letters, vol. 11, no. 2, pp. 2122–2129, 2025

2025

[17] [17]

Uav-codeagents: Scalable uav mis- sion planning via multi-agent react and vision-language reasoning,

O. Sautenkov, Y . Yaqoot, M. A. Mustafa, F. Batool, J. Sam, A. Lykov, C.-Y . Wen, and D. Tsetserukou, “Uav-codeagents: Scalable uav mis- sion planning via multi-agent react and vision-language reasoning,” arXiv preprint arXiv:2505.07236, 2025

work page arXiv 2025

[18] [18]

Navagent: Multi- scale urban street view fusion for uav embodied vision-and-language navigation,

Y . Liu, F. Yao, Y . Yue, G. Xu, X. Sun, and K. Fu, “Navagent: Multi- scale urban street view fusion for uav embodied vision-and-language navigation,”arXiv preprint arXiv:2411.08579, 2024

work page arXiv 2024

[19] [19]

Remembr: Building and reasoning over long-horizon spatio-temporal memory for robot navigation,

A. Anwar, J. Welsh, J. Biswas, S. Pouya, and Y . Chang, “Remembr: Building and reasoning over long-horizon spatio-temporal memory for robot navigation,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 2838–2845

2025

[20] [20]

Where are we in the search for an artificial visual cortex for embodied intelligence?

A. Majumdar, K. Yadav, S. Arnaud, J. Ma, C. Chen, S. Silwal, A. Jain, V .-P. Berges, T. Wu, J. Vakilet al., “Where are we in the search for an artificial visual cortex for embodied intelligence?”Advances in Neural Information Processing Systems, vol. 36, pp. 655–677, 2023

2023

[21] [21]

Multi-agent behavior retrieval: Retrieval-augmented policy training for cooperative push manipulation by mobile robots,

S. Kuroki, M. Nishimura, and T. Kozuno, “Multi-agent behavior retrieval: Retrieval-augmented policy training for cooperative push manipulation by mobile robots,” in2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 12 671–12 678

2024

[22] [22]

EMMA: End-to-End Multimodal Model for Autonomous Driving

J.-J. Hwang, R. Xu, H. Lin, W.-C. Hung, J. Ji, K. Choi, D. Huang, T. He, P. Covington, B. Sappet al., “Emma: End-to-end multimodal model for autonomous driving,”arXiv preprint arXiv:2410.23262, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

Y . Wang, W. Luo, J. Bai, Y . Cao, T. Che, K. Chen, Y . Chen, J. Di- amond, Y . Ding, W. Dinget al., “Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail,”arXiv preprint arXiv:2511.00088, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

Llm2swarm: robot swarms that responsively reason, plan, and collaborate through llms,

V . Strobel, M. Dorigo, and M. Fritz, “Llm2swarm: robot swarms that responsively reason, plan, and collaborate through llms,”arXiv preprint arXiv:2410.11387, 2024

work page arXiv 2024

[25] [25]

Language- guided pattern formation for swarm robotics with multi-agent rein- forcement learning,

H.-S. Liu, S. Kuroki, T. Kozuno, W.-F. Sun, and C.-Y . Lee, “Language- guided pattern formation for swarm robotics with multi-agent rein- forcement learning,” in2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 8998–9005

2024