PilotBench: A Benchmark for General Aviation Agents with Safety Constraints

· 2026 · cs.AI · arXiv 2604.08987

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

As Large Language Models (LLMs) advance toward embodied AI agents operating in physical environments, a fundamental question emerges: can models trained on text corpora reliably reason about complex physics while adhering to safety constraints? We address this through PilotBench, a benchmark evaluating LLMs on safety-critical flight trajectory and attitude prediction. Built from 708 real-world general aviation trajectories spanning nine operationally distinct flight phases with synchronized 34-channel telemetry, PilotBench systematically probes the intersection of semantic understanding and physics-governed prediction through comparative analysis of LLMs and traditional forecasters. We introduce Pilot-Score, a composite metric balancing 60% regression accuracy with 40% instruction adherence and safety compliance. Comparative evaluation across 41 models uncovers a Precision-Controllability Dichotomy: traditional forecasters achieve superior MAE of 7.01 but lack semantic reasoning capabilities, while LLMs gain controllability with 86--89% instruction-following at the cost of 11--14 MAE precision. Phase-stratified analysis further exposes a Dynamic Complexity Gap-LLM performance degrades sharply in high-workload phases such as Climb and Approach, suggesting brittle implicit physics models. These empirical discoveries motivate hybrid architectures combining LLMs' symbolic reasoning with specialized forecasters' numerical precision. PilotBench provides a rigorous foundation for advancing embodied AI in safety-constrained domains.

representative citing papers

Pre-Flight: A Benchmark for Evaluating Large Language Models on Aviation Operational Knowledge

cs.AI · 2026-07-02 · accept · novelty 7.0

Pre-Flight is a new 300-question benchmark where top LLMs reach 82.7% accuracy against an informal expert reference of ~95%, leaving a persistent gap.

citing papers explorer

Showing 1 of 1 citing paper.

Pre-Flight: A Benchmark for Evaluating Large Language Models on Aviation Operational Knowledge cs.AI · 2026-07-02 · accept · none · ref 11 · internal anchor
Pre-Flight is a new 300-question benchmark where top LLMs reach 82.7% accuracy against an informal expert reference of ~95%, leaving a persistent gap.

PilotBench: A Benchmark for General Aviation Agents with Safety Constraints

fields

years

verdicts

representative citing papers

citing papers explorer