Recognition: no theorem link
Towards Automated Air Traffic Safety Assessment Around Non-Towered Airports Using Large Language Models
Pith reviewed 2026-05-13 04:20 UTC · model grok-4.3
The pith
Large language models can classify air traffic safety risks from radio communications and weather data at non-towered airports with macro F1 scores above 0.85.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose a general vision-language model approach to analyze the transcribed CTAF radio communications in natural language, METAR weather data, ADS-B flight trajectories, and Visual Flight Rules sectional charts of the airfield. We provide a preliminary study at Half Moon Bay Airport, with a qualitative real world case study and a quantitative evaluation using a new synthetic dataset of communications and weather modalities. Even limited to CTAF and METAR inputs and open source LLMs, instances of our framework typically achieve a macro F1 score above 0.85 on a binary nominal/danger classification task.
What carries the argument
The multimodal LLM framework that processes natural language transcriptions of pilot self-announcements along with weather reports to perform hazard classification and safety assessment.
If this is right
- Post-flight safety reviews at non-towered airports can be automated without constant human monitoring.
- Open-source LLMs are sufficient for initial binary classification of safety risks from limited inputs.
- The 12-category hazard taxonomy offers a structured approach for labeling and training models on common airport hazards.
- Combining text inputs with trajectory and chart data in future extensions could provide more comprehensive analysis.
- Qualitative success on real data indicates the framework can identify specific violations like right-of-way issues.
Where Pith is reading between the lines
- Real-time deployment could warn pilots during flight if processing speeds improve.
- Training on actual incident data might better handle ambiguous communications that the synthetic set approximates.
- Wider use could inform airport design or pilot training by revealing frequent communication patterns that lead to danger.
- Similar techniques might apply to other domains with self-coordinated traffic, such as maritime or drone operations.
Load-bearing premise
The synthetic dataset derived from real examples sufficiently captures the variability, ambiguity, and edge cases of actual CTAF communications and hazard scenarios so that model performance generalizes to real-world use.
What would settle it
Collecting and labeling a dataset of several hundred real CTAF radio transcripts from non-towered airports, then measuring whether the models maintain a macro F1 score above 0.7 on the nominal/danger task when evaluated against expert annotations.
read the original abstract
We investigate frameworks for post-flight safety analysis at non-towered airports using large language models (LLMs). Non-towered airports rely on the Common Traffic Advisory Frequency (CTAF) for air traffic coordination and experience frequent near mid-air collisions due to the pilot self-announcement communication protocol. We propose a general vision-language model (VLM) approach to analyze the transcribed CTAF radio communications in natural language, METeorological Aerodrome Report (METAR) weather data, Automatic Dependent Surveillance-Broadcast (ADS-B) flight trajectories, and Visual Flight Rules sectional charts of the airfield. We provide a preliminary study at Half Moon Bay Airport, with a qualitative real world case study and a quantitative evaluation using a new synthetic dataset of communications and weather modalities. We qualitatively evaluate our framework on real flight data using Gemini 2.5 Pro, demonstrating accurate identification of a right-of-way violation. The synthetic dataset is derived from real examples and includes a 12-category hazard taxonomy, and is used to benchmark three open-source (Qwen 2.5-7B, Mistral-7B, Gemma-2-9B) and three closed-source (GPT-4o, GPT-5.4, Claude Sonnet 4.6) LLM models on the subset of inputs related to CTAF and METAR. Even limited to CTAF and METAR inputs and open source LLMs, instances of our framework typically achieve a macro F1 score above 0.85 on a binary nominal/danger classification task. Future work includes a quantitative evaluation across all modalities and a larger number of real world examples. Taken together, our results suggest that VLM analysis of safety at non-towered airports may be a valuable future capability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a vision-language model framework for post-flight safety analysis at non-towered airports by processing transcribed CTAF radio communications, METAR weather reports, ADS-B trajectories, and sectional charts. It presents a qualitative case study on real data from Half Moon Bay Airport using Gemini 2.5 Pro to detect a right-of-way violation, alongside quantitative benchmarking of six LLMs (three open-source, three closed-source) on a new synthetic CTAF+METAR dataset derived from real examples and organized by a 12-category hazard taxonomy. The central quantitative result is that even open-source LLMs restricted to CTAF and METAR inputs achieve macro F1 scores above 0.85 on a binary nominal/danger classification task. The work is framed as preliminary, with future directions including full multimodal quantitative evaluation and more real-world examples.
Significance. If the synthetic results generalize and the multimodal components prove effective, the approach could enable scalable, automated post-hoc safety assessment at the many non-towered airports where near mid-air collisions remain a documented risk due to self-announcement protocols. The introduction of a structured hazard taxonomy and systematic benchmarking across model families are positive contributions to an emerging application area. However, the current evidence base is limited to synthetic data for the quantitative claims and a single unquantified real-world example, so the practical significance for operational safety assessment remains prospective rather than demonstrated.
major comments (3)
- [§4] §4 (Synthetic Dataset): The construction of the synthetic CTAF+METAR dataset is described only at a high level as 'derived from real examples.' No details are provided on how variability in pilot phrasing, clipped transmissions, overlapping calls, radio artifacts, or combined hazards are (or are not) modeled. Because the macro F1 > 0.85 result on the binary classification task rests entirely on this dataset, the absence of these specifics prevents assessment of whether the reported performance supports generalization to actual CTAF traffic.
- [§5] §5 (Quantitative Evaluation): All reported F1 scores are obtained on CTAF and METAR inputs only, despite the abstract and introduction framing the contribution as a general VLM approach that also incorporates ADS-B trajectories and sectional charts. No ablation studies, full-modality results, or even preliminary metrics on the additional modalities are presented, so the broader claim that 'VLM analysis of safety at non-towered airports may be a valuable future capability' is not yet quantitatively supported.
- [Real-world case study] Real-world case study (Section 6 or equivalent): Only a single qualitative example is shown, with no quantitative metrics, no error analysis, and no description of the exact input transcription or model prompt used. This single instance cannot substantiate robustness claims when real CTAF data contain the very ambiguities (overlaps, variable phrasing, artifacts) that the synthetic dataset may omit.
minor comments (3)
- [Abstract and §5] The model names 'GPT-5.4' and 'Claude Sonnet 4.6' appear non-standard; clarify whether these are specific versions, placeholders, or typos.
- [§5] Prompt templates and exact classification instructions used for the LLMs are not reproduced in the main text or appendix, hindering reproducibility of the F1 results.
- [Figures] Figure captions for any qualitative examples should explicitly state the input modalities shown and the ground-truth hazard label.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our preliminary study. We address each major comment below, clarifying the scope of the current work while committing to revisions that strengthen the manuscript.
read point-by-point responses
-
Referee: §4 (Synthetic Dataset): The construction of the synthetic CTAF+METAR dataset is described only at a high level as 'derived from real examples.' No details are provided on how variability in pilot phrasing, clipped transmissions, overlapping calls, radio artifacts, or combined hazards are (or are not) modeled. Because the macro F1 > 0.85 result on the binary classification task rests entirely on this dataset, the absence of these specifics prevents assessment of whether the reported performance supports generalization to actual CTAF traffic.
Authors: We agree that additional details on dataset construction are needed to support evaluation of generalization. In the revised manuscript, Section 4 will be expanded to describe the generation process explicitly: variability in pilot phrasing was introduced by paraphrasing real CTAF examples while preserving intent; clipped transmissions and radio artifacts were simulated via text truncation and insertion of common noise markers (e.g., 'static', partial words); overlapping calls were modeled by concatenating multiple transmissions with explicit speaker indicators; and combined hazards appear in a subset of examples to test multi-hazard detection. These choices were made to balance realism with controlled evaluation. We will also release the full generation script and a sample of the dataset to enable independent assessment. revision: yes
-
Referee: §5 (Quantitative Evaluation): All reported F1 scores are obtained on CTAF and METAR inputs only, despite the abstract and introduction framing the contribution as a general VLM approach that also incorporates ADS-B trajectories and sectional charts. No ablation studies, full-modality results, or even preliminary metrics on the additional modalities are presented, so the broader claim that 'VLM analysis of safety at non-towered airports may be a valuable future capability' is not yet quantitatively supported.
Authors: The manuscript is explicitly positioned as preliminary, with the abstract and Section 5 stating that quantitative results are limited to CTAF+METAR inputs to establish a strong baseline. The broader VLM framing (including ADS-B and charts) is proposed as a general framework whose full quantitative evaluation is listed as future work. The reported macro F1 > 0.85 even with restricted inputs is intended to demonstrate feasibility of the text-based core. To address the concern, we will add explicit language in the introduction and conclusion reiterating that no full-modality quantitative claims are made and that multimodal benchmarking remains future work. No new experiments are added at this stage. revision: partial
-
Referee: Real-world case study (Section 6 or equivalent): Only a single qualitative example is shown, with no quantitative metrics, no error analysis, and no description of the exact input transcription or model prompt used. This single instance cannot substantiate robustness claims when real CTAF data contain the very ambiguities (overlaps, variable phrasing, artifacts) that the synthetic dataset may omit.
Authors: We acknowledge that the real-world case study is qualitative and limited to one instance, consistent with the preliminary framing of the paper. In the revision, we will append the exact Gemini 2.5 Pro prompt used and the full transcribed CTAF excerpt (with METAR) for the Half Moon Bay right-of-way example. We will also include a short error analysis paragraph discussing observed ambiguities in the real transmission and how the model handled them. Quantitative metrics on real data are not feasible without a larger labeled corpus, which is identified as future work. revision: partial
Circularity Check
No circularity: empirical benchmarking on synthetic dataset with no derivations or fitted predictions
full rationale
The paper is an empirical study that constructs a synthetic CTAF/METAR dataset from real examples, defines a 12-category hazard taxonomy, and directly measures macro F1 scores of several LLMs on a binary nominal/danger task. No equations, first-principles derivations, parameter fitting, or predictions appear; the reported performance numbers are computed outputs on the explicitly constructed test set rather than quantities forced by construction from fitted inputs. The single qualitative real-world case study is presented separately and carries no quantitative claim. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked. The central claims therefore remain independent of any internal reduction and are self-contained as a benchmarking exercise.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Large language models can accurately interpret and classify aviation-specific natural language communications and weather reports for hazard detection.
Reference graph
Works this paper leans on
-
[1]
The Complexity Construct in Air Traffic Control: A Review and Synthesis of the Literature,
Mogford, R. H., Guttman, J. A., Morrow, S. L., and Kopardekar, P., “The Complexity Construct in Air Traffic Control: A Review and Synthesis of the Literature,” Tech. Rep. DOT/FAA/CT-TN95/22, U.S. Department of Transportation, Federal Aviation Administration, 1995
work page 1995
-
[2]
Advisory Circular 90-66C: Non-Towered Airport Flight Operations,
Federal Aviation Administration, “Advisory Circular 90-66C: Non-Towered Airport Flight Operations,” Tech. rep., U.S. De- partment of Transportation, Federal Aviation Administration, June 2023. URLhttps://www.faa.gov/documentlibrary/ media/advisory_circular/ac_90-66c.pdf
work page 2023
-
[3]
ASRS Database Report Set: Non-Tower Airport Incidents,
National Aeronautics and Space Administration (NASA), Ames Research Center, “ASRS Database Report Set: Non-Tower Airport Incidents,” Technical Memorandum, ASRS Report Set No. 39 TH: 262-7, NASA Ames Research Center, Jun. 2025. URLhttps://asrs.arc.nasa.gov/docs/rpsts/non_twr.pdf, update No. 39, June 12 2025
work page 2025
-
[4]
In-Time Aviation Safety Management: Challenges and Research for an Evolving Aviation System,
National Academies of Sciences, Engineering, and Medicine, “In-Time Aviation Safety Management: Challenges and Research for an Evolving Aviation System,” Tech. rep., The National Academies Press, 2022. URLhttps://www.faa.gov/sites/ faa.gov/files/2022-05/508.NationalAcademyReport.pdf,section“SystemMonitoring–DataFusion,Completeness, and Quality” points out...
work page 2022
-
[5]
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I., “Attention is All youNeed,”AdvancesinNeuralInformationProcessingSystems(NeurIPS),Vol.30,editedbyI.Guyon,U.V.Luxburg,S.Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Curran Associates, Inc., 2017, pp. 5998–6008. URLhttps:// proc...
work page 2017
-
[6]
Recent Advances in Speech Language Models: A Survey,
Cui, W., Yu, D., Jiao, X., Meng, Z., Zhang, G., Wang, Q., Guo, Y., and King, I., “Recent Advances in Speech Language Models: A Survey,”Proceedings of ACL (Long Paper), 2025. URLhttps://aclanthology.org/2025.acl-long.682.pdf, preprint available at arXiv
work page 2025
-
[7]
A Survey on Multimodal Large Language Models,
Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., and Chen, E., “A Survey on Multimodal Large Language Models,”National Science Review, Vol. 11, No. 12, 2024, p. nwae403. https://doi.org/10.1093/nsr/nwae403
-
[8]
Neural Machine Translation: A Review,
Stahlberg, F., “Neural Machine Translation: A Review,”Journal of Artificial Intelligence Research (JAIR), 2020. URL https://jair.org/index.php/jair/article/download/12007/26611/24616
work page 2020
-
[9]
Multi-Step Reasoning with Large Language Models, a Survey,
Plaat, A., Wong, A., Verberne, S., Broekens, J., van Stein, N., and Bäck, T., “Multi-Step Reasoning with Large Language Models, a Survey,”arXiv preprint arXiv:2407.11511, 2024. URL https://arxiv.org/abs/2407.11511
-
[10]
Call Me When Necessary: LLMs can Efficiently and Faithfully Reason over Structured Environments,
Cheng, S., Zhuang, Z., Xu, Y., Yang, F., Zhang, C., Qin, X., Huang, X., Chen, L., Lin, Q., Zhang, D., Rajmohan, S., and Zhang, Q., “Call Me When Necessary: LLMs can Efficiently and Faithfully Reason over Structured Environments,”Findings of the Association for Computational Linguistics: ACL 2024, edited by L.-W. Ku, A. Martins, and V. Srikumar, Associatio...
-
[11]
The Traffic Alert and Collision Avoidance System (TCAS): Past, Present and Future,
Kuchar, J., and Wan, L. M., “The Traffic Alert and Collision Avoidance System (TCAS): Past, Present and Future,”Lincoln Laboratory Journal, Vol. 16, No. 2, 2007, p. 14–32. Survey of airborne automated collision avoidance systems
work page 2007
-
[12]
Probabilityofconflictanalysisof3Daircraftflightbasedontwo-levelMarkovchainapproximation approach,
Al-Basman,M.,andHu,J.,“Probabilityofconflictanalysisof3Daircraftflightbasedontwo-levelMarkovchainapproximation approach,”2010 International Conference on Networking, Sensing and Control (ICNSC), IEEE, 2010, pp. 608–613
work page 2010
-
[13]
Next-generation airborne collision avoidance system,
Kochenderfer, M. J., Holland, J. E., and Chryssanthacopoulos, J. P., “Next-generation airborne collision avoidance system,” Lincoln Laboratory Journal, Vol. 19, No. 1, 2012, pp. 17–33
work page 2012
-
[14]
Kaona: Deep Searching and Curating Aviation Safety Reporting Systems,
Paradis, C. V., Hong, C., Matthews, B., Davies, M. D., and Hooey, B., “Kaona: Deep Searching and Curating Aviation Safety Reporting Systems,”AIAA SCITECH 2025 Forum, 2025, p. 2152. https://doi.org/10.2514/6.2025-2152, URL https://arc.aiaa.org/doi/abs/10.2514/6.2025-2152
-
[15]
Abdulhak, S., Hubbard, W., Gopalakrishnan, K., and Li, M. Z., “CHATATC: Large Language Model-Driven Conversational Agents for Supporting Strategic Air Traffic Flow Management,”arXiv preprint, Vol. arXiv:2402.14850, 2024. URL https://arxiv.org/abs/2402.14850
-
[16]
Autonomous Air Traffic Control for Non-Towered Airports,
Mahboubi, Z., and Kochenderfer, M. J., “Autonomous Air Traffic Control for Non-Towered Airports,”Proc. USA/Eur. Air Traffic Manage. Res. Develop. Seminar, 2015, pp. 1–6. 13
work page 2015
-
[17]
Tikayat Ray, A., Bhat, A. P., White, R. T., Nguyen, V. M., Pinon Fischer, O. J., and Mavris, D. N., “Examining the Potential of Generative Language Models for Aviation Safety Analysis: Case Study and Insights Using the Aviation Safety Reporting System (ASRS),”Aerospace, Vol. 10, No. 9, 2023, p. 770
work page 2023
-
[18]
Towards an Aviation Large Language Model by Fine-tuning and Evaluating Transformers,
Nielsen, D., Clarke, S. S., and Kalyanam, K. M., “Towards an Aviation Large Language Model by Fine-tuning and Evaluating Transformers,”2024 AIAA DATC/IEEE 43rd Digital Avionics Systems Conference (DASC), IEEE, 2024, pp. 1–5
work page 2024
-
[19]
AviationGPT:ALargeLanguageModelfortheAviationDomain,
Wang,L.,Chou,J.,Tien,A.,Zhou,X.,andBaumgartner,D.,“AviationGPT:ALargeLanguageModelfortheAviationDomain,” AIAA AVIATION FORUM AND ASCEND 2024, 2024, p. 4250
work page 2024
-
[20]
LLaMA: Open and Efficient Foundation Language Models
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al., “LLama: Open and Efficient Foundation Language Models,”arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
WorkloadGPT: A Large Language Model Approach to Real-Time Detection of Pilot Workload,
Gao, Y., Yue, L., Sun, J., Shan, X., Liu, Y., and Wu, X., “WorkloadGPT: A Large Language Model Approach to Real-Time Detection of Pilot Workload,”Applied Sciences, Vol. 14, No. 18, 2024, p. 8274
work page 2024
-
[22]
Chain-of-ThoughtFlightPlanner: End-to-EndLLMRoutingUnderWindHazards,
Tabrizian, A., Ghazanfari, M., andWei, P., “Chain-of-ThoughtFlightPlanner: End-to-EndLLMRoutingUnderWindHazards,” AIAA AVIATION FORUM AND ASCEND 2025, 2025, p. 3711
work page 2025
-
[23]
Automatic Control With Human-Like Reasoning: Exploring Language Model Embodied Air Traffic Agents,
Andriuškevičius, J., and Sun, J., “Automatic Control With Human-Like Reasoning: Exploring Language Model Embodied Air Traffic Agents,”arXiv preprint arXiv:2409.09717, 2024
-
[24]
A Flight Simulator Software for Visualization of 3-Dimensional Airspaces and Air Traffic Management,
Su, W., Kam, J., and Bulusu, D. V., “A Flight Simulator Software for Visualization of 3-Dimensional Airspaces and Air Traffic Management,”2025 Regional Student Conferences, 2025, p. 97940
work page 2025
-
[25]
Language Conditioning Improves Accuracy of Aircraft Goal Prediction in Untowered Airspace,
Sangeetha, S. V., Chiu, C.-Y., Li, S. H., and Kousik, S., “Language Conditioning Improves Accuracy of Aircraft Goal Prediction in Untowered Airspace,”arXiv preprint arXiv:2509.14063, 2025
-
[26]
Aviation Weather Center, “AviationWeather.gov,” https://aviationweather.gov/, 2025. Accessed: 2025-11-07
work page 2025
-
[27]
Team, Q., et al., “Qwen2 Technical Report,”arXiv preprint arXiv:2407.10671, Vol. 2, No. 3, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-A., Stock, P., Le Scao, T., Lavril, T., Wang, T., Lacroix, T., and El Sayed, W., “Mistral 7B,”arXiv preprint arXiv:2310.06825, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
Gemma 2: Improving Open Language Models at a Practical Size
Team, G., Riviere, M., Pathak, S., Sessa, P. G., Hardin, C., Bhupatiraju, S., Hussenot, L., Mesnard, T., Shahriari, B., Ramé, A., et al., “Gemma 2: Improving open language models at a practical size,”arXiv preprint arXiv:2408.00118, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al., “Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities,”arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
Chain-of-thought prompting elicits reasoning in large language models,
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., et al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in Neural Information Processing Systems (NeurIPS), Vol. 35, 2022, pp. 24824–24837
work page 2022
-
[32]
LLMs on a budget: System-level approaches to power-efficient and scalable fine-tuning,
Gogineni, K., Suvizi, A., and Venkataramani, G., “LLMs on a budget: System-level approaches to power-efficient and scalable fine-tuning,”IEEE Open Journal of the Computer Society, 2025. 14 VIII. Appendix Dataset Generator Prompts —CTAF-KHAF-Synthetic User message (per scenario).For every scenario, the procedurally sampled aircraft, position events, and ME...
-
[33]
ASR Quality We re-transcribe each scenario’s clean audio with three Whisper sizes—base(74M parameters),medium(769M), andlarge-v3(1.55B; the default in the main experiments)—and re-evaluate each open-source LLM on each transcript under all three prompting strategies. Figure 11 reports macro-𝐹1 as a function of Whisper size for each open-source LLM under ze...
-
[34]
Audio Noise Robustness We inject additive white Gaussian noise into each scenario’s clean audio at five noise-to-signal ratios (NSR): 5%, 10%, 25%, 50%, and 75%. The corrupted audio is transcribed with Whisper-Large-v3 (held fixed) and classified under zero-shot prompting only. Figure 13 reports macro-𝐹1 as a function of noise-to-signal ratio (NSR) for ea...
-
[35]
Random fraction of words replaced with a fixed mask token
Transcript Text Masking To probe tolerance for partial transcript loss directly—without running the audio chain through Whisper—we mask the ground-truth transcript at five rates (𝑟∈ {10,20,40,60,80}% ) under two schemes. Random fraction of words replaced with a fixed mask token. LLMs still see the conversational structure but with gappy content. And utter...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.