pith. machine review for the scientific record. sign in

arxiv: 2605.12332 · v1 · submitted 2026-05-12 · 💻 cs.AI

Recognition: no theorem link

Towards Automated Air Traffic Safety Assessment Around Non-Towered Airports Using Large Language Models

Alexandre Bayen, Amin Tabrizian, Jordan Kam, Mahyar Ghazanfari, Peng Wei, Torsten Darrell

Pith reviewed 2026-05-13 04:20 UTC · model grok-4.3

classification 💻 cs.AI
keywords non-towered airportsCTAF communicationslarge language modelsair traffic safetyhazard classificationMETAR datasynthetic dataset evaluationmultimodal analysis
0
0 comments X

The pith

Large language models can classify air traffic safety risks from radio communications and weather data at non-towered airports with macro F1 scores above 0.85.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how large language models might automate safety checks at airports without towers, where pilots coordinate via radio announcements on a shared frequency. It combines these communications with weather data and other sources to spot potential hazards like near collisions. Tests on a new synthetic dataset show that even open-source models achieve strong performance on distinguishing normal from dangerous situations. A real-world example demonstrates spotting a right-of-way violation. This suggests a path toward scalable post-flight analysis that could help reduce risks in these settings.

Core claim

We propose a general vision-language model approach to analyze the transcribed CTAF radio communications in natural language, METAR weather data, ADS-B flight trajectories, and Visual Flight Rules sectional charts of the airfield. We provide a preliminary study at Half Moon Bay Airport, with a qualitative real world case study and a quantitative evaluation using a new synthetic dataset of communications and weather modalities. Even limited to CTAF and METAR inputs and open source LLMs, instances of our framework typically achieve a macro F1 score above 0.85 on a binary nominal/danger classification task.

What carries the argument

The multimodal LLM framework that processes natural language transcriptions of pilot self-announcements along with weather reports to perform hazard classification and safety assessment.

If this is right

  • Post-flight safety reviews at non-towered airports can be automated without constant human monitoring.
  • Open-source LLMs are sufficient for initial binary classification of safety risks from limited inputs.
  • The 12-category hazard taxonomy offers a structured approach for labeling and training models on common airport hazards.
  • Combining text inputs with trajectory and chart data in future extensions could provide more comprehensive analysis.
  • Qualitative success on real data indicates the framework can identify specific violations like right-of-way issues.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Real-time deployment could warn pilots during flight if processing speeds improve.
  • Training on actual incident data might better handle ambiguous communications that the synthetic set approximates.
  • Wider use could inform airport design or pilot training by revealing frequent communication patterns that lead to danger.
  • Similar techniques might apply to other domains with self-coordinated traffic, such as maritime or drone operations.

Load-bearing premise

The synthetic dataset derived from real examples sufficiently captures the variability, ambiguity, and edge cases of actual CTAF communications and hazard scenarios so that model performance generalizes to real-world use.

What would settle it

Collecting and labeling a dataset of several hundred real CTAF radio transcripts from non-towered airports, then measuring whether the models maintain a macro F1 score above 0.7 on the nominal/danger task when evaluated against expert annotations.

read the original abstract

We investigate frameworks for post-flight safety analysis at non-towered airports using large language models (LLMs). Non-towered airports rely on the Common Traffic Advisory Frequency (CTAF) for air traffic coordination and experience frequent near mid-air collisions due to the pilot self-announcement communication protocol. We propose a general vision-language model (VLM) approach to analyze the transcribed CTAF radio communications in natural language, METeorological Aerodrome Report (METAR) weather data, Automatic Dependent Surveillance-Broadcast (ADS-B) flight trajectories, and Visual Flight Rules sectional charts of the airfield. We provide a preliminary study at Half Moon Bay Airport, with a qualitative real world case study and a quantitative evaluation using a new synthetic dataset of communications and weather modalities. We qualitatively evaluate our framework on real flight data using Gemini 2.5 Pro, demonstrating accurate identification of a right-of-way violation. The synthetic dataset is derived from real examples and includes a 12-category hazard taxonomy, and is used to benchmark three open-source (Qwen 2.5-7B, Mistral-7B, Gemma-2-9B) and three closed-source (GPT-4o, GPT-5.4, Claude Sonnet 4.6) LLM models on the subset of inputs related to CTAF and METAR. Even limited to CTAF and METAR inputs and open source LLMs, instances of our framework typically achieve a macro F1 score above 0.85 on a binary nominal/danger classification task. Future work includes a quantitative evaluation across all modalities and a larger number of real world examples. Taken together, our results suggest that VLM analysis of safety at non-towered airports may be a valuable future capability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript proposes a vision-language model framework for post-flight safety analysis at non-towered airports by processing transcribed CTAF radio communications, METAR weather reports, ADS-B trajectories, and sectional charts. It presents a qualitative case study on real data from Half Moon Bay Airport using Gemini 2.5 Pro to detect a right-of-way violation, alongside quantitative benchmarking of six LLMs (three open-source, three closed-source) on a new synthetic CTAF+METAR dataset derived from real examples and organized by a 12-category hazard taxonomy. The central quantitative result is that even open-source LLMs restricted to CTAF and METAR inputs achieve macro F1 scores above 0.85 on a binary nominal/danger classification task. The work is framed as preliminary, with future directions including full multimodal quantitative evaluation and more real-world examples.

Significance. If the synthetic results generalize and the multimodal components prove effective, the approach could enable scalable, automated post-hoc safety assessment at the many non-towered airports where near mid-air collisions remain a documented risk due to self-announcement protocols. The introduction of a structured hazard taxonomy and systematic benchmarking across model families are positive contributions to an emerging application area. However, the current evidence base is limited to synthetic data for the quantitative claims and a single unquantified real-world example, so the practical significance for operational safety assessment remains prospective rather than demonstrated.

major comments (3)
  1. [§4] §4 (Synthetic Dataset): The construction of the synthetic CTAF+METAR dataset is described only at a high level as 'derived from real examples.' No details are provided on how variability in pilot phrasing, clipped transmissions, overlapping calls, radio artifacts, or combined hazards are (or are not) modeled. Because the macro F1 > 0.85 result on the binary classification task rests entirely on this dataset, the absence of these specifics prevents assessment of whether the reported performance supports generalization to actual CTAF traffic.
  2. [§5] §5 (Quantitative Evaluation): All reported F1 scores are obtained on CTAF and METAR inputs only, despite the abstract and introduction framing the contribution as a general VLM approach that also incorporates ADS-B trajectories and sectional charts. No ablation studies, full-modality results, or even preliminary metrics on the additional modalities are presented, so the broader claim that 'VLM analysis of safety at non-towered airports may be a valuable future capability' is not yet quantitatively supported.
  3. [Real-world case study] Real-world case study (Section 6 or equivalent): Only a single qualitative example is shown, with no quantitative metrics, no error analysis, and no description of the exact input transcription or model prompt used. This single instance cannot substantiate robustness claims when real CTAF data contain the very ambiguities (overlaps, variable phrasing, artifacts) that the synthetic dataset may omit.
minor comments (3)
  1. [Abstract and §5] The model names 'GPT-5.4' and 'Claude Sonnet 4.6' appear non-standard; clarify whether these are specific versions, placeholders, or typos.
  2. [§5] Prompt templates and exact classification instructions used for the LLMs are not reproduced in the main text or appendix, hindering reproducibility of the F1 results.
  3. [Figures] Figure captions for any qualitative examples should explicitly state the input modalities shown and the ground-truth hazard label.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our preliminary study. We address each major comment below, clarifying the scope of the current work while committing to revisions that strengthen the manuscript.

read point-by-point responses
  1. Referee: §4 (Synthetic Dataset): The construction of the synthetic CTAF+METAR dataset is described only at a high level as 'derived from real examples.' No details are provided on how variability in pilot phrasing, clipped transmissions, overlapping calls, radio artifacts, or combined hazards are (or are not) modeled. Because the macro F1 > 0.85 result on the binary classification task rests entirely on this dataset, the absence of these specifics prevents assessment of whether the reported performance supports generalization to actual CTAF traffic.

    Authors: We agree that additional details on dataset construction are needed to support evaluation of generalization. In the revised manuscript, Section 4 will be expanded to describe the generation process explicitly: variability in pilot phrasing was introduced by paraphrasing real CTAF examples while preserving intent; clipped transmissions and radio artifacts were simulated via text truncation and insertion of common noise markers (e.g., 'static', partial words); overlapping calls were modeled by concatenating multiple transmissions with explicit speaker indicators; and combined hazards appear in a subset of examples to test multi-hazard detection. These choices were made to balance realism with controlled evaluation. We will also release the full generation script and a sample of the dataset to enable independent assessment. revision: yes

  2. Referee: §5 (Quantitative Evaluation): All reported F1 scores are obtained on CTAF and METAR inputs only, despite the abstract and introduction framing the contribution as a general VLM approach that also incorporates ADS-B trajectories and sectional charts. No ablation studies, full-modality results, or even preliminary metrics on the additional modalities are presented, so the broader claim that 'VLM analysis of safety at non-towered airports may be a valuable future capability' is not yet quantitatively supported.

    Authors: The manuscript is explicitly positioned as preliminary, with the abstract and Section 5 stating that quantitative results are limited to CTAF+METAR inputs to establish a strong baseline. The broader VLM framing (including ADS-B and charts) is proposed as a general framework whose full quantitative evaluation is listed as future work. The reported macro F1 > 0.85 even with restricted inputs is intended to demonstrate feasibility of the text-based core. To address the concern, we will add explicit language in the introduction and conclusion reiterating that no full-modality quantitative claims are made and that multimodal benchmarking remains future work. No new experiments are added at this stage. revision: partial

  3. Referee: Real-world case study (Section 6 or equivalent): Only a single qualitative example is shown, with no quantitative metrics, no error analysis, and no description of the exact input transcription or model prompt used. This single instance cannot substantiate robustness claims when real CTAF data contain the very ambiguities (overlaps, variable phrasing, artifacts) that the synthetic dataset may omit.

    Authors: We acknowledge that the real-world case study is qualitative and limited to one instance, consistent with the preliminary framing of the paper. In the revision, we will append the exact Gemini 2.5 Pro prompt used and the full transcribed CTAF excerpt (with METAR) for the Half Moon Bay right-of-way example. We will also include a short error analysis paragraph discussing observed ambiguities in the real transmission and how the model handled them. Quantitative metrics on real data are not feasible without a larger labeled corpus, which is identified as future work. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmarking on synthetic dataset with no derivations or fitted predictions

full rationale

The paper is an empirical study that constructs a synthetic CTAF/METAR dataset from real examples, defines a 12-category hazard taxonomy, and directly measures macro F1 scores of several LLMs on a binary nominal/danger task. No equations, first-principles derivations, parameter fitting, or predictions appear; the reported performance numbers are computed outputs on the explicitly constructed test set rather than quantities forced by construction from fitted inputs. The single qualitative real-world case study is presented separately and carries no quantitative claim. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked. The central claims therefore remain independent of any internal reduction and are self-contained as a benchmarking exercise.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central performance claims rest on the domain assumption that current LLMs can reliably parse and classify domain-specific aviation language and weather data without fine-tuning or additional grounding.

axioms (1)
  • domain assumption Large language models can accurately interpret and classify aviation-specific natural language communications and weather reports for hazard detection.
    The reported F1 scores and qualitative identification rest on this capability holding for the given inputs.

pith-pipeline@v0.9.0 · 5635 in / 1385 out tokens · 154096 ms · 2026-05-13T04:20:42.871217+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 5 internal anchors

  1. [1]

    The Complexity Construct in Air Traffic Control: A Review and Synthesis of the Literature,

    Mogford, R. H., Guttman, J. A., Morrow, S. L., and Kopardekar, P., “The Complexity Construct in Air Traffic Control: A Review and Synthesis of the Literature,” Tech. Rep. DOT/FAA/CT-TN95/22, U.S. Department of Transportation, Federal Aviation Administration, 1995

  2. [2]

    Advisory Circular 90-66C: Non-Towered Airport Flight Operations,

    Federal Aviation Administration, “Advisory Circular 90-66C: Non-Towered Airport Flight Operations,” Tech. rep., U.S. De- partment of Transportation, Federal Aviation Administration, June 2023. URLhttps://www.faa.gov/documentlibrary/ media/advisory_circular/ac_90-66c.pdf

  3. [3]

    ASRS Database Report Set: Non-Tower Airport Incidents,

    National Aeronautics and Space Administration (NASA), Ames Research Center, “ASRS Database Report Set: Non-Tower Airport Incidents,” Technical Memorandum, ASRS Report Set No. 39 TH: 262-7, NASA Ames Research Center, Jun. 2025. URLhttps://asrs.arc.nasa.gov/docs/rpsts/non_twr.pdf, update No. 39, June 12 2025

  4. [4]

    In-Time Aviation Safety Management: Challenges and Research for an Evolving Aviation System,

    National Academies of Sciences, Engineering, and Medicine, “In-Time Aviation Safety Management: Challenges and Research for an Evolving Aviation System,” Tech. rep., The National Academies Press, 2022. URLhttps://www.faa.gov/sites/ faa.gov/files/2022-05/508.NationalAcademyReport.pdf,section“SystemMonitoring–DataFusion,Completeness, and Quality” points out...

  5. [5]

    Attention is All youNeed,

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I., “Attention is All youNeed,”AdvancesinNeuralInformationProcessingSystems(NeurIPS),Vol.30,editedbyI.Guyon,U.V.Luxburg,S.Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Curran Associates, Inc., 2017, pp. 5998–6008. URLhttps:// proc...

  6. [6]

    Recent Advances in Speech Language Models: A Survey,

    Cui, W., Yu, D., Jiao, X., Meng, Z., Zhang, G., Wang, Q., Guo, Y., and King, I., “Recent Advances in Speech Language Models: A Survey,”Proceedings of ACL (Long Paper), 2025. URLhttps://aclanthology.org/2025.acl-long.682.pdf, preprint available at arXiv

  7. [7]

    A Survey on Multimodal Large Language Models,

    Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., and Chen, E., “A Survey on Multimodal Large Language Models,”National Science Review, Vol. 11, No. 12, 2024, p. nwae403. https://doi.org/10.1093/nsr/nwae403

  8. [8]

    Neural Machine Translation: A Review,

    Stahlberg, F., “Neural Machine Translation: A Review,”Journal of Artificial Intelligence Research (JAIR), 2020. URL https://jair.org/index.php/jair/article/download/12007/26611/24616

  9. [9]

    Multi-Step Reasoning with Large Language Models, a Survey,

    Plaat, A., Wong, A., Verberne, S., Broekens, J., van Stein, N., and Bäck, T., “Multi-Step Reasoning with Large Language Models, a Survey,”arXiv preprint arXiv:2407.11511, 2024. URL https://arxiv.org/abs/2407.11511

  10. [10]

    Call Me When Necessary: LLMs can Efficiently and Faithfully Reason over Structured Environments,

    Cheng, S., Zhuang, Z., Xu, Y., Yang, F., Zhang, C., Qin, X., Huang, X., Chen, L., Lin, Q., Zhang, D., Rajmohan, S., and Zhang, Q., “Call Me When Necessary: LLMs can Efficiently and Faithfully Reason over Structured Environments,”Findings of the Association for Computational Linguistics: ACL 2024, edited by L.-W. Ku, A. Martins, and V. Srikumar, Associatio...

  11. [11]

    The Traffic Alert and Collision Avoidance System (TCAS): Past, Present and Future,

    Kuchar, J., and Wan, L. M., “The Traffic Alert and Collision Avoidance System (TCAS): Past, Present and Future,”Lincoln Laboratory Journal, Vol. 16, No. 2, 2007, p. 14–32. Survey of airborne automated collision avoidance systems

  12. [12]

    Probabilityofconflictanalysisof3Daircraftflightbasedontwo-levelMarkovchainapproximation approach,

    Al-Basman,M.,andHu,J.,“Probabilityofconflictanalysisof3Daircraftflightbasedontwo-levelMarkovchainapproximation approach,”2010 International Conference on Networking, Sensing and Control (ICNSC), IEEE, 2010, pp. 608–613

  13. [13]

    Next-generation airborne collision avoidance system,

    Kochenderfer, M. J., Holland, J. E., and Chryssanthacopoulos, J. P., “Next-generation airborne collision avoidance system,” Lincoln Laboratory Journal, Vol. 19, No. 1, 2012, pp. 17–33

  14. [14]

    Kaona: Deep Searching and Curating Aviation Safety Reporting Systems,

    Paradis, C. V., Hong, C., Matthews, B., Davies, M. D., and Hooey, B., “Kaona: Deep Searching and Curating Aviation Safety Reporting Systems,”AIAA SCITECH 2025 Forum, 2025, p. 2152. https://doi.org/10.2514/6.2025-2152, URL https://arc.aiaa.org/doi/abs/10.2514/6.2025-2152

  15. [15]

    CHATATC: Large Language Model-Driven Conversational Agents for Supporting Strategic Air Traffic Flow Management,

    Abdulhak, S., Hubbard, W., Gopalakrishnan, K., and Li, M. Z., “CHATATC: Large Language Model-Driven Conversational Agents for Supporting Strategic Air Traffic Flow Management,”arXiv preprint, Vol. arXiv:2402.14850, 2024. URL https://arxiv.org/abs/2402.14850

  16. [16]

    Autonomous Air Traffic Control for Non-Towered Airports,

    Mahboubi, Z., and Kochenderfer, M. J., “Autonomous Air Traffic Control for Non-Towered Airports,”Proc. USA/Eur. Air Traffic Manage. Res. Develop. Seminar, 2015, pp. 1–6. 13

  17. [17]

    Examining the Potential of Generative Language Models for Aviation Safety Analysis: Case Study and Insights Using the Aviation Safety Reporting System (ASRS),

    Tikayat Ray, A., Bhat, A. P., White, R. T., Nguyen, V. M., Pinon Fischer, O. J., and Mavris, D. N., “Examining the Potential of Generative Language Models for Aviation Safety Analysis: Case Study and Insights Using the Aviation Safety Reporting System (ASRS),”Aerospace, Vol. 10, No. 9, 2023, p. 770

  18. [18]

    Towards an Aviation Large Language Model by Fine-tuning and Evaluating Transformers,

    Nielsen, D., Clarke, S. S., and Kalyanam, K. M., “Towards an Aviation Large Language Model by Fine-tuning and Evaluating Transformers,”2024 AIAA DATC/IEEE 43rd Digital Avionics Systems Conference (DASC), IEEE, 2024, pp. 1–5

  19. [19]

    AviationGPT:ALargeLanguageModelfortheAviationDomain,

    Wang,L.,Chou,J.,Tien,A.,Zhou,X.,andBaumgartner,D.,“AviationGPT:ALargeLanguageModelfortheAviationDomain,” AIAA AVIATION FORUM AND ASCEND 2024, 2024, p. 4250

  20. [20]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al., “LLama: Open and Efficient Foundation Language Models,”arXiv preprint arXiv:2302.13971, 2023

  21. [21]

    WorkloadGPT: A Large Language Model Approach to Real-Time Detection of Pilot Workload,

    Gao, Y., Yue, L., Sun, J., Shan, X., Liu, Y., and Wu, X., “WorkloadGPT: A Large Language Model Approach to Real-Time Detection of Pilot Workload,”Applied Sciences, Vol. 14, No. 18, 2024, p. 8274

  22. [22]

    Chain-of-ThoughtFlightPlanner: End-to-EndLLMRoutingUnderWindHazards,

    Tabrizian, A., Ghazanfari, M., andWei, P., “Chain-of-ThoughtFlightPlanner: End-to-EndLLMRoutingUnderWindHazards,” AIAA AVIATION FORUM AND ASCEND 2025, 2025, p. 3711

  23. [23]

    Automatic Control With Human-Like Reasoning: Exploring Language Model Embodied Air Traffic Agents,

    Andriuškevičius, J., and Sun, J., “Automatic Control With Human-Like Reasoning: Exploring Language Model Embodied Air Traffic Agents,”arXiv preprint arXiv:2409.09717, 2024

  24. [24]

    A Flight Simulator Software for Visualization of 3-Dimensional Airspaces and Air Traffic Management,

    Su, W., Kam, J., and Bulusu, D. V., “A Flight Simulator Software for Visualization of 3-Dimensional Airspaces and Air Traffic Management,”2025 Regional Student Conferences, 2025, p. 97940

  25. [25]

    Language Conditioning Improves Accuracy of Aircraft Goal Prediction in Untowered Airspace,

    Sangeetha, S. V., Chiu, C.-Y., Li, S. H., and Kousik, S., “Language Conditioning Improves Accuracy of Aircraft Goal Prediction in Untowered Airspace,”arXiv preprint arXiv:2509.14063, 2025

  26. [26]

    AviationWeather.gov,

    Aviation Weather Center, “AviationWeather.gov,” https://aviationweather.gov/, 2025. Accessed: 2025-11-07

  27. [27]

    Qwen2 Technical Report

    Team, Q., et al., “Qwen2 Technical Report,”arXiv preprint arXiv:2407.10671, Vol. 2, No. 3, 2024

  28. [28]

    Mistral 7B

    Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-A., Stock, P., Le Scao, T., Lavril, T., Wang, T., Lacroix, T., and El Sayed, W., “Mistral 7B,”arXiv preprint arXiv:2310.06825, 2023

  29. [29]

    Gemma 2: Improving Open Language Models at a Practical Size

    Team, G., Riviere, M., Pathak, S., Sessa, P. G., Hardin, C., Bhupatiraju, S., Hussenot, L., Mesnard, T., Shahriari, B., Ramé, A., et al., “Gemma 2: Improving open language models at a practical size,”arXiv preprint arXiv:2408.00118, 2024

  30. [30]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al., “Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities,”arXiv preprint arXiv:2507.06261, 2025

  31. [31]

    Chain-of-thought prompting elicits reasoning in large language models,

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., et al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in Neural Information Processing Systems (NeurIPS), Vol. 35, 2022, pp. 24824–24837

  32. [32]

    LLMs on a budget: System-level approaches to power-efficient and scalable fine-tuning,

    Gogineni, K., Suvizi, A., and Venkataramani, G., “LLMs on a budget: System-level approaches to power-efficient and scalable fine-tuning,”IEEE Open Journal of the Computer Society, 2025. 14 VIII. Appendix Dataset Generator Prompts —CTAF-KHAF-Synthetic User message (per scenario).For every scenario, the procedurally sampled aircraft, position events, and ME...

  33. [33]

    Figure 11 reports macro-𝐹1 as a function of Whisper size for each open-source LLM under zero-shot, one-shot, and few-shot prompting

    ASR Quality We re-transcribe each scenario’s clean audio with three Whisper sizes—base(74M parameters),medium(769M), andlarge-v3(1.55B; the default in the main experiments)—and re-evaluate each open-source LLM on each transcript under all three prompting strategies. Figure 11 reports macro-𝐹1 as a function of Whisper size for each open-source LLM under ze...

  34. [34]

    The corrupted audio is transcribed with Whisper-Large-v3 (held fixed) and classified under zero-shot prompting only

    Audio Noise Robustness We inject additive white Gaussian noise into each scenario’s clean audio at five noise-to-signal ratios (NSR): 5%, 10%, 25%, 50%, and 75%. The corrupted audio is transcribed with Whisper-Large-v3 (held fixed) and classified under zero-shot prompting only. Figure 13 reports macro-𝐹1 as a function of noise-to-signal ratio (NSR) for ea...

  35. [35]

    Random fraction of words replaced with a fixed mask token

    Transcript Text Masking To probe tolerance for partial transcript loss directly—without running the audio chain through Whisper—we mask the ground-truth transcript at five rates (𝑟∈ {10,20,40,60,80}% ) under two schemes. Random fraction of words replaced with a fixed mask token. LLMs still see the conversational structure but with gappy content. And utter...