pith. machine review for the scientific record. sign in

arxiv: 2604.21715 · v1 · submitted 2026-04-21 · 💻 cs.SE

Recognition: unknown

Automated LTL Specification Generation from Industrial Aerospace Requirements

Bin Gu, Cheng Wen, Cong Tian, Mengfei Yang, Rui Chen, Shengchao Qin, Xiao Liang, Zhi Ma

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:09 UTC · model grok-4.3

classification 💻 cs.SE
keywords LTL generationaerospace requirementslarge language modelsformal specificationrequirements engineeringsafety-critical systemsnatural language to logicverification automation
0
0 comments X

The pith

AeroReq2LTL uses a data dictionary and templates to let LLMs turn real aerospace requirements into correct LTL specifications.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to show that a targeted LLM system can reliably translate complex natural-language aerospace requirements into Linear Temporal Logic formulas that verification tools can use immediately. Manual translation currently demands rare expertise in both domain engineering and formal methods and often introduces errors that affect safety-critical software. By first normalizing technical terms into atomic propositions and making implicit timing and logic explicit through templates, the framework reduces the expertise barrier. If the reported performance holds, it would let engineers produce verifiable properties directly from existing requirement documents.

Core claim

AeroReq2LTL automates LTL property generation for aerospace requirements by feeding normalized inputs to large language models. A data dictionary converts domain jargon into precise atomic propositions, while a template-based requirement language surfaces temporal cues and logical relations that would otherwise remain implicit. On an industrial dataset the system produces LTL formulas at 85% precision and 88% recall, and the outputs are directly consumable by existing verification tools without further manual editing.

What carries the argument

The two-stage normalization step that first replaces technical terms via a data dictionary and then rewrites requirements in a template language to expose timing and logical structure before LLM translation.

If this is right

  • Generated LTL formulas can be passed unchanged to standard model checkers and verification tools used in aerospace certification.
  • The volume of manual effort and the risk of transcription errors in safety-property formalization both decrease.
  • General-purpose NL-to-LTL systems become unnecessary once domain-specific normalization is applied first.
  • Requirement documents written in the template language become reusable assets that support both human review and automated verification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same dictionary-plus-template pattern could be replicated in other regulated domains such as automotive or medical device software where natural-language requirements must feed into formal analysis.
  • Extending the template set to cover probabilistic or continuous-time relations would test whether the approach scales beyond discrete LTL.
  • Automatic extraction of dictionary entries from existing glossaries might further reduce the manual setup cost.
  • Integration with requirement-management platforms could allow the translation step to run whenever a requirement is updated.

Load-bearing premise

The data dictionary and templates are assumed to capture every relevant temporal or logical relation present in industrial aerospace requirements without omission or distortion.

What would settle it

Apply the framework to a fresh collection of aerospace requirements whose timing constraints are not covered by the existing templates and check whether the generated LTL still matches independent expert translations at comparable precision.

Figures

Figures reproduced from arXiv: 2604.21715 by Bin Gu, Cheng Wen, Cong Tian, Mengfei Yang, Rui Chen, Shengchao Qin, Xiao Liang, Zhi Ma.

Figure 1
Figure 1. Figure 1: The complexity and accuracy in NL requirement and its specification. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Workflow of AeroReq2LTL. the text alone, but is distributed across heterogeneous engineering artifacts. To bridge this gap, a framework must perform multi-source information synthesis, which motivates our two-pronged approach: – Semantic Grounding via SpaceKG: To prevent the fragmentation of tech￾nical terms, the formalization process is grounded in a domain-specific data dictionary. By ingesting the Inter… view at source ↗
Figure 3
Figure 3. Figure 3: Examples of requirement statements, corresponding [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Accuracy of SpaceKG. State Change Bound Check Work-Mode Change Command Process State Respond State Maintenance Mode Invariant Non sequential Prediction State Change Bound Check Work-Mode Change Command Process State Respond State Maintenance Mode Invariant Non sequential Ground Truth 31 0 5 0 0 0 0 0 0 18 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 7 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 1 0 0 0 0 3 0 … view at source ↗
read the original abstract

In the development and verification of safety-critical aero-space software, Linear Temporal Logic (LTL) has been widely used to specify complex system properties derived from requirements. However, a significant gap remains in industrial practice: translating natural language (NL) requirements into formal LTL properties is a labor-intensive and error-prone process that requires rare expertise in both aerospace control engineering and formal methods. While recent NL-to-LTL tools (e.g., NL2SPEC, NL2TL, NL2LTL) are capable of automating parts of this process, they often fail on real requirement documents in industrial settings, due to complex domain terminology or implicit temporal and logical structure. To address these challenges, we present AeroReq2LTL, a framework that automates LTL property generation for aerospace requirements using large language models (LLMs), with two key industrial innovations: (i) a data dictionary that normalizes technical jargon into precise atomic propositions; and (ii) a template-based requirement language that makes temporal cues and logical relations explicit before translation. On a real aerospace dataset, AeroReq2LTL achieves 85% precision and 88% recall in LTL generation, and its outputs can be directly consumed by existing verification tools.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents AeroReq2LTL, an LLM-based framework for automatically generating LTL specifications from natural-language aerospace requirements. It introduces two industrial adaptations—a data dictionary to normalize domain jargon into atomic propositions and a template-based requirement language to make temporal cues and logical relations explicit—before performing the LLM translation step. The central empirical claim is that the framework achieves 85% precision and 88% recall on a real aerospace dataset and produces outputs directly consumable by existing verification tools.

Significance. If the reported metrics are shown to hold under a transparent evaluation protocol that includes raw industrial inputs, the work would meaningfully reduce the expertise barrier for formalizing safety-critical aerospace requirements, extending beyond prior NL-to-LTL tools that struggle with domain terminology and implicit structure.

major comments (2)
  1. [Abstract] Abstract: The 85% precision and 88% recall figures are presented without any information on dataset size, the procedure used to obtain ground-truth LTL formulas, the precise evaluation protocol (e.g., matching criteria, handling of partial matches), or comparisons against baselines such as NL2SPEC or NL2TL. These omissions make the central empirical claim impossible to assess.
  2. [Abstract and Evaluation section] Abstract and Evaluation section: The framework description states that the template-based language is used to make temporal and logical relations explicit 'before translation.' It is therefore unclear whether the reported precision/recall were measured on raw natural-language requirements or on inputs that had already been rewritten into the template language. If the latter, the metrics only validate the LLM translation step under idealized conditions and do not substantiate the claim that the full pipeline correctly normalizes implicit relations without loss.
minor comments (1)
  1. [Abstract] The abstract refers to 'existing verification tools' without naming them or showing an example of direct consumption; adding a brief concrete illustration would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the manuscript to improve clarity and completeness of the evaluation description.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The 85% precision and 88% recall figures are presented without any information on dataset size, the procedure used to obtain ground-truth LTL formulas, the precise evaluation protocol (e.g., matching criteria, handling of partial matches), or comparisons against baselines such as NL2SPEC or NL2TL. These omissions make the central empirical claim impossible to assess.

    Authors: We agree that the abstract, due to length constraints, omits key details needed to assess the empirical results. The Evaluation section provides the dataset size, the ground-truth creation process involving domain experts, the exact matching criteria for LTL formulas, and handling of matches. However, to make the central claim assessable from the abstract alone, we will expand it with a concise summary of the dataset and protocol. We will also ensure the Evaluation section includes direct comparisons against NL2SPEC and NL2TL on the same dataset. revision: yes

  2. Referee: [Abstract and Evaluation section] Abstract and Evaluation section: The framework description states that the template-based language is used to make temporal and logical relations explicit 'before translation.' It is therefore unclear whether the reported precision/recall were measured on raw natural-language requirements or on inputs that had already been rewritten into the template language. If the latter, the metrics only validate the LLM translation step under idealized conditions and do not substantiate the claim that the full pipeline correctly normalizes implicit relations without loss.

    Authors: The reported metrics reflect the full AeroReq2LTL pipeline applied directly to raw natural-language requirements from the industrial dataset. The data dictionary and template-based language are automated components of the framework that process the raw inputs to normalize terminology and surface implicit relations before the LLM translation step; no manual pre-rewriting of requirements into templates was performed for the evaluation. We will revise the abstract and Evaluation section to state this explicitly and describe the end-to-end application on raw inputs. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework evaluation on external dataset with no self-referential reductions.

full rationale

The paper describes an LLM-based pipeline (AeroReq2LTL) that incorporates a data dictionary and template language as preprocessing steps before translation to LTL. The central claims consist of reported precision/recall metrics on a real aerospace dataset plus compatibility with existing verification tools. These are empirical outcomes, not mathematical derivations, fitted parameters renamed as predictions, or results forced by self-citation chains. No equations appear, and the evaluation does not reduce by construction to the inputs; the pipeline remains independently testable against the stated dataset and external tools.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework depends on the empirical effectiveness of LLMs after domain preprocessing rather than new mathematical axioms or invented entities.

axioms (1)
  • domain assumption Large language models can reliably produce correct LTL once domain jargon is normalized and temporal cues are made explicit via templates
    Central to the translation pipeline described in the abstract.

pith-pipeline@v0.9.0 · 5527 in / 1215 out tokens · 39393 ms · 2026-05-10T03:09:32.834740+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 9 canonical work pages · 1 internal anchor

  1. [1]

    Linear temporal logic symbolic model checking.Computer Sci- ence Review, 5(2):163–203, 2011

    Kristin Y Rozier. Linear temporal logic symbolic model checking.Computer Sci- ence Review, 5(2):163–203, 2011

  2. [2]

    Springer Science & Business Media, 1998

    Shmuel Merhav.Aerospace sensor systems and applications. Springer Science & Business Media, 1998

  3. [3]

    Runtime verification for ltl and tltl.ACM Transactions on Software Engineering and Methodology (TOSEM), 20(4):1–64, 2011

    Andreas Bauer, Martin Leucker, and Christian Schallhart. Runtime verification for ltl and tltl.ACM Transactions on Software Engineering and Methodology (TOSEM), 20(4):1–64, 2011

  4. [4]

    Large Language Models: A Survey

    Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. Large language models: A survey. arXiv preprint arXiv:2402.06196, 2024

  5. [5]

    A comprehen- sive overview of large language models.ACM Transactions on Intelligent Systems and Technology, 16(5):1–72, 2025

    Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian. A comprehen- sive overview of large language models.ACM Transactions on Intelligent Systems and Technology, 16(5):1–72, 2025

  6. [6]

    Automatically inspecting thousands of static bug warnings with large language model: How far are we?ACM Transactions on Knowledge Discovery from Data, 18(7):1–34, 2024

    Cheng Wen, Yuandao Cai, Bin Zhang, Jie Su, Zhiwu Xu, Dugang Liu, Shengchao Qin, Zhong Ming, and Tian Cong. Automatically inspecting thousands of static bug warnings with large language model: How far are we?ACM Transactions on Knowledge Discovery from Data, 18(7):1–34, 2024

  7. [7]

    Cfstra: En- hancing configurable program analysis through llm-driven strategy selection based on code features

    Jie Su, Liansai Deng, Cheng Wen, Shengchao Qin, and Cong Tian. Cfstra: En- hancing configurable program analysis through llm-driven strategy selection based on code features. InInternational Symposium on Theoretical Aspects of Software Engineering, pages 374–391. Springer, 2024

  8. [8]

    nl2spec: Interactively translating unstructured natural language to temporal logics with large language models

    Matthias Cosler, Christopher Hahn, Daniel Mendoza, Frederik Schmitt, and Car- oline Trippel. nl2spec: Interactively translating unstructured natural language to temporal logics with large language models. InProceedings of the 35th Interna- tional Conference on Computer Aided Verification, volume 13965 ofLecture Notes in Computer Science, pages 383–396, Pa...

  9. [9]

    Nl2tl: Transforming natural languages to temporal logics using large language models

    Yongchao Chen, Rujul Gandhi, Yang Zhang, and Chuchu Fan. Nl2tl: Transforming natural languages to temporal logics using large language models. InProceedings of the Conference on Empirical Methods in Natural Language Processing, pages 15880–15903, Singapore, 2023. Association for Computational Linguistics

  10. [10]

    Nl2ltl - a python package for con- verting natural language (nl) instructions to linear temporal logic (ltl) formulas

    Francesco Fuggitti and Tathagata Chakraborti. Nl2ltl - a python package for con- verting natural language (nl) instructions to linear temporal logic (ltl) formulas. In Proceedings of the 37th AAAI Conference on Artificial Intelligence, the 35th Con- ference on Innovative Applications of Artificial Intelligence, the 13th Symposium on Educational Advances i...

  11. [11]

    PhD thesis, Massachusetts Institute of Technology, 1998

    Joshua Ian Bernstein.Design methods in the aerospace industry: looking for ev- idence of set-based practices. PhD thesis, Massachusetts Institute of Technology, 1998

  12. [12]

    From informal to formal– incorporating and evaluating llms on natural language requirements to verifiable formal proofs

    Jialun Cao, Yaojie Lu, Meiziniu Li, Haoyang Ma, Haokun Li, Mengda He, Cheng Wen, Le Sun, Hongyu Zhang, Shengchao Qin, et al. From informal to formal– incorporating and evaluating llms on natural language requirements to verifiable formal proofs. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Paper...

  13. [13]

    Enchanting program specification synthesis by large language models using static analysis and program verification

    Cheng Wen, Jialun Cao, Jie Su, Zhiwu Xu, Shengchao Qin, Mengda He, Haokun Li, Shing-Chi Cheung, and Cong Tian. Enchanting program specification synthesis by large language models using static analysis and program verification. InInterna- tional Conference on Computer Aided Verification, pages 302–328. Springer, 2024. Automated LTL Generation for Aerospace...

  14. [14]

    Integrating ensemble learning and large language models for efficient formal verification of ip-based aerospace systems

    Zhi Ma, Cheng Wen, Bin Yu, and Jie Su. Integrating ensemble learning and large language models for efficient formal verification of ip-based aerospace systems. Information Fusion, 125:103466, 2026

  15. [15]

    Formal verifica- tion of aerospace software ip components: A multi-tool case study

    Shaolin Zhang, HongJin Liu, Zhi Ma, Xiao Liang, and Cheng Wen. Formal verifica- tion of aerospace software ip components: A multi-tool case study. InProceedings of the 2026 6th International Conference on Computer Network Security and Soft- ware Engineering, pages 1–9, 2026

  16. [16]

    To- wards practical requirement analysis and verification: A case study on software ip components in aerospace embedded systems.arXiv preprint arXiv:2404.00795, 2024

    Zhi Ma, Cheng Wen, Jie Su, Ming Zhao, Bin Yu, Xu Lu, and Cong Tian. To- wards practical requirement analysis and verification: A case study on software ip components in aerospace embedded systems.arXiv preprint arXiv:2404.00795, 2024

  17. [17]

    Gpt-4o technical report.https://openai.com/index/gpt-4o, 2024

    OpenAI. Gpt-4o technical report.https://openai.com/index/gpt-4o, 2024. Ac- cessed: May 2025

  18. [18]

    Deepseek-vl: Scaling vision-language models with vision token learner.arXiv preprint arXiv:2405.07927, 2024

    Shuai Zhang, Haisen Zhao, et al. Deepseek-vl: Scaling vision-language models with vision token learner.arXiv preprint arXiv:2405.07927, 2024

  19. [19]

    Checking metric temporal logic with trace

    Martijn Hendriks, Marc Geilen, Amir Reza Baghban Behrouzian, Twan Basten, Hadi Alizadeh Ara, and Dip Goswami. Checking metric temporal logic with trace. 2016 16th International Conference on Application of Concurrency to System De- sign (ACSD), pages 19–24, 2016

  20. [20]

    Klee: Unassisted and automatic generation of high-coverage tests for complex systems programs

    Cristian Cadar, Daniel Dunbar, Dawson R Engler, et al. Klee: Unassisted and automatic generation of high-coverage tests for complex systems programs. In OSDI, volume 8, pages 209–224, 2008

  21. [21]

    Klee symbolic execution engine in 2019.In- ternational Journal on Software Tools for Technology Transfer, 23:867–870, 2021

    Cristian Cadar and Martin Nowack. Klee symbolic execution engine in 2019.In- ternational Journal on Software Tools for Technology Transfer, 23:867–870, 2021

  22. [22]

    Program analysis with a debugger: Gdb

    Jo Van Hoey. Program analysis with a debugger: Gdb. InBeginning x64 Assembly Programming: From Novice to AVX Professional, pages 21–33. Springer, 2019

  23. [23]

    Grammar-forced translation of natural language to temporal logic using llms

    William English, Dominic Simon, Sumit Kumar Jha, and Rickard Ewetz. Grammar-forced translation of natural language to temporal logic using llms. In International Conference on Machine Learning, 2025

  24. [25]

    Learning from failures: Translation of natural language requirements into linear temporal logic with large language mod- els

    Yilongfei Xu, Jincao Feng, and Weikai Miao. Learning from failures: Translation of natural language requirements into linear temporal logic with large language mod- els. In2024 IEEE 24th International Conference on Software Quality, Reliability and Security (QRS), pages 204–215. IEEE, 2024

  25. [26]

    Lang2ltl: Translating natural language commands to temporal specification with large language models

    Jason Xinyu Liu, Ziyi Yang, Benjamin Schornstein, Sam Liang, Ifrah Idrees, Ste- fanie Tellex, and Ankit Shah. Lang2ltl: Translating natural language commands to temporal specification with large language models. InWorkshop on Language and Robotics at CoRL 2022, 2022

  26. [27]

    JasonXinyuLiu,AnkitShah,GeorgeDimitriKonidaris,StefanieTellex,andDavid Paulius. Lang2ltl-2: Grounding spatiotemporal navigation commands using large language and vision-language models.2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2325–2332, 2024

  27. [28]

    Shaojun Xu, Xusheng Luo, Yutong Huang, Letian Leng, Ruixuan Liu, and Changliu Liu. Nl2hltl2plan: Scaling up natural language understanding for multi- robots through hierarchical temporal logic task specifications.IEEE Robotics and Automation Letters, 10:10482–10489, 2024. 20 Ma, Liang, Wen, Chen, Gu, Qin, Tian, Yang Meng-Fei

  28. [29]

    Demystifying gpt self-repair for code generation

    Theo X. Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, and Ar- mando Solar-Lezama. Demystifying gpt self-repair for code generation.ArXiv, abs/2306.09896, 2023

  29. [30]

    Automated extraction of protocol state machines from 3gpp specifications with domain-informed prompts and llm ensembles.ArXiv, abs/2510.14348, 2025

    Miao Zhang, Runhan Feng, Hongbo Tang, Yu Zhao, Jie Yang, Hang Qiu, and Qi Liu. Automated extraction of protocol state machines from 3gpp specifications with domain-informed prompts and llm ensembles.ArXiv, abs/2510.14348, 2025

  30. [31]

    Springer, 2006

    DinesBjøner.Software Engineering 3: Domains, requirements, and software design. Springer, 2006

  31. [32]

    From domain descriptions to requirements prescriptions

    Dines Bjørner. From domain descriptions to requirements prescriptions

  32. [33]

    Unlocking the silent needs: Business-logic-driven iterative requirements auto-completion

    Zhujun Wu, Xiaohong Chen, Zhi Jin, Ming Hu, and Dongming Jin. Unlocking the silent needs: Business-logic-driven iterative requirements auto-completion. In2026 IEEE/ACM 48th International Conference on Software Engineering (ICSE ’26), page 12, 2026

  33. [34]

    Flag: Formal and llm-assisted sva generation for formal specifications of on-chip communication protocols, 2025

    Yu-An Shih, Annie Lin, Aarti Gupta, and Sharad Malik. Flag: Formal and llm- assisted sva generation for formal specifications of on-chip communication proto- cols.ArXiv, abs/2504.17226, 2025

  34. [35]

    Towards legal contract for- malization with controlled natural language templates

    Regan Meloche, Daniel Amyot, and John Mylopoulos. Towards legal contract for- malization with controlled natural language templates. In2023 IEEE 31st Inter- national Requirements Engineering Conference (RE), pages 317–322. IEEE, 2023

  35. [36]

    Zhi Ma, Cheng Wen, Zhexin Su, Xiao Liang, Cong Tian, Shengchao Qin, and Mengfei Yang. Bridging natural language and formal specification–automated translation of software requirements to ltl via hierarchical semantics decomposition using llms.arXiv preprint arXiv:2512.17334, 2025

  36. [37]

    Domain engineering: A software engineering discipline in need of research

    Dines Bjørner. Domain engineering: A software engineering discipline in need of research. InInternational Conference on Current Trends in Theory and Practice of Computer Science, pages 1–17. Springer, 2000

  37. [38]

    Survey of machine learning for software-assisted hardware design verification: Past, present, and prospect.ACM Transactions on Design Automation of Electronic Systems, 2024

    Na Wu, Yifan Li, Hongyang Yang, Hao Chen, Shouyi Dai, and Chenghao Hao. Survey of machine learning for software-assisted hardware design verification: Past, present, and prospect.ACM Transactions on Design Automation of Electronic Systems, 2024

  38. [39]

    Faegheh Yeganli, Mohammad Baharloo, and Amirali Bani- asadi

    Meisam Abdollahi, S. Faegheh Yeganli, Mohammad Baharloo, and Amirali Bani- asadi. Hardware design and verification with large language models: A scoping review, challenges, and open issues.Electronics, 2024

  39. [40]

    Modeling like peeling an onion: Layerwise analysis-driven automatic behavioral model genera- tion

    Yike Huang, Ming Hu, Xiaohong Chen, Zhi Jin, and Shuyuan Xiao. Modeling like peeling an onion: Layerwise analysis-driven automatic behavioral model genera- tion. In2026 IEEE/ACM 48th International Conference on Software Engineering (ICSE ’26), page 12, 2026

  40. [41]

    English, Chase Walker, Dominic Simon, Sumit K

    William English, Chase Walker, Dominic Simon, Sumit Kumar Jha, and Rickard Ewetz. Verifiable natural language to linear temporal logic translation: A bench- mark dataset and evaluation suite.ArXiv, abs/2507.00877, 2025

  41. [42]

    Translating natural language to temporal logics with large language models and model checkers.2024 Formal Methods in Computer-Aided Design (FMCAD), pages 1–11, 2024

    Daniel Mendoza, Christopher Hahn, and Caroline Trippel. Translating natural language to temporal logics with large language models and model checkers.2024 Formal Methods in Computer-Aided Design (FMCAD), pages 1–11, 2024

  42. [43]

    Wang et al

    Jun Wang, David Smith Sundarsingh, Jyotirmoy V. Deshmukh, and Yiannis Kan- taros. Conformalnl2ltl: Translating natural language instructions into temporal logic formulas with conformal correctness guarantees.ArXiv, abs/2504.21022, 2025