pith. machine review for the scientific record. sign in

arxiv: 2605.01209 · v1 · submitted 2026-05-02 · 💻 cs.SE · cs.FL

Recognition: unknown

ClarifySTL: An Interactive LLM Agent Framework for STL Transformation through Requirements Clarification

Authors on Pith no claims yet

Pith reviewed 2026-05-09 15:14 UTC · model grok-4.3

classification 💻 cs.SE cs.FL
keywords Signal Temporal LogicLLM agentsrequirements clarificationvagueness detectionambiguity resolutioncyber-physical systemsformal specification
0
0 comments X

The pith

An interactive LLM agent framework detects vagueness and ambiguity in natural language requirements before converting them into accurate Signal Temporal Logic specifications.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an interactive process where LLM agents first scan natural language requirements for vague expressions that leave information underspecified. When vagueness is found, the agents generate specific questions for users to supply missing details, repeating until the requirement is complete. Next the agents identify ambiguities and pose focused questions to resolve them, updating the requirement with each answer. Only after these steps does the framework use LLMs to produce the final Signal Temporal Logic formulas. This matters for cyber-physical systems because imprecise requirements lead to incorrect formal specifications that cannot reliably verify real-time safety properties.

Core claim

ClarifySTL first detects vague expressions that indicate underspecified information in a requirement and generates targeted clarification queries to guide users in supplementing the requirement until all necessary details are provided. If ambiguities are detected, it formulates focused ambiguity clarification queries and updates the requirements based on user feedback until all ambiguities are resolved. Finally, the clarified requirements are transformed into STL specifications using LLMs. This interactive framework ensures that the resulting STL formulas faithfully capture user intent while reducing the burden on the user.

What carries the argument

The ClarifySTL interactive LLM-agent framework, which sequences vagueness detection and query generation, ambiguity detection and resolution, and final LLM-based transformation to STL.

If this is right

  • STL formulas produced after clarification more accurately reflect the user's original intent than those generated from raw natural language.
  • Targeted queries reduce the total user effort compared with asking users to rewrite entire requirements.
  • The approach improves results on existing benchmarks such as DeepSTL and STL-DivEn by handling incomplete inputs.
  • A new benchmark AmbiEval measures the agent's ability to detect vagueness and ambiguity and generate useful queries.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same clarification loop could be applied to other formal languages for specifying system behaviors beyond STL.
  • Repeated use of the framework on similar domains might allow the agent to anticipate common ambiguities and reduce the number of queries over time.
  • Pairing the clarified requirements with simulation or model-checking tools could provide immediate feedback on whether the resulting STL matches expected system traces.

Load-bearing premise

LLM agents can reliably detect vague expressions and ambiguities in requirements and generate clarification queries that, when answered by users, produce complete and unambiguous requirements suitable for accurate STL transformation.

What would settle it

A test on the AmbiEval benchmark in which the agent misses key vague expressions or produces STL formulas that still violate user intent after clarification queries are answered would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.01209 by Hongshen Chen, Jie An, Naijun Zhan, Xiaohong Chen, Yue Fang, Zhi Jin.

Figure 1
Figure 1. Figure 1: User-written natural language requirements may fail to accurately convey their intent. The left part view at source ↗
Figure 2
Figure 2. Figure 2: The Brief Structure of ClarifySTL. In this paper, to address these challenges, we propose the ClarifySTL framework, which interac￾tively clarifies vagueness and ambiguity in natural language requirements before their transforma￾tion into STL. The framework identifies vague and ambiguous components in requirements and generates precise clarifying queries, which can not only avoid unnecessary interactions be… view at source ↗
Figure 3
Figure 3. Figure 3: The Overview of ClaritySTL. ambiguity is detected in the requirement. Finally, ClarifySTL uses an LLM to transform the clarified requirement into an STL specification. These iterative re-checking loops serve as a post-clarification verification step that allows the user to confirm whether each refinement preserves the original requirement before the process proceeds to the next iteration. Example. As shown in view at source ↗
Figure 4
Figure 4. Figure 4: Fine-tuning the Vagueness Detector for Identification. view at source ↗
Figure 5
Figure 5. Figure 5: Prompts for the Vagueness Inquirer to Generate Clarification Queries. view at source ↗
Figure 6
Figure 6. Figure 6: Training and Inference Phases of the Ambiguity Detector. view at source ↗
Figure 7
Figure 7. Figure 7: Workflow and Prompts for Ambiguity Inquirer to Generate queries about Ambiguity Clarification. view at source ↗
Figure 8
Figure 8. Figure 8: Overlap of Vague Requirements Detected by Different Models. view at source ↗
Figure 9
Figure 9. Figure 9: Overlap of Ambiguous Requirements Detected by Different Models. view at source ↗
Figure 10
Figure 10. Figure 10: Comparison of Different Ambiguity Identification Methods. view at source ↗
Figure 11
Figure 11. Figure 11: Impact of the Number of Candidate STL Formulas. view at source ↗
Figure 12
Figure 12. Figure 12: Impact of Iterations on Vagueness Clarifica view at source ↗
Figure 14
Figure 14. Figure 14: Temperature Sensitivity of the Ambiguity In view at source ↗
read the original abstract

Signal Temporal Logic (STL) is a formal language for specifying real-time behaviors of cyber-physical systems (CPS). Automatically transforming natural language requirements into STL specifications has received growing attention. Recent efforts leveraging large language models (LLMs) have demonstrated impressive performance, but some natural language requirements in practice contain vague or ambiguous information, which remains challenging for LLMs to handle. To address these challenges, we propose ClarifySTL, an interactive LLM-agent framework that enhances STL transformation through requirements clarification. ClarifySTL first detects vague expressions that indicate underspecified information in a requirement. If any vagueness is detected, it generates targeted clarification queries to guide users in supplementing the requirement until all necessary details are provided. Subsequently, if ClarifySTL detects ambiguities, it formulates focused ambiguity clarification queries and updates the requirements based on user feedback until all ambiguities are resolved. Finally, the requirements with vagueness and ambiguity clarified are transformed into STL specifications using LLMs. This interactive framework ensures that the resulting STL formulas faithfully capture user intent while reducing the burden on the user. We evaluate ClarifySTL on the representative benchmarks DeepSTL and STL-DivEn, as well as our newly introduced AmbiEval benchmark, which is specifically designed to assess the performance of the agents in handling vagueness and ambiguity, including both detection and query generation. The experimental results show that ClarifySTL is effective.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes ClarifySTL, an interactive LLM-agent framework for transforming natural language requirements into Signal Temporal Logic (STL) specifications. The framework detects vague expressions in requirements, generates targeted user clarification queries to resolve underspecification, detects and resolves ambiguities via additional queries, and finally performs the NL-to-STL translation on the clarified input. It introduces the AmbiEval benchmark focused on vagueness and ambiguity handling and evaluates the overall approach on DeepSTL, STL-DivEn, and AmbiEval, claiming that the interactive process ensures STL formulas faithfully capture user intent while demonstrating effectiveness.

Significance. If the results hold with proper quantitative support, the work addresses a practical gap in automated formal specification by using targeted interaction to mitigate vagueness and ambiguity that direct LLM translations often mishandle. The new AmbiEval benchmark is a constructive contribution that could enable systematic evaluation of clarification components in future NL-to-STL research. The design's emphasis on reducing user burden while improving fidelity is relevant for safety-critical CPS applications.

major comments (2)
  1. [Experiments] Experiments section: The paper states that results on DeepSTL, STL-DivEn, and AmbiEval demonstrate effectiveness, but reports no component-level metrics (precision/recall/F1 for vagueness detection, query generation success rate, or STL accuracy with vs. without clarification steps) or ablations. This is load-bearing for the central claim that the interactive loop improves fidelity over direct translation.
  2. [AmbiEval benchmark] AmbiEval benchmark description: The benchmark is presented as assessing detection and query generation, yet no details on its size, construction method, ground-truth annotations, or inter-annotator agreement are provided, making it impossible to interpret any reported performance numbers or compare against baselines.
minor comments (2)
  1. The abstract would be strengthened by including at least one key quantitative result (e.g., accuracy gain or detection F1) to support the effectiveness claim.
  2. Notation for the pipeline stages (vagueness detection, ambiguity resolution, final transformation) could be made more consistent across figures and text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger quantitative support and benchmark transparency. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of our results and the AmbiEval benchmark.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: The paper states that results on DeepSTL, STL-DivEn, and AmbiEval demonstrate effectiveness, but reports no component-level metrics (precision/recall/F1 for vagueness detection, query generation success rate, or STL accuracy with vs. without clarification steps) or ablations. This is load-bearing for the central claim that the interactive loop improves fidelity over direct translation.

    Authors: We agree that component-level metrics and ablations are essential to substantiate the contribution of the interactive clarification loop. In the revised manuscript, we will report precision, recall, and F1 scores for vagueness detection and ambiguity detection. We will also include query generation success rates and an ablation comparing STL translation accuracy (e.g., semantic similarity or syntactic correctness metrics) with versus without the clarification steps on DeepSTL, STL-DivEn, and AmbiEval. These additions will directly quantify the fidelity gains from the interactive process. revision: yes

  2. Referee: [AmbiEval benchmark] AmbiEval benchmark description: The benchmark is presented as assessing detection and query generation, yet no details on its size, construction method, ground-truth annotations, or inter-annotator agreement are provided, making it impossible to interpret any reported performance numbers or compare against baselines.

    Authors: We acknowledge that the current AmbiEval description is incomplete for reproducibility and evaluation. In the revision, we will expand this section to specify the benchmark size (total requirements and breakdown by vagueness/ambiguity categories), construction method (sourcing from existing datasets and synthetic generation of underspecified examples), ground-truth annotations (labels for detection and expected clarification outcomes), and inter-annotator agreement (e.g., Cohen's kappa scores from multiple annotators). This will allow proper interpretation of results and baseline comparisons. revision: yes

Circularity Check

0 steps flagged

No significant circularity: linear pipeline with independent evaluation

full rationale

The paper presents ClarifySTL as a sequential, non-recursive pipeline (vagueness detection, query generation, user update, ambiguity resolution, then NL-to-STL transformation) whose steps are described procedurally without equations, fitted parameters, or self-referential definitions. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the provided text; the new AmbiEval benchmark is introduced as an external assessment tool rather than a quantity derived from the framework itself. The central effectiveness claim rests on experimental results across DeepSTL, STL-DivEn, and AmbiEval rather than reducing to its own inputs by construction, rendering the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unproven capability of LLMs to perform accurate vagueness/ambiguity detection and query generation; no free parameters or invented entities are described.

axioms (1)
  • domain assumption Large language models can be effectively prompted to detect vague expressions indicating underspecified information and to generate targeted clarification queries that resolve them.
    The entire interactive process depends on this LLM capability, which is assumed rather than demonstrated or bounded in the abstract.

pith-pipeline@v0.9.0 · 5558 in / 1204 out tokens · 51305 ms · 2026-05-09T15:14:13.086894+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 11 canonical work pages · 6 internal anchors

  1. [1]

    Daniel M Berry and Erik Kamsties. 2004. Ambiguity in Requirements Specification. InPerspectives on software requirements. Springer, 7–44

  2. [2]

    Daniel M Berry, Erik Kamsties, and Michael M Krieger. 2003. From Contract Drafting to Software Specification: Linguistic Sources of Ambiguity.Citeseer(2003)

  3. [3]

    Marco Bombieri, Daniele Meli, Diego Dall’Alba, Marco Rospocher, and Paolo Fiorini. 2023. Mapping natural language procedures descriptions to linear temporal logic templates: an application in the surgical robotic domain.Applied ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: May 2025. 30 Yue Fang et al. Intelligence53, 22 (202...

  4. [4]

    Igor Buzhinsky. 2019. Formalization of natural language requirements into temporal logics: a survey. In2019 IEEE 17th international conference on industrial informatics (INDIN), Vol. 1. IEEE, 400–406

  5. [5]

    Hongkai Chen, Scott A Smolka, Nicola Paoletti, and Shan Lin. 2023. An STL-based approach to resilient control for cyber-physical systems. InProceedings of the 26th ACM International Conference on Hybrid Systems: Computation and Control. 1–12

  6. [6]

    Yongchao Chen, Rujul Gandhi, Yang Zhang, and Chuchu Fan. 2023. NL2TL: Transforming Natural Languages to Temporal Logics using Large Language Models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computat...

  7. [7]

    Yongchao Chen, Yilun Hao, Yueying Liu, Yang Zhang, and Chuchu Fan. [n. d.]. CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance. InForty-second International Conference on Machine Learning

  8. [8]

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261(2025)

  9. [9]

    Matthias Cosler, Christopher Hahn, Daniel Mendoza, Frederik Schmitt, and Caroline Trippel. 2023. nl2spec: Interactively Translating Unstructured Natural Language to Temporal Logics with Large Language Models. InCA V 2023 (LNCS, Vol. 13965), Constantin Enea and Akash Lal (Eds.). Springer, 383–396

  10. [10]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models.arXiv e-prints(2024), arXiv–2407

  11. [11]

    Matthew B Dwyer, George S Avrunin, and James C Corbett. 1999. Patterns in property specifications for finite-state verification. InProceedings of the 21st international conference on Software engineering. 411–420

  12. [12]

    William H English, Dominic Simon, Sumit Kumar Jha, and Rickard Ewetz. [n. d.]. Grammar-Forced Translation of Natural Language to Temporal Logic using LLMs. InForty-second International Conference on Machine Learning

  13. [13]

    Gidon Ernst, Paolo Arcaini, Ismail Bennani, Aniruddh Chandratre, Alexandre Donzé, Georgios Fainekos, Goran Frehse, Khouloud Gaaloul, Jun Inoue, Tanmay Khandait, Logan Mathesen, Claudio Menghi, Giulia Pedrielli, Marc Pouzet, Masaki Waga, Shakiba Yaghoubi, Yoriyuki Yamagata, and Zhenya Zhang. 2021. ARCH-COMP 2021 Category Report: Falsification with Validati...

  14. [14]

    Gidon Ernst, Paolo Arcaini, Ismail Bennani, Alexandre Donzé, Georgios Fainekos, Goran Frehse, Logan Mathesen, Claudio Menghi, Giulia Pedrielli, Marc Pouzet, Shakiba Yaghoubi, Yoriyuki Yamagata, and Zhenya Zhang. 2020. ARCH- COMP 2020 Category Report: Falsification. In7th International Workshop on Applied Verification of Continuous and Hybrid Systems (ARCH...

  15. [15]

    Gidon Ernst, Paolo Arcaini, Georgios Fainekos, Federico Formica, Jun Inoue, Tanmay Khandait, Mohammad Mahdi Mahboob, Claudio Menghi, Giulia Pedrielli, Masaki Waga, Yoriyuki Yamagata, and Zhenya Zhang. 2022. ARCH-COMP 2022 Category Report: Falsification with Ubounded Resources. InProceedings of 9th International Workshop on Applied Verification of Continuo...

  16. [16]

    Saad Ezzini, Sallam Abualhaija, Chetan Arora, Mehrdad Sabetzadeh, and Lionel Briand. 2021. MAANA: An Automated Tool for DoMAin-Specific HANdling of Ambiguity. In2021 IEEE/ACM 43rd ICSE-Companion. 188–189. doi:10.1109/ICSE- Companion52605.2021.00082

  17. [17]

    Saad Ezzini, Sallam Abualhaija, Chetan Arora, Mehrdad Sabetzadeh, and Lionel C Briand. 2021. Using Domain-Specific Corpora for Improved Handling of Ambiguity in Requirements. In2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 1485–1497

  18. [18]

    Sarah Fakhoury, Aaditya Naik, Georgios Sakkas, Saikat Chakraborty, and Shuvendu K. Lahiri. 2024. LLM-Based Test- Driven Interactive Code Generation: User Study and Empirical Evaluation.IEEE Transactions on Software Engineering (2024)

  19. [19]

    Yue Fang, Zhi Jin, Jie An, Hongshen Chen, Xiaohong Chen, and Naijun Zhan. 2025. Enhancing Transformation from Natural Language to Signal Temporal Logic Using LLMs with Diverse External Knowledge. InFindings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria, July 27 - August 1, 2025, Wanxiang Che, Joyce Nabende, Ekaterina Shutova,...

  20. [20]

    Joseph L. Fleiss. 1971. Measuring Nominal Scale Agreement Among Many Raters.Psychological Bulletin76, 5 (1971), 378–382

  21. [21]

    Francesco Fuggitti and Tathagata Chakraborti. 2023. Nl2ltl–a python package for converting natural language (nl) instructions to linear temporal logic (ltl) formulas. InProceedings of the AAAI Conference on Artificial Intelligence, ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: May 2025. ClarifySTL: An Interactive LLM Agent F...

  22. [22]

    Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. SimCSE: Simple Contrastive Learning of Sentence Embeddings. In 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021. Association for Computational Linguistics (ACL), 6894–6910

  23. [23]

    Shalini Ghosh, Daniel Elenius, Wenchao Li, Patrick Lincoln, Natarajan Shankar, and Wilfried Steiner. 2016. ARSENAL: automatic requirements specification extraction from natural language. InNASA Formal Methods: 8th International Symposium, NFM 2016, Minneapolis, MN, USA, June 7-9, 2016, Proceedings 8. Springer, 41–46

  24. [24]

    Yann Gilpin, Vince Kurtz, and Hai Lin. 2020. A smooth robustness measure of signal temporal logic for symbolic control.IEEE Control Systems Letters5, 1 (2020), 241–246

  25. [25]

    Jie He, Ezio Bartocci, Dejan Nickovic, Haris Isakovic, and Radu Grosu. 2022. DeepSTL - From English Requirements to Signal Temporal Logic. InICSE 2022. ACM, 610–622

  26. [26]

    Elad Hoffer and Nir Ailon. 2015. Deep Metric Learning Using Triplet Network. InInternational Workshop on Similarity- based Pattern Recognition

  27. [27]

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card.arXiv preprint arXiv:2410.21276(2024)

  28. [28]

    Erik Kamsties and Barbara Peach. 2000. Taming ambiguity in natural language requirements. InProceedings of the Thirteenth International Conference on Software and Systems Engineering and Applications

  29. [29]

    Diederik P Kingma. 2014. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980(2014)

  30. [30]

    Dhanashree Kulkarni, Andrew N Fisher, and Chris J Myers. 2013. A new assertion property language for analog/mixed- signal circuits. InProceedings of the 2013 Forum on specification and Design Languages (FDL). IEEE, 1–8

  31. [31]

    Danyang Li, Mingyu Cai, Cristian-Ioan Vasile, and Roberto Tron. 2023. Learning Signal Temporal Logic through Neural Network for Interpretable Classification. In2023 American Control Conference (ACC). IEEE, 1907–1914

  32. [32]

    Marcus, and Hadas Kress-Gazit

    Constantine Lignos, Vasumathi Raman, Cameron Finucane, Mitchell P. Marcus, and Hadas Kress-Gazit. 2015. Provably correct reactive control from natural language.Auton. Robots38, 1 (2015), 89–105

  33. [33]

    Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. InText summarization branches out. 74–81

  34. [34]

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)

  35. [35]

    Dipeeka Luitel, Shabnam Hassani, and Mehrdad Sabetzadeh. 2024. Improving requirements completeness: Automated assistance through large language models.Requirements Engineering29, 1 (2024), 73–95

  36. [36]

    DeLateur, Ron Weiss, Douglas Densmore, and Calin Belta

    Curtis Madsen, Prashant Vaidyanathan, Sadra Sadraddini, Cristian Ioan Vasile, Nicholas A. DeLateur, Ron Weiss, Douglas Densmore, and Calin Belta. 2018. Metrics for Signal Temporal Logic Formulae. In57th IEEE Conference on Decision and Control, CDC 2018, Miami, FL, USA, December 17-19, 2018. IEEE, 1542–1547

  37. [37]

    Sebastian Maierhofer, Anna-Katharina Rettinger, Eva Charlotte Mayer, and Matthias Althoff. 2020. Formalization of interstate traffic rules in temporal logic. In2020 IEEE Intelligent Vehicles Symposium (IV). IEEE, 752–759

  38. [38]

    Oded Maler and Dejan Ničković. 2004. Monitoring temporal properties of continuous signals. InFORMATS/FTRTFT

  39. [39]

    LNCS, Vol. 3253. Springer, 152–166. doi:10.1007/978-3-540-30206-3_12

  40. [40]

    Yuchen Mao, Tianci Zhang, Xu Cao, Zhongyao Chen, Xinkai Liang, Bochen Xu, and Hao Fang. 2024. Nl2stl: Transfor- mation from logic natural language to signal temporal logics using llama2. In2024 IEEE International Conference on Cybernetics and Intelligent Systems (CIS) and IEEE International Conference on Robotics, Automation and Mechatronics (RAM). IEEE, 469–474

  41. [41]

    Daniel Mendoza, Christopher Hahn, and Caroline Trippel. 2024. Translating natural language to temporal logics with large language models and model checkers. InCONFERENCE ON FORMAL METHODS IN COMPUTER-AIDED DESIGN–FMCAD 2024. 119

  42. [42]

    Sara Mohammadinejad, Sheryl Paul, Yuan Xia, Vidisha Kudalkar, Jesse Thomason, and Jyotirmoy V Deshmukh. 2024. Systematic translation from natural language robot task descriptions to stl. InInternational Conference on Bridging the Gap between AI and Reality. Springer, 259–276

  43. [43]

    Fangwen Mu, Lin Shi, Song Wang, Zhuohao Yu, Binquan Zhang, ChenXue Wang, Shichao Liu, and Qing Wang. 2024. Clarifygpt: A framework for enhancing llm-based code generation via requirements clarification.Proceedings of the ACM on Software Engineering1, FSE (2024), 2332–2354

  44. [44]

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318

  45. [45]

    A Paszke. 2019. Pytorch: An imperative style, high-performance deep learning library.arXiv preprint arXiv:1912.01703 (2019)

  46. [46]

    Roma Patel, Roma Pavlick, and Stefanie Tellex. 2019. Learning to ground language to temporal logical form. In Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)

  47. [47]

    Amir Pnueli. 1977. The Temporal Logic of Programs. InFOCS 1977. IEEE, 46–57. doi:10.1109/SFCS.1977.32 ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: May 2025. 32 Yue Fang et al

  48. [48]

    KC Pragyan, Rambod Ghandiparsi, Thomas Herron, John Heaps, and Mitra Bokaei Hosseini. 2025. Demystifying Feature Requests: Leveraging LLMs to Refine Feature Requests in Open-Source Software. In2025 IEEE 33rd International Requirements Engineering Conference (RE). IEEE, 104–116

  49. [49]

    Vasumathi Raman, Constantine Lignos, Cameron Finucane, Kenton CT Lee, Mitchell P Marcus, and Hadas Kress-Gazit

  50. [50]

    In Robotics: science and systems, Vol

    Sorry Dave, I’m Afraid I Can’t Do That: Explaining Unachievable Robot Tasks Using Natural Language.. In Robotics: science and systems, Vol. 2. Berlin, Germany, 2–1

  51. [51]

    Tainã Santos, Gustavo Carvalho, and Augusto Sampaio. 2018. Formal modelling of environment restrictions from natural-language requirements. InSBMF 2018. Springer, 252–270

  52. [52]

    Eshita Shukla, Quinn Thibeault, and Giulia Pedrielli. 2025. A gray box approach for Large Language Model-guided Natural Language to Temporal Logic Automatic Translation. InProceedings of the ACM/IEEE 16th International Conference on Cyber-Physical Systems (with CPS-IoT Week 2025). 1–2

  53. [53]

    Stefanie Tellex, Nakul Gopalan, Hadas Kress-Gazit, and Cynthia Matuszek. 2020. Robots that use language.Annual Review of Control, Robotics, and Autonomous Systems3, 1 (2020), 25–55

  54. [54]

    JG Thistle and WM Wonham. 1986. Control problems in a temporal logic framework.Internat. J. Control44, 4 (1986), 943–976

  55. [55]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al . 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837

  56. [56]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al . 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations. 38–45

  57. [57]

    Xiaodan Xu, Chao Ni, Xinrong Guo, Shaoxuan Liu, Xiaoya Wang, Kui Liu, and Xiaohu Yang. 2025. Distinguishing llm-generated from human-written code by contrastive learning.ACM Transactions on Software Engineering and Methodology34, 4 (2025), 1–31

  58. [58]

    Yilongfei Xu, Jincao Feng, and Weikai Miao. 2024. Learning from failures: Translation of natural language requirements into linear temporal logic with large language models. In2024 IEEE 24th International Conference on Software Quality, Reliability and Security (QRS). IEEE, 204–215

  59. [59]

    Ruixuan Yan, Agung Julius, Maria Chang, Achille Fokoue, Tengfei Ma, and Rosario Uceda-Sosa. 2021. Stone: Signal temporal logic neural network for time series classification. In2021 International Conference on Data Mining Workshops (ICDMW). IEEE, 778–787

  60. [60]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

  61. [61]

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2019. BERTScore: Evaluating Text Generation with BERT. InInternational Conference on Learning Representations

  62. [62]

    Yaowei Zheng, Richong Zhang, Junhao Zhang, YeYanhan YeYanhan, and Zheyan Luo. 2024. LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). 400–410

  63. [63]

    2010.Temporal logic for man

    Lukáš Žilka. 2010.Temporal logic for man. Ph. D. Dissertation. Master’s thesis, Brno University of Technology. Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009 ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: May 2025