pith. machine review for the scientific record. sign in

arxiv: 2604.07321 · v1 · submitted 2026-04-08 · 💻 cs.LO · cs.AI

Recognition: no theorem link

Syntax Is Easy, Semantics Is Hard: Evaluating LLMs for LTL Translation

Mohammad Saqib Hasan, Niranjan Balasubramanian, Omar Chowdhury, Priscilla Kyei Danso

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:33 UTC · model grok-4.3

classification 💻 cs.LO cs.AI
keywords LLM evaluationLTL translationnatural language to formal logicsyntax versus semanticsprompt engineeringcode completion
0
0 comments X

The pith

Large language models translate English sentences to LTL formulas more accurately on syntax than semantics and improve further when the task is framed as Python code completion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how well several representative large language models turn assertive English sentences into propositional linear temporal logic formulas. It separates the evaluation into syntactic correctness of the generated formulas and semantic fidelity to the original meaning, using both human-written and synthetic reference formulas. Models show stronger results on syntax than on semantics, gain from more detailed prompts, and achieve their largest improvements when the translation is recast as completing Python code that encodes the LTL formula. The work highlights practical difficulties in creating unambiguous test data for this kind of translation task.

Core claim

LLMs perform better on syntactic aspects of LTL than on semantic ones, generally benefit from more detailed prompts, and achieve substantially higher overall performance when the task is reformulated as a Python code-completion problem.

What carries the argument

Comparison of LLM outputs against human-generated and synthetic ground-truth LTL formulas along separate syntactic and semantic metrics, with controlled variations in prompt detail and task framing as direct formula generation versus Python code completion.

If this is right

  • More detailed prompts raise the quality of both syntactic and semantic aspects of the generated LTL formulas.
  • Reformulating the translation task as Python code completion produces the largest measured gains in overall accuracy.
  • Syntactic correctness remains easier for current LLMs to achieve than correct capture of intended meaning.
  • LTL-based security and privacy analysis tools could become usable by a wider set of developers through such LLM interfaces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The code-completion framing could be tested on other temporal or modal logics to see whether similar gains appear.
  • Integration of these models into requirement-specification tools might reduce the need for manual LTL writing in software projects.
  • Future work could measure whether fine-tuning on code representations of LTL further narrows the syntax-semantics gap.

Load-bearing premise

The human-generated and synthetic ground-truth LTL formulas accurately capture the intended semantics of the natural language sentences without ambiguity or multiple valid interpretations.

What would settle it

A new collection of English sentences paired with multiple distinct but semantically equivalent LTL formulas where current models consistently produce only one of the valid translations.

Figures

Figures reproduced from arXiv: 2604.07321 by Mohammad Saqib Hasan, Niranjan Balasubramanian, Omar Chowdhury, Priscilla Kyei Danso.

Figure 1
Figure 1. Figure 1: Performance per model and interface (precision, [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Detailed vs. Minimal Interface performance. Accu [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Trace generation results on Little Tricky Logic [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Semantic performance of NL→LTL (Past) transla￾tion evaluated via bidirectional entailment. Equiv. denotes semantic equivalence (𝜑GT ⇔ 𝜑Pred ), Sound. denotes ground￾truth-to-prediction entailment (𝜑GT ⇒ 𝜑Pred ), Compl. denotes prediction-to-ground-truth entailment (𝜑Pred ⇒ 𝜑GT ), and Syntax reports well-formed LTL outputs. models, often producing syntactically valid formulas that are not semantically equiv… view at source ↗
read the original abstract

Propositional Linear Temporal Logic (LTL) is a popular formalism for specifying desirable requirements and security and privacy policies for software, networks, and systems. Yet expressing such requirements and policies in LTL remains challenging because of its intricate semantics. Since many security and privacy analysis tools require LTL formulas as input, this difficulty places them out of reach for many developers and analysts. Large Language Models (LLMs) could broaden access to such tools by translating natural language fragments into LTL formulas. This paper evaluates that premise by assessing how effectively several representative LLMs translate assertive English sentences into LTL formulas. Using both human-generated and synthetic ground-truth data, we evaluate effectiveness along syntactic and semantic dimensions. The results reveal three findings: (1) in line with prior findings, LLMs perform better on syntactic aspects of LTL than on semantic ones; (2) they generally benefit from more detailed prompts; and (3) reformulating the task as a Python code-completion problem substantially improves overall performance. We also discuss challenges in conducting a fair evaluation on this task and conclude with recommendations for future work.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript evaluates several representative LLMs on translating assertive English sentences into LTL formulas, using both human-generated and synthetic ground-truth datasets. It assesses performance along syntactic and semantic dimensions and reports three main findings: LLMs perform better on syntactic aspects than semantic ones, benefit from more detailed prompts, and show substantial improvement when the task is reformulated as a Python code-completion problem. The paper also discusses challenges in fair evaluation and offers recommendations for future work.

Significance. If the results hold under more robust semantic equivalence checking, the work provides concrete evidence on LLM limitations in formal specification generation and identifies practical prompting and reformulation strategies that could make LTL-based verification tools more accessible to non-experts. The use of both human and synthetic data is a strength, as is the explicit discussion of evaluation challenges.

major comments (3)
  1. [§3 and §4] §3 (Evaluation Methodology) and §4 (Results): Semantic correctness appears to be measured against a single ground-truth LTL formula per natural-language sentence. LTL admits many semantically equivalent formulas (via rewriting rules, operator nesting, or derived operators) and many input sentences are under-specified; without a comprehensive equivalence oracle or enumeration of valid targets, reported semantic error rates and the syntax-vs-semantics gap may be inflated artifacts of the chosen ground truth rather than intrinsic LLM limitations.
  2. [§4] §4: The three headline findings rest on quantitative comparisons, yet the manuscript provides no details on exact metrics (e.g., exact-match vs. AST-based syntactic score, semantic equivalence procedure), dataset sizes, statistical tests, or error analysis. This absence makes it impossible to verify the robustness or statistical significance of the reported improvements from detailed prompts and code-completion reformulation.
  3. [§3.3] §3.3 (Task Reformulation): The claim that reformulating as Python code completion 'substantially improves overall performance' is load-bearing for the third finding, but the paper does not report the precise prompt template, the Python library used for LTL, or an ablation isolating the effect of the code-completion framing from other prompt changes.
minor comments (2)
  1. [Abstract] The abstract and introduction would benefit from a brief statement of the exact number of models, sentences, and prompt variants evaluated.
  2. [§4] Notation for syntactic vs. semantic accuracy metrics should be defined explicitly before the results section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate clarifications, additional details, and expanded discussion where appropriate.

read point-by-point responses
  1. Referee: [§3 and §4] §3 (Evaluation Methodology) and §4 (Results): Semantic correctness appears to be measured against a single ground-truth LTL formula per natural-language sentence. LTL admits many semantically equivalent formulas (via rewriting rules, operator nesting, or derived operators) and many input sentences are under-specified; without a comprehensive equivalence oracle or enumeration of valid targets, reported semantic error rates and the syntax-vs-semantics gap may be inflated artifacts of the chosen ground truth rather than intrinsic LLM limitations.

    Authors: We agree that LTL admits multiple semantically equivalent formulas and that some natural-language sentences may be under-specified. Our semantic evaluation checked equivalence to the provided ground-truth formula using an LTL-to-Büchi automata translation followed by language equivalence checking. We acknowledge that this approach does not enumerate all possible valid alternatives. In the revision we will expand §3 and §4 with a fuller description of the equivalence procedure, a discussion of its limitations for under-specified inputs, and explicit caveats on how this may affect the reported syntax-versus-semantics gap. These additions will contextualize rather than overturn the comparative findings. revision: yes

  2. Referee: [§4] §4: The three headline findings rest on quantitative comparisons, yet the manuscript provides no details on exact metrics (e.g., exact-match vs. AST-based syntactic score, semantic equivalence procedure), dataset sizes, statistical tests, or error analysis. This absence makes it impossible to verify the robustness or statistical significance of the reported improvements from detailed prompts and code-completion reformulation.

    Authors: We accept that the current presentation lacks sufficient detail for full verification. The syntactic metric combined exact string match with an AST-based similarity score; semantic equivalence followed the automata procedure noted above. The datasets comprise both the human-generated and synthetic collections described in §3. We did not perform formal statistical hypothesis tests, as the study is exploratory and trend-oriented. In the revision we will move all metric definitions, exact dataset sizes, and a concise error analysis into the main text of §4, and we will add a short note explaining the absence of statistical tests together with basic robustness indicators such as per-model variance. revision: yes

  3. Referee: [§3.3] §3.3 (Task Reformulation): The claim that reformulating as Python code completion 'substantially improves overall performance' is load-bearing for the third finding, but the paper does not report the precise prompt template, the Python library used for LTL, or an ablation isolating the effect of the code-completion framing from other prompt changes.

    Authors: We will include the exact prompt templates used for the code-completion setting in the revised §3.3. The implementation relied on the Spot Python bindings for LTL formula construction and manipulation. While the comparison was performed with otherwise comparable prompt structures, we did not conduct a dedicated ablation that isolates the code-completion framing from other prompt variations. In the revision we will state this limitation explicitly, provide the full templates, and add a brief qualitative discussion of why the code-completion framing may leverage LLMs' code-training data. These changes will make the third finding more transparent without altering the reported performance numbers. revision: partial

Circularity Check

0 steps flagged

Empirical evaluation against independent ground-truth shows no circularity

full rationale

The paper evaluates LLM translations of natural language to LTL by direct comparison against separately generated human and synthetic ground-truth formulas. No derivation chain exists that reduces predictions or findings to fitted parameters, self-definitions, or self-citation load-bearing premises; the three headline findings are empirical measurements on external benchmarks. The evaluation is self-contained against those benchmarks, with no equations or claims that equate outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical evaluation study with no mathematical derivations, fitted parameters, or new postulated entities. It relies on standard practices for LLM prompting and benchmark creation.

pith-pipeline@v0.9.0 · 5506 in / 1197 out tokens · 97248 ms · 2026-05-10T17:33:21.264290+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 36 canonical work pages · 3 internal anchors

  1. [1]

    Anonymous. 2026. Anonymous Artifact. https://github.com/ochowdhu/NL 2LTL-SecDev26

  2. [2]

    M Fareed Arif, Daniel Larraz, Mitziu Echeverria, Andrew Reynolds, Omar Chowd- hury, and Cesare Tinelli. 2020. SYSLITE: syntax-guided synthesis of PLTL for- mulas from finite traces. In# PLACEHOLDER_PARENT_METADATA_V ALUE#, Vol. 1. TU Wien Academic Press, 93–103

  3. [3]

    Baier, J.P

    C. Baier, J.P. Katoen, and K.G. Larsen. 2008.Principles of Model Checking. MIT Press.https://books.google.com/books?id=5dvxCwAAQBAJ

  4. [4]

    Sotiris Batsakis, Ilias Tachmazidis, Matthew Mantle, Nikolaos Papadakis, and Grig- oris Antoniou. 2025. Model Checking Using Large Language Models—Evaluation and Future Directions.Electronics14, 2 (2025). doi: 10.3390/electronics140 20401

  5. [5]

    Andreas Bauer, Martin Leucker, and Christian Schallhart. 2011. Runtime Verifica- tion for LTL and TLTL.ACM Trans. Softw. Eng. Methodol.20, 4, Article 14 (Sept. 2011), 64 pages. doi:10.1145/2000799.2000800

  6. [6]

    Andreas Bauer, Martin Leucker, and Jonathan Streit. 2006. SALT—Structured Assertion Language for Temporal Logic. InFormal Methods and Software Engi- neering, Zhiming Liu and Jifeng He (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 757–775

  7. [7]

    Neural-symbolic learning and reasoning: A survey and interpretation.arXiv preprint arXiv:1711.03902,

    Tarek R. Besold, Artur d’Avila Garcez, Sebastian Bader, Howard Bowman, Pedro Domingos, Pascal Hitzler, Kai-Uwe Kuehnberger, Luis C. Lamb, Daniel Lowd, Priscila Machado Vieira Lima, Leo de Penning, Gadi Pinkas, Hoifung Poon, and Gerson Zaverucha. 2017. Neural-Symbolic Learning and Reasoning: A Survey and Interpretation. arXiv:1711.03902 [cs.AI] https://arx...

  8. [8]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

  9. [9]

    Igor Buzhinsky. 2019. Formalization of natural language requirements into temporal logics: a survey. In2019 IEEE 17th International Conference on Industrial Informatics (INDIN), Vol. 1. 400–406. doi:10.1109/INDIN41052.2019.8972130

  10. [10]

    Antonio Cau, Hussein Zedan, and François Siewe. 2013. Dynamic Access Control Policies: Specification and Verification.Comput. J.56, 10 (2013), 1242–1260. doi:10.1093/comjnl/bxs102

  11. [11]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374 (2021)

  12. [12]

    Christopher Chi et al. 2019. AutoTap: Synthesizing and Repairing Trigger-Action Programs from LTL Properties. InProceedings of the 41st International Conference on Software Engineering (ICSE). IEEE/ACM, 616–626. doi: 10.1109/ICSE.2019. 00043

  13. [13]

    Omar Chowdhury, Andreas Gampe, Jianwei Niu, Jeffery von Ronne, Jared Bennatt, Anupam Datta, Limin Jia, and William H Winsborough. 2013. Privacy promises that can be kept: a policy analysis method with application to the HIPAA privacy rule. InProceedings of the 18th ACM symposium on Access control models and technologies. 3–14

  14. [14]

    Omar Chowdhury, Deepak Garg, Limin Jia, and Anupam Datta. 2015. Equivalence- based Security for Querying Encrypted Databases: Theory and Application to Privacy Policy Audits. InProceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. 1130–1143

  15. [15]

    Omar Chowdhury, Limin Jia, Deepak Garg, and Anupam Datta. 2014. Tempo- ral Mode-Checking for Runtime Monitoring of Privacy Policies. InComputer Aided Verification, Armin Biere and Roderick Bloem (Eds.). Springer International Publishing, Cham, 131–149

  16. [16]

    Clarke, Orna Grumberg, and Doron A

    Edmund M. Clarke, Orna Grumberg, and Doron A. Peled. 2000.Model checking. MIT Press, Cambridge, MA, USA

  17. [17]

    Matthias Cosler, Christopher Hahn, Daniel Mendoza, Frederik Schmitt, and Car- oline Trippel. 2023. nl2spec: Interactively Translating Unstructured Natural Language to Temporal Logics with Large Language Models. InCom- puter Aided Verification: 35th International Conference, CA V 2023, Paris, France, July 17–22, 2023, Proceedings, Part II(Paris, ...

  18. [18]

    Jonas Duregård and Konstantinos Sagonas. 2022. Quickstrom: Property-based acceptance testing with LTL specifications. InProceedings of the 43rd ACM SIG- PLAN Conference on Programming Language Design and Implementation (PLDI). 299–313. doi:10.1145/3519939.3523728

  19. [19]

    Dwyer, George S

    Matthew B. Dwyer, George S. Avrunin, and James C. Corbett. 1999. Patterns in property specifications for finite-state verification. InProceedings of the 21st International Conference on Software Engineering(Los Angeles, California, USA) (ICSE ’99). Association for Computing Machinery, New York, NY, USA, 411–420. doi:10.1145/302405.302672

  20. [20]

    Mitziu Echeverria, Zeeshan Ahmed, Bincheng Wang, M Fareed Arif, Syed Rafiul Hussain, and Omar Chowdhury. 2021. Phoenix: Device-centric cellular network protocol monitoring using runtime verification.arXiv preprint arXiv:2101.00328 (2021)

  21. [21]

    Mitziu Echeverria, Aliakbar Sadeghi, GM Tasnim Alam, and Omar Chowdhury

  22. [22]

    In18th ACM Conference on Security and Privacy in Wireless and Mobile Networks

    On the Performance and Consistency Trade-off of the eSIM M2M Remote Provisioning Protocol. In18th ACM Conference on Security and Privacy in Wireless and Mobile Networks. 40–52

  23. [23]

    2024.Adventures in FRET and Specification

    Marie Farrell, Matt Luckcuck, Rosemary Monahan, Conor Reynolds, and Oisín Sheridan. 2024.Adventures in FRET and Specification. Springer Nature Switzerland, 106–123. doi:10.1007/978-3-031-75380-0_7

  24. [24]

    Bernd Finkbeiner. 2016. Synthesis of Reactive Systems. InDependable Software Systems Engineering (NATO Science for Peace and Security Series, D: Informa- tion and Communication Security, Vol. 45), Javier Esparza, Orna Grumberg, and Salomon Sickert (Eds.). IOS Press, 72–98

  25. [25]

    M. Fisher. 2011.An Introduction to Practical Formal Methods Using Temporal Logic. Wiley.https://books.google.com/books?id=zl6OLZv7d1kC

  26. [26]

    Gordon Fraser and Franz Wotawa. 2007. Using LTL rewriting to improve the performance of model-checker based test-case generation. InProceedings of the 3rd International Workshop on Advances in Model-Based Testing(London, United Kingdom)(A-MOST ’07). Association for Computing Machinery, New York, NY, USA, 64–74. doi:10.1145/1291535.1291542

  27. [27]

    Francesco Fuggitti and Tathagata Chakraborti. 2024. NL2LTL – a Python Package for Converting Natural Language (NL) Instructions to Linear Temporal Logic (LTL) Formulas.Proceedings of the AAAI Conference on Artificial Intelligence37, 13 (Jul. 2024), 16428–16430. doi:10.1609/aaai.v37i13.27068

  28. [28]

    Luca Geatti, Nicola Gigante, Angelo Montanari, and Gabriele Venturato. 2021. Past Matters: Supporting LTL+Past in the BLACK Satisfiability Checker. In28th International Symposium on Temporal Representation and Reasoning (TIME 2021) (Leibniz International Proceedings in Informatics (LIPIcs), Vol. 206), Carlo Combi, Johann Eder, and Mark Reynolds (Eds.). Sc...

  29. [29]

    Antonios Gouglidis, Anna Kagia, and Vincent C. Hu. 2023. Model Check- ing Access Control Policies: A Case Study using Google Cloud IAM. arXiv:2303.16688 [cs.CR]https://arxiv.org/abs/2303.16688

  30. [30]

    Ben Greenman, Sam Saarinen, Tim Nelson, and Shriram Krishnamurthi. 2022. Little Tricky Logic: Misconceptions in the Understanding of LTL.The Art, Science, and Engineering of Programming7, 2 (Oct. 2022). doi: 10.22152/programming- journal.org/2023/7/7

  31. [31]

    Bethel Hall, Owen Ungaro, and William Eiers. 2025. CloudFix: Automated Pol- icy Repair for Cloud Access Control Policies Using Large Language Models. arXiv:2512.09957(2025)

  32. [32]

    Endadul Hoque, Omar Chowdhury, Sze Yiu Chau, Cristina Nita-Rotaru, and Ninghui Li. 2017. Analyzing operational behavior of stateful protocol implemen- tations for detecting semantic bugs. In2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, 627–638

  33. [33]

    S Hussain, O Chowdhury, S Mehnaz, and E Bertino. 2018. LTEInspector: A Sys- tematic Approach for Adversarial Testing of 4G LTE. InNetwork and Distributed Systems Security (NDSS) Symposium 2018

  34. [34]

    Syed Rafiul Hussain, Mitziu Echeverria, Imtiaz Karim, Omar Chowdhury, and Elisa Bertino. 2019. 5GReasoner: A property-directed security and privacy analy- sis framework for 5G cellular network protocol. InProceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security. 669–684

  35. [35]

    Syed Rafiul Hussain, Imtiaz Karim, Abdullah Al Ishtiaq, Omar Chowdhury, and Elisa Bertino. 2021. Noncompliance as deviant behavior: An automated black-box SecDev ’26, July 5–6, 2026, Montreal, QC, Canada Danso et al. noncompliance checker for 4g lte cellular devices. InProceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security. ...

  36. [36]

    Jeffrey and J.P

    R.C. Jeffrey and J.P. Burgess. 2006.Formal Logic: Its Scope and Limits. Hackett Publishing Company. https://books.google.com/books?id=iqvsjhvZCgcC

  37. [37]

    Subbarao Kambhampati. 2024. Can large language models reason and plan? Annals of the New York Academy of Sciences1534, 1 (March 2024), 15–18. doi: 10 .1111/nyas.15125

  38. [38]

    Aaron Kane, Omar Chowdhury, Anupam Datta, and Philip Koopman. 2015. A case study on runtime monitoring of an autonomous research vehicle (ARV) system. InRuntime Verification: 6th International Conference, RV 2015, Vienna, Austria, September 22-25, 2015. Proceedings. Springer, 102–117

  39. [39]

    Igor Konnov. 2019. Edmund M. Clarke, Thomas A. Henzinger, Helmut Veith, and Roderick Bloem (eds): Handbook of model checking: Springer Interna- tional Publishing AG, Cham, Switzerland, 2018, xxiv+1210 pp, ISBN 978- 3-319-10574-1 (Hardcover, 2.13 kg), ISBN 978-3-319-10575-8 (eBook, PDF). https://doi.org/10.1007/978-3-319-10575-8.Form. Asp. Comput.31, 4 (Au...

  40. [40]

    Konrad and B.H.C

    S. Konrad and B.H.C. Cheng. 2005. Real-time specification patterns. InProceedings. 27th International Conference on Software Engineering, 2005. ICSE 2005.372–381. doi:10.1109/ICSE.2005.1553580

  41. [41]

    Daniel Larraz, Robert Lorch, Moosa Yahyazadeh, M Fareed Arif, Omar Chowd- hury, and Cesare Tinelli. 2023. CRV: Automated Cyber-Resiliency Reasoning for System Design Models. In2023 Formal Methods in Computer-Aided Design (FMCAD). IEEE, 209–220

  42. [42]

    Raymond Li, Loubna Ben Allal, Ahsan Benhalloum, Miltiadis Allamanis, Aqsa Bibi, Federico Cassano, Charles Chausson, Debajyoti Dey, Iddo Drori, and et al. 2023. StarCoder: may the source be with you!Transactions on Machine Learning Research(2023). https://openreview.net/forum?id=GZwT6KAF0Y OpenReview preprint

  43. [43]

    Orna Lichtenstein, Amir Pnueli, and Lenore D. Zuck. 1985. The Glory of the Past. InProceedings of the Conference on Logic of Programs. Springer-Verlag, Berlin, Heidelberg, 196–218

  44. [44]

    Robert Lorch, Daniel Larraz, Cesare Tinelli, and Omar Chowdhury. 2024. A comprehensive, automated security analysis of the uptane automotive over-the- air update framework. InProceedings of the 27th International Symposium on Research in Attacks, Intrusions and Defenses. 594–612

  45. [45]

    Weilin Luo, Hai Wan, Delong Zhang, Jianfeng Du, and Hengdi Su. 2023. Checking LTL Satisfiability via End-to-end Learning(ASE ’22). Association for Computing Machinery, New York, NY, USA, Article 21, 13 pages. doi: 10.1145/3551349.35 61163

  46. [46]

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. 2023. Self-Refine: Iterative Refinement with Self-Feedback. InAdvances in Neural Information Processi...

  47. [47]

    Oded Maler and Dejan Nickovic. 2004. Monitoring Temporal Properties of Continuous Signals. InFormal Techniques, Modelling and Analysis of Timed and Fault-Tolerant Systems, Yassine Lakhnech and Sergio Yovine (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 152–166

  48. [48]

    M Hammad Mazhar, Li Li, Endadul Hoque, and Omar Chowdhury. 2023. Maverick: An app-independent and platform-agnostic approach to enforce policies in iot systems at runtime. InProceedings of the 16th ACM Conference on Security and Privacy in Wireless and Mobile Networks. 73–84

  49. [49]

    Ruijie Meng, Zhen Dong, Jialin Li, Ivan Beschastnikh, and Abhik Roychoudhury

  50. [50]

    InProceedings of the 44th International Conference on Software Engineering(Pittsburgh, Pennsylvania) (ICSE ’22)

    Linear-time temporal logic guided greybox fuzzing. InProceedings of the 44th International Conference on Software Engineering(Pittsburgh, Pennsylvania) (ICSE ’22). Association for Computing Machinery, New York, NY, USA, 1343–1355. doi:10.1145/3510003.3510082

  51. [51]

    William Murphy, Nikolaus Holzer, Nathan Koenig, Leyi Cui, Raven Rothkopf, Feitong Qiao, and Mark Santolucito. 2024. Guiding LLM Temporal Logic Gen- eration with Explicit Separation of Data and Control. arXiv:2406.07400 [cs.LG] https://arxiv.org/abs/2406.07400

  52. [52]

    Polina Ovsiannikova, Antti Pakonen, and Valeriy Vyatkin. 2023. Automatic Generation of Repair Suggestions for Control Logic of I&C Systems. InIECON 2023- 49th Annual Conference of the IEEE Industrial Electronics Society. 1–6. doi: 10 .1109/IECON51785.2023.10311970

  53. [53]

    Kanghee Park, Timothy Zhou, and Loris D’Antoni. 2025. Flexible and Efficient Grammar-Constrained Decoding. InForty-second International Conference on Machine Learning.https://openreview.net/forum?id=L6CYAzpO1k

  54. [54]

    José, Paulo de Tarso P

    Paulo Pirozelli, Marcos M. José, Paulo de Tarso P. Filho, Anarosa A. F. Brandão, and Fabio G. Cozman. 2024. Assessing Logical Reasoning Capabil- ities of Encoder-Only Transformer Models. InNeural-Symbolic Learning and Reasoning: 18th International Conference, NeSy 2024, Barcelona, Spain, Sep- tember 9–12, 2024, Proceedings, Part I(Barcelona, Spain). ...

  55. [55]

    Amir Pnueli. 1977. The temporal logic of programs.Foundations of Computer Science(Sept. 1977), 46–57. doi:10.1109/SFCS.1977.32

  56. [56]

    Paapa Kwesi Quansah, Pablo Rivas, and Ernest Bonnah. 2026. VERIFY: A Novel Multi-Domain Dataset Grounding LTL in Contextual Natural Language via Prov- able Intermediate Logic. InThe Fourteenth International Conference on Learning Representations.https://openreview.net/forum?id=NChBLvOr7I

  57. [57]

    Abul Ala Moududi, and MD

    Anichur Rahman, Shahariar Hossain Mahir, Md Tanjum An Tashrif, Airin Afroj Aishi, Md Ahsan Karim, Dipanjali Kundu, Tanoy Debnath, Md. Abul Ala Moududi, and MD. Zunead Abedin Eidmum. 2025. Comparative Analysis Based on DeepSeek, ChatGPT, and Google Gemini: Features, Techniques, Performance, Fu- ture Prospects. arXiv:2503.04783 [cs.CL] https://arxiv.org/abs...

  58. [58]

    Syed Md Mukit Rashid, Tianwei Wu, Kai Tu, Abdullah Al Ishtiaq, Ridwanul Hasan Tanvir, Yilu Dong, Omar Chowdhury, and Syed Rafiul Hussain. 2024. State ma- chine mutation-based testing framework for wireless communication protocols. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Commu- nications Security. 2102–2116

  59. [59]

    Partha Pratim Ray. 2023. ChatGPT: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope.Internet of Things and Cyber-Physical Systems3 (2023), 121–154. doi: 10.1016/j.iotcps.2 023.04.003

  60. [60]

    Yahui Song, Xiang Gao, Wenhua Li, Wei-Ngan Chin, and Abhik Roychoudhury

  61. [61]

    ACM Softw

    ProveNFix: Temporal Property-Guided Program Repair.Proc. ACM Softw. Eng.1, FSE (2024), 226–248. doi:10.1145/3643737

  62. [62]

    Weizhi Tang and Vaishak Belle. 2024. LTLBench: Towards Benchmarks for Evaluating Temporal Logic Reasoning in Large Language Models. arXiv:2407.05434 [cs.CL]https://arxiv.org/abs/2407.05434

  63. [63]

    Yue Wang, Weishi Wang, Shafiq Joty, Steven CH Lin, and Hwee Tou Ng. 2021. CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 8693–8708

  64. [64]

    Chi, Quoc V

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. InProceedings of the 36th International Conference on Neural Information Processing Systems(New Orleans, LA, USA) (NIPS ’22). Curran Associates Inc., Red Hook, NY...

  65. [65]

    Yilongfei Xu, Jincao Feng, and Weikai Miao. 2024. Learning from Failures: Trans- lation of Natural Language Requirements into Linear Temporal Logic with Large Language Models. In2024 IEEE 24th International Conference on Software Quality, Reliability and Security (QRS). 204–215. doi:10.1109/QRS62785.2024.00029

  66. [66]

    Moosa Yahyazadeh, Syed Rafiul Hussain, Endadul Hoque, and Omar Chowd- hury. 2020. PatrIoT: Policy Assisted Resilient Programmable IoT System. In Runtime Verification, Jyotirmoy Deshmukh and Dejan Ničković (Eds.). Springer International Publishing, Cham, 151–171

  67. [67]

    Yao Yao, Zuchao Li, and Hai Zhao. 2024. Beyond Chain-of-Thought, Effective Graph-of-Thought Reasoning in Language Models. arXiv:2305.16582 [cs.CL] https://arxiv.org/abs/2305.16582

  68. [69]

    Zhihao Yao. 2024. Formal Trust and Threat Modeling Using Large Language Models. In2024 Annual Computer Security Applications Conference Workshops (ACSAC Workshops). 232–239. doi:10.1109/ACSACW65225.2024.00033

  69. [70]

    Yedi Zhang, Yufan Cai, Xinyue Zuo, Xiaokun Luan, Kailong Wang, Zhe Hou, Yifan Zhang, Zhiyuan Wei, Meng Sun, Jun Sun, Jing Sun, and Jin Song Dong. 2024. The Fusion of Large Language Models and Formal Methods for Trustworthy AI Agents: A Roadmap. arXiv:2412.06512 [cs.AI] https://arxiv.org/abs/2412 .06512

  70. [71]

    Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

    Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, and Ed Chi. 2023. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. arXiv:2205.10625 [cs.AI]https://arxiv.org/abs/2205.10625 Syntax Is Easy, Semantics Is Hard: Evaluating LLMs for LTL Translation S...