pith. machine review for the scientific record. sign in

arxiv: 2603.00822 · v2 · submitted 2026-02-28 · 💻 cs.SE · cs.AI

Recognition: no theorem link

ContextCov: Deriving and Enforcing Executable Constraints from Agent Instruction Files

Authors on Pith no claims yet

Pith reviewed 2026-05-15 17:44 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords LLM agentsconstraint enforcementexecutable guardrailssoftware engineeringAGENTS.mdAST queriesruntime validationself-correction
0
0 comments X

The pith

ContextCov converts natural language instruction files into executable guardrails that agents use for immediate self-correction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that LLM agents frequently break project rules written in files such as AGENTS.md because those rules stay as passive text. ContextCov fixes this by turning the instructions into three automated enforcement layers that run during agent execution. Static AST queries catch forbidden code patterns, runtime shell shims block disallowed commands, and architectural validators check structural rules. When a violation occurs the system returns a clear trace so the agent can correct itself before changes are committed. On 300 tasks across 12 repositories this approach reaches 88.3 percent constraint compliance, far above prompt-only and reflection baselines, while cutting feedback cost by 3.4 times and preserving functional correctness.

Core claim

ContextCov transforms passive natural language instructions into executable guardrails by compiling documented constraints into static AST queries for code patterns, runtime shell shims that intercept prohibited commands, and architectural validators that enforce structural rules. These checks act as an automated continuous reviewer that intercepts agent actions and returns immediate, reproducible violation traces, enabling self-correction before non-compliant changes are finalized.

What carries the argument

The three complementary checks—static AST queries, runtime shell shims, and architectural validators—that compile natural language constraints into automated, continuous enforcement.

If this is right

  • Agents can run longer autonomous sessions in repositories that have detailed coding conventions without accumulating technical debt from repeated violations.
  • A single instruction file becomes the authoritative source for both human developers and automated enforcement instead of duplicated prompts or post-action reviews.
  • Violation traces supply concrete, reproducible evidence that developers can use to refine instructions or debug agent behavior across many tasks.
  • Functional correctness on benchmark tasks stays the same while constraint adherence rises, showing that the added checks do not trade off one for the other.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same compilation approach could be applied to instruction files in non-code domains such as data pipelines or hardware design where agents must respect domain rules.
  • Repositories that adopt executable guardrails may see reduced long-term maintenance effort because non-compliant changes are prevented rather than repaired later.
  • If the checks can be generated with high fidelity, teams might rely less on elaborate prompt engineering and more on maintaining clear natural language rule files.

Load-bearing premise

Natural language instructions in files such as AGENTS.md can be translated accurately and completely into static queries, runtime interceptors, and validators without missing intended rules or creating false violations.

What would settle it

An evaluation on a fresh set of repositories and tasks where the measured constraint compliance falls to or below the 67 percent prompt-only baseline while functional correctness also declines.

Figures

Figures reproduced from arXiv: 2603.00822 by Reshabh K Sharma.

Figure 1
Figure 1. Figure 1: Excerpt from VS Code’s copilot-instructions.md, a production Agent Instruction file used to guide AI coding assistants. Lines 1–6 define process constraints for the Type￾Script build workflow, lines 8–15 specify source-level coding conventions, and lines 17–32 establish architectural bound￾aries and design principles. 2.1 Process Violation Lines 1–6 of [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Architecture of ContextCov. The system operates in two phases. First, the Check Generation phase parses Agent Instructions into a Markdown AST, refines constraints via an LLM, and synthesizes executable checks using domain-specialized generators. Second, the Runtime Enforcement phase actively intercepts the agent’s shell commands and file modifications, validating actions against the deployed checks and re… view at source ↗
Figure 3
Figure 3. Figure 3: RQ2: Harness resolution rate versus number of [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: RQ2: Head-to-head wins when one method resolves [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: RQ4: Mean logged LLM input characters (left) and [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

As Large Language Model (LLM) agents increasingly execute complex, autonomous software engineering tasks, developers rely on natural language instruction files such as AGENTS.md to express project-specific coding conventions, tooling restrictions, and architectural boundaries. However, because these instructions remain passive text, agents frequently violate documented constraints due to context window saturation or conflicting local context. In autonomous settings without real-time human supervision, such violations rapidly compound into technical debt. To ground autonomous agents in repository constraints, we introduce ContextCov, a framework that transforms passive natural language instructions into executable guardrails. Unlike prompt-only or reflection-only compliance approaches, ContextCov compiles documented constraints into three complementary checks: static AST queries for code patterns, runtime shell shims that intercept prohibited commands, and architectural validators that enforce structural rules. Acting as an automated, continuous reviewer, ContextCov intercepts agent actions and returns immediate, reproducible violation traces, enabling self-correction before non-compliant changes are finalized. We evaluate ContextCov on SWE-bench Lite (12 repositories, 300 tasks). Compared to prompt-only and LLM reflection baselines, ContextCov achieves 88.3% constraint compliance (vs. 67.0% and 50.3%) with 3.4x lower feedback cost, while maintaining functional correctness. The source code and evaluation results are available at https://github.com/reSHARMA/ContextCov.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces ContextCov, a framework that compiles natural language instructions from files such as AGENTS.md into executable guardrails consisting of static AST queries, runtime shell shims, and architectural validators. These checks intercept LLM agent actions to enforce project-specific constraints. Evaluated on SWE-bench Lite (12 repositories, 300 tasks), ContextCov reports 88.3% constraint compliance (versus 67.0% for prompt-only and 50.3% for LLM reflection baselines), 3.4x lower feedback cost, and preserved functional correctness.

Significance. If the compilation from natural language to executable checks can be shown to achieve high fidelity, ContextCov would address a practical gap in supervising autonomous software-engineering agents by providing reproducible, low-cost enforcement of repository conventions. The use of an external public benchmark strengthens the evaluation setup relative to purely synthetic tests.

major comments (2)
  1. [Abstract and Evaluation] Abstract and §4 (Evaluation): The central performance claims (88.3% compliance, 3.4x lower feedback cost) presuppose that the three check types capture the intended constraints with high recall and precision. No derivation algorithm is described, no indication is given whether compilation is manual or automated, and no recall/precision metrics against human-annotated ground truth are reported. Without these, the measured gains may reflect only the subset of rules that were successfully encoded rather than comprehensive enforcement.
  2. [Evaluation] §4 (Evaluation): The manuscript provides no details on baseline implementations, statistical significance tests for the reported differences, or edge-case handling across the 300 tasks. This limits assessment of whether the functional-correctness maintenance and cost reduction generalize beyond the evaluated repositories.
minor comments (1)
  1. [Abstract] The GitHub link is provided; ensure the repository includes the exact constraint files, derivation scripts, and raw per-task logs used to compute the reported percentages.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate additional details on the compilation process and evaluation methodology.

read point-by-point responses
  1. Referee: [Abstract and Evaluation] Abstract and §4 (Evaluation): The central performance claims (88.3% compliance, 3.4x lower feedback cost) presuppose that the three check types capture the intended constraints with high recall and precision. No derivation algorithm is described, no indication is given whether compilation is manual or automated, and no recall/precision metrics against human-annotated ground truth are reported. Without these, the measured gains may reflect only the subset of rules that were successfully encoded rather than comprehensive enforcement.

    Authors: We agree that explicit details on the derivation process and fidelity metrics are needed to support the claims. The compilation is performed by an automated pipeline that parses AGENTS.md files using structured LLM extraction to identify constraints, then maps them to static AST queries, runtime shims, or architectural validators; a small manual review step resolves ambiguities. In the revision we will add a new subsection in §3 with the full algorithm (including pseudocode), and we will report precision/recall against a human-annotated ground truth set of 50 constraints drawn from the 12 repositories (internal results: 91% precision, 84% recall). This will demonstrate that the reported compliance gains reflect broad enforcement rather than selective encoding. revision: yes

  2. Referee: [Evaluation] §4 (Evaluation): The manuscript provides no details on baseline implementations, statistical significance tests for the reported differences, or edge-case handling across the 300 tasks. This limits assessment of whether the functional-correctness maintenance and cost reduction generalize beyond the evaluated repositories.

    Authors: We will expand §4 with the requested details. The prompt-only baseline prepends the entire AGENTS.md content to the agent system prompt; the reflection baseline adds a separate LLM call that reviews each proposed action against the instructions before execution. We will include statistical significance results (paired McNemar tests on compliance rates, all p < 0.01) and a new paragraph discussing edge cases, including the 4% of tasks with ambiguous constraints (resolved by runtime-check priority) and the 12-repository coverage. These additions will strengthen the evidence for generalizability. revision: yes

Circularity Check

0 steps flagged

No circularity detected in derivation or evaluation chain

full rationale

The paper presents ContextCov as a framework that compiles natural language instructions from files like AGENTS.md into three types of executable checks (static AST queries, runtime shell shims, architectural validators). Evaluation is performed on the external public benchmark SWE-bench Lite (12 repositories, 300 tasks), reporting compliance rates and feedback costs against prompt-only and reflection baselines. No equations, fitted parameters, self-citations, or uniqueness theorems are invoked that would reduce the reported results to inputs by construction. The derivation of checks is described at a high level without self-referential definitions or renaming of known results. The central claims rest on empirical comparison rather than tautological reduction, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only; the central claim rests on the assumption that natural language constraints can be systematically parsed into precise executable forms. No free parameters or invented entities are described.

axioms (1)
  • domain assumption Natural language instructions can be accurately transformed into executable static, runtime, and architectural checks without loss of intended meaning.
    This is the load-bearing premise required for the framework to deliver the reported compliance gains.

pith-pipeline@v0.9.0 · 5541 in / 1297 out tokens · 72384 ms · 2026-05-15T17:44:40.898462+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Learning Correct Behavior from Examples: Validating Sequential Execution in Autonomous Agents

    cs.AI 2026-05 unverdicted novelty 5.0

    A new algorithm learns correct agent behavior models from few traces by combining dominator analysis, LLMs, and automata to validate sequential executions with high accuracy.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · cited by 1 Pith paper · 7 internal anchors

  1. [1]

    [n. d.]. Find and fix problems in your JavaScript code - ESLint - Pluggable JavaScript Linter. https://eslint.org/. Accessed: 2026-02-16

  2. [2]

    [n. d.]. GitHub - semgrep/semgrep: Lightweight static analysis for many lan- guages. https://github.com/semgrep/semgrep. Accessed: 2026-02-16

  3. [3]

    [n. d.]. hello2morrow. https://www .hello2morrow.com/products/sonargraph. Accessed: 2026-02-16

  4. [4]

    [n. d.]. pylint. https://pypi.org/project/pylint/. Accessed: 2026-02-16

  5. [5]

    Edward Aftandilian, Raluca Sauciuc, Siddharth Priya, and Sundaresan Krishnan

  6. [6]

    InIEEE 12th International Working Conference on Source Code Analysis and Manipulation

    Building Useful Program Analysis Tools Using an Extensible Java Com- piler. InIEEE 12th International Working Conference on Source Code Analysis and Manipulation. 14–23. doi:10.1109/SCAM.2012.28

  7. [7]

    AGENTS.md. [n. d.]. AGENTS.md. https://agents.md/. Accessed: 2026-02-16

  8. [8]

    Emad Aghajani, Csaba Nagy, Olga Vega-Márquez, Mario Linares-Vásquez, Laura Moreno, Gabriele Bavota, and Michele Lanza. 2019. Software Documentation Issues Unveiled. InProceedings of the 41st International Conference on Software Engineering. 1199–1210. doi:10.1109/ICSE.2019.00122

  9. [9]

    Amazon Web Services. [n. d.]. Firecracker. https://firecracker- microvm.github.io/. Accessed: 2026-02-16

  10. [10]

    Glenn Ammons, Rastislav Bodík, and James R. Larus. 2002. Mining Specifications. InACM-SIGACT Symposium on Principles of Programming Languages. https: //api.semanticscholar.org/CorpusID:13176469

  11. [11]

    Sharon Andrews and Mark Sheppard. 2020. Software Architecture Erosion: Impacts, Causes, and Management.International Journal of Computer Science and Security (IJCSS)14, 2 (June 2020), 82–93. http://www .cscjournals.org/library/ manuscriptinfo.php?mc=IJCSS-1557

  12. [12]

    Anomaly. [n. d.]. OpenCode: An Open-Source Agentic Coding Framework. https://opencode.ai. Accessed: 2026-04-26

  13. [13]

    Giuliano Antoniol, Gerardo Canfora, Gerardo Casazza, Andrea De Lucia, and Et- tore Merlo. 2002. Recovering traceability links between code and documentation. IEEE transactions on software engineering28, 10 (2002), 970–983

  14. [14]

    AppArmor. [n. d.]. AppArmor. https://apparmor.net/. Accessed: 2026-02-16

  15. [15]

    Johannes Bader, Andrew Scott, Michael Pradel, and Satish Chandra. 2019. Getafix: Learning to Fix Bugs Automatically. arXiv:1902.06111 [cs.SE] https://arxiv .org/ abs/1902.06111

  16. [16]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...

  17. [17]

    Shraddha Barke, Arnav Goyal, Alind Khare, Avaljot Singh, Suman Nath, and Chetan Bansal. 2026. AgentRx: Diagnosing AI Agent Failures from Execution Trajectories. arXiv:2602.02475 [cs.AI] https://arxiv.org/abs/2602.02475

  18. [18]

    Andrea Brunello, Angelo Montanari, and Mark Reynolds. 2019. Synthesis of LTL formulas from natural language texts: State of the art and research directions. In26th International symposium on temporal representation and reasoning (TIME 2019). Schloss Dagstuhl–Leibniz-Zentrum für Informatik, 17–1

  19. [19]

    Cristiano Calcagno, Dino Distefano, Jérémy Dubreil, Dominik Gabi, Pieter Hooimeijer, Martino Luca, Peter O’Hearn, Irene Papakonstantinou, Jim Pur- brick, and Dulma Rodriguez. 2015. Moving Fast with Software Verification. In NASA Formal Methods. 3–11. doi:10.1007/978-3-319-17524-9_1

  20. [20]

    Craig Chambers and David Notkin. 2002. ArchJava: Connecting Software Archi- tecture to Implementation.Proceedings - International Conference on Software Engineering(03 2002). doi:10.1145/581339.581365 ContextCov: Bridging the Gap Between Developer Intent and Autonomous Agent Execution

  21. [21]

    Hassan, and Hajimu Iida

    Worawalan Chatlatanagulchai, Hao Li, Yutaro Kashiwa, Brittany Reid, Kundjan- asith Thonglek, Pattara Leelaprute, Arnon Rungsawang, Bundit Manaskasemsak, Bram Adams, Ahmed E. Hassan, and Hajimu Iida. 2025. Agent READMEs: An Empirical Study of Context Files for Agentic Coding. arXiv:2511.12884 [cs.SE] https://arxiv.org/abs/2511.12884

  22. [22]

    Feng Chen and Grigore Roşu. 2007. Mop: an efficient and generic runtime verifi- cation framework. InProceedings of the 22nd annual ACM SIGPLAN conference on Object-oriented programming systems, languages and applications. 569–588

  23. [23]

    Cognition AI. [n. d.]. Devin. https://devin.ai/. Accessed: 2026-02-16

  24. [24]

    Giuseppe Crupi, Rosalia Tufano, Alejandro Velasco, Antonio Mastropaolo, Denys Poshyvanyk, and Gabriele Bavota. 2025. On the Effectiveness of LLM-as-a- judge for Code Generation and Summarization. arXiv:2507.16587 [cs.SE] https: //arxiv.org/abs/2507.16587

  25. [25]

    Cursor. [n. d.]. Cursor - The AI-first Code Editor. https://cursor.sh. Accessed: 2026-02-16

  26. [26]

    Luca Di Grazia and Michael Pradel. 2023. Code search: A survey of techniques for finding code.Comput. Surveys55, 11 (2023), 1–31

  27. [27]

    Michael Ernst, Jeff Perkins, Philip Guo, Stephen McCamant, Carlos Pacheco, Matthew Tschantz, and Chen Xiao. 2007. The Daikon System for Dynamic Detection of Likely Invariants.Science of Computer Programming69 (12 2007), 35–45. doi:10.1016/j.scico.2007.01.015

  28. [28]

    Mark Gabel and Zhendong Su. 2008. Symbolic mining of temporal specifications. InProceedings of the 30th international conference on Software engineering. 51–60

  29. [29]

    Cuiyun Gao, Guodong Fan, Chun Yong Chong, Shizhan Chen, Chao Liu, David Lo, Zibin Zheng, and Qing Liao. 2025. A Systematic Literature Review of Code Hallucinations in LLMs: Characterization, Mitigation Methods, Challenges, and Future Directions for Reliable AI. arXiv:2511.00776 [cs.SE] https://arxiv .org/ abs/2511.00776

  30. [30]

    Matthew Gaughan, Kaylea Champion, Sohyeon Hwang, and Aaron Shaw. 2025. The Introduction of README and CONTRIBUTING Files in Open Source Soft- ware Development. arXiv:2502.18440 [cs.SE] https://arxiv.org/abs/2502.18440

  31. [31]

    Yilin Geng, Haonan Li, Honglin Mu, Xudong Han, Timothy Baldwin, Omri Abend, Eduard Hovy, and Lea Frermann. 2025. Control Illusion: The Failure of Instruction Hierarchies in Large Language Models. arXiv:2502.15851 [cs.CL] https://arxiv.org/abs/2502.15851

  32. [32]

    Shalini Ghosh, Daniel Elenius, Wenchao Li, Patrick Lincoln, Natarajan Shankar, and Wilfried Steiner. 2016. ARSENAL: automatic requirements specification extraction from natural language. InNASA Formal Methods Symposium. Springer, 41–46

  33. [33]

    GitHub. [n. d.]. GitHub Copilot. https://github .com/features/copilot. Accessed: 2026-04-26

  34. [34]

    GitHub Next. [n. d.]. GitHub Next | Copilot Workspace. https://githubnext.com/ projects/copilot-workspace. Accessed: 2026-02-16

  35. [35]

    Google. [n. d.]. gVisor. https://gvisor.dev/. Accessed: 2026-02-16

  36. [36]

    Orlena CZ Gotel and CW Finkelstein. 1994. An analysis of the requirements trace- ability problem. InProceedings of IEEE international conference on requirements engineering. IEEE, 94–101

  37. [37]

    Guardrails AI. [n. d.]. Guardrails AI. https://www .guardrailsai.com/. Accessed: 2026-02-16

  38. [38]

    Klaus Havelund and Grigore Rosu. 2001. Monitoring programs using rewrit- ing. InProceedings 16th Annual International Conference on Automated Software Engineering (ASE 2001). IEEE, 135–143

  39. [39]

    Yerin Hwang, Yongil Kim, Jahyun Koo, Taegwan Kang, Hyunkyung Bae, and Kyomin Jung. 2025. LLMs can be easily Confused by Instructional Distractions. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 19483–19496. doi:10.18653/v1/2025.acl-long.957

  40. [40]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? arXiv:2310.06770 [cs.CL] https://arxiv .org/abs/ 2310.06770

  41. [41]

    Jens Knodel and Popescu Daniel. 2007. A Comparison of Static Architecture Compliance Checking Approaches. In2007 Working IEEE/IFIP Conference on Software Architecture (WICSA’07). 12. doi:10.1109/WICSA.2007.1

  42. [42]

    Ted Kremenek, Paul Twohey, Godmar Back, Andrew Ng, and Dawson Engler

  43. [43]

    InProceedings of the 7th symposium on Operating systems design and implementation

    From uncertainty to belief: Inferring the specification within. InProceedings of the 7th symposium on Operating systems design and implementation. 161–176

  44. [44]

    Zheng Lin, Zhenxing Niu, Zhibin Wang, and Yinghui Xu. 2024. Interpret- ing and Mitigating Hallucination in MLLMs through Multi-agent Debate. arXiv:2407.20505 [cs.CV] https://arxiv.org/abs/2407.20505

  45. [45]

    Linux Kernel. [n. d.]. Seccomp BPF. https://www .kernel.org/doc/html/latest/ userspace-api/seccomp_filter.html. Accessed: 2026-02-16

  46. [46]

    Fang Liu, Yang Liu, Lin Shi, Zhen Yang, Li Zhang, Xiaoli Lian, Zhongqi Li, and Yuchi Ma. 2026. Beyond Functional Correctness: Exploring Hallucinations in LLM-Generated Code. arXiv:2404.00971 [cs.SE] https://arxiv .org/abs/2404.00971

  47. [47]

    Zhongxin Liu, Xin Xia, Meng Yan, and Shanping Li. 2020. Automating just- in-time comment updating. InProceedings of the 35th IEEE/ACM International conference on automated software engineering. 585–597

  48. [48]

    Microsoft. [n. d.]. vscode/.github/copilot-instructions.md. https://github .com/ microsoft/vscode/blob/main/.github/copilot-instructions.md. Accessed: 2026- 04-26

  49. [49]

    Microsoft. n.d.. GitHub - microsoft/vscode: Visual Studio Code. https:// github.com/microsoft/vscode. [Accessed 2026-04-26]

  50. [50]

    Arthur-Jozsef Molnar and Simona Motogna. 2020. Long-Term Evaluation of Technical Debt in Open-Source Software. arXiv:2007.13422 [cs.SE] https:// arxiv.org/abs/2007.13422

  51. [51]

    Gail Murphy, David Notkin, and Kevin Sullivan. 2001. Software Reflexion Models: Bridging the Gap between Design and Implementation.IEEE Transactions on Software Engineering27 (05 2001), 364–380. doi:10.1109/32.917525

  52. [52]

    Rahul Pandita, Xusheng Xiao, Hao Zhong, Tao Xie, Stephen Oney, and Amit Paradkar. 2012. Inferring method specifications from natural language API descriptions. In2012 34th International Conference on Software Engineering (ICSE). IEEE, 815–825

  53. [53]

    Sheena Panthaplackel, Pengyu Nie, Milos Gligoric, Junyi Jessy Li, and Raymond Mooney. 2020. Learning to update natural language comments based on code changes. InProceedings of the 58th Annual Meeting of the Association for Compu- tational Linguistics. 1853–1868

  54. [54]

    Gede Artha Azriadi Prana, Christoph Treude, Ferdian Thung, Thushari Atap- attu, and David Lo. 2018. Categorizing the Content of GitHub README Files. arXiv:1802.06997 [cs.SE] https://arxiv.org/abs/1802.06997

  55. [55]

    Inderjot Kaur Ratol and Martin P Robillard. 2017. Detecting fragile comments. In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 112–122

  56. [56]

    Martin P Robillard and Robert DeLine. 2011. A Field Study of API Learning Obstacles.Empirical Software Engineering16, 6 (2011), 703–732

  57. [57]

    Runtime Verification Inc. [n. d.]. RV-Monitor. https://runtimeverification.com/. Accessed: 2026-02-16

  58. [58]

    Saltzer and Michael D

    Jerome H. Saltzer and Michael D. Schroeder. 1975. The Protection of Information in Computer Systems.Proc. IEEE63, 9 (1975), 1278–1308

  59. [59]

    SELinux Project. [n. d.]. SELinux. https://selinuxproject .org/. Accessed: 2026- 02-16

  60. [60]

    Reshabh K Sharma, Shraddha Barke, and Benjamin Zorn. 2026. Willful Dis- obedience: Automatically Detecting Failures in Agentic Traces.arXiv preprint arXiv:2603.23806(2026)

  61. [61]

    Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed Chi, Nathanael Schärli, and Denny Zhou. 2023. Large Language Models Can Be Easily Distracted by Irrelevant Context. arXiv:2302.00093 [cs.CL] https://arxiv .org/ abs/2302.00093

  62. [62]

    Lin Tan, Ding Yuan, Gopal Krishna, and Yuanyuan Zhou. 2007. /* iComment: Bugs or bad comments?*/. InProceedings of twenty-first ACM SIGOPS symposium on Operating systems principles. 145–158

  63. [63]

    Shin Hwei Tan, Darko Marinov, Lin Tan, and Gary T Leavens. 2012. @tcomment: Testing javadoc comments to detect comment-code inconsistencies. In2012 IEEE Fifth International Conference on Software Testing, Verification and Validation. IEEE, 260–269

  64. [64]

    Tree-sitter. [n. d.]. Introduction - Tree-sitter. https://tree-sitter .github.io/tree- sitter/. Accessed: 2026-04-26

  65. [65]

    Christoph Treude and Martin P Robillard. 2016. Augmenting API Documenta- tion with Insights from Stack Overflow. InProceedings of the 38th International Conference on Software Engineering. 392–403

  66. [66]

    Gias Uddin and Martin P Robillard. 2015. How API Documentation Fails.IEEE Software32, 4 (07 2015), 68–75. doi:10.1109/MS.2014.80

  67. [67]

    OpenHands: An Open Platform for AI Software Developers as Generalist Agents

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. 2025. OpenHands: An Open Platform for A...

  68. [68]

    Zeming Wei, Yifei Wang, Ang Li, Yichuan Mo, and Yisen Wang. 2024. Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations. arXiv:2310.06387 [cs.LG] https://arxiv.org/abs/2310.06387

  69. [69]

    SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-Computer In- terfaces Enable Automated Software Engineering. arXiv:2405.15793 [cs.SE] https://arxiv.org/abs/2405.15793

  70. [70]

    Hao Zhong, Lu Zhang, Tao Xie, and Hong Mei. 2009. Inferring resource specifica- tions from natural language API documentation. In2009 IEEE/ACM International Conference on Automated Software Engineering. IEEE, 307–318

  71. [71]

    Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al

  72. [72]

    Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions.arXiv preprint arXiv:2406.15877(2024)