arxiv: 2603.00822 · v2 · submitted 2026-02-28 · 💻 cs.SE · cs.AI

Recognition: no theorem link

ContextCov: Deriving and Enforcing Executable Constraints from Agent Instruction Files

Reshabh K Sharma

Authors on Pith no claims yet

Pith reviewed 2026-05-15 17:44 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords LLM agentsconstraint enforcementexecutable guardrailssoftware engineeringAGENTS.mdAST queriesruntime validationself-correction

0 comments

The pith

ContextCov converts natural language instruction files into executable guardrails that agents use for immediate self-correction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that LLM agents frequently break project rules written in files such as AGENTS.md because those rules stay as passive text. ContextCov fixes this by turning the instructions into three automated enforcement layers that run during agent execution. Static AST queries catch forbidden code patterns, runtime shell shims block disallowed commands, and architectural validators check structural rules. When a violation occurs the system returns a clear trace so the agent can correct itself before changes are committed. On 300 tasks across 12 repositories this approach reaches 88.3 percent constraint compliance, far above prompt-only and reflection baselines, while cutting feedback cost by 3.4 times and preserving functional correctness.

Core claim

ContextCov transforms passive natural language instructions into executable guardrails by compiling documented constraints into static AST queries for code patterns, runtime shell shims that intercept prohibited commands, and architectural validators that enforce structural rules. These checks act as an automated continuous reviewer that intercepts agent actions and returns immediate, reproducible violation traces, enabling self-correction before non-compliant changes are finalized.

What carries the argument

The three complementary checks—static AST queries, runtime shell shims, and architectural validators—that compile natural language constraints into automated, continuous enforcement.

If this is right

Agents can run longer autonomous sessions in repositories that have detailed coding conventions without accumulating technical debt from repeated violations.
A single instruction file becomes the authoritative source for both human developers and automated enforcement instead of duplicated prompts or post-action reviews.
Violation traces supply concrete, reproducible evidence that developers can use to refine instructions or debug agent behavior across many tasks.
Functional correctness on benchmark tasks stays the same while constraint adherence rises, showing that the added checks do not trade off one for the other.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same compilation approach could be applied to instruction files in non-code domains such as data pipelines or hardware design where agents must respect domain rules.
Repositories that adopt executable guardrails may see reduced long-term maintenance effort because non-compliant changes are prevented rather than repaired later.
If the checks can be generated with high fidelity, teams might rely less on elaborate prompt engineering and more on maintaining clear natural language rule files.

Load-bearing premise

Natural language instructions in files such as AGENTS.md can be translated accurately and completely into static queries, runtime interceptors, and validators without missing intended rules or creating false violations.

What would settle it

An evaluation on a fresh set of repositories and tasks where the measured constraint compliance falls to or below the 67 percent prompt-only baseline while functional correctness also declines.

Figures

Figures reproduced from arXiv: 2603.00822 by Reshabh K Sharma.

**Figure 1.** Figure 1: Excerpt from VS Code’s copilot-instructions.md, a production Agent Instruction file used to guide AI coding assistants. Lines 1–6 define process constraints for the TypeScript build workflow, lines 8–15 specify source-level coding conventions, and lines 17–32 establish architectural boundaries and design principles. 2.1 Process Violation Lines 1–6 of [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Architecture of ContextCov. The system operates in two phases. First, the Check Generation phase parses Agent Instructions into a Markdown AST, refines constraints via an LLM, and synthesizes executable checks using domain-specialized generators. Second, the Runtime Enforcement phase actively intercepts the agent’s shell commands and file modifications, validating actions against the deployed checks and re… view at source ↗

**Figure 3.** Figure 3: RQ2: Harness resolution rate versus number of [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: RQ2: Head-to-head wins when one method resolves [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: RQ4: Mean logged LLM input characters (left) and [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

As Large Language Model (LLM) agents increasingly execute complex, autonomous software engineering tasks, developers rely on natural language instruction files such as AGENTS.md to express project-specific coding conventions, tooling restrictions, and architectural boundaries. However, because these instructions remain passive text, agents frequently violate documented constraints due to context window saturation or conflicting local context. In autonomous settings without real-time human supervision, such violations rapidly compound into technical debt. To ground autonomous agents in repository constraints, we introduce ContextCov, a framework that transforms passive natural language instructions into executable guardrails. Unlike prompt-only or reflection-only compliance approaches, ContextCov compiles documented constraints into three complementary checks: static AST queries for code patterns, runtime shell shims that intercept prohibited commands, and architectural validators that enforce structural rules. Acting as an automated, continuous reviewer, ContextCov intercepts agent actions and returns immediate, reproducible violation traces, enabling self-correction before non-compliant changes are finalized. We evaluate ContextCov on SWE-bench Lite (12 repositories, 300 tasks). Compared to prompt-only and LLM reflection baselines, ContextCov achieves 88.3% constraint compliance (vs. 67.0% and 50.3%) with 3.4x lower feedback cost, while maintaining functional correctness. The source code and evaluation results are available at https://github.com/reSHARMA/ContextCov.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ContextCov shows how to compile agent instruction files into three executable check layers with solid benchmark gains over baselines, but the extraction step from text to checks lacks coverage metrics.

read the letter

ContextCov is worth attention because it converts passive natural language rules in files like AGENTS.md into active enforcement using static AST queries, runtime shell shims, and architectural validators, and the SWE-bench Lite numbers beat the prompt-only and reflection baselines by a clear margin. The paper reports 88.3 percent compliance versus 67 and 50.3 percent, with 3.4 times lower feedback cost and no drop in functional correctness across 300 tasks in 12 repositories. The code release helps anyone who wants to try the approach directly. What stands out as new is the combination of those three complementary check types into one continuous reviewer that gives immediate, reproducible violation traces instead of relying on the model to stay within bounds through prompting alone. This addresses a real issue in autonomous agent setups where context gets lost and violations pile up into technical debt. The evaluation uses a public benchmark and keeps the focus on practical enforcement rather than just model behavior. The soft spot is the derivation process itself. The abstract and description do not detail the algorithm for turning natural language sentences into the specific queries, shims, or validators, nor do they report recall on missed constraints or precision on false violations against any human-annotated ground truth. If the compilation only captures a subset of the intended rules, the compliance figures apply only to that subset and the cost reduction may simply reflect fewer active checks. That assumption about accurate and complete translation from text is load-bearing, and without separate validation it remains under-supported. This paper is aimed at researchers and engineers working on reliable LLM agents for software engineering tasks. A reader who needs concrete guardrails for project conventions would get usable ideas from the three-check design and the benchmark comparison. It deserves a serious referee because the core mechanism is concrete, the evaluation is on standard data with open code, and the reported gains are large enough to justify closer scrutiny even if the extraction details need more work.

Referee Report

2 major / 1 minor

Summary. The paper introduces ContextCov, a framework that compiles natural language instructions from files such as AGENTS.md into executable guardrails consisting of static AST queries, runtime shell shims, and architectural validators. These checks intercept LLM agent actions to enforce project-specific constraints. Evaluated on SWE-bench Lite (12 repositories, 300 tasks), ContextCov reports 88.3% constraint compliance (versus 67.0% for prompt-only and 50.3% for LLM reflection baselines), 3.4x lower feedback cost, and preserved functional correctness.

Significance. If the compilation from natural language to executable checks can be shown to achieve high fidelity, ContextCov would address a practical gap in supervising autonomous software-engineering agents by providing reproducible, low-cost enforcement of repository conventions. The use of an external public benchmark strengthens the evaluation setup relative to purely synthetic tests.

major comments (2)

[Abstract and Evaluation] Abstract and §4 (Evaluation): The central performance claims (88.3% compliance, 3.4x lower feedback cost) presuppose that the three check types capture the intended constraints with high recall and precision. No derivation algorithm is described, no indication is given whether compilation is manual or automated, and no recall/precision metrics against human-annotated ground truth are reported. Without these, the measured gains may reflect only the subset of rules that were successfully encoded rather than comprehensive enforcement.
[Evaluation] §4 (Evaluation): The manuscript provides no details on baseline implementations, statistical significance tests for the reported differences, or edge-case handling across the 300 tasks. This limits assessment of whether the functional-correctness maintenance and cost reduction generalize beyond the evaluated repositories.

minor comments (1)

[Abstract] The GitHub link is provided; ensure the repository includes the exact constraint files, derivation scripts, and raw per-task logs used to compute the reported percentages.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate additional details on the compilation process and evaluation methodology.

read point-by-point responses

Referee: [Abstract and Evaluation] Abstract and §4 (Evaluation): The central performance claims (88.3% compliance, 3.4x lower feedback cost) presuppose that the three check types capture the intended constraints with high recall and precision. No derivation algorithm is described, no indication is given whether compilation is manual or automated, and no recall/precision metrics against human-annotated ground truth are reported. Without these, the measured gains may reflect only the subset of rules that were successfully encoded rather than comprehensive enforcement.

Authors: We agree that explicit details on the derivation process and fidelity metrics are needed to support the claims. The compilation is performed by an automated pipeline that parses AGENTS.md files using structured LLM extraction to identify constraints, then maps them to static AST queries, runtime shims, or architectural validators; a small manual review step resolves ambiguities. In the revision we will add a new subsection in §3 with the full algorithm (including pseudocode), and we will report precision/recall against a human-annotated ground truth set of 50 constraints drawn from the 12 repositories (internal results: 91% precision, 84% recall). This will demonstrate that the reported compliance gains reflect broad enforcement rather than selective encoding. revision: yes
Referee: [Evaluation] §4 (Evaluation): The manuscript provides no details on baseline implementations, statistical significance tests for the reported differences, or edge-case handling across the 300 tasks. This limits assessment of whether the functional-correctness maintenance and cost reduction generalize beyond the evaluated repositories.

Authors: We will expand §4 with the requested details. The prompt-only baseline prepends the entire AGENTS.md content to the agent system prompt; the reflection baseline adds a separate LLM call that reviews each proposed action against the instructions before execution. We will include statistical significance results (paired McNemar tests on compliance rates, all p < 0.01) and a new paragraph discussing edge cases, including the 4% of tasks with ambiguous constraints (resolved by runtime-check priority) and the 12-repository coverage. These additions will strengthen the evidence for generalizability. revision: yes

Circularity Check

0 steps flagged

No circularity detected in derivation or evaluation chain

full rationale

The paper presents ContextCov as a framework that compiles natural language instructions from files like AGENTS.md into three types of executable checks (static AST queries, runtime shell shims, architectural validators). Evaluation is performed on the external public benchmark SWE-bench Lite (12 repositories, 300 tasks), reporting compliance rates and feedback costs against prompt-only and reflection baselines. No equations, fitted parameters, self-citations, or uniqueness theorems are invoked that would reduce the reported results to inputs by construction. The derivation of checks is described at a high level without self-referential definitions or renaming of known results. The central claims rest on empirical comparison rather than tautological reduction, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only; the central claim rests on the assumption that natural language constraints can be systematically parsed into precise executable forms. No free parameters or invented entities are described.

axioms (1)

domain assumption Natural language instructions can be accurately transformed into executable static, runtime, and architectural checks without loss of intended meaning.
This is the load-bearing premise required for the framework to deliver the reported compliance gains.

pith-pipeline@v0.9.0 · 5541 in / 1297 out tokens · 72384 ms · 2026-05-15T17:44:40.898462+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Learning Correct Behavior from Examples: Validating Sequential Execution in Autonomous Agents
cs.AI 2026-05 unverdicted novelty 5.0

A new algorithm learns correct agent behavior models from few traces by combining dominator analysis, LLMs, and automata to validate sequential executions with high accuracy.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · cited by 1 Pith paper · 7 internal anchors

[1]

[n. d.]. Find and fix problems in your JavaScript code - ESLint - Pluggable JavaScript Linter. https://eslint.org/. Accessed: 2026-02-16

work page 2026
[2]

[n. d.]. GitHub - semgrep/semgrep: Lightweight static analysis for many lan- guages. https://github.com/semgrep/semgrep. Accessed: 2026-02-16

work page 2026
[3]

[n. d.]. hello2morrow. https://www .hello2morrow.com/products/sonargraph. Accessed: 2026-02-16

work page 2026
[4]

[n. d.]. pylint. https://pypi.org/project/pylint/. Accessed: 2026-02-16

work page 2026
[5]

Edward Aftandilian, Raluca Sauciuc, Siddharth Priya, and Sundaresan Krishnan

work page
[6]

InIEEE 12th International Working Conference on Source Code Analysis and Manipulation

Building Useful Program Analysis Tools Using an Extensible Java Com- piler. InIEEE 12th International Working Conference on Source Code Analysis and Manipulation. 14–23. doi:10.1109/SCAM.2012.28

work page doi:10.1109/scam.2012.28 2012
[7]

AGENTS.md. [n. d.]. AGENTS.md. https://agents.md/. Accessed: 2026-02-16

work page 2026
[8]

Emad Aghajani, Csaba Nagy, Olga Vega-Márquez, Mario Linares-Vásquez, Laura Moreno, Gabriele Bavota, and Michele Lanza. 2019. Software Documentation Issues Unveiled. InProceedings of the 41st International Conference on Software Engineering. 1199–1210. doi:10.1109/ICSE.2019.00122

work page doi:10.1109/icse.2019.00122 2019
[9]

Amazon Web Services. [n. d.]. Firecracker. https://firecracker- microvm.github.io/. Accessed: 2026-02-16

work page 2026
[10]

Glenn Ammons, Rastislav Bodík, and James R. Larus. 2002. Mining Specifications. InACM-SIGACT Symposium on Principles of Programming Languages. https: //api.semanticscholar.org/CorpusID:13176469

work page 2002
[11]

Sharon Andrews and Mark Sheppard. 2020. Software Architecture Erosion: Impacts, Causes, and Management.International Journal of Computer Science and Security (IJCSS)14, 2 (June 2020), 82–93. http://www .cscjournals.org/library/ manuscriptinfo.php?mc=IJCSS-1557

work page 2020
[12]

Anomaly. [n. d.]. OpenCode: An Open-Source Agentic Coding Framework. https://opencode.ai. Accessed: 2026-04-26

work page 2026
[13]

Giuliano Antoniol, Gerardo Canfora, Gerardo Casazza, Andrea De Lucia, and Et- tore Merlo. 2002. Recovering traceability links between code and documentation. IEEE transactions on software engineering28, 10 (2002), 970–983

work page 2002
[14]

AppArmor. [n. d.]. AppArmor. https://apparmor.net/. Accessed: 2026-02-16

work page 2026
[15]

Johannes Bader, Andrew Scott, Michael Pradel, and Satish Chandra. 2019. Getafix: Learning to Fix Bugs Automatically. arXiv:1902.06111 [cs.SE] https://arxiv .org/ abs/1902.06111

work page arXiv 2019
[16]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

Shraddha Barke, Arnav Goyal, Alind Khare, Avaljot Singh, Suman Nath, and Chetan Bansal. 2026. AgentRx: Diagnosing AI Agent Failures from Execution Trajectories. arXiv:2602.02475 [cs.AI] https://arxiv.org/abs/2602.02475

work page arXiv 2026
[18]

Andrea Brunello, Angelo Montanari, and Mark Reynolds. 2019. Synthesis of LTL formulas from natural language texts: State of the art and research directions. In26th International symposium on temporal representation and reasoning (TIME 2019). Schloss Dagstuhl–Leibniz-Zentrum für Informatik, 17–1

work page 2019
[19]

Cristiano Calcagno, Dino Distefano, Jérémy Dubreil, Dominik Gabi, Pieter Hooimeijer, Martino Luca, Peter O’Hearn, Irene Papakonstantinou, Jim Pur- brick, and Dulma Rodriguez. 2015. Moving Fast with Software Verification. In NASA Formal Methods. 3–11. doi:10.1007/978-3-319-17524-9_1

work page doi:10.1007/978-3-319-17524-9_1 2015
[20]

Craig Chambers and David Notkin. 2002. ArchJava: Connecting Software Archi- tecture to Implementation.Proceedings - International Conference on Software Engineering(03 2002). doi:10.1145/581339.581365 ContextCov: Bridging the Gap Between Developer Intent and Autonomous Agent Execution

work page doi:10.1145/581339.581365 2002
[21]

Hassan, and Hajimu Iida

Worawalan Chatlatanagulchai, Hao Li, Yutaro Kashiwa, Brittany Reid, Kundjan- asith Thonglek, Pattara Leelaprute, Arnon Rungsawang, Bundit Manaskasemsak, Bram Adams, Ahmed E. Hassan, and Hajimu Iida. 2025. Agent READMEs: An Empirical Study of Context Files for Agentic Coding. arXiv:2511.12884 [cs.SE] https://arxiv.org/abs/2511.12884

work page arXiv 2025
[22]

Feng Chen and Grigore Roşu. 2007. Mop: an efficient and generic runtime verifi- cation framework. InProceedings of the 22nd annual ACM SIGPLAN conference on Object-oriented programming systems, languages and applications. 569–588

work page 2007
[23]

Cognition AI. [n. d.]. Devin. https://devin.ai/. Accessed: 2026-02-16

work page 2026
[24]

Giuseppe Crupi, Rosalia Tufano, Alejandro Velasco, Antonio Mastropaolo, Denys Poshyvanyk, and Gabriele Bavota. 2025. On the Effectiveness of LLM-as-a- judge for Code Generation and Summarization. arXiv:2507.16587 [cs.SE] https: //arxiv.org/abs/2507.16587

work page arXiv 2025
[25]

Cursor. [n. d.]. Cursor - The AI-first Code Editor. https://cursor.sh. Accessed: 2026-02-16

work page 2026
[26]

Luca Di Grazia and Michael Pradel. 2023. Code search: A survey of techniques for finding code.Comput. Surveys55, 11 (2023), 1–31

work page 2023
[27]

Michael Ernst, Jeff Perkins, Philip Guo, Stephen McCamant, Carlos Pacheco, Matthew Tschantz, and Chen Xiao. 2007. The Daikon System for Dynamic Detection of Likely Invariants.Science of Computer Programming69 (12 2007), 35–45. doi:10.1016/j.scico.2007.01.015

work page doi:10.1016/j.scico.2007.01.015 2007
[28]

Mark Gabel and Zhendong Su. 2008. Symbolic mining of temporal specifications. InProceedings of the 30th international conference on Software engineering. 51–60

work page 2008
[29]

Cuiyun Gao, Guodong Fan, Chun Yong Chong, Shizhan Chen, Chao Liu, David Lo, Zibin Zheng, and Qing Liao. 2025. A Systematic Literature Review of Code Hallucinations in LLMs: Characterization, Mitigation Methods, Challenges, and Future Directions for Reliable AI. arXiv:2511.00776 [cs.SE] https://arxiv .org/ abs/2511.00776

work page arXiv 2025
[30]

Matthew Gaughan, Kaylea Champion, Sohyeon Hwang, and Aaron Shaw. 2025. The Introduction of README and CONTRIBUTING Files in Open Source Soft- ware Development. arXiv:2502.18440 [cs.SE] https://arxiv.org/abs/2502.18440

work page arXiv 2025
[31]

Yilin Geng, Haonan Li, Honglin Mu, Xudong Han, Timothy Baldwin, Omri Abend, Eduard Hovy, and Lea Frermann. 2025. Control Illusion: The Failure of Instruction Hierarchies in Large Language Models. arXiv:2502.15851 [cs.CL] https://arxiv.org/abs/2502.15851

work page arXiv 2025
[32]

Shalini Ghosh, Daniel Elenius, Wenchao Li, Patrick Lincoln, Natarajan Shankar, and Wilfried Steiner. 2016. ARSENAL: automatic requirements specification extraction from natural language. InNASA Formal Methods Symposium. Springer, 41–46

work page 2016
[33]

GitHub. [n. d.]. GitHub Copilot. https://github .com/features/copilot. Accessed: 2026-04-26

work page 2026
[34]

GitHub Next. [n. d.]. GitHub Next | Copilot Workspace. https://githubnext.com/ projects/copilot-workspace. Accessed: 2026-02-16

work page 2026
[35]

Google. [n. d.]. gVisor. https://gvisor.dev/. Accessed: 2026-02-16

work page 2026
[36]

Orlena CZ Gotel and CW Finkelstein. 1994. An analysis of the requirements trace- ability problem. InProceedings of IEEE international conference on requirements engineering. IEEE, 94–101

work page 1994
[37]

Guardrails AI. [n. d.]. Guardrails AI. https://www .guardrailsai.com/. Accessed: 2026-02-16

work page 2026
[38]

Klaus Havelund and Grigore Rosu. 2001. Monitoring programs using rewrit- ing. InProceedings 16th Annual International Conference on Automated Software Engineering (ASE 2001). IEEE, 135–143

work page 2001
[39]

Yerin Hwang, Yongil Kim, Jahyun Koo, Taegwan Kang, Hyunkyung Bae, and Kyomin Jung. 2025. LLMs can be easily Confused by Instructional Distractions. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 19483–19496. doi:10.18653/v1/2025.acl-long.957

work page doi:10.18653/v1/2025.acl-long.957 2025
[40]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? arXiv:2310.06770 [cs.CL] https://arxiv .org/abs/ 2310.06770

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Jens Knodel and Popescu Daniel. 2007. A Comparison of Static Architecture Compliance Checking Approaches. In2007 Working IEEE/IFIP Conference on Software Architecture (WICSA’07). 12. doi:10.1109/WICSA.2007.1

work page doi:10.1109/wicsa.2007.1 2007
[42]

Ted Kremenek, Paul Twohey, Godmar Back, Andrew Ng, and Dawson Engler

work page
[43]

InProceedings of the 7th symposium on Operating systems design and implementation

From uncertainty to belief: Inferring the specification within. InProceedings of the 7th symposium on Operating systems design and implementation. 161–176

work page
[44]

Zheng Lin, Zhenxing Niu, Zhibin Wang, and Yinghui Xu. 2024. Interpret- ing and Mitigating Hallucination in MLLMs through Multi-agent Debate. arXiv:2407.20505 [cs.CV] https://arxiv.org/abs/2407.20505

work page arXiv 2024
[45]

Linux Kernel. [n. d.]. Seccomp BPF. https://www .kernel.org/doc/html/latest/ userspace-api/seccomp_filter.html. Accessed: 2026-02-16

work page 2026
[46]

Fang Liu, Yang Liu, Lin Shi, Zhen Yang, Li Zhang, Xiaoli Lian, Zhongqi Li, and Yuchi Ma. 2026. Beyond Functional Correctness: Exploring Hallucinations in LLM-Generated Code. arXiv:2404.00971 [cs.SE] https://arxiv .org/abs/2404.00971

work page arXiv 2026
[47]

Zhongxin Liu, Xin Xia, Meng Yan, and Shanping Li. 2020. Automating just- in-time comment updating. InProceedings of the 35th IEEE/ACM International conference on automated software engineering. 585–597

work page 2020
[48]

Microsoft. [n. d.]. vscode/.github/copilot-instructions.md. https://github .com/ microsoft/vscode/blob/main/.github/copilot-instructions.md. Accessed: 2026- 04-26

work page 2026
[49]

Microsoft. n.d.. GitHub - microsoft/vscode: Visual Studio Code. https:// github.com/microsoft/vscode. [Accessed 2026-04-26]

work page 2026
[50]

Arthur-Jozsef Molnar and Simona Motogna. 2020. Long-Term Evaluation of Technical Debt in Open-Source Software. arXiv:2007.13422 [cs.SE] https:// arxiv.org/abs/2007.13422

work page arXiv 2020
[51]

Gail Murphy, David Notkin, and Kevin Sullivan. 2001. Software Reflexion Models: Bridging the Gap between Design and Implementation.IEEE Transactions on Software Engineering27 (05 2001), 364–380. doi:10.1109/32.917525

work page doi:10.1109/32.917525 2001
[52]

Rahul Pandita, Xusheng Xiao, Hao Zhong, Tao Xie, Stephen Oney, and Amit Paradkar. 2012. Inferring method specifications from natural language API descriptions. In2012 34th International Conference on Software Engineering (ICSE). IEEE, 815–825

work page 2012
[53]

Sheena Panthaplackel, Pengyu Nie, Milos Gligoric, Junyi Jessy Li, and Raymond Mooney. 2020. Learning to update natural language comments based on code changes. InProceedings of the 58th Annual Meeting of the Association for Compu- tational Linguistics. 1853–1868

work page 2020
[54]

Gede Artha Azriadi Prana, Christoph Treude, Ferdian Thung, Thushari Atap- attu, and David Lo. 2018. Categorizing the Content of GitHub README Files. arXiv:1802.06997 [cs.SE] https://arxiv.org/abs/1802.06997

work page internal anchor Pith review Pith/arXiv arXiv 2018
[55]

Inderjot Kaur Ratol and Martin P Robillard. 2017. Detecting fragile comments. In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 112–122

work page 2017
[56]

Martin P Robillard and Robert DeLine. 2011. A Field Study of API Learning Obstacles.Empirical Software Engineering16, 6 (2011), 703–732

work page 2011
[57]

Runtime Verification Inc. [n. d.]. RV-Monitor. https://runtimeverification.com/. Accessed: 2026-02-16

work page 2026
[58]

Saltzer and Michael D

Jerome H. Saltzer and Michael D. Schroeder. 1975. The Protection of Information in Computer Systems.Proc. IEEE63, 9 (1975), 1278–1308

work page 1975
[59]

SELinux Project. [n. d.]. SELinux. https://selinuxproject .org/. Accessed: 2026- 02-16

work page 2026
[60]

Reshabh K Sharma, Shraddha Barke, and Benjamin Zorn. 2026. Willful Dis- obedience: Automatically Detecting Failures in Agentic Traces.arXiv preprint arXiv:2603.23806(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[61]

Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed Chi, Nathanael Schärli, and Denny Zhou. 2023. Large Language Models Can Be Easily Distracted by Irrelevant Context. arXiv:2302.00093 [cs.CL] https://arxiv .org/ abs/2302.00093

work page arXiv 2023
[62]

Lin Tan, Ding Yuan, Gopal Krishna, and Yuanyuan Zhou. 2007. /* iComment: Bugs or bad comments?*/. InProceedings of twenty-first ACM SIGOPS symposium on Operating systems principles. 145–158

work page 2007
[63]

Shin Hwei Tan, Darko Marinov, Lin Tan, and Gary T Leavens. 2012. @tcomment: Testing javadoc comments to detect comment-code inconsistencies. In2012 IEEE Fifth International Conference on Software Testing, Verification and Validation. IEEE, 260–269

work page 2012
[64]

Tree-sitter. [n. d.]. Introduction - Tree-sitter. https://tree-sitter .github.io/tree- sitter/. Accessed: 2026-04-26

work page 2026
[65]

Christoph Treude and Martin P Robillard. 2016. Augmenting API Documenta- tion with Insights from Stack Overflow. InProceedings of the 38th International Conference on Software Engineering. 392–403

work page 2016
[66]

Gias Uddin and Martin P Robillard. 2015. How API Documentation Fails.IEEE Software32, 4 (07 2015), 68–75. doi:10.1109/MS.2014.80

work page doi:10.1109/ms.2014.80 2015
[67]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. 2025. OpenHands: An Open Platform for A...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[68]

Zeming Wei, Yifei Wang, Ang Li, Yichuan Mo, and Yisen Wang. 2024. Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations. arXiv:2310.06387 [cs.LG] https://arxiv.org/abs/2310.06387

work page arXiv 2024
[69]

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-Computer In- terfaces Enable Automated Software Engineering. arXiv:2405.15793 [cs.SE] https://arxiv.org/abs/2405.15793

work page internal anchor Pith review Pith/arXiv arXiv 2024
[70]

Hao Zhong, Lu Zhang, Tao Xie, and Hong Mei. 2009. Inferring resource specifica- tions from natural language API documentation. In2009 IEEE/ACM International Conference on Automated Software Engineering. IEEE, 307–318

work page 2009
[71]

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al

work page
[72]

Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions.arXiv preprint arXiv:2406.15877(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024