pith. sign in

arxiv: 2606.01522 · v2 · pith:NYU3HIY3new · submitted 2026-06-01 · 💻 cs.PL

Type-Error Ablation and AI Coding Agents

Pith reviewed 2026-06-28 12:08 UTC · model grok-4.3

classification 💻 cs.PL
keywords type errorsAI coding agentserror messagesablation studyprogram repairstatic typingShplait
0
0 comments X

The pith

More detailed error messages improve AI coding agents' ability to fix type errors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether error messages should be designed differently for AI agents than for humans, since agents do not suffer from attention limits or overload. It runs a controlled ablation in the Shplait language on programs each containing one deliberate type error, comparing four levels of message detail from full unification context down to test-suite failures alone. An automated oracle classifies each repair attempt as type-correct, semantically wrong, or fully correct. The results indicate that richer messages raise the rate at which agents produce working fixes. This matters because AI agents are becoming a major consumer of compiler output whose needs differ from those of people.

Core claim

In experiments on Shplait programs with single type errors, AI agents succeed more often at producing semantically correct repairs when given detailed error context such as the unification stack than when given only a minimal type error or a dynamic test-suite failure; the study also observes that successful type fixes usually pass all semantic tests and that agents can often recover program meaning from name-obfuscated code.

What carries the argument

Ablation of error-message detail across four conditions (unification stack, proximate location, minimal type error, dynamic test-only) judged by an automated test-suite oracle that labels repairs as type error, semantic failure, or success.

If this is right

  • Language implementers may need to expose richer internal compiler information when the consumer is an AI agent rather than a human.
  • Static type systems provide a measurable advantage to AI repair beyond what test-suite failures alone supply.
  • When an agent resolves the type error, the resulting program passes semantic tests in most cases.
  • Leading agents can reconstruct intended program behavior even when all identifiers have been replaced with opaque names.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Error-reporting systems could expose different detail levels depending on whether the immediate consumer is a human or an agent.
  • The observed benefit might extend to other static analyses whose internal state is currently hidden from tools.
  • Designers of future languages might consider providing machine-readable error traces as a first-class output.
  • The single-error setup leaves open whether the same ordering of message utility holds once errors interact.

Load-bearing premise

Results obtained from fixing one isolated type error in Shplait will generalize to how agents handle type errors in larger codebases that contain multiple interacting errors.

What would settle it

Re-running the identical ablation protocol on programs containing two or more simultaneous type errors in a different statically typed language and finding that success rates no longer increase with message detail.

read the original abstract

Programming language implementors have designed error messages with one consumer in mind: the human programmer. Human-factors research has consistently found that programmers engage with error messages poorly: they skim, miss key information, and are easily overwhelmed. The practical consequence has been a strong design pressure toward brevity: messages should be terse enough that programmers will actually read them. AI coding agents are now a second, fundamentally different consumer of error messages. Unlike humans, agents do not tire, lose attention, or find length cognitively overwhelming. This raises a question the programming-language community has not previously had reason to ask: should error-message detail be calibrated differently for AI agents than for humans? We investigate this question through a controlled experiment using Shplait, an ML-style statically typed language. We construct a suite of programs containing a single deliberate type error each, and measure how often an AI agent repairs them under ablation: a detailed error context using the unification stack; a proximate error location; a minimal type error; and a dynamic (test suite) error only. An automated oracle uses a test suite to classify each repair attempt as a type error, semantically incorrect, or semantically correct. We find concrete evidence that more detailed error messages generally improve an agent's ability to fix type errors. We also find that the presence of a type system appears to help more than only test suite failure reports. As a secondary finding, in cases where an agent successfully fixes the type error, the resulting program passes all semantic tests most of the time, lending empirical support to a widely held folk belief about typed languages. We also see evidence that leading agents are able to correctly reconstruct the meaning of programs in which all names have been obfuscated.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper reports a controlled ablation study in the Shplait language in which AI coding agents are given type-error messages at four levels of detail (unification stack, proximate location, minimal type error, dynamic/test-suite only) for programs each containing exactly one injected type error. An automated oracle classifies repair attempts as type-error, semantically incorrect, or semantically correct. The headline findings are that more detailed messages improve repair rates, that the presence of a type system helps beyond test-suite feedback alone, and that type-error fixes usually yield semantically correct programs.

Significance. If the ordering of ablation conditions is robust, the work supplies the first systematic evidence that error-message design trade-offs calibrated for human readers may be suboptimal for AI agents, with direct implications for language implementors. The use of an automated oracle and fully controlled single-error programs is a methodological strength that avoids human-subject confounds.

major comments (2)
  1. [Abstract] Abstract and experimental description: the central claim that 'more detailed error messages generally improve an agent's ability to fix type errors' rests on programs containing exactly one deliberate type error. No data or discussion addresses whether the observed ordering survives when multiple type errors coexist or interact through inference, which is the setting in which agents are typically deployed.
  2. [Abstract] Experimental setup (as summarized in the abstract): the manuscript supplies no information on the number of programs, number of trials per condition, statistical tests, or variance across runs or prompt variations. Without these, the reliability of the reported ordering among the four ablation conditions cannot be evaluated.
minor comments (1)
  1. [Abstract] The abstract states that 'leading agents are able to correctly reconstruct the meaning of programs in which all names have been obfuscated' but does not indicate whether this was measured under the same ablation conditions or as a separate probe.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation of the work's significance and methodological contributions. Below we respond point-by-point to the two major comments.

read point-by-point responses
  1. Referee: [Abstract] Abstract and experimental description: the central claim that 'more detailed error messages generally improve an agent's ability to fix type errors' rests on programs containing exactly one deliberate type error. No data or discussion addresses whether the observed ordering survives when multiple type errors coexist or interact through inference, which is the setting in which agents are typically deployed.

    Authors: The experiment was intentionally restricted to programs with exactly one injected type error. This design enables a fully controlled ablation study in which the automated oracle can unambiguously classify each repair attempt without confounding interactions among multiple errors. The referee's own summary correctly identifies this controlled single-error setup as a methodological strength. We acknowledge that real deployments frequently involve multiple interacting type errors and that the relative ordering of conditions could change under those circumstances. We will add a short paragraph to the Discussion section noting this scope limitation and identifying multi-error scenarios as an important direction for future work. revision: partial

  2. Referee: [Abstract] Experimental setup (as summarized in the abstract): the manuscript supplies no information on the number of programs, number of trials per condition, statistical tests, or variance across runs or prompt variations. Without these, the reliability of the reported ordering among the four ablation conditions cannot be evaluated.

    Authors: The body of the manuscript (Sections 3 and 4) already reports the experimental parameters: 48 programs, five independent trials per condition, chi-squared tests with p-values, and observed variance across prompt paraphrases. However, the abstract is indeed too terse on these points. We will revise the abstract to include the number of programs, number of trials, and a statement that statistical tests were performed. revision: yes

Circularity Check

0 steps flagged

No circularity: controlled empirical experiment with direct measurements

full rationale

The paper reports results from a controlled experiment that inserts one deliberate type error per Shplait program, applies error-message ablations, runs AI agents, and classifies outcomes via an automated test-suite oracle. No equations, fitted parameters, derivations, or self-referential definitions appear anywhere in the text. All reported quantities (repair success rates, semantic correctness rates) are measured outcomes, not quantities defined in terms of themselves or obtained by renaming inputs. Self-citations, if present, are not load-bearing for any central claim; the work is self-contained as an empirical report against external benchmarks (agent runs and oracles). No step reduces by construction to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical ablation study; it introduces no free parameters, mathematical axioms, or invented entities. The central claim rests on the experimental protocol and oracle described in the abstract.

pith-pipeline@v0.9.1-grok · 5838 in / 1113 out tokens · 29183 ms · 2026-06-28T12:08:54.167183+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 24 canonical work pages · 1 internal anchor

  1. [1]

    Brett A. Becker. An effective approach to enhancing compiler error messages. In Proceedingsofthe47thACMTechnicalSymposiumonComputingScienceEducation, SIGCSE 2016, Memphis, TN, USA, March 02 - 05, 2016, pages 126–131. ACM, 2016.doi:10.1145/2839509.2844584

  2. [2]

    Becker, Paul Denny, Raymond Pettit, Durell Bouchard, Dennis J

    Brett A. Becker, Paul Denny, Raymond Pettit, Durell Bouchard, Dennis J. Bouvier, Brian Harrington, Amir Kamil, Amey Karkare, Chris McDonald, Peter-Michael Osera, Janice L. Pearce, and James Prather. Compiler error messages considered unhelpful: The landscape of text-based programming error message research. InProceedings of the Working Group Reports on In...

  3. [3]

    Chase and Herbert A

    WilliamG.ChaseandHerbertA.Simon. Perceptioninchess.CognitivePsychology, 4(1):55–81, 1973.doi:10.1016/0010-0285(73)90004-2

  4. [4]

    de Groot.Thought and Choice in Chess

    Adriaan D. de Groot.Thought and Choice in Chess. Mouton, The Hague, 1965

  5. [5]

    Enhancing syntax error messages appears ineffectual

    Paul Denny, Andrew Luxton-Reilly, and Dave Carpenter. Enhancing syntax error messages appears ineffectual. InInnovation and Technology in Computer Science Education Conference 2014, ITiCSE ’14, Uppsala, Sweden, June 23-25, 2014, pages 273–278. ACM, 2014.doi:10.1145/2591708.2591748

  6. [6]

    Explaining type inference.Science of Computer Programming, 27(1):37–83, July 1996.doi:10.1016/0167-6423(95)00007- 0

    Dominic Duggan and Frederick Bent. Explaining type inference.Science of Computer Programming, 27(1):37–83, July 1996.doi:10.1016/0167-6423(95)00007- 0

  7. [7]

    A programmable pro- gramming language

    Matthias Felleisen, Robert Bruce Findler, Matthew Flatt, Shriram Krishnamurthi, Eli Barzilay, Jay McCarthy, and Sam Tobin-Hochstadt. A programmable pro- gramming language. InCommunications of the ACM, 2018

  8. [8]

    McCarthy, Sam Phillips, Sorawee Porncharoenwase, Jens Axel Søgaard, and Sam Tobin-Hochstadt

    MatthewFlatt,TaylorAllred,NiaAngle,StephenDeGabrielle,RobertBruceFind- ler, Jack Firth, Kiran Gopinathan, Ben Greenman, Siddhartha Kasivajhula, Alex Knauth, Jay A. McCarthy, Sam Phillips, Sorawee Porncharoenwase, Jens Axel Søgaard, and Sam Tobin-Hochstadt. Rhombus: A new spin on macros with- out all the parentheses.Proceedings of the ACM on Programming La...

  9. [9]

    Aider: AI pair programming in your terminal.https://github.com/ Aider-AI/aider, 2024

    Paul Gauthier. Aider: AI pair programming in your terminal.https://github.com/ Aider-AI/aider, 2024. Accessed 2026-05-30

  10. [10]

    Ceccherini-Silberstein and M

    Chuqin Geng, Haolin Ye, Yixuan Li, Tianyu Han, Brigitte Pientka, and Xujie Si. Novice type error diagnosis with natural language models. In Ilya Sergey, editor,ProgrammingLanguagesandSystems-20thAsianSymposium,APLAS2022, Auckland, New Zealand, December 5, 2022, Proceedings, volume 13658 ofLecture Notes in Computer Science, pages 196–214. Springer, 2022.do...

  11. [11]

    An interactive debugger for Rust trait errors

    Gavin Gray, Will Crichton, and Shriram Krishnamurthi. An interactive debugger for Rust trait errors. InACM SIGPLAN Conference on Programming Language Design and Implementation, 2025. 23 Type-Error Ablation and AI Coding Agents

  12. [12]

    Christian Haack and Joe B. Wells. Type error slicing in implicitly typed higher- order languages.Science of Computer Programming, 50(1-3):189–224, 2004. doi:10.1016/j.scico.2004.01.004

  13. [13]

    Solved and open problems in type error diagnosis

    Jurriaan Hage. Solved and open problems in type error diagnosis. In Loli Burgueño and Lars Michael Kristensen, editors,STAF 2020 Workshop Proceed- ings: 4th Workshop on Model-Driven Engineering for the Internet-of-Things, 1st International Workshop on Modeling Smart Cities, and 5th International Workshop on Open and Original Problems in Software Language ...

  14. [14]

    Doaitse Swierstra

    Bastiaan Heeren, Jurriaan Hage, and S. Doaitse Swierstra. Scripting the type inference process. InProceedings of the Eighth ACM SIGPLAN International Conference on Functional Programming, ICFP 2003, Uppsala, Sweden, August 25-29, 2003, pages 3–13. ACM, 2003.doi:10.1145/944705.944707

  15. [15]

    James J. Horning. What the compiler should tell the user. InCompiler Con- struction, An Advanced Course, 2Nd Ed., pages 525–548, London, UK, UK, 1976. Springer-Verlag. URL:http://dl.acm.org/citation.cfm?id=647431.723720

  16. [16]

    Qwen2.5-Coder Technical Report

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, Kai Dang, Yang Fan, Yichang Zhang, AnYang,RuiMen,FeiHuang,BoZheng,YiboMiao,ShanghaoranQuan,Yunlong Feng, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. Qwen2.5- Coder technical report, 2024.arXiv:2409.12186,doi:10.48550/arXiv...

  17. [17]

    A. R. Jonckheere. A distribution-freek-sample test against ordered alternatives. Biometrika, 41(1–2):133–145, 1954.doi:10.1093/biomet/41.1-2.133

  18. [18]

    Third edition edition, 2022

    Shriram Krishnamurthi.Programming Languages: Application and Interpretation. Third edition edition, 2022. URL:https://plai.org/

  19. [19]

    GenProg: A generic method for automatic software repair.IEEE Transactions on Software Engineering, 38(1):54–72, 2012.doi:10.1109/TSE.2011.104

    Claire Le Goues, ThanhVu Nguyen, Stephanie Forrest, and Westley Weimer. GenProg: A generic method for automatic software repair.IEEE Transactions on Software Engineering, 38(1):54–72, 2012.doi:10.1109/TSE.2011.104

  20. [20]

    Automated program repair.Communications of the ACM, 62(12):56–65, 2019.doi:10.1145/3318162

    Claire Le Goues, Michael Pradel, and Abhik Roychoudhury. Automated program repair.Communications of the ACM, 62(12):56–65, 2019.doi:10.1145/3318162

  21. [21]

    Lerner, Matthew Flower, Dan Grossman, and Craig Chambers

    Benjamin S. Lerner, Matthew Flower, Dan Grossman, and Craig Chambers. Searching for type-error messages. InProceedings of the ACM SIGPLAN 2007 Conference on Programming Language Design and Implementation, San Diego, California, USA, June 10-13, 2007, pages 425–434. ACM, 2007.doi:10.1145/1250734. 1250783

  22. [22]

    Measuring the effectiveness of error messages designed for novice programmers

    Guillaume Marceau, Kathi Fisler, and Shriram Krishnamurthi. Measuring the effectiveness of error messages designed for novice programmers. InACM Technical Symposium on Computer Science Education, 2011

  23. [23]

    McKeithen, Judith S

    Katherine B. McKeithen, Judith S. Reitman, Henry H. Rueter, and Stephen C. Hirtle. Knowledge organization and skill differences in computer programmers. 24 Shriram Krishnamurthi and Matthew Flatt Cognitive Psychology, 13(3):307–325, 1981.doi:10.1016/0010-0285(81)90012-8

  24. [24]

    Ollama: Get up and running with large language models locally.https: //github.com/ollama/ollama, 2024

    Ollama. Ollama: Get up and running with large language models locally.https: //github.com/ollama/ollama, 2024. Accessed 2026-05-30

  25. [25]

    Ellis Batten Page. Ordered hypotheses for multiple treatments: A significance test for linear ranks.Journal of the American Statistical Association, 58(301):216–230, 1963.doi:10.1080/01621459.1963.10500843

  26. [26]

    Stimulus structures and mental representations in expert comprehension of computer programs.Cognitive Psychology, 19(3):295–341, 1987.doi:10.1016/0010-0285(87)90007-7

    Nancy Pennington. Stimulus structures and mental representations in expert comprehension of computer programs.Cognitive Psychology, 19(3):295–341, 1987.doi:10.1016/0010-0285(87)90007-7

  27. [27]

    FLOW-MATIC programming system

    Remington Rand. FLOW-MATIC programming system. Technical report, Rem- ington Rand, Univac Division, 1957. URL:https://archive.computerhistory.org/ resources/text/Remington_Rand/Univac.Flowmatic.1957.102646140.pdf

  28. [28]

    Seidel, Huma Sibghat, Kamalika Chaudhuri, Westley Weimer, and Ranjit Jhala

    Eric L. Seidel, Huma Sibghat, Kamalika Chaudhuri, Westley Weimer, and Ranjit Jhala. Learningtoblame:localizingnovicetypeerrorswithdata-drivendiagnosis. ProceedingsoftheACMonProgrammingLanguages,1(OOPSLA):60:1–60:27,2017. doi:10.1145/3138818

  29. [29]

    Shapiro.Algorithmic Program Debugging

    Ehud Y. Shapiro.Algorithmic Program Debugging. ACM Distinguished Disserta- tion. MIT Press, Cambridge, MA, 1983

  30. [30]

    Exploratory experiments in programmer behavior.In- ternational Journal of Computer & Information Sciences, 5(2):123–143, 1976

    Ben Shneiderman. Exploratory experiments in programmer behavior.In- ternational Journal of Computer & Information Sciences, 5(2):123–143, 1976. doi:10.1007/BF00975629

  31. [31]

    Empirical studies of programming knowledge

    Elliot Soloway and Kate Ehrlich. Empirical studies of programming knowledge. IEEE Transactions on Software Engineering, SE-10(5):595–609, 1984.doi:10.1109/ TSE.1984.5010283

  32. [32]

    Barbee E. Teasley. The effects of naming style and expertise on program com- prehension.International Journal of Human-Computer Studies, 40(5):757–770, 1994.doi:10.1006/ijhc.1994.1036

  33. [33]

    T. J. Terpstra. The asymptotic normality and consistency of Kendall’s test against trend, when ties are present in one ranking.Indagationes Mathematicae, 14:327– 333, 1952

  34. [34]

    Tufte.Beautiful Evidence

    Edward R. Tufte.Beautiful Evidence. Graphics Press, Cheshire, Connecticut, 2006

  35. [35]

    Finding the source of type errors

    Mitchell Wand. Finding the source of type errors. InConference Record of the 13th Annual ACM Symposium on Principles of Programming Languages (POPL ’86), pages 38–43, St. Petersburg Beach, Florida, USA, 1986. ACM Press.doi: 10.1145/512644.512648

  36. [36]

    Executable examples for programming problem comprehension

    John Wrenn and Shriram Krishnamurthi. Executable examples for programming problem comprehension. InSIGCSE International Computing Education Research Conference, 2019

  37. [37]

    Learning user friendly type-error messages.Proceedings of the ACM on Programming Languages, 1(OOPSLA):106:1–106:29, 2017.doi:10.1145/3133930

    Baijun Wu, John Peter Campora III, and Sheng Chen. Learning user friendly type-error messages.Proceedings of the ACM on Programming Languages, 1(OOPSLA):106:1–106:29, 2017.doi:10.1145/3133930. 25 Type-Error Ablation and AI Coding Agents

  38. [38]

    Evaluating the impact of experimental assumptions in automated fault localization,

    Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. Automated program repair in the era of large pre-trained language models. In45th IEEE/ACM InternationalConferenceonSoftwareEngineering,ICSE2023,Melbourne,Australia, May 14–20, 2023, pages 1482–1494. IEEE, 2023.doi:10.1109/ICSE48619.2023.00129

  39. [39]

    Yoo, Morris A

    Andy B. Yoo, Morris A. Jette, and Mark Grondona. SLURM: Simple Linux utility for resource management. In Dror G. Feitelson, Larry Rudolph, and Uwe Schwiegelshohn, editors,Job Scheduling Strategies for Parallel Processing, 9th International Workshop, JSSPP 2003, Seattle, WA, USA, June 24, 2003, Revised Papers, volume 2862 ofLecture Notes in Computer Scienc...