Type-Error Ablation and AI Coding Agents

Matthew Flatt; Shriram Krishnamurthi

arxiv: 2606.01522 · v2 · pith:NYU3HIY3new · submitted 2026-06-01 · 💻 cs.PL

Type-Error Ablation and AI Coding Agents

Shriram Krishnamurthi , Matthew Flatt This is my paper

Pith reviewed 2026-06-28 12:08 UTC · model grok-4.3

classification 💻 cs.PL

keywords type errorsAI coding agentserror messagesablation studyprogram repairstatic typingShplait

0 comments

The pith

More detailed error messages improve AI coding agents' ability to fix type errors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether error messages should be designed differently for AI agents than for humans, since agents do not suffer from attention limits or overload. It runs a controlled ablation in the Shplait language on programs each containing one deliberate type error, comparing four levels of message detail from full unification context down to test-suite failures alone. An automated oracle classifies each repair attempt as type-correct, semantically wrong, or fully correct. The results indicate that richer messages raise the rate at which agents produce working fixes. This matters because AI agents are becoming a major consumer of compiler output whose needs differ from those of people.

Core claim

In experiments on Shplait programs with single type errors, AI agents succeed more often at producing semantically correct repairs when given detailed error context such as the unification stack than when given only a minimal type error or a dynamic test-suite failure; the study also observes that successful type fixes usually pass all semantic tests and that agents can often recover program meaning from name-obfuscated code.

What carries the argument

Ablation of error-message detail across four conditions (unification stack, proximate location, minimal type error, dynamic test-only) judged by an automated test-suite oracle that labels repairs as type error, semantic failure, or success.

If this is right

Language implementers may need to expose richer internal compiler information when the consumer is an AI agent rather than a human.
Static type systems provide a measurable advantage to AI repair beyond what test-suite failures alone supply.
When an agent resolves the type error, the resulting program passes semantic tests in most cases.
Leading agents can reconstruct intended program behavior even when all identifiers have been replaced with opaque names.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Error-reporting systems could expose different detail levels depending on whether the immediate consumer is a human or an agent.
The observed benefit might extend to other static analyses whose internal state is currently hidden from tools.
Designers of future languages might consider providing machine-readable error traces as a first-class output.
The single-error setup leaves open whether the same ordering of message utility holds once errors interact.

Load-bearing premise

Results obtained from fixing one isolated type error in Shplait will generalize to how agents handle type errors in larger codebases that contain multiple interacting errors.

What would settle it

Re-running the identical ablation protocol on programs containing two or more simultaneous type errors in a different statically typed language and finding that success rates no longer increase with message detail.

read the original abstract

Programming language implementors have designed error messages with one consumer in mind: the human programmer. Human-factors research has consistently found that programmers engage with error messages poorly: they skim, miss key information, and are easily overwhelmed. The practical consequence has been a strong design pressure toward brevity: messages should be terse enough that programmers will actually read them. AI coding agents are now a second, fundamentally different consumer of error messages. Unlike humans, agents do not tire, lose attention, or find length cognitively overwhelming. This raises a question the programming-language community has not previously had reason to ask: should error-message detail be calibrated differently for AI agents than for humans? We investigate this question through a controlled experiment using Shplait, an ML-style statically typed language. We construct a suite of programs containing a single deliberate type error each, and measure how often an AI agent repairs them under ablation: a detailed error context using the unification stack; a proximate error location; a minimal type error; and a dynamic (test suite) error only. An automated oracle uses a test suite to classify each repair attempt as a type error, semantically incorrect, or semantically correct. We find concrete evidence that more detailed error messages generally improve an agent's ability to fix type errors. We also find that the presence of a type system appears to help more than only test suite failure reports. As a secondary finding, in cases where an agent successfully fixes the type error, the resulting program passes all semantic tests most of the time, lending empirical support to a widely held folk belief about typed languages. We also see evidence that leading agents are able to correctly reconstruct the meaning of programs in which all names have been obfuscated.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The ablation finds detailed error messages help agents fix single type errors more than minimal ones, but the single-error small-program design limits how far the result travels.

read the letter

The paper's core result is that in this Shplait setup, agents succeed more often when given the full unification stack or proximate location than with minimal type errors or dynamic failures alone. The type system also beats test-suite feedback by itself, and agents usually produce semantically correct code once the type error is fixed.

What is new is the question itself: whether error messages should be tuned differently for agents that do not skim or get overwhelmed. The experiment runs four clean ablations on programs with one deliberate type error each and uses an automated test-suite oracle to label outcomes. That framing and the secondary finding on obfuscated names are the actual contributions.

The work is straightforward empirical work on a timely topic. The controlled conditions and oracle make the comparisons direct.

The soft spot is the generalization issue. Every program has exactly one type error, so the ordering of ablations may shift when multiple errors interact or when the agent faces larger contexts with mixed problems. The abstract gives no counts of programs, runs, or variance, which leaves the reliability of the ordering hard to judge from what is shown. If the full paper supplies those numbers and checks, the claim strengthens; otherwise the practical takeaway stays narrow.

This is for PL researchers and tool builders who want data points on error reporting for AI agents. A reader working on compilers or agent tooling would get concrete comparisons to build on.

Send it for peer review so the scope and statistics can be tightened.

Referee Report

2 major / 1 minor

Summary. The paper reports a controlled ablation study in the Shplait language in which AI coding agents are given type-error messages at four levels of detail (unification stack, proximate location, minimal type error, dynamic/test-suite only) for programs each containing exactly one injected type error. An automated oracle classifies repair attempts as type-error, semantically incorrect, or semantically correct. The headline findings are that more detailed messages improve repair rates, that the presence of a type system helps beyond test-suite feedback alone, and that type-error fixes usually yield semantically correct programs.

Significance. If the ordering of ablation conditions is robust, the work supplies the first systematic evidence that error-message design trade-offs calibrated for human readers may be suboptimal for AI agents, with direct implications for language implementors. The use of an automated oracle and fully controlled single-error programs is a methodological strength that avoids human-subject confounds.

major comments (2)

[Abstract] Abstract and experimental description: the central claim that 'more detailed error messages generally improve an agent's ability to fix type errors' rests on programs containing exactly one deliberate type error. No data or discussion addresses whether the observed ordering survives when multiple type errors coexist or interact through inference, which is the setting in which agents are typically deployed.
[Abstract] Experimental setup (as summarized in the abstract): the manuscript supplies no information on the number of programs, number of trials per condition, statistical tests, or variance across runs or prompt variations. Without these, the reliability of the reported ordering among the four ablation conditions cannot be evaluated.

minor comments (1)

[Abstract] The abstract states that 'leading agents are able to correctly reconstruct the meaning of programs in which all names have been obfuscated' but does not indicate whether this was measured under the same ablation conditions or as a separate probe.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation of the work's significance and methodological contributions. Below we respond point-by-point to the two major comments.

read point-by-point responses

Referee: [Abstract] Abstract and experimental description: the central claim that 'more detailed error messages generally improve an agent's ability to fix type errors' rests on programs containing exactly one deliberate type error. No data or discussion addresses whether the observed ordering survives when multiple type errors coexist or interact through inference, which is the setting in which agents are typically deployed.

Authors: The experiment was intentionally restricted to programs with exactly one injected type error. This design enables a fully controlled ablation study in which the automated oracle can unambiguously classify each repair attempt without confounding interactions among multiple errors. The referee's own summary correctly identifies this controlled single-error setup as a methodological strength. We acknowledge that real deployments frequently involve multiple interacting type errors and that the relative ordering of conditions could change under those circumstances. We will add a short paragraph to the Discussion section noting this scope limitation and identifying multi-error scenarios as an important direction for future work. revision: partial
Referee: [Abstract] Experimental setup (as summarized in the abstract): the manuscript supplies no information on the number of programs, number of trials per condition, statistical tests, or variance across runs or prompt variations. Without these, the reliability of the reported ordering among the four ablation conditions cannot be evaluated.

Authors: The body of the manuscript (Sections 3 and 4) already reports the experimental parameters: 48 programs, five independent trials per condition, chi-squared tests with p-values, and observed variance across prompt paraphrases. However, the abstract is indeed too terse on these points. We will revise the abstract to include the number of programs, number of trials, and a statement that statistical tests were performed. revision: yes

Circularity Check

0 steps flagged

No circularity: controlled empirical experiment with direct measurements

full rationale

The paper reports results from a controlled experiment that inserts one deliberate type error per Shplait program, applies error-message ablations, runs AI agents, and classifies outcomes via an automated test-suite oracle. No equations, fitted parameters, derivations, or self-referential definitions appear anywhere in the text. All reported quantities (repair success rates, semantic correctness rates) are measured outcomes, not quantities defined in terms of themselves or obtained by renaming inputs. Self-citations, if present, are not load-bearing for any central claim; the work is self-contained as an empirical report against external benchmarks (agent runs and oracles). No step reduces by construction to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical ablation study; it introduces no free parameters, mathematical axioms, or invented entities. The central claim rests on the experimental protocol and oracle described in the abstract.

pith-pipeline@v0.9.1-grok · 5838 in / 1113 out tokens · 29183 ms · 2026-06-28T12:08:54.167183+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 24 canonical work pages · 1 internal anchor

[1]

Brett A. Becker. An effective approach to enhancing compiler error messages. In Proceedingsofthe47thACMTechnicalSymposiumonComputingScienceEducation, SIGCSE 2016, Memphis, TN, USA, March 02 - 05, 2016, pages 126–131. ACM, 2016.doi:10.1145/2839509.2844584

work page doi:10.1145/2839509.2844584 2016
[2]

Becker, Paul Denny, Raymond Pettit, Durell Bouchard, Dennis J

Brett A. Becker, Paul Denny, Raymond Pettit, Durell Bouchard, Dennis J. Bouvier, Brian Harrington, Amir Kamil, Amey Karkare, Chris McDonald, Peter-Michael Osera, Janice L. Pearce, and James Prather. Compiler error messages considered unhelpful: The landscape of text-based programming error message research. InProceedings of the Working Group Reports on In...

work page doi:10.1145/3344429.3372508 2019
[3]

Chase and Herbert A

WilliamG.ChaseandHerbertA.Simon. Perceptioninchess.CognitivePsychology, 4(1):55–81, 1973.doi:10.1016/0010-0285(73)90004-2

work page doi:10.1016/0010-0285(73)90004-2 1973
[4]

de Groot.Thought and Choice in Chess

Adriaan D. de Groot.Thought and Choice in Chess. Mouton, The Hague, 1965

1965
[5]

Enhancing syntax error messages appears ineffectual

Paul Denny, Andrew Luxton-Reilly, and Dave Carpenter. Enhancing syntax error messages appears ineffectual. InInnovation and Technology in Computer Science Education Conference 2014, ITiCSE ’14, Uppsala, Sweden, June 23-25, 2014, pages 273–278. ACM, 2014.doi:10.1145/2591708.2591748

work page doi:10.1145/2591708.2591748 2014
[6]

Explaining type inference.Science of Computer Programming, 27(1):37–83, July 1996.doi:10.1016/0167-6423(95)00007- 0

Dominic Duggan and Frederick Bent. Explaining type inference.Science of Computer Programming, 27(1):37–83, July 1996.doi:10.1016/0167-6423(95)00007- 0

work page doi:10.1016/0167-6423(95)00007- 1996
[7]

A programmable pro- gramming language

Matthias Felleisen, Robert Bruce Findler, Matthew Flatt, Shriram Krishnamurthi, Eli Barzilay, Jay McCarthy, and Sam Tobin-Hochstadt. A programmable pro- gramming language. InCommunications of the ACM, 2018

2018
[8]

McCarthy, Sam Phillips, Sorawee Porncharoenwase, Jens Axel Søgaard, and Sam Tobin-Hochstadt

MatthewFlatt,TaylorAllred,NiaAngle,StephenDeGabrielle,RobertBruceFind- ler, Jack Firth, Kiran Gopinathan, Ben Greenman, Siddhartha Kasivajhula, Alex Knauth, Jay A. McCarthy, Sam Phillips, Sorawee Porncharoenwase, Jens Axel Søgaard, and Sam Tobin-Hochstadt. Rhombus: A new spin on macros with- out all the parentheses.Proceedings of the ACM on Programming La...

work page doi:10.1145/3622818 2023
[9]

Aider: AI pair programming in your terminal.https://github.com/ Aider-AI/aider, 2024

Paul Gauthier. Aider: AI pair programming in your terminal.https://github.com/ Aider-AI/aider, 2024. Accessed 2026-05-30

2024
[10]

Ceccherini-Silberstein and M

Chuqin Geng, Haolin Ye, Yixuan Li, Tianyu Han, Brigitte Pientka, and Xujie Si. Novice type error diagnosis with natural language models. In Ilya Sergey, editor,ProgrammingLanguagesandSystems-20thAsianSymposium,APLAS2022, Auckland, New Zealand, December 5, 2022, Proceedings, volume 13658 ofLecture Notes in Computer Science, pages 196–214. Springer, 2022.do...

work page doi:10.1007/978-3-031- 2022
[11]

An interactive debugger for Rust trait errors

Gavin Gray, Will Crichton, and Shriram Krishnamurthi. An interactive debugger for Rust trait errors. InACM SIGPLAN Conference on Programming Language Design and Implementation, 2025. 23 Type-Error Ablation and AI Coding Agents

2025
[12]

Christian Haack and Joe B. Wells. Type error slicing in implicitly typed higher- order languages.Science of Computer Programming, 50(1-3):189–224, 2004. doi:10.1016/j.scico.2004.01.004

work page doi:10.1016/j.scico.2004.01.004 2004
[13]

Solved and open problems in type error diagnosis

Jurriaan Hage. Solved and open problems in type error diagnosis. In Loli Burgueño and Lars Michael Kristensen, editors,STAF 2020 Workshop Proceed- ings: 4th Workshop on Model-Driven Engineering for the Internet-of-Things, 1st International Workshop on Modeling Smart Cities, and 5th International Workshop on Open and Original Problems in Software Language ...

2020
[14]

Doaitse Swierstra

Bastiaan Heeren, Jurriaan Hage, and S. Doaitse Swierstra. Scripting the type inference process. InProceedings of the Eighth ACM SIGPLAN International Conference on Functional Programming, ICFP 2003, Uppsala, Sweden, August 25-29, 2003, pages 3–13. ACM, 2003.doi:10.1145/944705.944707

work page doi:10.1145/944705.944707 2003
[15]

James J. Horning. What the compiler should tell the user. InCompiler Con- struction, An Advanced Course, 2Nd Ed., pages 525–548, London, UK, UK, 1976. Springer-Verlag. URL:http://dl.acm.org/citation.cfm?id=647431.723720

arXiv 1976
[16]

Qwen2.5-Coder Technical Report

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, Kai Dang, Yang Fan, Yichang Zhang, AnYang,RuiMen,FeiHuang,BoZheng,YiboMiao,ShanghaoranQuan,Yunlong Feng, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. Qwen2.5- Coder technical report, 2024.arXiv:2409.12186,doi:10.48550/arXiv...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2409.12186 2024
[17]

A. R. Jonckheere. A distribution-freek-sample test against ordered alternatives. Biometrika, 41(1–2):133–145, 1954.doi:10.1093/biomet/41.1-2.133

work page doi:10.1093/biomet/41.1-2.133 1954
[18]

Third edition edition, 2022

Shriram Krishnamurthi.Programming Languages: Application and Interpretation. Third edition edition, 2022. URL:https://plai.org/

2022
[19]

GenProg: A generic method for automatic software repair.IEEE Transactions on Software Engineering, 38(1):54–72, 2012.doi:10.1109/TSE.2011.104

Claire Le Goues, ThanhVu Nguyen, Stephanie Forrest, and Westley Weimer. GenProg: A generic method for automatic software repair.IEEE Transactions on Software Engineering, 38(1):54–72, 2012.doi:10.1109/TSE.2011.104

work page doi:10.1109/tse.2011.104 2012
[20]

Automated program repair.Communications of the ACM, 62(12):56–65, 2019.doi:10.1145/3318162

Claire Le Goues, Michael Pradel, and Abhik Roychoudhury. Automated program repair.Communications of the ACM, 62(12):56–65, 2019.doi:10.1145/3318162

work page doi:10.1145/3318162 2019
[21]

Lerner, Matthew Flower, Dan Grossman, and Craig Chambers

Benjamin S. Lerner, Matthew Flower, Dan Grossman, and Craig Chambers. Searching for type-error messages. InProceedings of the ACM SIGPLAN 2007 Conference on Programming Language Design and Implementation, San Diego, California, USA, June 10-13, 2007, pages 425–434. ACM, 2007.doi:10.1145/1250734. 1250783

work page doi:10.1145/1250734 2007
[22]

Measuring the effectiveness of error messages designed for novice programmers

Guillaume Marceau, Kathi Fisler, and Shriram Krishnamurthi. Measuring the effectiveness of error messages designed for novice programmers. InACM Technical Symposium on Computer Science Education, 2011

2011
[23]

McKeithen, Judith S

Katherine B. McKeithen, Judith S. Reitman, Henry H. Rueter, and Stephen C. Hirtle. Knowledge organization and skill differences in computer programmers. 24 Shriram Krishnamurthi and Matthew Flatt Cognitive Psychology, 13(3):307–325, 1981.doi:10.1016/0010-0285(81)90012-8

work page doi:10.1016/0010-0285(81)90012-8 1981
[24]

Ollama: Get up and running with large language models locally.https: //github.com/ollama/ollama, 2024

Ollama. Ollama: Get up and running with large language models locally.https: //github.com/ollama/ollama, 2024. Accessed 2026-05-30

2024
[25]

Ellis Batten Page. Ordered hypotheses for multiple treatments: A significance test for linear ranks.Journal of the American Statistical Association, 58(301):216–230, 1963.doi:10.1080/01621459.1963.10500843

work page doi:10.1080/01621459.1963.10500843 1963
[26]

Stimulus structures and mental representations in expert comprehension of computer programs.Cognitive Psychology, 19(3):295–341, 1987.doi:10.1016/0010-0285(87)90007-7

Nancy Pennington. Stimulus structures and mental representations in expert comprehension of computer programs.Cognitive Psychology, 19(3):295–341, 1987.doi:10.1016/0010-0285(87)90007-7

work page doi:10.1016/0010-0285(87)90007-7 1987
[27]

FLOW-MATIC programming system

Remington Rand. FLOW-MATIC programming system. Technical report, Rem- ington Rand, Univac Division, 1957. URL:https://archive.computerhistory.org/ resources/text/Remington_Rand/Univac.Flowmatic.1957.102646140.pdf

arXiv 1957
[28]

Seidel, Huma Sibghat, Kamalika Chaudhuri, Westley Weimer, and Ranjit Jhala

Eric L. Seidel, Huma Sibghat, Kamalika Chaudhuri, Westley Weimer, and Ranjit Jhala. Learningtoblame:localizingnovicetypeerrorswithdata-drivendiagnosis. ProceedingsoftheACMonProgrammingLanguages,1(OOPSLA):60:1–60:27,2017. doi:10.1145/3138818

work page doi:10.1145/3138818 2017
[29]

Shapiro.Algorithmic Program Debugging

Ehud Y. Shapiro.Algorithmic Program Debugging. ACM Distinguished Disserta- tion. MIT Press, Cambridge, MA, 1983

1983
[30]

Exploratory experiments in programmer behavior.In- ternational Journal of Computer & Information Sciences, 5(2):123–143, 1976

Ben Shneiderman. Exploratory experiments in programmer behavior.In- ternational Journal of Computer & Information Sciences, 5(2):123–143, 1976. doi:10.1007/BF00975629

work page doi:10.1007/bf00975629 1976
[31]

Empirical studies of programming knowledge

Elliot Soloway and Kate Ehrlich. Empirical studies of programming knowledge. IEEE Transactions on Software Engineering, SE-10(5):595–609, 1984.doi:10.1109/ TSE.1984.5010283

arXiv 1984
[32]

Barbee E. Teasley. The effects of naming style and expertise on program com- prehension.International Journal of Human-Computer Studies, 40(5):757–770, 1994.doi:10.1006/ijhc.1994.1036

work page doi:10.1006/ijhc.1994.1036 1994
[33]

T. J. Terpstra. The asymptotic normality and consistency of Kendall’s test against trend, when ties are present in one ranking.Indagationes Mathematicae, 14:327– 333, 1952

1952
[34]

Tufte.Beautiful Evidence

Edward R. Tufte.Beautiful Evidence. Graphics Press, Cheshire, Connecticut, 2006

2006
[35]

Finding the source of type errors

Mitchell Wand. Finding the source of type errors. InConference Record of the 13th Annual ACM Symposium on Principles of Programming Languages (POPL ’86), pages 38–43, St. Petersburg Beach, Florida, USA, 1986. ACM Press.doi: 10.1145/512644.512648

work page doi:10.1145/512644.512648 1986
[36]

Executable examples for programming problem comprehension

John Wrenn and Shriram Krishnamurthi. Executable examples for programming problem comprehension. InSIGCSE International Computing Education Research Conference, 2019

2019
[37]

Learning user friendly type-error messages.Proceedings of the ACM on Programming Languages, 1(OOPSLA):106:1–106:29, 2017.doi:10.1145/3133930

Baijun Wu, John Peter Campora III, and Sheng Chen. Learning user friendly type-error messages.Proceedings of the ACM on Programming Languages, 1(OOPSLA):106:1–106:29, 2017.doi:10.1145/3133930. 25 Type-Error Ablation and AI Coding Agents

work page doi:10.1145/3133930 2017
[38]

Evaluating the impact of experimental assumptions in automated fault localization,

Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. Automated program repair in the era of large pre-trained language models. In45th IEEE/ACM InternationalConferenceonSoftwareEngineering,ICSE2023,Melbourne,Australia, May 14–20, 2023, pages 1482–1494. IEEE, 2023.doi:10.1109/ICSE48619.2023.00129

work page doi:10.1109/icse48619.2023.00129 2023
[39]

Yoo, Morris A

Andy B. Yoo, Morris A. Jette, and Mark Grondona. SLURM: Simple Linux utility for resource management. In Dror G. Feitelson, Larry Rudolph, and Uwe Schwiegelshohn, editors,Job Scheduling Strategies for Parallel Processing, 9th International Workshop, JSSPP 2003, Seattle, WA, USA, June 24, 2003, Revised Papers, volume 2862 ofLecture Notes in Computer Scienc...

work page doi:10.1007/10968987_3 2003

[1] [1]

Brett A. Becker. An effective approach to enhancing compiler error messages. In Proceedingsofthe47thACMTechnicalSymposiumonComputingScienceEducation, SIGCSE 2016, Memphis, TN, USA, March 02 - 05, 2016, pages 126–131. ACM, 2016.doi:10.1145/2839509.2844584

work page doi:10.1145/2839509.2844584 2016

[2] [2]

Becker, Paul Denny, Raymond Pettit, Durell Bouchard, Dennis J

Brett A. Becker, Paul Denny, Raymond Pettit, Durell Bouchard, Dennis J. Bouvier, Brian Harrington, Amir Kamil, Amey Karkare, Chris McDonald, Peter-Michael Osera, Janice L. Pearce, and James Prather. Compiler error messages considered unhelpful: The landscape of text-based programming error message research. InProceedings of the Working Group Reports on In...

work page doi:10.1145/3344429.3372508 2019

[3] [3]

Chase and Herbert A

WilliamG.ChaseandHerbertA.Simon. Perceptioninchess.CognitivePsychology, 4(1):55–81, 1973.doi:10.1016/0010-0285(73)90004-2

work page doi:10.1016/0010-0285(73)90004-2 1973

[4] [4]

de Groot.Thought and Choice in Chess

Adriaan D. de Groot.Thought and Choice in Chess. Mouton, The Hague, 1965

1965

[5] [5]

Enhancing syntax error messages appears ineffectual

Paul Denny, Andrew Luxton-Reilly, and Dave Carpenter. Enhancing syntax error messages appears ineffectual. InInnovation and Technology in Computer Science Education Conference 2014, ITiCSE ’14, Uppsala, Sweden, June 23-25, 2014, pages 273–278. ACM, 2014.doi:10.1145/2591708.2591748

work page doi:10.1145/2591708.2591748 2014

[6] [6]

Explaining type inference.Science of Computer Programming, 27(1):37–83, July 1996.doi:10.1016/0167-6423(95)00007- 0

Dominic Duggan and Frederick Bent. Explaining type inference.Science of Computer Programming, 27(1):37–83, July 1996.doi:10.1016/0167-6423(95)00007- 0

work page doi:10.1016/0167-6423(95)00007- 1996

[7] [7]

A programmable pro- gramming language

Matthias Felleisen, Robert Bruce Findler, Matthew Flatt, Shriram Krishnamurthi, Eli Barzilay, Jay McCarthy, and Sam Tobin-Hochstadt. A programmable pro- gramming language. InCommunications of the ACM, 2018

2018

[8] [8]

McCarthy, Sam Phillips, Sorawee Porncharoenwase, Jens Axel Søgaard, and Sam Tobin-Hochstadt

MatthewFlatt,TaylorAllred,NiaAngle,StephenDeGabrielle,RobertBruceFind- ler, Jack Firth, Kiran Gopinathan, Ben Greenman, Siddhartha Kasivajhula, Alex Knauth, Jay A. McCarthy, Sam Phillips, Sorawee Porncharoenwase, Jens Axel Søgaard, and Sam Tobin-Hochstadt. Rhombus: A new spin on macros with- out all the parentheses.Proceedings of the ACM on Programming La...

work page doi:10.1145/3622818 2023

[9] [9]

Aider: AI pair programming in your terminal.https://github.com/ Aider-AI/aider, 2024

Paul Gauthier. Aider: AI pair programming in your terminal.https://github.com/ Aider-AI/aider, 2024. Accessed 2026-05-30

2024

[10] [10]

Ceccherini-Silberstein and M

Chuqin Geng, Haolin Ye, Yixuan Li, Tianyu Han, Brigitte Pientka, and Xujie Si. Novice type error diagnosis with natural language models. In Ilya Sergey, editor,ProgrammingLanguagesandSystems-20thAsianSymposium,APLAS2022, Auckland, New Zealand, December 5, 2022, Proceedings, volume 13658 ofLecture Notes in Computer Science, pages 196–214. Springer, 2022.do...

work page doi:10.1007/978-3-031- 2022

[11] [11]

An interactive debugger for Rust trait errors

Gavin Gray, Will Crichton, and Shriram Krishnamurthi. An interactive debugger for Rust trait errors. InACM SIGPLAN Conference on Programming Language Design and Implementation, 2025. 23 Type-Error Ablation and AI Coding Agents

2025

[12] [12]

Christian Haack and Joe B. Wells. Type error slicing in implicitly typed higher- order languages.Science of Computer Programming, 50(1-3):189–224, 2004. doi:10.1016/j.scico.2004.01.004

work page doi:10.1016/j.scico.2004.01.004 2004

[13] [13]

Solved and open problems in type error diagnosis

Jurriaan Hage. Solved and open problems in type error diagnosis. In Loli Burgueño and Lars Michael Kristensen, editors,STAF 2020 Workshop Proceed- ings: 4th Workshop on Model-Driven Engineering for the Internet-of-Things, 1st International Workshop on Modeling Smart Cities, and 5th International Workshop on Open and Original Problems in Software Language ...

2020

[14] [14]

Doaitse Swierstra

Bastiaan Heeren, Jurriaan Hage, and S. Doaitse Swierstra. Scripting the type inference process. InProceedings of the Eighth ACM SIGPLAN International Conference on Functional Programming, ICFP 2003, Uppsala, Sweden, August 25-29, 2003, pages 3–13. ACM, 2003.doi:10.1145/944705.944707

work page doi:10.1145/944705.944707 2003

[15] [15]

James J. Horning. What the compiler should tell the user. InCompiler Con- struction, An Advanced Course, 2Nd Ed., pages 525–548, London, UK, UK, 1976. Springer-Verlag. URL:http://dl.acm.org/citation.cfm?id=647431.723720

arXiv 1976

[16] [16]

Qwen2.5-Coder Technical Report

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, Kai Dang, Yang Fan, Yichang Zhang, AnYang,RuiMen,FeiHuang,BoZheng,YiboMiao,ShanghaoranQuan,Yunlong Feng, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. Qwen2.5- Coder technical report, 2024.arXiv:2409.12186,doi:10.48550/arXiv...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2409.12186 2024

[17] [17]

A. R. Jonckheere. A distribution-freek-sample test against ordered alternatives. Biometrika, 41(1–2):133–145, 1954.doi:10.1093/biomet/41.1-2.133

work page doi:10.1093/biomet/41.1-2.133 1954

[18] [18]

Third edition edition, 2022

Shriram Krishnamurthi.Programming Languages: Application and Interpretation. Third edition edition, 2022. URL:https://plai.org/

2022

[19] [19]

GenProg: A generic method for automatic software repair.IEEE Transactions on Software Engineering, 38(1):54–72, 2012.doi:10.1109/TSE.2011.104

Claire Le Goues, ThanhVu Nguyen, Stephanie Forrest, and Westley Weimer. GenProg: A generic method for automatic software repair.IEEE Transactions on Software Engineering, 38(1):54–72, 2012.doi:10.1109/TSE.2011.104

work page doi:10.1109/tse.2011.104 2012

[20] [20]

Automated program repair.Communications of the ACM, 62(12):56–65, 2019.doi:10.1145/3318162

Claire Le Goues, Michael Pradel, and Abhik Roychoudhury. Automated program repair.Communications of the ACM, 62(12):56–65, 2019.doi:10.1145/3318162

work page doi:10.1145/3318162 2019

[21] [21]

Lerner, Matthew Flower, Dan Grossman, and Craig Chambers

Benjamin S. Lerner, Matthew Flower, Dan Grossman, and Craig Chambers. Searching for type-error messages. InProceedings of the ACM SIGPLAN 2007 Conference on Programming Language Design and Implementation, San Diego, California, USA, June 10-13, 2007, pages 425–434. ACM, 2007.doi:10.1145/1250734. 1250783

work page doi:10.1145/1250734 2007

[22] [22]

Measuring the effectiveness of error messages designed for novice programmers

Guillaume Marceau, Kathi Fisler, and Shriram Krishnamurthi. Measuring the effectiveness of error messages designed for novice programmers. InACM Technical Symposium on Computer Science Education, 2011

2011

[23] [23]

McKeithen, Judith S

Katherine B. McKeithen, Judith S. Reitman, Henry H. Rueter, and Stephen C. Hirtle. Knowledge organization and skill differences in computer programmers. 24 Shriram Krishnamurthi and Matthew Flatt Cognitive Psychology, 13(3):307–325, 1981.doi:10.1016/0010-0285(81)90012-8

work page doi:10.1016/0010-0285(81)90012-8 1981

[24] [24]

Ollama: Get up and running with large language models locally.https: //github.com/ollama/ollama, 2024

Ollama. Ollama: Get up and running with large language models locally.https: //github.com/ollama/ollama, 2024. Accessed 2026-05-30

2024

[25] [25]

Ellis Batten Page. Ordered hypotheses for multiple treatments: A significance test for linear ranks.Journal of the American Statistical Association, 58(301):216–230, 1963.doi:10.1080/01621459.1963.10500843

work page doi:10.1080/01621459.1963.10500843 1963

[26] [26]

Stimulus structures and mental representations in expert comprehension of computer programs.Cognitive Psychology, 19(3):295–341, 1987.doi:10.1016/0010-0285(87)90007-7

Nancy Pennington. Stimulus structures and mental representations in expert comprehension of computer programs.Cognitive Psychology, 19(3):295–341, 1987.doi:10.1016/0010-0285(87)90007-7

work page doi:10.1016/0010-0285(87)90007-7 1987

[27] [27]

FLOW-MATIC programming system

Remington Rand. FLOW-MATIC programming system. Technical report, Rem- ington Rand, Univac Division, 1957. URL:https://archive.computerhistory.org/ resources/text/Remington_Rand/Univac.Flowmatic.1957.102646140.pdf

arXiv 1957

[28] [28]

Seidel, Huma Sibghat, Kamalika Chaudhuri, Westley Weimer, and Ranjit Jhala

Eric L. Seidel, Huma Sibghat, Kamalika Chaudhuri, Westley Weimer, and Ranjit Jhala. Learningtoblame:localizingnovicetypeerrorswithdata-drivendiagnosis. ProceedingsoftheACMonProgrammingLanguages,1(OOPSLA):60:1–60:27,2017. doi:10.1145/3138818

work page doi:10.1145/3138818 2017

[29] [29]

Shapiro.Algorithmic Program Debugging

Ehud Y. Shapiro.Algorithmic Program Debugging. ACM Distinguished Disserta- tion. MIT Press, Cambridge, MA, 1983

1983

[30] [30]

Exploratory experiments in programmer behavior.In- ternational Journal of Computer & Information Sciences, 5(2):123–143, 1976

Ben Shneiderman. Exploratory experiments in programmer behavior.In- ternational Journal of Computer & Information Sciences, 5(2):123–143, 1976. doi:10.1007/BF00975629

work page doi:10.1007/bf00975629 1976

[31] [31]

Empirical studies of programming knowledge

Elliot Soloway and Kate Ehrlich. Empirical studies of programming knowledge. IEEE Transactions on Software Engineering, SE-10(5):595–609, 1984.doi:10.1109/ TSE.1984.5010283

arXiv 1984

[32] [32]

Barbee E. Teasley. The effects of naming style and expertise on program com- prehension.International Journal of Human-Computer Studies, 40(5):757–770, 1994.doi:10.1006/ijhc.1994.1036

work page doi:10.1006/ijhc.1994.1036 1994

[33] [33]

T. J. Terpstra. The asymptotic normality and consistency of Kendall’s test against trend, when ties are present in one ranking.Indagationes Mathematicae, 14:327– 333, 1952

1952

[34] [34]

Tufte.Beautiful Evidence

Edward R. Tufte.Beautiful Evidence. Graphics Press, Cheshire, Connecticut, 2006

2006

[35] [35]

Finding the source of type errors

Mitchell Wand. Finding the source of type errors. InConference Record of the 13th Annual ACM Symposium on Principles of Programming Languages (POPL ’86), pages 38–43, St. Petersburg Beach, Florida, USA, 1986. ACM Press.doi: 10.1145/512644.512648

work page doi:10.1145/512644.512648 1986

[36] [36]

Executable examples for programming problem comprehension

John Wrenn and Shriram Krishnamurthi. Executable examples for programming problem comprehension. InSIGCSE International Computing Education Research Conference, 2019

2019

[37] [37]

Learning user friendly type-error messages.Proceedings of the ACM on Programming Languages, 1(OOPSLA):106:1–106:29, 2017.doi:10.1145/3133930

Baijun Wu, John Peter Campora III, and Sheng Chen. Learning user friendly type-error messages.Proceedings of the ACM on Programming Languages, 1(OOPSLA):106:1–106:29, 2017.doi:10.1145/3133930. 25 Type-Error Ablation and AI Coding Agents

work page doi:10.1145/3133930 2017

[38] [38]

Evaluating the impact of experimental assumptions in automated fault localization,

Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. Automated program repair in the era of large pre-trained language models. In45th IEEE/ACM InternationalConferenceonSoftwareEngineering,ICSE2023,Melbourne,Australia, May 14–20, 2023, pages 1482–1494. IEEE, 2023.doi:10.1109/ICSE48619.2023.00129

work page doi:10.1109/icse48619.2023.00129 2023

[39] [39]

Yoo, Morris A

Andy B. Yoo, Morris A. Jette, and Mark Grondona. SLURM: Simple Linux utility for resource management. In Dror G. Feitelson, Larry Rudolph, and Uwe Schwiegelshohn, editors,Job Scheduling Strategies for Parallel Processing, 9th International Workshop, JSSPP 2003, Seattle, WA, USA, June 24, 2003, Revised Papers, volume 2862 ofLecture Notes in Computer Scienc...

work page doi:10.1007/10968987_3 2003