Recognition: no theorem link
ChatGPT: Friend or Foe When Comprehending and Changing Unfamiliar Code
Pith reviewed 2026-05-12 04:21 UTC · model grok-4.3
The pith
AI assistance leaves all detailed coding problem-solving steps intact while changing how developers get stuck and recover.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Developers in the AI group repeatedly turned to the tool to offload aspects of the process, yet every one of the twenty-five detailed problem-solving behaviors appeared in both the AI and non-AI groups. Nine of the ten participants encountered being stuck during the task, but the patterns of how they became stuck and how they became unstuck differed by group. The authors catalog seven distinct causes for getting stuck and note specific instances in which AI support either aided or impeded progress toward unstuck states.
What carries the argument
Polya's four problem-solving phases combined with twenty-five inductively generated behavior codes, applied to triangulated data from think-aloud sessions, code changes, web searches, and LLM prompts.
If this is right
- All twenty-five detailed problem-solving behaviors remain present even when developers have access to AI for offloading work.
- Developers encounter seven distinct causes of becoming stuck while extending unfamiliar code.
- AI can either facilitate or obstruct recovery from stuck states depending on the cause.
- Stuck and unstuck patterns differ between developers who use AI and those who do not.
Where Pith is reading between the lines
- Tool designers could add features that support recovery from each of the seven stuck causes without introducing new ones.
- Training for developers might emphasize strategies for using AI that avoid the hindering stuck patterns observed here.
- The same comparison of behaviors could be repeated on larger, multi-file changes to test whether the preservation of all steps continues.
- Teams adopting AI might track which stuck causes appear most often in their workflow and adjust prompts or processes accordingly.
Load-bearing premise
That the problem-solving behaviors and stuck patterns seen in advanced students on one lab task will hold for professional developers on real projects.
What would settle it
A study that records professional developers working on actual codebases and checks whether the same twenty-five behaviors appear in both AI and non-AI conditions along with the same seven stuck causes and AI effects on recovery.
Figures
read the original abstract
A rapidly growing body of research is examining how LLMs influence developers when they code. To date, this research has tended to focus on productivity and code quality outcomes, rather than the underlying cognitive processes involved in programming. To address this gap, we report on the results of an exploratory laboratory study of ten advanced student developers (five with support from AI and five without) who had to make a non-trivial extension to a sizable software system. Leveraging Polya's four problem-solving phases and 25 inductively-generated codes detailing distinct problem-solving behaviors as the primary lenses, we examined: (1) how AI impacted the problem-solving approach the developers used to solve the programming task, and (2) how AI impacted their progress when they became stuck. For the analysis, we triangulated data across multiple sources (e.g., think-aloud, code changes, web searches, and LLM prompts). Unexpectedly, while developers in the AI group repeatedly turned to the AI tool to offload certain aspects of the process, all detailed problem-solving behaviors appeared in both groups. We also found that nine out of ten participants found themselves stuck in their work, but with key differences in how they became stuck and unstuck. We highlight seven distinct causes for being stuck and highlight how AI in some cases helped and in other cases hindered becoming unstuck.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reports an exploratory laboratory study with ten advanced student developers (five with ChatGPT support and five without) performing a non-trivial extension task on a sizable software system. Using Polya's four problem-solving phases and 25 inductively generated codes for distinct behaviors, the authors triangulate think-aloud protocols, code changes, web searches, and LLM prompts to examine (1) AI's impact on problem-solving approaches and (2) how AI affects progress when stuck. Key claims are that all 25 behaviors appeared in both groups (with AI used to offload aspects of the process) and that nine of ten participants became stuck, with differences in the seven identified causes and in how they became unstuck.
Significance. If the patterns hold, the work fills a gap by focusing on cognitive processes rather than productivity or code quality metrics alone. Strengths include data triangulation across multiple sources and the inductive generation of a behavior inventory grounded in Polya's framework. This could inform the design of AI coding tools by suggesting they supplement rather than replace core problem-solving behaviors. The exploratory design and small sample, however, limit the strength of claims about invariance of the behavioral repertoire.
major comments (3)
- [Methods and Results] Methods and Results sections: The central claim that all 25 problem-solving behaviors appeared in both the AI and non-AI groups rests on a sample of only five participants per arm. With this n, the absence of observed differences may reflect limited opportunity to surface variations rather than true equivalence of the behavioral repertoires.
- [Discussion] Discussion: The interpretation that AI merely offloads without altering underlying problem-solving behaviors is load-bearing for the 'friend or foe' framing, yet the design uses advanced students on one controlled lab task. This does not secure generalizability to professional developers working on real, unfamiliar codebases, as a different task or participant pool could yield different distributions of behaviors or stuck triggers.
- [Analysis] Analysis: Full details on the inductive generation of the 25 codes, their application, and any inter-rater reliability metrics are not provided. This is load-bearing for the validity of the behavior inventory that underpins the equivalence finding across groups.
minor comments (2)
- [Abstract] Abstract: Specify whether the single participant who did not become stuck was in the AI or non-AI group, as this detail could illuminate the reported differences in stuck/unstuck patterns.
- The manuscript would benefit from a dedicated limitations subsection that explicitly addresses sample size, participant expertise, and task specificity in relation to the generalizability of the seven stuck causes.
Simulated Author's Rebuttal
We thank the referee for their constructive comments and for recognizing the value of our exploratory focus on cognitive processes and data triangulation. We respond point-by-point to the major comments below, agreeing where the critique is valid and outlining specific revisions to address each concern while preserving the integrity of our findings.
read point-by-point responses
-
Referee: [Methods and Results] Methods and Results sections: The central claim that all 25 problem-solving behaviors appeared in both the AI and non-AI groups rests on a sample of only five participants per arm. With this n, the absence of observed differences may reflect limited opportunity to surface variations rather than true equivalence of the behavioral repertoires.
Authors: We agree that the small sample (n=5 per group) means we cannot claim equivalence of behavioral repertoires; the absence of group-specific behaviors in this study may simply reflect limited opportunity to observe variations. Our original phrasing reported an observation from the data rather than a general claim, but we will revise the Methods and Results sections (and update the abstract) to explicitly frame this as an exploratory finding, noting that larger samples could surface differences not seen here. This change will be made without altering the reported observations. revision: yes
-
Referee: [Discussion] Discussion: The interpretation that AI merely offloads without altering underlying problem-solving behaviors is load-bearing for the 'friend or foe' framing, yet the design uses advanced students on one controlled lab task. This does not secure generalizability to professional developers working on real, unfamiliar codebases, as a different task or participant pool could yield different distributions of behaviors or stuck triggers.
Authors: We acknowledge the limitation in generalizability. The study involved advanced students on a single controlled extension task, so patterns of offloading, stuck points, and recovery may differ for professionals on real, large-scale codebases. We will revise the Discussion to add an explicit limitations subsection that discusses the participant pool, task constraints, and the need for future work with industry developers. We will also temper the 'friend or foe' framing to present the results as contextual insights that can still inform AI tool design, rather than broad generalizations. revision: yes
-
Referee: [Analysis] Analysis: Full details on the inductive generation of the 25 codes, their application, and any inter-rater reliability metrics are not provided. This is load-bearing for the validity of the behavior inventory that underpins the equivalence finding across groups.
Authors: We will expand the Analysis section to include full details on the inductive process. The 25 codes were developed through iterative open coding of think-aloud protocols, code changes, web searches, and LLM prompts, using Polya's phases as an organizing lens; two authors independently coded initial data and met to refine the codebook through discussion until consensus. We did not compute formal inter-rater reliability metrics given the exploratory qualitative design, but we will describe the multi-author review and consensus process in detail and add a supplementary table with code definitions and examples to support transparency and allow assessment of the inventory. revision: yes
Circularity Check
No circularity: purely empirical qualitative study grounded in participant data
full rationale
The paper is an exploratory laboratory study that collects think-aloud, code-change, search, and prompt logs from ten participants, then applies Polya's phases and 25 inductively generated codes to describe observed behaviors. No equations, fitted parameters, derivations, or self-referential definitions exist. All claims (e.g., all 25 behaviors appearing in both groups, differences in stuck causes) are direct summaries of the collected data rather than reductions to prior inputs or self-citations. Standard qualitative triangulation does not meet any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Polya's four problem-solving phases apply to software development tasks involving unfamiliar code
- domain assumption Inductively generated codes from think-aloud protocols can reliably distinguish distinct problem-solving behaviors
Reference graph
Works this paper leans on
-
[1]
Norman Anderson, Tarek Alakmeh, Victoria Jackson, Guilherme Vaz Pereira, Umit Akirmak, Anthony Estey, Rafael Prikladnicki, Thomas Fritz, André van der Hoek, and Margaret-Anne Storey. 2026. Study Supplementary Material. https: //doi.org/10.5281/zenodo.18372121
-
[2]
Martin Balfroid, Benoît Vanderose, and Xavier Devroey. 2024. Towards LLM- Generated Code Tours for Onboarding. InProceedings of the Third ACM/IEEE International Workshop on NL-Based Software Engineering (NLBSE ’24). Associa- tion for Computing Machinery, New York, NY, USA, 65–68. doi:10.1145/3643787. 3648033 ChatGPT: Friend or Foe When Comprehending and C...
-
[3]
Andrew Begel and Beth Simon. 2008. Struggles of New College Graduates in Their First Software Development Job. InProceedings of the 39th SIGCSE Technical Symposium on Computer Science Education (SIGCSE ’08). Association for Computing Machinery, New York, NY, USA, 226–230. doi:10.1145/1352135. 1352218
-
[4]
Blazity. 2024. Enterprise Commerce. Blazity. https://github.com/Blazity/ enterprise-commerce
work page 2024
-
[5]
Lane, Miaomiao Zhang, Vladimir Jacimovic, and Karim R
Léonard Boussioux, Jacqueline N. Lane, Miaomiao Zhang, Vladimir Jacimovic, and Karim R. Lakhani. 2024. The Crowdless Future? Generative AI and Creative Problem-Solving.Organization science (Providence, R.I.)35, 5 (2024), 1589–1607
work page 2024
-
[6]
Bill Curtis, Elliot M. Soloway, Ruth E. Brooks, John B. Black, Kate Ehrlich, and H. R. Ramsey. 1986. Software psychology: The need for an interdisciplinary program.Proc. IEEE74, 8 (1986), 1092–1106. doi:10.1109/PROC.1986.13596
-
[7]
Davidson, Raymond Deuser, and Robert J
Janet E. Davidson, Raymond Deuser, and Robert J. Sternberg. 1994. The role of metacognition in problem solving. InMetacognition: Knowing about Knowing, Janet Metcalfe and Arthur P. Shimamura (Eds.). MIT Press, Cambridge, MA, 207–226
work page 1994
-
[8]
Fabrizio Dell’Acqua, Edward McFowland III, Ethan R. Mollick, Hila Lifshitz- Assaf, Katherine Kellogg, Saran Rajendran, Lisa Krayer, François Candelon, and Karim R. Lakhani. 2023.Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality. Working Paper 24-013. Harvard Business...
-
[9]
Karl Duncker. 1945.On Problem-Solving. American Psychological Association, Washington, DC. doi:10.1037/h0093599 Trans. by L. S. Lees.Psychological Monographs, 58(5), i–113
-
[10]
Dietrich Dörner and Joachim Funke. 2017. Complex Problem Solving: What It Is and What It Is Not.Frontiers in Psychology8 (2017), 1153. doi:10.3389/fpsyg. 2017.01153
-
[11]
Christof Ebert and Panos Louridas. 2023. Generative AI for Software Practitioners. IEEE Software40, 4 (2023), 30–38. doi:10.1109/MS.2023.3265877
-
[12]
K. A. Ericsson and J. H. Moxley. 2011. Thinking Aloud Protocols: Concurrent Verbalizations of Thinking During Performance on Tasks Involving Decision Making. InA Handbook of Process Tracing Methods for Decision Research: A Critical Review and User’s Guide, M. Schulte-Mecklenbeck, A. Kühberger, and R. Ranyard (Eds.). Psychology Press, 89–114
work page 2011
-
[13]
Fabian Fagerholm, Michael Felderer, Davide Fucci, Michael Unterkalmsteiner, Bogdan Marculescu, Michele Martini, Lars G. W. Tengberg, Robert Feldt, Björn Lehtelä, Bálint Nagyváradi, and Junaid Khattak. 2022. Cognition in Software Engineering: A Taxonomy and Survey of a Half-Century of Research.Comput. Surveys54, 11s (2022), 1–36. doi:10.1145/3508359
-
[14]
Joachim Funke. 2010. Complex Problem Solving: A Case for Complex Cognition? Cognitive Processing11, 2 (May 2010), 133–142. doi:10.1007/s10339-009-0345-0
-
[15]
Patrick Griffin and Esther Care. 2015. The ATC21S Method. InAssessment and Teaching of 21st Century Skills, Patrick Griffin and Esther Care (Eds.). Springer, Dordrecht, NL, 3–33
work page 2015
-
[16]
Xinyue Hao, Emrah Demir, and Daniel Eyers. 2025. Beyond Human-in-the-Loop: Sensemaking between Artificial Intelligence and Human Intelligence Collabora- tion.Sustainable Futures10 (2025), 101152. doi:10.1016/j.sftr.2025.101152
-
[17]
Ava Heinonen, Bettina Lehtelä, Arto Hellas, and Fabian Fagerholm. 2023. Syn- thesizing Research on Programmers’ Mental Models of Programs, Tasks and Concepts — A Systematic Literature Review.Information and Software Technology 164 (2023), 107300. doi:10.1016/j.infsof.2023.107300
-
[18]
Paul P. Heppner and Charles J. Krauskopf. 1987. An Information-Processing Approach to Personal Problem Solving.The Counseling Psychologist15, 3 (1987), 371–447. doi:10.1177/0011000087153001
-
[19]
Cheng Hou, Guangyang Zhu, Vivek Sudarshan, Fook S. Lim, and Yew-Soon Ong
-
[20]
doi:10.1016/j.compedu.2025.105329
Measuring Undergraduate Students’ Reliance on Generative AI During Problem-Solving: Scale Development and Validation.Computers & Education 234 (2025), 105329. doi:10.1016/j.compedu.2025.105329
-
[21]
Sven Jacobs, Maurice Kempf, and Natalie Kiesler. 2025. That’s Not the Feed- back I Need! - Student Engagement with GenAI Feedback in the Tutor Kai. In Proceedings of the 2025 Conference on UK and Ireland Computing Education Re- search (UKICER ’25). Association for Computing Machinery, New York, NY, USA. doi:10.1145/3754508.3754512
-
[22]
Tingting Jiang, Zhumo Sun, Shiting Fu, and Yan Lv. 2024. Human-AI Inter- action Research Agenda: A User-Centered Perspective.Data and Information Management8, 4 (2024), 100078. doi:10.1016/j.dim.2024.100078
-
[23]
Enkelejda Kasneci, Kathrin Sessler, Stefan Küchemann, Maria Bannert, Daryna Dementieva, Frank Fischer, Urs Gasser, Georg Groh, Stephan Günnemann, Eyke Hüllermeier, Stephan Krusche, Gitta Kutyniok, Tilman Michaeli, Clau- dia Nerdel, Jürgen Pfeffer, Oleksandra Poquet, Michael Sailer, Albrecht Schmidt, Tina Seidel, Matthias Stadler, Jochen Weller, Jochen Kuh...
-
[24]
Learning and Individual Differences
ChatGPT for good? On opportunities and challenges of large language models for education.Learning and Individual Differences103 (2023), 102274. doi:10.1016/j.lindif.2023.102274
-
[25]
Amy J. Ko, Thomas D. LaToza, Stephen Hull, Ellen A. Ko, William Kwok, Jane Quichocho, Harshitha Akkaraju, and Rishin Pandit. 2019. Teaching Explicit Programming Strategies to Adolescents. InProceedings of the 50th ACM Technical Symposium on Computer Science Education (SIGCSE ’19). Association for Com- puting Machinery, New York, NY, USA, 469–475. doi:10.1...
-
[26]
Amy J Ko, Brad A Myers, Michael J Coblenz, and Htet Htet Aung. 2006. An exploratory study of how developers seek, relate, and collect relevant information during software maintenance tasks.IEEE Transactions on software engineering 32, 12 (2006), 971–987
work page 2006
-
[27]
Jürgen Koenemann and Scott P Robertson. 1991. Expert problem solving strate- gies for program comprehension. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems. 125–130
work page 1991
-
[28]
Natalia Kosmyna, Erik Hauptmann, Yu T. Yuan, Jing Situ, Xuan-Hao Liao, An- ton V. Beresnitzky, Itai Braunstein, and Pattie Maes. 2025. Your Brain on ChatGPT: Accumulation of Cognitive Debt When Using an AI Assistant for Essay Writing Task. doi:10.48550/arXiv.2506.08872
-
[29]
Ko, Will Jernigan, Alannah Oleson, Christopher J
Dastyni Loksa, Amy J. Ko, Will Jernigan, Alannah Oleson, Christopher J. Mendez, and Margaret M. Burnett. 2016. Programming, Problem Solving, and Self- Awareness: Effects of Explicit Guidance. InProceedings of the 2016 CHI Conference on Human Factors in Computing Systems (CHI ’16). Association for Computing Machinery, New York, NY, USA, 1449–1461. doi:10.1...
-
[30]
Walid Maalej, Rebecca Tiarks, Tobias Roehm, and Rainer Koschke. 2014. On the Comprehension of Program Comprehension.ACM Trans. Softw. Eng. Methodol. 23, 4 (Sept. 2014). doi:10.1145/2622669
-
[31]
Klaus Mainzer. 2009. Challenges of Complexity in the 21st Century: An Inter- disciplinary Introduction.European Review17 (2009), 219–236. doi:10.1017/ S1062798709000714
work page 2009
-
[32]
Ran Mo, Dongyu Wang, Wenjing Zhan, Yingjie Jiang, Yepeng Wang, Yuqi Zhao, Zengyang Li, and Yutao Ma. 2025. Assessing and Analyzing the Correctness of GitHub Copilot’s Code Suggestions.ACM Trans. Softw. Eng. Methodol.34, 7 (Aug. 2025). doi:10.1145/3715108
-
[33]
Sebastian C. Müller and Thomas Fritz. 2015. Stuck and Frustrated or in Flow and Happy: Sensing Developers’ Emotions and Progress. In2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 1. 688–699. doi:10. 1109/ICSE.2015.334
work page 2015
-
[34]
Daye Nam, Andrew Macvean, Vincent Hellendoorn, Bogdan Vasilescu, and Brad Myers. 2024. Using an LLM to Help With Code Understanding. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering (ICSE ’24). Association for Computing Machinery, New York, NY, USA. doi:10.1145/ 3597503.3639187
-
[35]
Allen Newell and Herbert A. Simon. 1972.Human Problem Solving. Prentice-Hall, Englewood Cliffs, NJ
work page 1972
-
[36]
Preetam Paul and Chandrima Variawa. 2025. A Framework for Understanding the Role of Generative AI in Engineering Education: A Literature Review. In Proceedings of the 2025 ASEE Annual Conference & Exposition. Montreal, Quebec, Canada. doi:10.18260/1-2--55366
-
[37]
Nancy Pennington. 1987. Stimulus structures and mental representations in expert comprehension of computer programs.Cognitive Psychology19, 3 (1987), 295–341. doi:10.1016/0010-0285(87)90007-7
-
[38]
George Polya. 1945.How to Solve It. Princeton University Press, Princeton, NJ
work page 1945
-
[39]
It’s Weird That it Knows What I Want
James Prather, Brent N. Reeves, Paul Denny, Brett A. Becker, Juho Leinonen, Andrew Luxton-Reilly, Garrett Powell, James Finnie-Ansley, and Eddie Antonio Santos. 2023. “It’s Weird That it Knows What I Want”: Usability and Interactions with Copilot for Novice Programmers.ACM Trans. Comput.-Hum. Interact.31, 1 (Nov. 2023). doi:10.1145/3617367
-
[40]
and Kimmel, Bailey and Wright, Jared and Briggs, Ben , title =
James Prather, Brent N Reeves, Juho Leinonen, Stephen MacNeil, Arisoa S Ran- drianasolo, Brett A. Becker, Bailey Kimmel, Jared Wright, and Ben Briggs. 2024. The Widening Gap: The Benefits and Harms of Generative AI for Novice Pro- grammers. InProceedings of the 2024 ACM Conference on International Computing Education Research - Volume 1 (ICER ’24). Associ...
-
[41]
Yunhan Qiao, Md Istiak Hossain Shihab, and Christopher Hundhausen. 2025. A Systematic Literature Review of the Use of GenAI Assistants for Code Compre- hension: Implications for Computing Education Research and Practice.ACM Trans. Comput. Educ.(Dec. 2025). doi:10.1145/3785366
-
[42]
M.P. Robillard, W. Coelho, and G.C. Murphy. 2004. How effective developers investigate source code: an exploratory study.IEEE Transactions on Software Engineering30, 12 (2004), 889–903. doi:10.1109/TSE.2004.101
-
[43]
Tobias Roehm, Rebecca Tiarks, Rainer Koschke, and Walid Maalej. 2012. How do professional developers comprehend software?. In2012 34th International Con- ference on Software Engineering (ICSE). 255–265. doi:10.1109/ICSE.2012.6227188
-
[44]
Jaakko Sauvola, Sasu Tarkoma, Mika Klemettinen, Jukka Riekki, and David Doermann. 2024. Future of software development with generative AI.Automated Software Engineering31, 1 (2024), 26
work page 2024
-
[45]
Alan H. Schoenfeld. 1992. Learning to Think Mathematically: Problem Solving, Metacognition, and Sense-Making in Mathematics. InHandbook of Research on Mathematics Teaching and Learning, Douglas A. Grouws (Ed.). Macmillan, New York, NY, 334–370. Anderson et al
work page 1992
-
[46]
Mohammed Tahri Sqalli. 2025. Eyes on the Code: Mapping Critical Thinking Through Eye-Tracking for Student-LLM Coding Interactions. InProceedings of the 16th Biannual Conference of the Italian SIGCHI Chapter (CHItaly ’25). Association for Computing Machinery, New York, NY, USA. doi:10.1145/3750069.3750397
-
[47]
Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models
Priyan Vaithilingam, Tianyi Zhang, and Elena L. Glassman. 2022. Expectation vs. Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models. InExtended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems (CHI EA ’22). Association for Computing Machinery, New York, NY, USA. doi:10.1145/3491101.3519665
-
[48]
A. Von Mayrhauser and A.M. Vans. 1995. Program comprehension during software maintenance and evolution.Computer28, 8 (1995), 44–55. doi:10.1109/ 2.402076
work page 1995
-
[49]
Ting-Ting Wu, Adi Asmara, Yueh-Min Huang, and Indah Permata Hapsari
-
[50]
doi:10.1177/ 21582440241249897
Identification of Problem-Solving Techniques in Computational Thinking Studies: Systematic Literature Review.Sage Open14, 2 (2024). doi:10.1177/ 21582440241249897
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.