pith. machine review for the scientific record. sign in

arxiv: 2604.18538 · v1 · submitted 2026-04-20 · 💻 cs.HC

Recognition: unknown

Fast and Forgettable: A Controlled Study of Novices' Performance, Learning, Workload, and Emotion in AI-Assisted and Human Pair Programming Paradigms

Nicholas Gardella , James Prather , Juho Leinonen , Paul Denny , Raymond Pettit , Sara L. Riggs

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:36 UTC · model grok-4.3

classification 💻 cs.HC
keywords pair programmingAI-assisted codingnovice performanceworkloademotionlearning retentionprogramming education
0
0 comments X

The pith

Novices performed significantly better and with lower workload using AI than with a human programming partner, but experienced more positive emotions and better retention with the human.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates the effects of AI code generation on novice programmers by comparing it directly to human pair programming in a controlled setting. Participants completed timed coding tasks in both human pairs and with AI assistance, then were retested individually after one week on performance, workload, emotion, and learning retention. The results indicate superior immediate performance and reduced workload with AI, but superior emotional responses and retention with human partners. A reader would care because this informs whether AI can effectively substitute for human collaboration in programming education without sacrificing learning outcomes. The study concludes that both approaches have distinct benefits worth integrating.

Core claim

Participants in the study wrote Python code under time pressure for 20 minutes each with a human teammate and individually with an AI code generation tool. They achieved significantly higher performance with the AI tool and reported lower workload across several dimensions, yet the human teammate produced significantly more positive and arousing emotional effects. Retest performance after one week showed a larger decrement in the AI condition, suggesting differences in how the two paradigms support learning.

What carries the argument

The experimental setup where each participant completed coding tasks with both a human partner and an AI tool, comparing performance, workload, emotion, and one-week retest results.

If this is right

  • AI assistance enables higher coding performance for novices in short sessions.
  • Several aspects of perceived workload decrease when using AI for coding tasks.
  • Human partners elicit stronger positive emotional responses than AI tools.
  • Retention of programming knowledge may suffer more after using AI compared to human collaboration.
  • Pair programming should be reconsidered as a tool alongside AI in educational settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The emotional advantage of human partners might lead to greater long-term engagement in programming courses.
  • Combining AI and human pair programming could potentially maximize both performance and retention benefits.
  • The observed effects might vary with different types of programming tasks or student experience levels.

Load-bearing premise

That differences observed in a single 20-minute session with only 22 participants after a one-week retention interval reflect meaningful and generalizable effects on learning in novice programming education.

What would settle it

Finding no significant difference in one-week retest performance between the AI and human conditions in a replication with more participants and sessions would undermine the claim of differential learning impacts.

Figures

Figures reproduced from arXiv: 2604.18538 by James Prather, Juho Leinonen, Nicholas Gardella, Paul Denny, Raymond Pettit, Sara L. Riggs.

Figure 1
Figure 1. Figure 1: Experimental design and counterbalancing permutations. [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Procedure—from left to right, (a) shows the procedure for study Session 1, the first exposure to the programming tasks (in [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: From left to right, (a) shows a scatter plot of actual Session 1 team performance score and self-grade for one’s own contribution [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: From left to right, (a) shows box plots of four workload sub-scales from the six-item NASA Task Load Index, excluding the [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
read the original abstract

Code-generating Artificial Intelligence has gained popularity within both professional and educational programming settings over the past several years. While research and pedagogy are beginning to cope with this change, computing students are left to bear the unforeseen consequences of AI amidst a dearth of empirical evidence about its effects. Though pair programming between students is well studied and known to be beneficial to self-efficacy and academic achievement, it remains underutilized and further threatened by the proposition that AI can replace a human programming partner. In this paper, we present a controlled pair programming study with 22 participants who wrote Python code under time pressure in teams of two and individually with GitHub Copilot for 20 minutes each. They were incentivized by bonus compensation to balance performance with understanding and were retested individually on the programming tasks after a retention interval of one week. Subjective measures of workload and emotion as well as objective measures of performance and learning (retest performance) were collected. Results showed that participants performed significantly better with GitHub Copilot than their human teammate, and several dimensions of their workload were significantly reduced. However, the emotional effect of the human teammate was significantly more positive and arousing as compared to working with Copilot. Furthermore, there was a nonsignificant absolute retest performance reduction in the AI condition and a larger retest performance decrement in the AI condition. We recommend that educators strongly consider revisiting pair programming as an educational tool in addition to embracing modern AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a within-subjects controlled study with 22 novice participants comparing GitHub Copilot-assisted programming to human pair programming in 20-minute Python coding sessions under time pressure. Participants were retested after one week. The study finds significantly higher performance and lower workload with Copilot, but more positive and arousing emotions with human partners. Retest performance showed a nonsignificant reduction that was larger in the AI condition. The authors suggest educators should reconsider pair programming in light of AI tools.

Significance. If the findings are robust, this work offers valuable insights into the performance benefits and emotional costs of AI assistance versus human collaboration in programming education. The combination of objective performance data and subjective measures (NASA-TLX and SAM) is a strength, as is the focus on retention to address learning. However, the limited sample and task duration constrain the strength of conclusions about long-term learning effects.

major comments (2)
  1. [Results] Results section (and abstract): The claim of a 'larger retest performance decrement in the AI condition' is presented despite being nonsignificant; without effect sizes, confidence intervals, or a power analysis for the retention scores, this interpretation risks overstating evidence for differential learning impacts.
  2. [Methods] Methods section: The design uses only 22 participants for within-subjects comparisons across two 20-minute sessions with a one-week retention interval; given that learning/retention is central to the educational recommendations, the manuscript should include a power calculation or sensitivity analysis to justify that the sample can reliably detect the reported effects.
minor comments (2)
  1. [Abstract] Abstract: Significant results (e.g., performance and workload differences) are described without accompanying statistics such as t-values, p-values, or effect sizes; adding these would allow quicker assessment of the strength of the evidence.
  2. [Discussion] Discussion: The final recommendation to 'strongly consider revisiting pair programming' could be tempered to better reflect the mixed findings, specifically the performance advantage of Copilot alongside the emotional advantages of human partners.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive review. We have carefully considered the concerns regarding the interpretation of the nonsignificant retention results and the justification for the sample size. Below we respond point by point and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Results] Results section (and abstract): The claim of a 'larger retest performance decrement in the AI condition' is presented despite being nonsignificant; without effect sizes, confidence intervals, or a power analysis for the retention scores, this interpretation risks overstating evidence for differential learning impacts.

    Authors: We agree that the phrasing in the abstract and results could be interpreted as suggesting a meaningful differential effect on learning despite the nonsignificant result. In the revised version we will explicitly state that the retest performance reduction was nonsignificant in both conditions, with only a numerically larger (but nonsignificant) decrement observed in the AI condition. We will add Cohen's d effect sizes and 95% confidence intervals for the retention performance scores to allow readers to assess the magnitude and precision of the observed differences. We will also remove any language implying differential learning impacts and instead frame the retention data as exploratory and preliminary, consistent with the small sample and short retention interval. revision: yes

  2. Referee: [Methods] Methods section: The design uses only 22 participants for within-subjects comparisons across two 20-minute sessions with a one-week retention interval; given that learning/retention is central to the educational recommendations, the manuscript should include a power calculation or sensitivity analysis to justify that the sample can reliably detect the reported effects.

    Authors: We acknowledge that n=22 limits statistical power for detecting small effects on retention, even with the efficiency of the within-subjects design. We will add a sensitivity analysis to the Methods section that reports the minimum detectable effect size (with 80% power and alpha=0.05) for the paired retention comparison given our sample size and observed variance. This will provide transparent context for the retention findings. We note that the primary hypotheses concerned immediate performance, workload, and emotion, for which the within-subjects design yielded significant effects; the retention measure was included as an initial exploration of longer-term implications rather than the central outcome. revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely empirical measurements

full rationale

The paper reports results from a controlled user study with 22 participants performing timed programming tasks under two conditions (human pair programming vs. GitHub Copilot), followed by subjective workload/emotion scales and a one-week retest. All claims derive from direct statistical comparisons of collected performance metrics, NASA-TLX scores, SAM emotion ratings, and retest deltas. No equations, derivations, fitted parameters, predictions, or first-principles arguments appear; the design contains no self-referential reductions or load-bearing self-citations that collapse claims to inputs by construction. This is a standard empirical HCI study whose validity rests on external data collection rather than internal definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions of experimental psychology and HCI research rather than new postulates.

axioms (1)
  • standard math Standard statistical assumptions for significance testing (normality, independence of observations) in behavioral experiments
    Invoked when reporting significant differences in performance and workload scores.

pith-pipeline@v0.9.0 · 5590 in / 1221 out tokens · 30827 ms · 2026-05-10T03:36:24.284517+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

85 extracted references · 58 canonical work pages · 1 internal anchor

  1. [1]

    Kazunori Akizuki and Yukari Ohashi. 2015. Measurement of functional task difficulty during motor learning: What level of difficulty corresponds to the optimal challenge point?Human Movement Science43 (Oct. 2015), 107–117. doi:10.1016/j.humov.2015.07.007

  2. [2]

    Matin Amoozadeh, Daye Nam, Daniel Prol, Ali Alfageeh, James Prather, Michael Hilton, Sruti Srinivasa Ragavan, and Amin Alipour. 2024. Student-AI Interaction: A Case Study of CS1 students. InProceedings of the 24th Koli Calling International Conference on Computing Education Research (Koli Calling ’24). Association for Computing Machinery, New York, NY, US...

  3. [3]

    Lisanne Bainbridge. 1983. Ironies of automation. InAnalysis, design and evaluation of man–machine systems. Elsevier, 129–135

  4. [4]

    André Barcaui. 2025. ChatGPT as a cognitive crutch: Evidence from a randomized controlled trial on knowledge retention.Social Sciences & Humanities Open12 (2025), 102287. doi:10.1016/j.ssaho.2025.102287

  5. [5]

    James, and Nadia Polikarpova

    Shraddha Barke, Michael B. James, and Nadia Polikarpova. 2023. Grounded copilot: How programmers interact with code-generating models. Proceedings of the ACM on Programming Languages7, OOPSLA1 (2023), 85–111. Number: OOPSLA1

  6. [6]

    Basawapatna, Alexander Repenning, Kyu Han Koh, and Hilarie Nickerson

    Ashok R. Basawapatna, Alexander Repenning, Kyu Han Koh, and Hilarie Nickerson. 2013. The zones of proximal flow: guiding students through a space of computational thinking skills and challenges. InProceedings of the ninth annual international ACM conference on International computing education research. ACM, San Diego San California USA, 67–74. doi:10.114...

  7. [7]

    2004.Extreme Programming Explained: Embrace Change

    Kent Beck, Cynthia Andres, and O’Reilly Online Learning: Academic/Public Library Edition. 2004.Extreme Programming Explained: Embrace Change. Addison Wesley Professional, Boston. http://RE5QY4SB7X.search.serialssolutions.com/?V=1.0&L=RE5QY4SB7X&S=JCs&C=TC0000074886&T=marc

  8. [8]

    and Denny, Paul and Finnie-Ansley, James and Luxton-Reilly, Andrew and Prather, James and Santos, Eddie Antonio , title =

    Brett A. Becker, Paul Denny, James Finnie-Ansley, Andrew Luxton-Reilly, James Prather, and Eddie Antonio Santos. 2023. Programming Is Hard - Or at Least It Used to Be: Educational Opportunities and Challenges of AI Code Generation. InProceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1 (SIGCSE 2023). Association for Computing...

  9. [9]

    Ben-Shachar, Daniel Lüdecke, and Dominique Makowski

    Mattan S. Ben-Shachar, Daniel Lüdecke, and Dominique Makowski. 2020. effectsize: Estimation of Effect Size Indices and Standardized Parameters. Journal of Open Source Software5, 56 (2020), 2815. doi:10.21105/joss.02815

  10. [10]

    Yoav Benjamini and Yosef Hochberg. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing.Journal of the Royal statistical society: series B (Methodological)57, 1 (1995), 289–300. Number: 1

  11. [11]

    Simon Berger and Julian Voigt. 2025. My Way or the AI Way? How Autonomy Frustration Undermines Creative Performance Through Editing AI Advice.How Autonomy Frustration Undermines Creative Performance Through Editing AI Advice (April 28, 2025)(2025)

  12. [12]

    Christian Bird, Denae Ford, Thomas Zimmermann, Nicole Forsgren, Eirini Kalliamvakou, Travis Lowdermilk, and Idan Gazit. 2023. Taking flight with copilot.Commun. ACM66, 6 (2023), 56–62

  13. [13]

    Robert A. Bjork. 1994. Memory and Metamemory Considerations in the Training of Human Beings. InMetacognition: Knowing About Knowing, Janet Metcalfe and Arthur P. Shimamura (Eds.). MIT Press, Cambridge, MA, 185–205

  14. [14]

    Colin Cameron, Jonah B

    A. Colin Cameron, Jonah B. Gelbach, and Douglas L. Miller. 2008. Bootstrap-based improvements for inference with clustered errors.The review of economics and statistics90, 3 (2008), 414–427. https://direct.mit.edu/rest/article-abstract/90/3/414/57731

  15. [15]

    Colin Cameron and Douglas L

    A. Colin Cameron and Douglas L. Miller. 2015. A practitioner’s guide to cluster-robust inference.Journal of human resources50, 2 (2015), 317–372. https://jhr.uwpress.org/content/50/2/317.short

  16. [16]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, and Greg Brockman. 2021. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374(2021)

  17. [17]

    Myra Cheng, Cinoo Lee, Pranav Khadpe, Sunny Yu, Dyllan Han, and Dan Jurafsky. 2025. Sycophantic AI Decreases Prosocial Intentions and Promotes Dependence. doi:10.48550/arXiv.2510.01395 arXiv:2510.01395 [cs]

  18. [18]

    2013.Statistical power analysis for the behavioral sciences

    Jacob Cohen. 2013.Statistical power analysis for the behavioral sciences. Academic press

  19. [19]

    Livesey, and Justin A

    Ben Colagiuri, Evan J. Livesey, and Justin A. Harris. 2011. Can expectancies produce placebo effects for implicit learning?Psychonomic Bulletin & Review18, 2 (April 2011), 399–405. doi:10.3758/s13423-010-0041-1

  20. [20]

    2013.Understanding the new statistics: Effect sizes, confidence intervals, and meta-analysis

    Geoff Cumming. 2013.Understanding the new statistics: Effect sizes, confidence intervals, and meta-analysis. Routledge. https://api.taylorfrancis.com/ content/books/mono/download?identifierName=doi&identifierValue=10.4324/9780203807002&type=googlepdf Manuscript submitted to ACM 20 Gardella et al

  21. [21]

    Elise Deitrick, R Benjamin Shapiro, and Brian Gravel. 2016. How do we assess equity in programming pairs? Singapore: International Society of the Learning Sciences

  22. [22]

    Becker, and Brent N

    Paul Denny, Juho Leinonen, James Prather, Andrew Luxton-Reilly, Thezyrie Amarouche, Brett A. Becker, and Brent N. Reeves. 2024. Prompt Problems: A New Programming Exercise for the Generative AI Era. InProceedings of the 55th ACM Technical Symposium on Computer Science Education V. 1 (SIGCSE 2024). Association for Computing Machinery, New York, NY, USA, 29...

  23. [23]

    Becker, James Finnie-Ansley, Arto Hellas, Juho Leinonen, Andrew Luxton-Reilly, Brent N

    Paul Denny, James Prather, Brett A. Becker, James Finnie-Ansley, Arto Hellas, Juho Leinonen, Andrew Luxton-Reilly, Brent N. Reeves, Eddie Antonio Santos, and Sami Sarsa. 2024. Computing Education in the Era of Generative AI.Commun. ACM67, 2 (Jan. 2024), 56–67. doi:10.1145/3624720

  24. [24]

    Maria TM Dijkstra and Astrid C. Homan. 2016. Engaging in rather than disengaging from stress: Effective coping and perceived control.Frontiers in psychology7 (2016), 1415. https://www.frontiersin.org/journals/psychology/articles/10.3389/fpsyg.2016.01415/full

  25. [25]

    Guangrui Fan, Dandan Liu, Rui Zhang, and Lihu Pan. 2025. The impact of AI-assisted pair programming on student motivation, programming anxiety, collaborative learning, and programming performance: a comparative study with traditional pair programming and individual approaches. International Journal of STEM Education12, 1 (March 2025), 16. doi:10.1186/s405...

  26. [26]

    James Finnie-Ansley, Paul Denny, Andrew Luxton-Reilly, Eddie Antonio Santos, James Prather, and Brett A. Becker. 2023. My AI Wants to Know if This Will Be on the Exam: Testing OpenAI’s Codex on CS2 Programming Exercises. InProceedings of the 25th Australasian Computing Education Conference (ACE ’23). Association for Computing Machinery, New York, NY, USA,...

  27. [27]

    2004.The Midnight Disease: The Drive to Write, Writer’s Block, and the Creative Brain

    Alice Weaver Flaherty. 2004.The Midnight Disease: The Drive to Write, Writer’s Block, and the Creative Brain. Houghton Mifflin, Boston

  28. [28]

    Nat Friedman. 2021. Introducing GitHub Copilot: your AI pair programmer. https://github.blog/2021-06-29-introducing-github-copilot-ai-pair- programmer/

  29. [29]

    Nicholas Gardella, Raymond Pettit, and Sara L. Riggs. 2024. Performance, Workload, Emotion, and Self-Efficacy of Novice Programmers Using AI Code Generation. InProceedings of the 2024 on Innovation and Technology in Computer Science Education V. 1. ACM, Milan Italy, 290–296. doi:10.1145/3649217.3653615

  30. [30]

    Nicholas Gardella, Joseph Shelton, Isabella Graßl, and Sara Riggs. 2025. HBCU Student Perspectives on Identity, Persistence, and Code-Generating AI in CS Education: A Case Study. InProceedings of the 25th Koli Calling International Conference on Computing Education Research. ACM, Koli Finland, 1–12. doi:10.1145/3769994.3770015

  31. [31]

    Gian Luca Scoccia. 2023. Exploring Early Adopters’ Perceptions of ChatGPT as a Code Generation Tool. (Sept. 2023). doi:10.1109/asew60602.2023.00016 MAG ID: 4388214696

  32. [32]

    Brian Hanks, Sue Fitzgerald, Renée McCauley, Laurie Murphy, and Carol Zander. 2011. Pair programming in education: A literature review.Computer Science Education21, 2 (2011), 135–173

  33. [33]

    Hannay, Tore Dybå, Erik Arisholm, and Dag I

    Jo E. Hannay, Tore Dybå, Erik Arisholm, and Dag I. K. Sjøberg. 2009. The effectiveness of pair programming: A meta-analysis.Information and Software Technology51, 7 (July 2009), 1110–1122. doi:10.1016/j.infsof.2009.02.001

  34. [34]

    Hart and Lowell E

    Sandra G. Hart and Lowell E. Staveland. 1988. Development of NASA-TLX (Task Load Index): Results of empirical and theoretical research. In Advances in psychology. Vol. 52. Elsevier, 139–183

  35. [35]

    Anja Hawlitschek, Sarah Berndt, and Sandra Schulz. 2023. Empirical research on pair programming in higher education: a literature review. Computer Science Education33, 3 (July 2023), 400–428. doi:10.1080/08993408.2022.2039504

  36. [36]

    Toni Honicke and Jaclyn Broadbent. 2016. The influence of academic self-efficacy on academic performance: A systematic review.Educational research review17 (2016), 63–84. https://www.sciencedirect.com/science/article/pii/S1747938X15000639

  37. [37]

    Irene Hou, Owen Man, Kate Hamilton, Srishty Muthusekaran, Jeffin Johnykutty, Leili Zadeh, and Stephen MacNeil. 2025. ’All Roads Lead to ChatGPT’: How Generative AI is Eroding Social Interactions and Student Learning Communities. InProceedings of the 30th ACM Conference on Innovation and Technology in Computer Science Education V. 1 (ITiCSE 2025). Associat...

  38. [38]

    Irene Hou, Sophia Mettille, Owen Man, Zhuo Li, Cynthia Zastudil, and Stephen MacNeil. 2024. The Effects of Generative AI on Computing Students’ Help-Seeking Preferences. InProceedings of the 26th Australasian Computing Education Conference (ACE ’24). Association for Computing Machinery, New York, NY, USA, 39–48. doi:10.1145/3636243.3636248

  39. [39]

    Ross Ihaka and Robert Gentleman. 1996. R: A Language for Data Analysis and Graphics.Journal of Computational and Graphical Statistics5, 3 (Sept. 1996), 299–314. doi:10.1080/10618600.1996.10474713

  40. [40]

    Sumeet Jeswani, Akshay Mittal, and Soni Sewlani. 2025. Premature Trust: How Student Overreliance on GenAI is Seeding Tomorrow’s Security Gaps. InProceedings of the 26th ACM Annual Conference on Cybersecurity & Information Technology Education (SIGCITE ’25). Association for Computing Machinery, New York, NY, USA, 107–112. doi:10.1145/3769694.3771124

  41. [41]

    Martin Jonsson and Jakob Tholander. 2022. Cracking the code: Co-coding with AI in creative programming education. InProceedings of the 14th Conference on Creativity and Cognition (C&C ’22). Association for Computing Machinery, New York, NY, USA, 5–14. doi:10.1145/3527927.3532801

  42. [42]

    Gregor Jošt, Viktor Taneski, and Sašo Karakatič. 2024. The Impact of Large Language Models on Programming Education and Student Learning Outcomes.Applied Sciences14, 10 (Jan. 2024), 4115. doi:10.3390/app14104115 Number: 10

  43. [43]

    Maria Kallia. 2025. To Be, or to Be Otherwise? Silicon Souls in Search of Dasein and Authentic Human Engagement in the Age of Generative AI. In Proceedings of the 2025 ACM Conference on International Computing Education Research V.1 (ICER ’25). Association for Computing Machinery, New York, NY, USA, 240–255. doi:10.1145/3702652.3744223 Manuscript submitte...

  44. [44]

    Ericson, David Weintrop, and Tovi Grossman

    Majeed Kazemitabaar, Justin Chow, Carl Ka To Ma, Barbara J. Ericson, David Weintrop, and Tovi Grossman. 2023. Studying the effect of AI Code Generators on Supporting Novice Learners in Introductory Programming. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery, New York, NY, USA, ...

  45. [45]

    Majeed Kazemitabaar, Xinying Hou, Austin Henley, Barbara Jane Ericson, David Weintrop, and Tovi Grossman. 2024. How Novices Use LLM-based Code Generators to Solve CS1 Coding Tasks in a Self-Paced Learning Environment. InProceedings of the 23rd Koli Calling International Conference on Computing Education Research (Koli Calling ’23). Association for Computi...

  46. [46]

    Majeed Kazemitabaar, Runlong Ye, Xiaoning Wang, Austin Zachary Henley, Paul Denny, Michelle Craig, and Tovi Grossman. 2024. CodeAid: Evaluating a Classroom Deployment of an LLM-based Programming Assistant that Balances Student and Educator Needs. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI ’24). Association for Comp...

  47. [47]

    András Komócsi, Gergely Csaba, Eszter Veronika Csöngei, Krisztina Fischer, Kristóf Filipánits, László Czopf, and Andrea Tamás. 2026. Academic performance and progression among near-peer tutors: A comparative analysis in undergraduate medical education.Medical Teacher48, 3 (March 2026), 394–404. doi:10.1080/0142159X.2025.2553658 _eprint: https://doi.org/10...

  48. [48]

    Sandeep Kaur Kuttal, Bali Ong, Kate Kwasny, and Peter Robe. 2021. Trade-offs for Substituting a Human with an Agent in a Pair Programming Context: The Good, the Bad, and the Ugly. InProceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI ’21). Association for Computing Machinery, New York, NY, USA, 1–20. doi:10.1145/3411764.3445659

  49. [49]

    Liang, Chenyang Yang, and Brad A

    Jenny T. Liang, Chenyang Yang, and Brad A. Myers. 2024. A Large-Scale Survey on the Usability of AI Programming Assistants: Successes and Challenges. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering (ICSE ’24). Association for Computing Machinery, New York, NY, USA, 1–13. doi:10.1145/3597503.3608128

  50. [50]

    Mark Liffiton, Brad E Sheese, Jaromir Savelka, and Paul Denny. 2023. CodeHelp: Using Large Language Models with Guardrails for Scalable Support in Programming Classes. InProceedings of the 23rd Koli Calling International Conference on Computing Education Research. ACM, Koli Finland, 1–11. doi:10.1145/3631802.3631830

  51. [51]

    Wenhan Lyu, Yimeng Wang, Yifan Sun, and Yixuan Zhang. 2025. Will Your Next Pair Programming Partner Be Human? An Empirical Evaluation of Generative AI as a Collaborative Teammate in a Semester-Long Classroom Setting. InProceedings of the Twelfth ACM Conference on Learning @ Scale. ACM, Palermo Italy, 83–94. doi:10.1145/3698205.3729544

  52. [52]

    MacKinnon, Morten Ørregaard Nielsen, and Matthew D

    James G. MacKinnon, Morten Ørregaard Nielsen, and Matthew D. Webb. 2023. Fast and reliable jackknife and bootstrap methods for cluster-robust inference.Journal of Applied Econometrics38, 5 (Aug. 2023), 671–694. doi:10.1002/jae.2969

  53. [53]

    Margulieux, James Prather, Brent N

    Lauren E. Margulieux, James Prather, Brent N. Reeves, Brett A. Becker, Gozde Cetin Uzun, Dastyni Loksa, Juho Leinonen, and Paul Denny. 2024. Self-Regulation, Self-Efficacy, and Fear of Failure Interactions with How Novices Use LLMs to Solve Programming Problems. InProceedings of the 2024 on Innovation and Technology in Computer Science Education V. 1. ACM...

  54. [54]

    Daye Nam, Andrew Macvean, Vincent Hellendoorn, Bogdan Vasilescu, and Brad Myers. 2024. Using an LLM to Help With Code Understanding. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. ACM, Lisbon Portugal, 1–13. doi:10.1145/3597503.3639187

  55. [55]

    Akilan, P

    Palanichamy Naveen, T. Akilan, P. Manikandan, C. Swedheetha, and M. Saravanan. 2025. Comparative Review of Large Language Models. In2025 5th International Conference on Soft Computing for Security Applications (ICSCSA). IEEE, 2120–2124. https://ieeexplore.ieee.org/abstract/document/ 11170906/?casa_token=8DcrOMItU_cAAAAA:3qj-JPYir4hqrlJDkCm4GQL6QwDEiDgQh0v...

  56. [56]

    2021.The Extended Mind: The Power of Thinking Outside the Brain

    Annie Murphy Paul. 2021.The Extended Mind: The Power of Thinking Outside the Brain. Houghton Mifflin Harcourt, Boston, MA

  57. [57]

    Reinhard Pekrun. 2006. The Control-Value Theory of Achievement Emotions: Assumptions, Corollaries, and Implications for Educational Research and Practice.Educational Psychology Review18, 4 (Nov. 2006), 315–341. doi:10.1007/s10648-006-9029-9

  58. [58]

    Reinhard Pekrun. 2024. Control-Value Theory: From Achievement Emotion to a General Theory of Human Emotions.Educational Psychology Review36, 3 (Sept. 2024), 83. doi:10.1007/s10648-024-09909-7

  59. [59]

    Jacob Penney, Pawan Acharya, Peter Hilbert, Priyanka Parekh, Anita Sarma, Igor Steinmacher, and Marco Gerosa. 2025. Understanding Programming Students’ Help-Seeking Preferences in the Era of Generative AI. InProceedings of the ACM Global Computing Education Conference 2025 - Volume 1. ACM, Gaborone Botswana, 15–21. doi:10.1145/3736181.3747165

  60. [60]

    Reeves, Jaromir Savelka, IV Smith, David H., Sven Strickroth, and Daniel Zingaro

    James Prather, Juho Leinonen, Natalie Kiesler, Jamie Gorson Benario, Sam Lau, Stephen MacNeil, Narges Norouzi, Simone Opel, Vee Pettit, Leo Porter, Brent N. Reeves, Jaromir Savelka, IV Smith, David H., Sven Strickroth, and Daniel Zingaro. 2025. Beyond the Hype: A Comprehensive Review of Current Trends in Generative AI Research, Teaching Practices, and Too...

  61. [61]

    It’s Weird That it Knows What I Want

    James Prather, Brent N. Reeves, Paul Denny, Brett A. Becker, Juho Leinonen, Andrew Luxton-Reilly, Garrett Powell, James Finnie-Ansley, and Eddie Antonio Santos. 2024. “It’s Weird That it Knows What I Want”: Usability and Interactions with Copilot for Novice Programmers.ACM Transactions on Computer-Human Interaction31, 1 (Feb. 2024), 1–31. doi:10.1145/3617367

  62. [62]

    and Kimmel, Bailey and Wright, Jared and Briggs, Ben , title =

    James Prather, Brent N Reeves, Juho Leinonen, Stephen MacNeil, Arisoa S Randrianasolo, Brett A. Becker, Bailey Kimmel, Jared Wright, and Ben Briggs. 2024. The Widening Gap: The Benefits and Harms of Generative AI for Novice Programmers. InProceedings of the 2024 ACM Conference on International Computing Education Research - Volume 1. ACM, Melbourne VIC Au...

  63. [63]

    Ira J. Roseman. 1996. Appraisal Determinants of Emotions: Constructing a More Accurate and Comprehensive Theory.Cognition & Emotion10, 3 (May 1996), 241–278. doi:10.1080/026999396380240 Manuscript submitted to ACM 22 Gardella et al

  64. [64]

    Ross, Fernando Martinez, Stephanie Houde, Michael Muller, and Justin D

    Steven I. Ross, Fernando Martinez, Stephanie Houde, Michael Muller, and Justin D. Weisz. 2023. The Programmer’s Assistant: Conversational Interaction with a Large Language Model for Software Development. InProceedings of the 28th International Conference on Intelligent User Interfaces. ACM, Sydney NSW Australia, 491–514. doi:10.1145/3581641.3584037

  65. [65]

    James A. Russell. 1980. A circumplex model of affect.Journal of personality and social psychology39, 6 (1980), 1161. https://psycnet.apa.org/journals/ psp/39/6/1161/

  66. [66]

    Norsaremah Salleh, Emilia Mendes, and John Grundy. 2011. Empirical Studies of Pair Programming for CS/SE Teaching in Higher Education: A Systematic Literature Review.IEEE Transactions on Software Engineering37, 4 (July 2011), 509–525. doi:10.1109/TSE.2010.59

  67. [67]

    Sharpe and Ian Tyndall

    Benjamin T. Sharpe and Ian Tyndall. 2025. The Sustained Attention Paradox: A Critical Commentary on the Theoretical Impossibility of Perfect Vigilance.Cognitive Science49, 4 (2025), e70061. doi:10.1111/cogs.70061 _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/cogs.70061

  68. [68]

    Abdulhadi Shoufan. 2023. Can Students without Prior Knowledge Use ChatGPT to Answer Test Questions? An Empirical Study.ACM Trans. Comput. Educ.23, 4 (Dec. 2023), 45:1–45:29. doi:10.1145/3628162

  69. [69]

    Nancy Sommers. 1980. Revision Strategies of Student Writers and Experienced Adult Writers.College Composition & Communication31, 4 (1980), 378–388. doi:10.58680/ccc198015930 Type: Journal Article

  70. [70]

    Nancy Sommers. 1982. Responding to Student Writing.College Composition & Communication33, 2 (1982), 148–156. doi:10.58680/ccc198215854 Type: Journal Article

  71. [71]

    Son and Rajiv Sethi

    Lisa K. Son and Rajiv Sethi. 2006. Metacognitive Control and Optimal Learning.Cognitive Science30, 4 (July 2006), 759–774. doi:10.1207/ s15516709cog0000_74

  72. [72]

    Phil Steinhorst, Andrew Petersen, and Jan Vahrenhold. 2020. Revisiting Self-Efficacy in Introductory Programming. InProceedings of the 2020 ACM Conference on International Computing Education Research. ACM, Virtual Event New Zealand, 158–169. doi:10.1145/3372782.3406281

  73. [73]

    John Sweller. 2011. Cognitive load theory. InPsychology of learning and motivation. Vol. 55. Elsevier, 37–76. https://www.sciencedirect.com/science/ article/pii/B9780123876911000028

  74. [74]

    Ritzhaupt

    Karthikeyan Umapathy and Albert D. Ritzhaupt. 2017. A meta-analysis of pair-programming in computer programming courses: Implications for educational practice.ACM Transactions on Computing Education (TOCE)17, 4 (2017), 1–13

  75. [75]

    Smith IV, Mounika Padala, Christine Alvarado, Jamie Gorson Benario, and Leo Porter

    Annapurna Vadaparty, Daniel Zingaro, David H. Smith IV, Mounika Padala, Christine Alvarado, Jamie Gorson Benario, and Leo Porter. 2024. CS1-LLM: Integrating LLMs into CS1 Instruction. InProceedings of the 2024 on Innovation and Technology in Computer Science Education V. 1 (ITiCSE 2024). Association for Computing Machinery, New York, NY, USA, 297–303. doi...

  76. [76]

    Marcel Valový and Alena Buchalcevova. 2023. The Psychological Effects of AI-Assisted Programming on Students and Professionals. In2023 IEEE International Conference on Software Maintenance and Evolution (ICSME). 385–390. doi:10.1109/ICSME58846.2023.00050 ISSN: 2576-3148

  77. [77]

    Vygotsky

    Lev S. Vygotsky. 1978.Mind in Society: The Development of Higher Psychological Processes. Harvard University Press, Cambridge, MA

  78. [78]

    Matthew D. Webb. 2023. Reworking wild bootstrap-based inference for clustered errors.Canadian Journal of Economics/Revue canadienne d’économique56, 3 (Aug. 2023), 839–858. doi:10.1111/caje.12661

  79. [79]

    Weisz, Michael Muller, Michael Muller, Michael Muller, Stephanie Houde, John T

    Justin D. Weisz, Michael Muller, Michael Muller, Michael Muller, Stephanie Houde, John T. Richards, Steven I. Ross, Fernando Martinez, Mayank Agarwal, and Kartik Talamadupula. 2021. Perfection Not Required? Human-AI Partnerships in Code Translation. 402–412. doi:10.1145/3397481. 3450656 MAG ID: 3154248444

  80. [80]

    Bai, Robert Tairas, and Yu Huang

    Yuankai Xue, Hanlin Chen, Gina R. Bai, Robert Tairas, and Yu Huang. 2024. Does ChatGPT Help With Introductory Programming?An Experiment of Students Using ChatGPT in CS1. InProceedings of the 46th International Conference on Software Engineering: Software Engineering Education and Training (ICSE-SEET ’24). Association for Computing Machinery, New York, NY,...

Showing first 80 references.