pith. sign in

arxiv: 2606.30549 · v1 · pith:UA7TRYQPnew · submitted 2026-06-29 · 💻 cs.HC · cs.AI· cs.SE

To Tab or Not to Tab: Measuring Critical Engagement in AI Code Completion Tools Using Behavioral Signals and Attention Checks

Pith reviewed 2026-06-30 04:45 UTC · model grok-4.3

classification 💻 cs.HC cs.AIcs.SE
keywords AI code completioncritical engagementattention checksbehavioral metricsprogramming educationtab acceptdwell timereflective practice
0
0 comments X

The pith

Higher tab-accept rates in AI code tools link to weaker performance on attention checks measuring critical evaluation of suggestions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Clover, a logging tool that records how students interact with AI code suggestions and inserts attention checks to test whether they are reflecting on those suggestions. It defines a set of behavioral metrics drawn from existing literature, including how often users press tab to accept a suggestion and how long they dwell on a suggestion before acting. Analysis of student sessions shows that frequent tab accepts correlate with lower scores on the attention checks, while longer dwell times correlate with higher scores. The work argues that these patterns can indicate whether students are critically reviewing AI output rather than accepting it at face value during programming tasks.

Core claim

Higher rates of tab accepts were associated with lower attention check performance, while increased dwell time was associated with higher attention check performance, suggesting that behavioral interaction data can serve as signals of reflective engagement with AI code suggestions.

What carries the argument

Clover, a code completion tool that logs interactions with suggestions and deploys attention checks, paired with a taxonomy of behavioral metrics such as tab-accept rate and dwell time.

If this is right

  • Interaction logs from AI coding tools can be used to identify students who may not be critically evaluating suggestions.
  • Attention checks embedded in the coding workflow can surface moments of low engagement during live programming sessions.
  • Designers of AI coding assistants could surface dwell-time or accept-rate feedback to encourage more deliberate review of suggestions.
  • Programming educators could incorporate process data alongside final code to assess how students are using AI assistance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the correlations hold across settings, real-time dashboards could alert instructors when a student's accept rate spikes without corresponding dwell time.
  • The same logging approach might be adapted to non-code AI tools, such as document editors, to track whether users pause to review generated text.
  • A follow-up could test whether prompting students with their own behavioral metrics changes how often they accept suggestions without review.

Load-bearing premise

Attention checks provide a valid measure of reflective engagement with code suggestions rather than reflecting task difficulty, prior knowledge, or other unrelated factors.

What would settle it

An experiment that directly measures whether students can explain or correctly modify an accepted suggestion and checks whether that measure aligns with the reported correlations between tab accepts, dwell time, and attention-check scores.

Figures

Figures reproduced from arXiv: 2606.30549 by Antonio Lazaro, Ian Tyler Applebaum, Jessica Hutchison, Kenneth Angelikas, Kush Rakesh Patel, Nicholas Rucinski, Phuoc Nguyen, Rahad Arman Nabid, Stephen MacNeil.

Figure 1
Figure 1. Figure 1: Distribution of session-level interaction rates. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
read the original abstract

AI code completion tools, such as Github Copilot, provide students with code suggestions to help them write programs. However, recent qualitative studies suggest that students fail to critically evaluate these suggestions. We present Clover, a code completion tool that logs students' interactions with code suggestions and additionally offers attention checks to probe reflective engagement during programming tasks. We also develop a taxonomy of behavioral interaction metrics for AI-assisted programming, informed by literature. We analyzed relationships between interaction patterns, engagement with attention checks, and task performance. We observed that higher rates of tab accept were associated with lower attention check performance, while increased dwell time was associated with higher attention check performance. We conclude by discussing how programming process data and attention checks might support reflective engagement in AI-assisted programming.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Clover, a code completion tool that logs student interactions with AI suggestions (e.g., tab accepts) and administers attention checks during programming tasks. It develops a literature-informed taxonomy of behavioral interaction metrics and analyzes relationships among these metrics, attention-check performance, and task outcomes. Key observations are that higher tab-accept rates correlate with lower attention-check performance while longer dwell times correlate with higher attention-check performance. The authors conclude that such process data and checks can help measure and support reflective engagement with AI-assisted code completion.

Significance. If the reported associations hold after appropriate controls, the work supplies an empirical basis and practical instrumentation for studying critical evaluation of AI suggestions in CS education, an area of growing importance. The taxonomy of behavioral metrics, grounded in prior literature, is a reusable contribution that could improve comparability across studies. The paper does not ship machine-checked proofs or parameter-free derivations, but the logging tool and attention-check approach represent a concrete methodological step forward.

major comments (2)
  1. [Abstract] Abstract: The central claim treats attention-check performance as a proxy for reflective engagement with the specific AI code suggestions. No details are supplied on how the checks were constructed to be independent of task difficulty, problem-statement comprehension, or participants' prior domain knowledge. This is load-bearing because the negative association with tab-accept rate could be driven by expertise differences rather than differences in scrutiny of the Copilot output.
  2. [Abstract] Abstract (results paragraph): The directional associations are reported without reference to sample size, statistical tests, controls for expertise or task difficulty, or exclusion criteria. These omissions prevent evaluation of whether the data actually support the claimed relationships between behavioral signals and attention-check performance.
minor comments (1)
  1. The taxonomy section would benefit from explicit operational definitions and example log entries for each metric to allow replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on the abstract. We agree that the abstract requires expansion to include methodological details and statistical information, and we will revise it accordingly in the next version. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim treats attention-check performance as a proxy for reflective engagement with the specific AI code suggestions. No details are supplied on how the checks were constructed to be independent of task difficulty, problem-statement comprehension, or participants' prior domain knowledge. This is load-bearing because the negative association with tab-accept rate could be driven by expertise differences rather than differences in scrutiny of the Copilot output.

    Authors: We acknowledge the abstract lacks these details. The Methods section describes attention checks as embedded multiple-choice items on problem requirements and suggestion relevance, piloted for independence from syntax knowledge and administered at fixed points. Analyses include controls for self-reported prior experience and task difficulty via regression. We will revise the abstract to summarize check construction and note the controls, addressing the potential expertise confound. revision: yes

  2. Referee: [Abstract] Abstract (results paragraph): The directional associations are reported without reference to sample size, statistical tests, controls for expertise or task difficulty, or exclusion criteria. These omissions prevent evaluation of whether the data actually support the claimed relationships between behavioral signals and attention-check performance.

    Authors: The abstract is a concise summary; the Results section reports sample size (N=48), Pearson correlations and regressions with expertise/task controls, and exclusion criteria (e.g., incomplete tasks). We will expand the abstract to include sample size, mention of statistical tests and controls, and a note on exclusion criteria to support evaluation of the associations. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical correlations from observed data

full rationale

The paper reports an empirical study using a custom tool (Clover) to log interactions and attention checks during programming tasks. The central findings are observed associations (higher tab-accept rates linked to lower attention-check scores; higher dwell time linked to higher attention-check scores) derived from participant data analysis. No equations, fitted parameters, derivations, or predictions are described that reduce to inputs by construction. The taxonomy is informed by literature (standard practice), and no self-citation chains or uniqueness claims underpin the results. The derivation chain is self-contained against external benchmarks as straightforward behavioral data analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; all elements appear drawn from standard HCI methods and prior literature on code completion.

pith-pipeline@v0.9.1-grok · 5694 in / 1017 out tokens · 50759 ms · 2026-06-30T04:45:56.349051+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 31 canonical work pages

  1. [1]

    James, and Nadia Polikarpova

    Shraddha Barke, Michael B. James, and Nadia Polikarpova. 2023. Grounded Copilot: How Programmers Interact with Code-Generating Models.Proc. ACM Program. Lang.7, OOPSLA1, Article 78 (April 2023), 27 pages. https://doi.org/10. 1145/3586030

  2. [2]

    Seth Bernstein, Ashfin Rahman, Nadia Sharifi, Ariunjargal Terbish, and Stephen MacNeil. 2025. Beyond the Benefits: A Systematic Review of the Harms and Consequences of Generative AI in Computing Education. InProceedings of the 25th Koli Calling International Conference on Computing Education Research. https://doi.org/10.1145/3769994.3770036

  3. [3]

    Brown and Amjad Altadmri

    Neil C.C. Brown and Amjad Altadmri. 2014. Investigating novice programming mistakes: educator beliefs vs. student data. InProceedings of the Tenth Annual Conference on International Computing Education Research. 43–50. https://doi. org/10.1145/2632320.2632343

  4. [4]

    Carter, Christopher D

    Adam S. Carter, Christopher D. Hundhausen, and Olusola Adesope. 2015. The Normalized Programming State Model: Predicting Student Performance in Com- puting Courses Based on Programming Behavior. InProceedings of the Eleventh Annual International Conference on International Computing Education Research To Tab or Not to Tab (Omaha, Nebraska, USA)(ICER ’15)....

  5. [5]

    John Edwards, Arto Hellas, and Juho Leinonen. 2025. On the Opportunities of Large Language Models for Programming Process Data. InProceedings of the 27th Australasian Computing Education Conference (ACE 2025). 105–113. https://doi.org/10.1145/3716640.3716652

  6. [6]

    Jonathan St. B. T. Evans. 2008. Dual-processing accounts of reasoning, judgment, and social cognition.Annual Review of Psychology59 (2008), 255–278. https: //doi.org/10.1146/annurev.psych.59.103006.093629

  7. [7]

    Ge Gao, Samiha Marwan, and Thomas W Price. 2021. Early performance predic- tion using interpretable patterns in programming process data. InProceedings of the 52nd ACM technical symposium on computer science education. 342–348. https://doi.org/10.1145/3408877.3432439

  8. [8]

    Lydia Harbarth, Eva Gößwein, Daniel Bodemer, and Lenka Schnaubert. 2025. (Over) trusting AI recommendations: How system and person variables affect di- mensions of complacency.International Journal of Human–Computer Interaction 41, 1 (2025), 391–410. https://doi.org/10.1080/10447318.2023.2301250

  9. [9]

    Irene Hou, Sophia Mettille, Owen Man, Zhuo Li, Cynthia Zastudil, and Stephen MacNeil. 2024. The Effects of Generative AI on Computing Students’ Help- Seeking Preferences. InProceedings of the 26th Australasian Computing Education Conference (ACE ’24). ACM, 39–48. https://doi.org/10.1145/3636243.3636248

  10. [10]

    Irene Hou, Hannah Vy Nguyen, Owen Man, and Stephen MacNeil. 2025. The Evolving Usage of GenAI by Computing Students. InProceedings of the 56th ACM Technical Symposium on Computer Science Education V. 2. 1481––1482. https://doi.org/10.1145/3641555.3705266

  11. [11]

    Dhanya Jayagopal, Justin Lubin, and Sarah E. Chasins. 2022. Exploring the Learnability of Program Synthesizers by Novice Programmers. InProceedings of the 35th Annual ACM Symposium on User Interface Software and Technology. https://doi.org/10.1145/3526113.3545659

  12. [12]

    2011.Thinking, Fast and Slow

    Daniel Kahneman. 2011.Thinking, Fast and Slow. Farrar, Straus and Giroux. https://doi.org/10.1007/s00362-013-0533-y

  13. [13]

    Daniil Karol, Elizaveta Artser, Ilya Vlasov, Yaroslav Golubev, Hieke Keuning, and Anastasiia Birillo. 2025. KOALA: A Configurable Tool for Collecting IDE Data When Solving Programming Tasks. InProceedings of the ACM Global on Comput- ing Education Conference 2025 Vol 1 (CompEd 2025). Association for Computing Machinery, 183–189. https://doi.org/10.1145/37...

  14. [14]

    Majeed Kazemitabaar, Xinying Hou, Austin Henley, Barbara Jane Ericson, David Weintrop, and Tovi Grossman. 2024. How Novices Use LLM-based Code Gen- erators to Solve CS1 Coding Tasks in a Self-Paced Learning Environment. In Proceedings of the 23rd Koli Calling International Conference on Computing Edu- cation Research. 1–12. https://doi.org/10.1145/3631802.3631806

  15. [15]

    Hieke Keuning, Isaac Alpizar-Chacon, Ioanna Lykourentzou, Lauren Beehler, Christian Köppe, Imke de Jong, and Sergey Sosnovsky. 2024. Students’ Percep- tions and Use of Generative AI Tools for Programming Across Different Comput- ing Courses. InProceedings of the 24th Koli Calling International Conference on Computing Education Research. 1–12. https://doi....

  16. [16]

    Shao-Heng Ko and Kristin Stephens-Martinez. 2025. Rethinking Computing Students’ Help Resource Utilization through Sequentiality.ACM Trans. Comput. Educ.25 (2025), 1–34. https://doi.org/10.1145/3716860

  17. [17]

    Juho Leinonen et al. 2019. Keystroke data in programming courses.Department of Computer Science, Series of Publications A(2019)

  18. [18]

    Reeves, Juho Leinonen, and Rachel Louise Rossetti

    Stephen MacNeil, James Prather, Rahad Arman Nabid, Sebastian Gutierrez, Silas Carvalho, Saimon Shrestha, Paul Denny, Brent N. Reeves, Juho Leinonen, and Rachel Louise Rossetti. 2025. Fostering Responsible AI Use Through Negative Expertise: A Contextualized Autocompletion Quiz. InProceedings of the 30th ACM Conference on Innovation and Technology in Comput...

  19. [19]

    Stephen MacNeil, Magdalena Rogalska, Juho Leinonen, Paul Denny, Arto Hel- las, and Xandria Crosland. 2024. Synthetic Students: A Comparative Study of Bug Distribution Between Large Language Models and Computing Students. InProceedings of the 2024 on ACM Virtual Global Computing Education Confer- ence V. 1 (SIGCSE Virtual 2024). Association for Computing M...

  20. [20]

    Margulieux, James Prather, Brent N

    Lauren E. Margulieux, James Prather, Brent N. Reeves, Brett A. Becker, Gozde Cetin Uzun, et al . 2024. Self-Regulation, Self-Efficacy, and Fear of Failure In- teractions with How Novices Use LLMs to Solve Programming Problems. In Proceedings of the 2024 on Innovation and Technology in Computer Science Educa- tion V. 1 (ITiCSE 2024). ACM, 276–282. https://...

  21. [21]

    Pratibha Menon. 2023. Exploring GitHub Copilot assistance for working with classes in a programming course.Issues in Information Systems24, 4 (2023)

  22. [22]

    Marvin Minsky. 1997. Negative expertise. (1997)

  23. [23]

    Seong Min Park, Marco Ho, Michael Pin-Chuan Lin, and Jeeho Ryoo. 2025. Evaluating the Impact of Assistive AI Tools on Learning Outcomes and Ethical Considerations in Programming Education. In2025 IEEE Global Engineering Education Conference (EDUCON). 1–10. https://doi.org/10.1109/EDUCON62633. 2025.11016517

  24. [24]

    2024.Learn AI-Assisted Python Programming: With Github Copilot and ChatGPT

    Leo Porter and Daniel Zingaro. 2024.Learn AI-Assisted Python Programming: With Github Copilot and ChatGPT. Simon and Schuster

  25. [25]

    Becker, Ibrahim Albluwi, et al

    James Prather, Paul Denny, Juho Leinonen, Brett A. Becker, Ibrahim Albluwi, et al. 2023. The Robots Are Here: Navigating the Generative AI Revolution in Computing Education. InProceedings of the 2023 Working Group Reports on Innovation and Technology in Computer Science Education (ITiCSE-WGR ’23). Association for Computing Machinery. https://doi.org/10.11...

  26. [26]

    James Prather, Juho Leinonen, Natalie Kiesler, Jamie Gorson Benario, et al. 2025. Beyond the Hype: A Comprehensive Review of Current Trends in Generative AI Research, Teaching Practices, and Tools. In2024 Working Group Reports on Innovation and Technology in Computer Science Education(Milan, Italy)(ITiCSE 2024). Association for Computing Machinery, New Yo...

  27. [27]

    It’s Weird That it Knows What I Want

    James Prather, Brent N. Reeves, Paul Denny, Brett A. Becker, Juho Leinonen, Andrew Luxton-Reilly, Garrett Powell, James Finnie-Ansley, and Eddie Antonio Santos. 2023. “It’s Weird That it Knows What I Want”: Usability and Interactions with Copilot for Novice Programmers.ACM Trans. Comput.-Hum. Interact.31, 1, Article 4 (Nov. 2023), 31 pages. https://doi.or...

  28. [28]

    In: Proc

    James Prather, Brent N Reeves, Juho Leinonen, Stephen MacNeil, Arisoa S Ran- drianasolo, Brett A. Becker, Bailey Kimmel, Jared Wright, and Ben Briggs. 2024. The Widening Gap: The Benefits and Harms of Generative AI for Novice Pro- grammers. InProceedings of the 2024 ACM Conference on International Computing Education Research - Volume 1 (ICER ’24). Associ...

  29. [29]

    Price, David Hovemeyer, Kelly Rivers, Ge Gao, et al

    Thomas W. Price, David Hovemeyer, Kelly Rivers, Ge Gao, et al. 2020. ProgSnap2: A Flexible Format for Programming Process Data. InProceedings of the 2020 ACM Conference on Innovation and Technology in Computer Science Education (ITiCSE ’20). ACM, 356–362. https://doi.org/10.1145/3341525.3387373

  30. [30]

    Griswold, and Adalbert Gerald Soosai Raj

    Anshul Shah, Anya Chernova, Elena Tomson, Leo Porter, William G. Griswold, and Adalbert Gerald Soosai Raj. 2025. Students’ Use of GitHub Copilot for Work- ing with Large Code Bases. InProceedings of the 56th ACM Technical Symposium on Computer Science Education V. 1 (SIGCSETS 2025). Association for Computing Machinery, 1050–1056. https://doi.org/10.1145/3...

  31. [31]

    Md Istiak Hossain Shihab, Christopher Hundhausen, Ahsun Tariq, Summit Haque, Yunhan Qiao, and Brian Wise Mulanda. 2025. The Effects of GitHub Copilot on Computing Students’ Programming Effectiveness, Efficiency, and Processes in Brownfield Coding Tasks. InProceedings of the 2025 ACM Conference on Interna- tional Computing Education Research V.1 (ICER ’25)...

  32. [32]

    Estelle Smith, Kylee Shiekh, Hayden Cooreman, Sharfi Rahman, et al

    C. Estelle Smith, Kylee Shiekh, Hayden Cooreman, Sharfi Rahman, et al. 2024. Early Adoption of Generative Artificial Intelligence in Computing Education: Emergent Student Use Cases and Perspectives in 2023. InProceedings of the 2024 Innovation and Technology in Computer Science Education V.1 (ITiCSE 2024). ACM, 3–9. https://doi.org/10.1145/3649217.3653575

  33. [33]

    Smith IV, Mounika Padala, Christine Alvarado, Jamie Gorson Benario, and Leo Porter

    Annapurna Vadaparty, Daniel Zingaro, David H. Smith IV, Mounika Padala, Christine Alvarado, Jamie Gorson Benario, and Leo Porter. 2024. CS1-LLM: Integrating LLMs into CS1 Instruction. InProceedings of the 2024 on Innovation and Technology in Computer Science Education V. 1 (ITiCSE 2024). ACM, 297–303. https://doi.org/10.1145/3649217.3653584

  34. [34]

    experience: Evaluating the usability of code generation tools powered by large language models

    Priyan Vaithilingam, Tianyi Zhang, and Elena L. Glassman. 2022. Expectation vs. Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models. InExtended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems (CHI EA ’22). Association for Computing Machinery, Article 332, 7 pages. https://doi.org/10.114...

  35. [35]

    Michel Wermelinger. 2023. Using GitHub Copilot to Solve Simple Programming Problems. InProceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1 (SIGCSE 2023). Association for Computing Machinery, New York, NY, USA, 172–178. https://doi.org/10.1145/3545945.3569830

  36. [36]

    Cynthia Zastudil, Magdalena Rogalska, Christine Kapp, Jennifer Vaughn, and Stephen MacNeil. 2023. Generative AI in Computing Education: Perspectives of Students and Instructors. In2023 IEEE Frontiers in Education Conference (FIE). 1–9. https://doi.org/10.1109/FIE58773.2023.10343467