pith. machine review for the scientific record. sign in

arxiv: 2605.01104 · v1 · submitted 2026-05-01 · 💻 cs.SE · cs.CL· cs.HC

Recognition: unknown

RECAP: An End-to-End Platform for Capturing, Replaying, and Analyzing AI-Assisted Programming Interactions

Authors on Pith no claims yet

Pith reviewed 2026-05-09 18:35 UTC · model grok-4.3

classification 💻 cs.SE cs.CLcs.HC
keywords AI-assisted programminginteraction capturesession replaybehavioral analysiscode editingVS Codedeveloper workflow
0
0 comments X

The pith

RECAP merges AI chat logs with fine-grained code edits into one replayable timeline to support analyses impossible from isolated data sources.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Developers' interactions with AI coding tools span prompts, discarded attempts, and evolving code changes that no single log captures fully. RECAP records these elements passively inside VS Code, stitches them into a single timeline, and adds replay plus modular analysis tools. The result lets researchers trace exactly which prompt produced which edit and how strategies shift over time. A deployment in a university course gathered over two thousand prompts and eight thousand edits, demonstrating pattern detection that separate chat histories or version control data cannot achieve.

Core claim

RECAP is an open-source platform that passively records AI chat sessions and fine-grained code edits inside VS Code without disrupting workflow, merges the streams into a unified timeline for interactive replay, and supplies an extensible analysis layer that includes modules for behavioral classification and AI reliance measurement.

What carries the argument

The unified timeline that links each AI prompt directly to the code edits it triggered and to the developer's subsequent accept, reject, or revise actions.

If this is right

  • Analyses can classify developer behaviors using the full prompt-to-edit context rather than incomplete chat or git records alone.
  • AI reliance can be quantified by tracking acceptance rates, modification frequency, and strategy evolution across an entire multi-week project.
  • Researchers gain the ability to replay sessions interactively to inspect what developers tried and discarded at each step.
  • New analysis modules can be added to the extensible layer without altering the underlying capture or replay mechanisms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same linked timeline could let tool designers identify common points where developers abandon AI suggestions and test interface changes to reduce those drop-offs.
  • Longitudinal deployment across semesters or teams would make it possible to measure whether prompting skill improves with repeated use of the same AI assistant.
  • Extending the platform to additional editors would allow direct comparison of interaction patterns between different development environments.

Load-bearing premise

Passive recording captures developer behavior exactly as it would occur without the tool present, and the linked data contains enough context for analysis modules to extract undistorted patterns.

What would settle it

A side-by-side comparison in which the same developers are observed by a human researcher while RECAP runs, checking whether recorded prompts, edits, and sequences match the observed actions and stated intentions.

Figures

Figures reproduced from arXiv: 2605.01104 by Keyu He, Qianou Ma, Tongshuang Wu, Valerie Chen, Wayne Chi.

Figure 1
Figure 1. Figure 1: RECAP system architecture. The VS Code extension captures chat sessions and code edits in parallel and uploads them to cloud storage. The replay platform merges them into a unified timeline for interactive replay. Extensible Analyses are example modules built on the timeline; researchers can plug in their own without modifying the capture layer. sion never holds cloud credentials; it requests short￾lived p… view at source ↗
Figure 2
Figure 2. Figure 2: RECAP Replay Viewer. Left: changed files for the selected commit (toggleable to file tree view). Center: unified diff view showing a code change attributed to Copilot. Right: chat panel with the corresponding AI conversation. Bottom: timeline bar with color-coded event markers (event-spaced or time-proportional) view at source ↗
Figure 3
Figure 3. Figure 3: Multi-student overview panel. Each row is view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of prompt behavior categories; view at source ↗
Figure 5
Figure 5. Figure 5: AI edit share across successive work sessions. view at source ↗
Figure 6
Figure 6. Figure 6: t-SNE projection of 2,034 student prompts colored by behavior category. Clusters reveal distinct interaction view at source ↗
Figure 7
Figure 7. Figure 7: Error-pasting loop in the replay viewer. Over 11 minutes the student pastes the same view at source ↗
read the original abstract

Understanding how developers interact with AI coding assistants requires more than chat logs or git histories in isolation; it requires reconstructing the full context: which prompt led to which edit, what the developer tried and discarded, and how their strategy evolved over time. We present RECAP (Replay and Examine Captured AI Programming), an open-source platform that (1) passively records AI chat sessions and fine-grained code edits inside VS Code without disrupting the developer's workflow, (2) merges them into a unified timeline for interactive session replay, and (3) exposes an extensible analysis layer, with example modules for behavioral classification and AI reliance measurement. Deployed in a university software engineering course, RECAP captured 2,034 prompts and 8,239 code edits from 41 students across a multi-week project. We demonstrate how the platform's linked data and replay capabilities enable analyses of developer-AI interaction patterns that no single data source could support.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents RECAP, an open-source platform that passively captures AI chat sessions and fine-grained code edits inside VS Code, merges them into a unified timeline for interactive replay, and exposes an extensible analysis layer with example modules for behavioral classification and AI reliance measurement. Deployed in a university software engineering course, it captured 2,034 prompts and 8,239 code edits from 41 students; the central claim is that the linked data and replay capabilities enable analyses of developer-AI interaction patterns unattainable from any single data source in isolation.

Significance. A validated platform with demonstrated linkage between prompts and edits could meaningfully advance empirical work on AI-assisted programming by supporting timeline-based analyses that integrate chat context with code changes, discards, and strategy shifts. The open-source release and real deployment data are practical strengths that could facilitate replication and extension by other researchers.

major comments (2)
  1. [Abstract and analysis layer] Abstract and § on analysis modules: the claim that the platform enables analyses 'that no single data source could support' is not supported by any concrete worked example. The manuscript describes example modules for behavioral classification and AI reliance but supplies no specific result, pattern detected, or comparison against timestamp-aligned separate chat logs and edit histories showing that the explicit prompt-edit linkage (including discards and strategy shifts) is required to identify the pattern.
  2. [Deployment and evaluation] Deployment section: the reported figures (2,034 prompts, 8,239 edits from 41 students) are presented without any validation that the passive capture is complete, that no events were lost, or that the recording process does not alter developer behavior; this directly affects the soundness of the claimed new analyses.
minor comments (2)
  1. [Platform architecture] The description of the merging process into a unified timeline would benefit from additional technical detail on how timestamps and session boundaries are aligned across chat and edit streams.
  2. [Analysis layer] Clarify the exact scope and output format of the 'extensible analysis layer' so readers can assess how new modules would integrate with the replay timeline.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our manuscript. We address each major point below, indicating where revisions will be made to strengthen the paper.

read point-by-point responses
  1. Referee: [Abstract and analysis layer] Abstract and § on analysis modules: the claim that the platform enables analyses 'that no single data source could support' is not supported by any concrete worked example. The manuscript describes example modules for behavioral classification and AI reliance but supplies no specific result, pattern detected, or comparison against timestamp-aligned separate chat logs and edit histories showing that the explicit prompt-edit linkage (including discards and strategy shifts) is required to identify the pattern.

    Authors: We agree that the manuscript would benefit from a concrete worked example to substantiate the claim. While the analysis layer is described with modules for behavioral classification and AI reliance measurement, and the abstract states that linked data enables new analyses, no specific pattern or comparison demonstrating the necessity of the unified timeline (e.g., detecting a discard followed by a strategy shift) is provided from the deployment data. In the revised manuscript, we will add a worked example in the analysis section, using anonymized data from the student sessions to illustrate a pattern identifiable only through explicit prompt-edit linkage and replay, with a direct comparison to what separate timestamp-aligned sources would miss. revision: yes

  2. Referee: [Deployment and evaluation] Deployment section: the reported figures (2,034 prompts, 8,239 edits from 41 students) are presented without any validation that the passive capture is complete, that no events were lost, or that the recording process does not alter developer behavior; this directly affects the soundness of the claimed new analyses.

    Authors: The referee is correct that the deployment section reports the event counts without explicit validation of completeness, data loss, or behavioral impact. The capture mechanism relies on passive VS Code extension hooks, which are intended to be non-intrusive, but we performed no controlled measurements or cross-checks (beyond basic logging) to confirm zero loss or no alteration. We will revise the deployment and evaluation sections to add a dedicated limitations paragraph discussing these issues, including potential failure modes (e.g., extension crashes or network drops for chat) and any indirect consistency checks feasible from the existing data. Full empirical validation would require new experiments outside the scope of the current deployment; we will frame this as a limitation and suggest it for future work rather than asserting soundness without evidence. revision: partial

Circularity Check

0 steps flagged

No circularity: direct system description without derivations or self-referential claims

full rationale

The paper is an engineering systems description of the RECAP platform. It contains no equations, fitted parameters, predictions, uniqueness theorems, or ansatzes. The central claim—that linked chat/edit data enables analyses impossible from isolated sources—is presented as a direct consequence of the platform's design and deployment (2,034 prompts, 8,239 edits), not derived from any prior result or self-citation chain. No load-bearing step reduces to its own inputs by construction. This is the expected non-finding for a non-theoretical implementation paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a systems and tool-building paper with no mathematical derivations, fitted parameters, or postulated theoretical entities.

pith-pipeline@v0.9.0 · 5477 in / 991 out tokens · 22088 ms · 2026-05-09T18:35:55.741889+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 24 canonical work pages · 2 internal anchors

  1. [1]

    Hannah McLean Babe, Sydney Nguyen, Yangtian Zi, Arjun Guha, Molly Q Feldman, and Carolyn Jane Anderson. 2024. Studenteval: A benchmark of student-written prompts for large language models of code. In Findings of the Association for Computational Linguistics: ACL 2024, pages 8452--8474

  2. [2]

    Shraddha Barke, Michael B James, and Nadia Polikarpova. 2023. Grounded copilot: How programmers interact with code-generating models. Proceedings of the ACM on Programming Languages, 7(OOPSLA1):85--111

  3. [3]

    Manaal Basha, Aimeê M Ribeiro, Jeena Javahar, Cleidson R B de Souza, and Gema Rodríguez-Pérez. 2025. https://doi.org/10.48550/arXiv.2510.11536 CodeWatcher : IDE telemetry data extraction tool for understanding coding interactions with LLMs . arXiv [cs.SE]

  4. [4]

    Joachim Baumann, Vishakh Padmakumar, Xiang Li, John Yang, Diyi Yang, and Sanmi Koyejo. 2026. https://doi.org/10.48550/arXiv.2604.20779 SWE -chat: Coding agent interactions from real users in the wild . arXiv [cs.AI]

  5. [5]

    Christian Bird, Denae Ford, Thomas Zimmermann, Nicole Forsgren, Eirini Kalliamvakou, Travis Lowdermilk, and Idan Gazit. 2023. https://doi.org/10.1145/3582083 Taking flight with copilot: Early insights and opportunities of AI -powered pair-programming tools . Queueing Syst., 20(6):35--57

  6. [6]

    Neil Christopher Charles Brown, Michael Kölling, Davin McCall, and Ian Utting. 2014. https://doi.org/10.1145/2538862.2538924 Blackbox: a large scale repository of novice programmers' activity . In Proceedings of the 45th ACM technical symposium on Computer science education, New York, NY, USA. ACM

  7. [7]

    Valerie Chen, Ameet Talwalkar, Robert Brennan, and Graham Neubig. 2025. https://arxiv.org/abs/2507.08149 Code with me or for me? how increasing AI automation transforms developer workflows . arXiv [cs.SE]

  8. [8]

    Wayne Chi, Valerie Chen, Anastasios Nikolas Angelopoulos, Wei-Lin Chiang, Aditya Mittal, Naman Jain, Tianjun Zhang, Ion Stoica, Chris Donahue, and Ameet Talwalkar. 2025. https://arxiv.org/abs/2502.09328 Copilot arena: A platform for code llm evaluation in the wild . Preprint, arXiv:2502.09328

  9. [9]

    Cursor . 2026. https://github.com/cursor/agent-trace agent-trace: A standard format for tracing AI -generated code

  10. [10]

    Thomas Dohmke. 2023. GitHub copilot X : The AI -powered developer experience. https://github.blog/2023-03-22-github-copilot-x-the-ai-powered-developer-experience/. Accessed: 2023-9-5

  11. [11]

    Hao He, Courtney Miller, Shyam Agarwal, Christian Kästner, and Bogdan Vasilescu. 2026. https://doi.org/10.48550/arXiv.2511.04427 Speed at the cost of quality: How cursor AI increases short-term velocity and long-term complexity in open-source projects . arXiv [cs.SE]

  12. [12]

    Daniil Karol, Elizaveta Artser, Ilya Vlasov, Yaroslav Golubev, Hieke Keuning, and Anastasiia Birillo. 2025. https://arxiv.org/abs/2506.21266 KOALA : A configurable tool for collecting IDE data when solving programming tasks . arXiv [cs.SE]

  13. [13]

    Majeed Kazemitabaar, Runlong Ye, Xiaoning Wang, Austin Zachary Henley, Paul Denny, Michelle Craig, and Tovi Grossman. 2024. Codeaid: Evaluating a classroom deployment of an llm-based programming assistant that balances student and educator needs. In Proceedings of the 2024 chi conference on human factors in computing systems, pages 1--20

  14. [14]

    Mina Lee, Percy Liang, and Qian Yang. 2022. https://doi.org/10.1145/3491102.3502030 CoAuthor : Designing a human- AI collaborative writing dataset for exploring language model capabilities . In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, number Article 388 in CHI '22, pages 1--19, New York, NY, USA. Association for Comput...

  15. [15]

    Qianou Ma, Kenneth R Koedinger, and Tongshuang Wu. 2026. https://doi.org/10.1145/3772318.3791283 Not everyone wins with LLM s: Behavioral patterns and pedagogical implications for AI literacy in programmatic data science . In Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems, CHI '26, pages 1--22, New York, NY, USA. Association ...

  16. [16]

    Qianou Ma, Tongshuang Wu, and Kenneth Koedinger. 2023. https://doi.org/10.48550/arXiv.2306.05153 Is AI the better programming partner? human-human pair programming vs. human- AI pAIr programming . arXiv [cs.HC], pages 64--77

  17. [17]

    L Maaten and Geoffrey E Hinton. 2008. https://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf Visualizing data using t- SNE . Journal of Machine Learning Research, 9(86):2579--2605

  18. [18]

    Hussein Mozannar, Gagan Bansal, Adam Fourney, and Eric Horvitz. 2022. https://doi.org/10.48550/ARXIV.2210.14306 Reading between the lines: Modeling user behavior and costs in AI -assisted programming . ArXiv

  19. [19]

    Hussein Mozannar, Valerie Chen, Mohammed Alsobay, Subhro Das, Sebastian Zhao, Dennis Wei, Manish Nagireddy, Prasanna Sattigeri, Ameet Talwalkar, and David Sontag. 2024. https://doi.org/10.48550/arXiv.2404.02806 The RealHumanEval : Evaluating large language models' abilities to support programmers . ArXiv, abs/2404.02806

  20. [20]

    Sheshera Mysore, Debarati Das, Hancheng Cao, and Bahareh Sarrafzadeh. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.852 Prototypical human- AI collaboration behaviors from LLM -assisted writing in the wild . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 16830--16857, Stroudsburg, PA, USA. Association ...

  21. [21]

    Hyunchan Park, Youngpil Kim, Kyungwoon Lee, Soonheon Jin, Jinseok Kim, Yan Heo, Gyuho Kim, and Eunhye Kim. 2025. https://doi.org/10.3390/app151910403 CodeDive : A web-based IDE with real-time code activity monitoring for programming education . Appl. Sci. (Basel), 15(19):10403

  22. [22]

    Vo Thien Tri Pham and Caitlin Kelleher. 2025. https://doi.org/10.1016/j.cola.2024.101313 Code histories: Documenting development by recording code influences and changes in code . J. Comput. Lang., 82(101313):101313

  23. [23]

    Thomas W Price, David Hovemeyer, Kelly Rivers, Ge Gao, Austin Cory Bart, Ayaan M Kazerouni, Brett A Becker, Andrew Petersen, Luke Gusukuma, Stephen H Edwards, and David Babcock. 2020. https://doi.org/10.1145/3341525.3387373 ProgSnap2 : A flexible format for programming process data . In Proceedings of the 2020 ACM Conference on Innovation and Technology i...

  24. [24]

    Nils Reimers and Iryna Gurevych. 2019. http://arxiv.org/abs/1908.10084 Sentence-bert: Sentence embeddings using siamese bert-networks . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics

  25. [25]

    Agnia Sergeyuk, Eric Huang, Dariia Karaeva, Anastasiia Serova, Yaroslav Golubev, and Iftekhar Ahmed. 2026. https://doi.org/10.48550/arXiv.2601.10258 Evolving with AI : A longitudinal analysis of developer logs . arXiv [cs.SE]

  26. [26]

    Alex Tamkin, Miles McCain, Kunal Handa, Esin Durmus, Liane Lovitt, Ankur Rathi, Saffron Huang, Alfred Mountfield, Jerry Hong, Stuart Ritchie, Michael Stern, Brian Clarke, Landon Goldberg, Theodore R Sumers, Jared Mueller, William McEachen, Wes Mitchell, Shan Carter, Jack Clark, and 2 others. 2024. https://arxiv.org/abs/2412.13678 Clio: Privacy-preserving ...

  27. [27]

    Priyan Vaithilingam, Tianyi Zhang, and Elena L Glassman. 2022. Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models

  28. [28]

    Tao Xiao, Christoph Treude, Hideaki Hata, and Kenichi Matsumoto. 2024. https://doi.org/10.1145/3643991.3648400 DevGPT : Studying developer- ChatGPT conversations . In Proceedings of the 21st International Conference on Mining Software Repositories, pages 227--230, New York, NY, USA. ACM

  29. [29]

    Benjamin Xie, Jared Ordona Lim, Paul K D Pham, Min Li, and Amy J Ko. 2023. https://doi.org/10.1145/3568813.3600127 Developing novice programmers’ self-regulation skills with code replays . In Proceedings of the 2023 ACM Conference on International Computing Education Research V.1, pages 298--313, New York, NY, USA. ACM

  30. [30]

    Lisa Yan, Annie Hu, and Chris Piech. 2019. https://doi.org/10.1145/3287324.3287483 Pensieve: Feedback on coding process for novices . In Proceedings of the 50th ACM Technical Symposium on Computer Science Education, pages 253--259, New York, NY, USA. ACM

  31. [31]

    Ashley Ge Zhang, Yan-Ru Jhou, Yinuo Yang, Shamita Rao, Maryam Arab, Yan Chen, and Steve Oney. 2026. https://doi.org/10.48550/arXiv.2601.20085 Editrail: Understanding AI usage by visualizing student- AI interaction in code . arXiv [cs.HC]

  32. [32]

    Ziqian Zhong, Shashwat Saxena, and Aditi Raghunathan. 2026. https://hodoscope.dev/blog/announcement.html Hodoscope: Unsupervised behavior discovery in ai agents