Recognition: unknown
An Empirical Study of Proactive Coding Assistants in Real-World Software Development
Pith reviewed 2026-05-08 09:18 UTC · model grok-4.3
The pith
Real IDE traces from 1,246 developers differ from simulated ones in diversity and patterns, revealing that simulation overestimates proactive coding assistant performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Paired comparison of real IDE traces collected from 1,246 industry developers over three days against LLM-generated counterparts shows that simulated traces exhibit lower behavioral diversity, flatter temporal structure, and reduced exploratory patterns. When representative LLMs, retrieval-augmented systems, and agent baselines are evaluated on the real traces through the new ProCodeBench, their reliability drops substantially below levels observed in simulation. A training study further establishes that pre-training on simulated data followed by fine-tuning on real data yields better results than either source used alone.
What carries the argument
The side-by-side real-versus-simulated IDE trace datasets that expose metric gaps and serve as the basis for the ProCodeBench evaluation.
Load-bearing premise
The three-day traces gathered from 1,246 developers through the custom extension capture representative behavior without systematic omissions or selection bias.
What would settle it
A replication that measured the same diversity, temporal, and exploratory metrics on the collected traces and found no substantial differences between real and simulated versions, or that demonstrated models reaching high accuracy on the real traces without any real-data fine-tuning.
Figures
read the original abstract
Large language model (LLM)-based coding assistants have made substantial progress, yet most systems remain reactive, requiring developers to explicitly formulate their needs. Proactive coding assistants aim to infer latent developer intent from integrated development environment (IDE) interactions and repository context, thereby reducing interaction overhead and supporting more seamless assistance. However, research in this direction is limited by the scarcity of large-scale real-world developer behavior data. Existing studies therefore often rely on LLM-simulated IDE traces, whose fidelity to real development behavior remains unclear. In this paper, we investigate this simulation-to-reality gap through a large-scale empirical study. We collect real IDE interaction traces from 1{,}246 experienced industry developers over three consecutive days using a custom Visual Studio Code extension, and construct paired LLM-simulated traces for controlled comparison. Our analysis shows that simulated traces differ substantially from real traces in behavioral diversity, temporal structure, and exploratory patterns. Based on the collected data, we introduce \textbf{ProCodeBench}, a real-world benchmark for proactive intent prediction. Experiments with representative LLMs, retrieval-augmented methods, and agentic baselines show that current approaches remain far from reliable under real IDE traces, suggesting that simulation-based evaluation can overestimate real-world performance. Finally, our training study shows that simulated data cannot replace real data, but can complement it when used before real-world fine-tuning. These findings highlight the importance of real developer behavior data for evaluating and training proactive coding assistants.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that LLM-simulated IDE traces differ substantially from real traces collected from 1,246 industry developers over three days in behavioral diversity, temporal structure, and exploratory patterns. It introduces ProCodeBench as a real-world benchmark for proactive intent prediction, shows that representative LLMs, RAG methods, and agentic baselines perform poorly on real traces (with simulation overestimating performance), and demonstrates via training experiments that simulated data cannot replace real data but can complement it when used prior to real-world fine-tuning.
Significance. If the findings hold, the work is significant for software engineering and AI-assisted development research. It supplies the first large-scale empirical evidence of a simulation-to-reality gap for proactive coding assistants, releases a real-trace benchmark (ProCodeBench), and provides actionable guidance on mixing simulated and real data for training. The scale of the data collection and the controlled paired-trace comparison are strengths that could shift evaluation practices away from pure simulation.
major comments (1)
- [Section 3] Data collection procedure (Section 3 and ProCodeBench construction): The central claims—that simulated traces differ in key dimensions and that simulation overestimates real-world performance—rest on the collected traces being a faithful, unbiased proxy for typical developer IDE behavior. The three-day window with a volunteer sample of 1,246 developers using a custom VS Code extension risks capturing atypical patterns (e.g., onboarding effects) and may omit low-level events such as keystroke timing or focus shifts. No validation against established developer behavior metrics or longer-term collection is described, leaving open the possibility that observed differences are artifacts of incomplete or non-representative logging rather than intrinsic simulation gaps.
minor comments (1)
- [Abstract] Abstract: the notation '1{,}246' is a formatting artifact; ensure consistent comma or thin-space separators for large numbers throughout the manuscript and tables.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback on our data collection approach. We address the concerns point by point below and describe the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Section 3] Data collection procedure (Section 3 and ProCodeBench construction): The central claims—that simulated traces differ in key dimensions and that simulation overestimates real-world performance—rest on the collected traces being a faithful, unbiased proxy for typical developer IDE behavior. The three-day window with a volunteer sample of 1,246 developers using a custom VS Code extension risks capturing atypical patterns (e.g., onboarding effects) and may omit low-level events such as keystroke timing or focus shifts. No validation against established developer behavior metrics or longer-term collection is described, leaving open the possibility that observed differences are artifacts of incomplete or non-representative logging rather than intrinsic simulation gaps.
Authors: We agree that the representativeness of the collected traces is foundational to our claims and appreciate the opportunity to clarify our design choices. The three-day collection window was determined through pilot studies to maximize data volume per participant while maintaining high compliance rates; longer durations were found to increase dropout in preliminary tests. Our sample of 1,246 experienced industry developers is, to our knowledge, the largest of its kind for IDE trace collection, providing statistical robustness against idiosyncratic behaviors. The custom VS Code extension was engineered to capture high-level events (file opens/closes, edits, navigations, and context switches) that directly support proactive intent prediction—the core task of ProCodeBench—while respecting privacy and avoiding excessive runtime overhead. Low-level signals such as per-keystroke timing or micro focus shifts fall outside the scope of the benchmark, as they are not required for the intent inference models we evaluate. We acknowledge that the original manuscript does not include explicit validation against established developer behavior metrics from prior literature (e.g., cross-referencing with studies on edit patterns or navigation frequency). We will add a dedicated paragraph in the revised Section 3 that maps our logged events to those used in related developer activity research and discusses alignment. Onboarding effects are mitigated by the consecutive-day design and instructions to participants to follow their normal workflows; we will also report per-day statistics in the revision to allow readers to assess stabilization. Longer-term collection was not performed due to practical constraints on participant retention and study resources. Importantly, the paired-trace design— revision: no
Circularity Check
No circularity: purely empirical study without derivations or self-referential fitting
full rationale
The paper is a large-scale empirical investigation that collects real IDE traces from 1,246 developers via a custom VS Code extension, generates paired simulated traces, performs comparative analysis on behavioral metrics, introduces ProCodeBench as a benchmark, runs model evaluations, and conducts a training study. No equations, mathematical derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. All central claims rest on direct data collection and external benchmarks rather than reducing to the study's own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The custom VS Code extension accurately records all relevant developer IDE interactions and intent signals without bias or omission.
- domain assumption The 1,246 developers and three-day collection period produce traces representative of general industry developer behavior.
Reference graph
Works this paper leans on
-
[1]
Evaluating Large Language Models Trained on Code
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, et al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review arXiv 2021
-
[2]
Competition-level code generation with alphacode,
Y . Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. Dal Lago, et al., “Competition-level code generation with alphacode,” Science, vol. 378, no. 6624, pp. 1092–1097, 2022
2022
-
[3]
An empirical evaluation of using large language models for automated unit test generation,
M. Sch¨afer, S. Nadi, A. Eghbali, and F. Tip, “An empirical evaluation of using large language models for automated unit test generation,”IEEE Transactions on Software Engineering, vol. 50, no. 1, pp. 85–105, 2023
2023
-
[4]
Evolving with ai: A longi- tudinal analysis of developer logs,
A. Sergeyuk, E. Huang, D. Karaeva, A. Serova, Y . Gol- ubev, and I. Ahmed, “Evolving with ai: A longi- tudinal analysis of developer logs,”arXiv preprint arXiv:2601.10258, 2026
-
[5]
Github copilot in the classroom: learning to code with ai assistance,
B. Puryear and G. Sprint, “Github copilot in the classroom: learning to code with ai assistance,”Journal of Computing Sciences in Colleges, vol. 38, no. 1, pp. 37–47, 2022
2022
-
[6]
Grounded copilot: How programmers interact with code-generating models,
S. Barke, M. B. James, and N. Polikarpova, “Grounded copilot: How programmers interact with code-generating models,”Proceedings of the ACM on Programming Languages, vol. 7, no. OOPSLA1, pp. 85–111, 2023
2023
-
[7]
A large-scale survey on the usability of ai programming assistants: Successes and challenges,
J. T. Liang, C. Yang, and B. A. Myers, “A large-scale survey on the usability of ai programming assistants: Successes and challenges,” inProceedings of the 46th IEEE/ACM international conference on software engi- neering, pp. 1–13, 2024
2024
-
[8]
Beyond code generation: An obser- vational study of chatgpt usage in software engineering practice,
R. Khojah, M. Mohamad, P. Leitner, and F. G. de Oliveira Neto, “Beyond code generation: An obser- vational study of chatgpt usage in software engineering practice,”Proceedings of the ACM on Software Engineer- ing, vol. 1, no. FSE, pp. 1819–1840, 2024
2024
-
[9]
Learning from examples to improve code completion systems,
M. Bruch, M. Monperrus, and M. Mezini, “Learning from examples to improve code completion systems,” in Proceedings of the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering, pp. 213–222, 2009
2009
-
[10]
Code completion with statistical language models,
V . Raychev, M. Vechev, and E. Yahav, “Code completion with statistical language models,” inProceedings of the 35th ACM SIGPLAN conference on programming language design and implementation, pp. 419–428, 2014
2014
-
[11]
SWE-agent: Agent-computer interfaces enable automated software engineering,
J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. R. Narasimhan, and O. Press, “SWE-agent: Agent-computer interfaces enable automated software engineering,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
2024
-
[12]
Need help? designing proactive ai assistants for programming,
V . Chen, A. Zhu, S. Zhao, H. Mozannar, D. Sontag, and A. Talwalkar, “Need help? designing proactive ai assistants for programming,” inProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pp. 1–18, 2025
2025
-
[13]
Codinggenie: A proactive llm-powered programming assistant,
S. Zhao, A. Zhu, H. Mozannar, D. Sontag, A. Talwalkar, and V . Chen, “Codinggenie: A proactive llm-powered programming assistant,” inProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering, pp. 1168–1172, 2025
2025
-
[14]
N. Tang, C. Chen, Z. Fang, G. Xu, M. Dhakal, Y . Shi, C. McMillan, Y . Huang, and T. J.-J. Li, “Program- ming by chat: A large-scale behavioral analysis of 11,579 real-world ai-assisted ide sessions,”arXiv preprint arXiv:2604.00436, 2026
-
[15]
Developer inter- action patterns with proactive AI: A five-day field study
N. Kuo, A. Sergeyuk, V . Chen, and M. Izadi, “Developer interaction patterns with proactive ai: A five-day field study,”arXiv preprint arXiv:2601.10253, 2026
-
[16]
Reading between the lines: Modeling user behavior and costs in ai-assisted programming,
H. Mozannar, G. Bansal, A. Fourney, and E. Horvitz, “Reading between the lines: Modeling user behavior and costs in ai-assisted programming,” inProceedings of the 2024 CHI conference on human factors in computing systems, pp. 1–16, 2024
2024
-
[17]
Y . Lu, S. Yang, C. Qian, G. Chen, Q. Luo, Y . Wu, H. Wang, X. Cong, Z. Zhang, Y . Lin,et al., “Proactive agent: Shifting llm agents from reactive responses to active assistance,”arXiv preprint arXiv:2410.12361, 2024
-
[18]
Propersim: Developing proactive and per- sonalized ai assistants through user-assistant simulation,
J. Kim, J. Choi, W. Chay, D. Kyung, Y . Kwon, Y . Jo, and E. Choi, “Propersim: Developing proactive and per- sonalized ai assistants through user-assistant simulation,” arXiv preprint arXiv:2509.21730, 2025
-
[19]
Repocoder: Repository- level code completion through iterative retrieval and generation,
F. Zhang, B. Chen, Y . Zhang, J. Keung, J. Liu, D. Zan, Y . Mao, J.-G. Lou, and W. Chen, “Repocoder: Repository- level code completion through iterative retrieval and generation,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 2471–2484, 2023
2023
-
[20]
Repograph: Enhancing ai software engineering with repository-level code graph,
S. Ouyang, W. Yu, K. Ma, Z. Xiao, Z. Zhang, M. Jia, J. Han, H. Zhang, and D. Yu, “Repograph: Enhancing ai software engineering with repository-level code graph,” in13th International Conference on Learning Repre- sentations, ICLR 2025, pp. 30361–30384, International Conference on Learning Representations, ICLR, 2025
2025
-
[21]
A-rag: Scaling agentic retrieval-augmented generation via hierarchical retrieval inter- faces
M. Du, B. Xu, C. Zhu, S. Wang, P. Wang, X. Wang, and Z. Mao, “A-rag: Scaling agentic retrieval-augmented generation via hierarchical retrieval interfaces,”arXiv preprint arXiv:2602.03442, 2026
-
[22]
A review on code generation with llms: Application and evaluation,
J. Wang and Y . Chen, “A review on code generation with llms: Application and evaluation,” in2023 IEEE Inter- national Conference on Medical Artificial Intelligence (MedAI), pp. 284–289, IEEE, 2023
2023
-
[23]
Automated program repair in the era of large pre-trained language models,
C. S. Xia, Y . Wei, and L. Zhang, “Automated program repair in the era of large pre-trained language models,” in2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pp. 1482–1494, IEEE, 2023
2023
-
[24]
CrossCodeEval: A diverse and multilingual benchmark for cross-file code completion,
Y . Ding, Z. Wang, W. U. Ahmad, H. Ding, M. Tan, N. Jain, M. K. Ramanathan, R. Nallapati, P. Bhatia, D. Roth, and B. Xiang, “CrossCodeEval: A diverse and multilingual benchmark for cross-file code completion,” inThirty- seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023
2023
-
[25]
Y . Deng, W. Lei, W. Lam, and T.-S. Chua, “A survey on proactive dialogue systems: Problems, methods, and prospects,”arXiv preprint arXiv:2305.02750, 2023
- [26]
-
[27]
Proagentbench: Evaluating llm agents for proactive assistance with real-world data
Y . Tang, H. Tang, T. Cao, L. Nguyen, A. Zhang, X. Cao, C. Liu, W. Ding, and Y . Li, “ProAgentBench: Evaluating llm agents for proactive assistance with real-world data,” arXiv preprint arXiv:2602.04482, 2025
-
[28]
Y . Chai, S. Tang, H. Xiao, R. Liu, and H. Li, “Pira- bench: A transition from reactive gui agents to gui-based proactive intent recommendation agents,”arXiv preprint arXiv:2603.08013, 2026
-
[29]
Deveval: A manually- annotated code generation benchmark aligned with real- world code repositories,
J. Li, G. Li, Y . Zhao, Y . Li, H. Liu, H. Zhu, L. Wang, K. Liu, Z. Fang, L. Wang,et al., “Deveval: A manually- annotated code generation benchmark aligned with real- world code repositories,” inFindings of the Association for Computational Linguistics: ACL 2024, pp. 3603–3614, 2024
2024
-
[30]
Evocodebench: An evolving code generation benchmark with domain-specific evaluations,
J. Li, G. Li, X. Zhang, Y . Zhao, Y . Dong, Z. Jin, B. Li, F. Huang, and Y . Li, “Evocodebench: An evolving code generation benchmark with domain-specific evaluations,” Advances in Neural Information Processing Systems, vol. 37, pp. 57619–57641, 2024
2024
-
[31]
Program Synthesis with Large Language Models
J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. J. Cai, M. Terry, Q. V . Le, and C. Sutton, “Program synthesis with large language models,”arXiv preprint arXiv:2108.07732, 2021
work page internal anchor Pith review arXiv 2021
-
[32]
SWE-bench: Can language models resolve real-world github issues?,
C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan, “SWE-bench: Can language models resolve real-world github issues?,” inThe Twelfth International Conference on Learning Representations, 2024
2024
-
[33]
CodeSearchNet Challenge: Evaluating the State of Semantic Code Search
H. Husain, H.-H. Wu, T. Gazit, M. Allamanis, and M. Brockschmidt, “CodeSearchNet challenge: Evaluat- ing the state of semantic code search,”arXiv preprint arXiv:1909.09436, 2019
work page internal anchor Pith review arXiv 1909
-
[34]
Devbench: A comprehensive benchmark for software development
B. Li, W. Wu, Z. Tang, L. Shi, J. Yang, J. Li, S. Yao, C. Qian, B. Hui, Q. Zhang,et al., “DevBench: A comprehensive benchmark for software development,” arXiv preprint arXiv:2403.08604, 2024
-
[35]
Judging llm-as-a-judge with mt-bench and chatbot arena,
L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. Xing,et al., “Judging llm-as-a-judge with mt-bench and chatbot arena,”Ad- vances in neural information processing systems, vol. 36, pp. 46595–46623, 2023
2023
-
[36]
Large language models are not fair evaluators,
P. Wang, L. Li, L. Chen, Z. Cai, D. Zhu, B. Lin, Y . Cao, L. Kong, Q. Liu, T. Liu,et al., “Large language models are not fair evaluators,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 9440–9450, 2024
2024
-
[37]
Glm-5: from vibe coding to agentic engineering,
GLM-5-Team, A. Zeng, X. Lv, Z. Hou, Z. Du, Q. Zheng, B. Chen, D. Yin, C. Ge, C. Huang, C. Xie, C. Zhu, C. Yin, C. Wang, G. Pan, H. Zeng, H. Zhang, H. Wang, H. Chen, J. Zhang, J. Jiao, J. Guo, J. Wang, J. Du, J. Wu, K. Wang, L. Li, L. Fan, L. Zhong, M. Liu, M. Zhao, P. Du, Q. Dong, R. Lu, Shuang-Li, S. Cao, S. Liu, T. Jiang, X. Chen, X. Zhang, X. Huang, X....
2026
-
[38]
Qwen3.5: Towards native multimodal agents,
Qwen Team, “Qwen3.5: Towards native multimodal agents,” February 2026
2026
-
[39]
Coderag: Finding relevant and necessary knowledge for retrieval- augmented repository-level code completion,
S. Zhang, Y . Ding, S. Lian, S. Song, and H. Li, “Coderag: Finding relevant and necessary knowledge for retrieval- augmented repository-level code completion,” inProceed- ings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 23289–23299, 2025
2025
-
[40]
Graphcoder: Enhancing repository-level code completion via coarse-to-fine retrieval based on code context graph,
W. Liu, A. Yu, D. Zan, B. Shen, W. Zhang, H. Zhao, Z. Jin, and Q. Wang, “Graphcoder: Enhancing repository-level code completion via coarse-to-fine retrieval based on code context graph,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engi- neering, pp. 570–581, 2024
2024
-
[41]
Qwen3 technical report,
Q. Team, “Qwen3 technical report,” 2025
2025
-
[42]
Chatglm: A family of large language models from glm- 130b to glm-4 all tools,
Team GLM, A. Zeng, B. Xu, B. Wang, C. Zhang, D. Yin, D. Rojas, G. Feng, H. Zhao, H. Lai, H. Yu, H. Wang, J. Sun, J. Zhang, J. Cheng, J. Gui, J. Tang, J. Zhang, J. Li, L. Zhao, L. Wu, L. Zhong, M. Liu, M. Huang, P. Zhang, Q. Zheng, R. Lu, S. Duan, S. Zhang, S. Cao, S. Yang, W. L. Tam, W. Zhao, X. Liu, X. Xia, X. Zhang, X. Gu, X. Lv, X. Liu, X. Liu, X. Yang...
2024
-
[43]
A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Ka- dian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan,et al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.