arxiv: 2605.05700 · v1 · submitted 2026-05-07 · 💻 cs.SE · cs.AI

Recognition: unknown

An Empirical Study of Proactive Coding Assistants in Real-World Software Development

Lehui Li , Ruixuan Jia , Guo-Ye Yang , Jia Li

Authors on Pith no claims yet

Pith reviewed 2026-05-08 09:18 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords proactive coding assistantsIDE interaction tracessimulation-reality gapLLM coding toolsintent predictionreal-world benchmarksdeveloper behavior data

0 comments

The pith

Real IDE traces from 1,246 developers differ from simulated ones in diversity and patterns, revealing that simulation overestimates proactive coding assistant performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper gathers actual interaction records from experienced developers working in their normal environments and builds matching simulated records for side-by-side comparison. It finds clear differences in how varied the actions are, how they unfold over time, and how much exploration occurs. These differences matter because they make existing prediction methods look stronger in simulation than they prove to be on genuine data. The authors release a benchmark built on the real traces and test representative models, retrieval methods, and agent setups. A follow-up training experiment shows that simulated examples can prepare a model but must be followed by real examples to reach usable accuracy.

Core claim

Paired comparison of real IDE traces collected from 1,246 industry developers over three days against LLM-generated counterparts shows that simulated traces exhibit lower behavioral diversity, flatter temporal structure, and reduced exploratory patterns. When representative LLMs, retrieval-augmented systems, and agent baselines are evaluated on the real traces through the new ProCodeBench, their reliability drops substantially below levels observed in simulation. A training study further establishes that pre-training on simulated data followed by fine-tuning on real data yields better results than either source used alone.

What carries the argument

The side-by-side real-versus-simulated IDE trace datasets that expose metric gaps and serve as the basis for the ProCodeBench evaluation.

Load-bearing premise

The three-day traces gathered from 1,246 developers through the custom extension capture representative behavior without systematic omissions or selection bias.

What would settle it

A replication that measured the same diversity, temporal, and exploratory metrics on the collected traces and found no substantial differences between real and simulated versions, or that demonstrated models reaching high accuracy on the real traces without any real-data fine-tuning.

Figures

Figures reproduced from arXiv: 2605.05700 by Guo-Ye Yang, Jia Li, Lehui Li, Ruixuan Jia.

**Figure 1.** Figure 1: Comparison of reactive and proactive coding assistants. view at source ↗

**Figure 2.** Figure 2: A representative example from ProCodeBench. The IDE view at source ↗

**Figure 3.** Figure 3: Overview of our research methodology. Stage 1 collects view at source ↗

**Figure 4.** Figure 4: Three-step intent annotation pipeline for converting view at source ↗

**Figure 5.** Figure 5: Operation type frequency distribution. Real-world data view at source ↗

**Figure 6.** Figure 6: Distribution of inter-operation time intervals. Real view at source ↗

**Figure 8.** Figure 8: Representative noise patterns. The real-world IDE view at source ↗

**Figure 9.** Figure 9: Training loss curves during fine-tuning on real-world view at source ↗

read the original abstract

Large language model (LLM)-based coding assistants have made substantial progress, yet most systems remain reactive, requiring developers to explicitly formulate their needs. Proactive coding assistants aim to infer latent developer intent from integrated development environment (IDE) interactions and repository context, thereby reducing interaction overhead and supporting more seamless assistance. However, research in this direction is limited by the scarcity of large-scale real-world developer behavior data. Existing studies therefore often rely on LLM-simulated IDE traces, whose fidelity to real development behavior remains unclear. In this paper, we investigate this simulation-to-reality gap through a large-scale empirical study. We collect real IDE interaction traces from 1{,}246 experienced industry developers over three consecutive days using a custom Visual Studio Code extension, and construct paired LLM-simulated traces for controlled comparison. Our analysis shows that simulated traces differ substantially from real traces in behavioral diversity, temporal structure, and exploratory patterns. Based on the collected data, we introduce \textbf{ProCodeBench}, a real-world benchmark for proactive intent prediction. Experiments with representative LLMs, retrieval-augmented methods, and agentic baselines show that current approaches remain far from reliable under real IDE traces, suggesting that simulation-based evaluation can overestimate real-world performance. Finally, our training study shows that simulated data cannot replace real data, but can complement it when used before real-world fine-tuning. These findings highlight the importance of real developer behavior data for evaluating and training proactive coding assistants.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers a useful new benchmark from real IDE traces that highlights how simulation overestimates proactive coding assistant performance, though the three-day volunteer sample leaves some room for doubt on how general the gap is.

read the letter

The main point is that simulated developer traces differ from real ones in diversity, timing, and exploration, which makes models look more capable than they are when tested only in simulation. The authors collected actual IDE interaction data from 1,246 industry developers over three days via a custom VS Code extension, generated matching simulated traces, built ProCodeBench for intent prediction, and ran comparisons plus a training study. Current LLM, retrieval, and agent baselines all drop on the real traces, and the training results show simulated data works as a warm-up but not a replacement for real fine-tuning. That empirical resource and the controlled real-vs-sim setup are the clearest advances here. The scale of the collection stands out as practical progress over prior work that stayed in simulation. The training experiment gives a usable takeaway on how to combine the two data sources. The soft spot is representativeness. Three consecutive days from volunteers could pick up extension onboarding effects or atypical short-term patterns, and a custom logger might skip keystroke timing, focus changes, or non-code actions that matter for intent. If those traces are incomplete or skewed, the reported gaps and performance drops could be partly artifacts rather than pure simulation-reality differences. The abstract does not detail the exact statistical tests or coverage checks, so the strength of the behavioral claims is hard to judge without the full tables. This is for researchers working on proactive coding tools or evaluation benchmarks who need real usage data. The new dataset and the finding that simulation alone is not enough make it worth a referee's time to verify the logging completeness and sample details.

Referee Report

1 major / 1 minor

Summary. The paper claims that LLM-simulated IDE traces differ substantially from real traces collected from 1,246 industry developers over three days in behavioral diversity, temporal structure, and exploratory patterns. It introduces ProCodeBench as a real-world benchmark for proactive intent prediction, shows that representative LLMs, RAG methods, and agentic baselines perform poorly on real traces (with simulation overestimating performance), and demonstrates via training experiments that simulated data cannot replace real data but can complement it when used prior to real-world fine-tuning.

Significance. If the findings hold, the work is significant for software engineering and AI-assisted development research. It supplies the first large-scale empirical evidence of a simulation-to-reality gap for proactive coding assistants, releases a real-trace benchmark (ProCodeBench), and provides actionable guidance on mixing simulated and real data for training. The scale of the data collection and the controlled paired-trace comparison are strengths that could shift evaluation practices away from pure simulation.

major comments (1)

[Section 3] Data collection procedure (Section 3 and ProCodeBench construction): The central claims—that simulated traces differ in key dimensions and that simulation overestimates real-world performance—rest on the collected traces being a faithful, unbiased proxy for typical developer IDE behavior. The three-day window with a volunteer sample of 1,246 developers using a custom VS Code extension risks capturing atypical patterns (e.g., onboarding effects) and may omit low-level events such as keystroke timing or focus shifts. No validation against established developer behavior metrics or longer-term collection is described, leaving open the possibility that observed differences are artifacts of incomplete or non-representative logging rather than intrinsic simulation gaps.

minor comments (1)

[Abstract] Abstract: the notation '1{,}246' is a formatting artifact; ensure consistent comma or thin-space separators for large numbers throughout the manuscript and tables.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our data collection approach. We address the concerns point by point below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Section 3] Data collection procedure (Section 3 and ProCodeBench construction): The central claims—that simulated traces differ in key dimensions and that simulation overestimates real-world performance—rest on the collected traces being a faithful, unbiased proxy for typical developer IDE behavior. The three-day window with a volunteer sample of 1,246 developers using a custom VS Code extension risks capturing atypical patterns (e.g., onboarding effects) and may omit low-level events such as keystroke timing or focus shifts. No validation against established developer behavior metrics or longer-term collection is described, leaving open the possibility that observed differences are artifacts of incomplete or non-representative logging rather than intrinsic simulation gaps.

Authors: We agree that the representativeness of the collected traces is foundational to our claims and appreciate the opportunity to clarify our design choices. The three-day collection window was determined through pilot studies to maximize data volume per participant while maintaining high compliance rates; longer durations were found to increase dropout in preliminary tests. Our sample of 1,246 experienced industry developers is, to our knowledge, the largest of its kind for IDE trace collection, providing statistical robustness against idiosyncratic behaviors. The custom VS Code extension was engineered to capture high-level events (file opens/closes, edits, navigations, and context switches) that directly support proactive intent prediction—the core task of ProCodeBench—while respecting privacy and avoiding excessive runtime overhead. Low-level signals such as per-keystroke timing or micro focus shifts fall outside the scope of the benchmark, as they are not required for the intent inference models we evaluate. We acknowledge that the original manuscript does not include explicit validation against established developer behavior metrics from prior literature (e.g., cross-referencing with studies on edit patterns or navigation frequency). We will add a dedicated paragraph in the revised Section 3 that maps our logged events to those used in related developer activity research and discusses alignment. Onboarding effects are mitigated by the consecutive-day design and instructions to participants to follow their normal workflows; we will also report per-day statistics in the revision to allow readers to assess stabilization. Longer-term collection was not performed due to practical constraints on participant retention and study resources. Importantly, the paired-trace design— revision: no

Circularity Check

0 steps flagged

No circularity: purely empirical study without derivations or self-referential fitting

full rationale

The paper is a large-scale empirical investigation that collects real IDE traces from 1,246 developers via a custom VS Code extension, generates paired simulated traces, performs comparative analysis on behavioral metrics, introduces ProCodeBench as a benchmark, runs model evaluations, and conducts a training study. No equations, mathematical derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. All central claims rest on direct data collection and external benchmarks rather than reducing to the study's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on domain assumptions about data representativeness and measurement fidelity rather than mathematical axioms or invented physical entities.

axioms (2)

domain assumption The custom VS Code extension accurately records all relevant developer IDE interactions and intent signals without bias or omission.
The comparison between real and simulated traces and the benchmark construction depend on this assumption about data quality.
domain assumption The 1,246 developers and three-day collection period produce traces representative of general industry developer behavior.
Generalization from the collected data to broader conclusions about proactive assistants requires this representativeness assumption.

pith-pipeline@v0.9.0 · 5564 in / 1524 out tokens · 35585 ms · 2026-05-08T09:18:48.019774+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 15 canonical work pages · 4 internal anchors

[1]

Evaluating Large Language Models Trained on Code

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, et al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review arXiv 2021
[2]

Competition-level code generation with alphacode,

Y . Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. Dal Lago, et al., “Competition-level code generation with alphacode,” Science, vol. 378, no. 6624, pp. 1092–1097, 2022

2022
[3]

An empirical evaluation of using large language models for automated unit test generation,

M. Sch¨afer, S. Nadi, A. Eghbali, and F. Tip, “An empirical evaluation of using large language models for automated unit test generation,”IEEE Transactions on Software Engineering, vol. 50, no. 1, pp. 85–105, 2023

2023
[4]

Evolving with ai: A longi- tudinal analysis of developer logs,

A. Sergeyuk, E. Huang, D. Karaeva, A. Serova, Y . Gol- ubev, and I. Ahmed, “Evolving with ai: A longi- tudinal analysis of developer logs,”arXiv preprint arXiv:2601.10258, 2026

work page arXiv 2026
[5]

Github copilot in the classroom: learning to code with ai assistance,

B. Puryear and G. Sprint, “Github copilot in the classroom: learning to code with ai assistance,”Journal of Computing Sciences in Colleges, vol. 38, no. 1, pp. 37–47, 2022

2022
[6]

Grounded copilot: How programmers interact with code-generating models,

S. Barke, M. B. James, and N. Polikarpova, “Grounded copilot: How programmers interact with code-generating models,”Proceedings of the ACM on Programming Languages, vol. 7, no. OOPSLA1, pp. 85–111, 2023

2023
[7]

A large-scale survey on the usability of ai programming assistants: Successes and challenges,

J. T. Liang, C. Yang, and B. A. Myers, “A large-scale survey on the usability of ai programming assistants: Successes and challenges,” inProceedings of the 46th IEEE/ACM international conference on software engi- neering, pp. 1–13, 2024

2024
[8]

Beyond code generation: An obser- vational study of chatgpt usage in software engineering practice,

R. Khojah, M. Mohamad, P. Leitner, and F. G. de Oliveira Neto, “Beyond code generation: An obser- vational study of chatgpt usage in software engineering practice,”Proceedings of the ACM on Software Engineer- ing, vol. 1, no. FSE, pp. 1819–1840, 2024

2024
[9]

Learning from examples to improve code completion systems,

M. Bruch, M. Monperrus, and M. Mezini, “Learning from examples to improve code completion systems,” in Proceedings of the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering, pp. 213–222, 2009

2009
[10]

Code completion with statistical language models,

V . Raychev, M. Vechev, and E. Yahav, “Code completion with statistical language models,” inProceedings of the 35th ACM SIGPLAN conference on programming language design and implementation, pp. 419–428, 2014

2014
[11]

SWE-agent: Agent-computer interfaces enable automated software engineering,

J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. R. Narasimhan, and O. Press, “SWE-agent: Agent-computer interfaces enable automated software engineering,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

2024
[12]

Need help? designing proactive ai assistants for programming,

V . Chen, A. Zhu, S. Zhao, H. Mozannar, D. Sontag, and A. Talwalkar, “Need help? designing proactive ai assistants for programming,” inProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pp. 1–18, 2025

2025
[13]

Codinggenie: A proactive llm-powered programming assistant,

S. Zhao, A. Zhu, H. Mozannar, D. Sontag, A. Talwalkar, and V . Chen, “Codinggenie: A proactive llm-powered programming assistant,” inProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering, pp. 1168–1172, 2025

2025
[14]

Program- ming by chat: A large-scale behavioral analysis of 11,579 real-world ai-assisted ide sessions,

N. Tang, C. Chen, Z. Fang, G. Xu, M. Dhakal, Y . Shi, C. McMillan, Y . Huang, and T. J.-J. Li, “Program- ming by chat: A large-scale behavioral analysis of 11,579 real-world ai-assisted ide sessions,”arXiv preprint arXiv:2604.00436, 2026

work page arXiv 2026
[15]

Developer inter- action patterns with proactive AI: A five-day field study

N. Kuo, A. Sergeyuk, V . Chen, and M. Izadi, “Developer interaction patterns with proactive ai: A five-day field study,”arXiv preprint arXiv:2601.10253, 2026

work page arXiv 2026
[16]

Reading between the lines: Modeling user behavior and costs in ai-assisted programming,

H. Mozannar, G. Bansal, A. Fourney, and E. Horvitz, “Reading between the lines: Modeling user behavior and costs in ai-assisted programming,” inProceedings of the 2024 CHI conference on human factors in computing systems, pp. 1–16, 2024

2024
[17]

Proactive agent: Shifting llm agents from reactive responses to active assistance.arXiv preprint arXiv:2410.12361, 2024

Y . Lu, S. Yang, C. Qian, G. Chen, Q. Luo, Y . Wu, H. Wang, X. Cong, Z. Zhang, Y . Lin,et al., “Proactive agent: Shifting llm agents from reactive responses to active assistance,”arXiv preprint arXiv:2410.12361, 2024

work page arXiv 2024
[18]

Propersim: Developing proactive and per- sonalized ai assistants through user-assistant simulation,

J. Kim, J. Choi, W. Chay, D. Kyung, Y . Kwon, Y . Jo, and E. Choi, “Propersim: Developing proactive and per- sonalized ai assistants through user-assistant simulation,” arXiv preprint arXiv:2509.21730, 2025

work page arXiv 2025
[19]

Repocoder: Repository- level code completion through iterative retrieval and generation,

F. Zhang, B. Chen, Y . Zhang, J. Keung, J. Liu, D. Zan, Y . Mao, J.-G. Lou, and W. Chen, “Repocoder: Repository- level code completion through iterative retrieval and generation,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 2471–2484, 2023

2023
[20]

Repograph: Enhancing ai software engineering with repository-level code graph,

S. Ouyang, W. Yu, K. Ma, Z. Xiao, Z. Zhang, M. Jia, J. Han, H. Zhang, and D. Yu, “Repograph: Enhancing ai software engineering with repository-level code graph,” in13th International Conference on Learning Repre- sentations, ICLR 2025, pp. 30361–30384, International Conference on Learning Representations, ICLR, 2025

2025
[21]

A-rag: Scaling agentic retrieval-augmented generation via hierarchical retrieval inter- faces

M. Du, B. Xu, C. Zhu, S. Wang, P. Wang, X. Wang, and Z. Mao, “A-rag: Scaling agentic retrieval-augmented generation via hierarchical retrieval interfaces,”arXiv preprint arXiv:2602.03442, 2026

work page arXiv 2026
[22]

A review on code generation with llms: Application and evaluation,

J. Wang and Y . Chen, “A review on code generation with llms: Application and evaluation,” in2023 IEEE Inter- national Conference on Medical Artificial Intelligence (MedAI), pp. 284–289, IEEE, 2023

2023
[23]

Automated program repair in the era of large pre-trained language models,

C. S. Xia, Y . Wei, and L. Zhang, “Automated program repair in the era of large pre-trained language models,” in2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pp. 1482–1494, IEEE, 2023

2023
[24]

CrossCodeEval: A diverse and multilingual benchmark for cross-file code completion,

Y . Ding, Z. Wang, W. U. Ahmad, H. Ding, M. Tan, N. Jain, M. K. Ramanathan, R. Nallapati, P. Bhatia, D. Roth, and B. Xiang, “CrossCodeEval: A diverse and multilingual benchmark for cross-file code completion,” inThirty- seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023

2023
[25]

A survey on proactive dialogue systems: Problems, methods, and prospects.arXiv preprint arXiv:2305.02750, 2023

Y . Deng, W. Lei, W. Lam, and T.-S. Chua, “A survey on proactive dialogue systems: Problems, methods, and prospects,”arXiv preprint arXiv:2305.02750, 2023

work page arXiv 2023
[26]

eagerness

X. Zhou, W. Sun, Q. Ma, Y . Xie, J. Liu, W. Du, S. Welleck, Y . Yang, G. Neubig, S. T. Wu,et al., “Mind the sim2real gap in user simulation for agentic tasks,”arXiv preprint arXiv:2603.11245, 2026

work page arXiv 2026
[27]

Proagentbench: Evaluating llm agents for proactive assistance with real-world data

Y . Tang, H. Tang, T. Cao, L. Nguyen, A. Zhang, X. Cao, C. Liu, W. Ding, and Y . Li, “ProAgentBench: Evaluating llm agents for proactive assistance with real-world data,” arXiv preprint arXiv:2602.04482, 2025

work page arXiv 2025
[28]

Pira- bench: A transition from reactive gui agents to gui-based proactive intent recommendation agents,

Y . Chai, S. Tang, H. Xiao, R. Liu, and H. Li, “Pira- bench: A transition from reactive gui agents to gui-based proactive intent recommendation agents,”arXiv preprint arXiv:2603.08013, 2026

work page arXiv 2026
[29]

Deveval: A manually- annotated code generation benchmark aligned with real- world code repositories,

J. Li, G. Li, Y . Zhao, Y . Li, H. Liu, H. Zhu, L. Wang, K. Liu, Z. Fang, L. Wang,et al., “Deveval: A manually- annotated code generation benchmark aligned with real- world code repositories,” inFindings of the Association for Computational Linguistics: ACL 2024, pp. 3603–3614, 2024

2024
[30]

Evocodebench: An evolving code generation benchmark with domain-specific evaluations,

J. Li, G. Li, X. Zhang, Y . Zhao, Y . Dong, Z. Jin, B. Li, F. Huang, and Y . Li, “Evocodebench: An evolving code generation benchmark with domain-specific evaluations,” Advances in Neural Information Processing Systems, vol. 37, pp. 57619–57641, 2024

2024
[31]

Program Synthesis with Large Language Models

J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. J. Cai, M. Terry, Q. V . Le, and C. Sutton, “Program synthesis with large language models,”arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review arXiv 2021
[32]

SWE-bench: Can language models resolve real-world github issues?,

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan, “SWE-bench: Can language models resolve real-world github issues?,” inThe Twelfth International Conference on Learning Representations, 2024

2024
[33]

CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

H. Husain, H.-H. Wu, T. Gazit, M. Allamanis, and M. Brockschmidt, “CodeSearchNet challenge: Evaluat- ing the state of semantic code search,”arXiv preprint arXiv:1909.09436, 2019

work page internal anchor Pith review arXiv 1909
[34]

Devbench: A comprehensive benchmark for software development

B. Li, W. Wu, Z. Tang, L. Shi, J. Yang, J. Li, S. Yao, C. Qian, B. Hui, Q. Zhang,et al., “DevBench: A comprehensive benchmark for software development,” arXiv preprint arXiv:2403.08604, 2024

work page arXiv 2024
[35]

Judging llm-as-a-judge with mt-bench and chatbot arena,

L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. Xing,et al., “Judging llm-as-a-judge with mt-bench and chatbot arena,”Ad- vances in neural information processing systems, vol. 36, pp. 46595–46623, 2023

2023
[36]

Large language models are not fair evaluators,

P. Wang, L. Li, L. Chen, Z. Cai, D. Zhu, B. Lin, Y . Cao, L. Kong, Q. Liu, T. Liu,et al., “Large language models are not fair evaluators,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 9440–9450, 2024

2024
[37]

Glm-5: from vibe coding to agentic engineering,

GLM-5-Team, A. Zeng, X. Lv, Z. Hou, Z. Du, Q. Zheng, B. Chen, D. Yin, C. Ge, C. Huang, C. Xie, C. Zhu, C. Yin, C. Wang, G. Pan, H. Zeng, H. Zhang, H. Wang, H. Chen, J. Zhang, J. Jiao, J. Guo, J. Wang, J. Du, J. Wu, K. Wang, L. Li, L. Fan, L. Zhong, M. Liu, M. Zhao, P. Du, Q. Dong, R. Lu, Shuang-Li, S. Cao, S. Liu, T. Jiang, X. Chen, X. Zhang, X. Huang, X....

2026
[38]

Qwen3.5: Towards native multimodal agents,

Qwen Team, “Qwen3.5: Towards native multimodal agents,” February 2026

2026
[39]

Coderag: Finding relevant and necessary knowledge for retrieval- augmented repository-level code completion,

S. Zhang, Y . Ding, S. Lian, S. Song, and H. Li, “Coderag: Finding relevant and necessary knowledge for retrieval- augmented repository-level code completion,” inProceed- ings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 23289–23299, 2025

2025
[40]

Graphcoder: Enhancing repository-level code completion via coarse-to-fine retrieval based on code context graph,

W. Liu, A. Yu, D. Zan, B. Shen, W. Zhang, H. Zhao, Z. Jin, and Q. Wang, “Graphcoder: Enhancing repository-level code completion via coarse-to-fine retrieval based on code context graph,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engi- neering, pp. 570–581, 2024

2024
[41]

Qwen3 technical report,

Q. Team, “Qwen3 technical report,” 2025

2025
[42]

Chatglm: A family of large language models from glm- 130b to glm-4 all tools,

Team GLM, A. Zeng, B. Xu, B. Wang, C. Zhang, D. Yin, D. Rojas, G. Feng, H. Zhao, H. Lai, H. Yu, H. Wang, J. Sun, J. Zhang, J. Cheng, J. Gui, J. Tang, J. Zhang, J. Li, L. Zhao, L. Wu, L. Zhong, M. Liu, M. Huang, P. Zhang, Q. Zheng, R. Lu, S. Duan, S. Zhang, S. Cao, S. Yang, W. L. Tam, W. Zhao, X. Liu, X. Xia, X. Zhang, X. Gu, X. Lv, X. Liu, X. Liu, X. Yang...

2024
[43]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Ka- dian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan,et al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review arXiv 2024