pith. machine review for the scientific record. sign in

arxiv: 2604.03144 · v1 · submitted 2026-04-03 · 💻 cs.AR · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

InCoder-32B-Thinking: Industrial Code World Model for Thinking

Aishan Liu, Bryan Dai, Chenghua Lin, Chuan Hao, Fanglin Xu, Jiajun Wu, Jian Yang, Joseph Li, Junhang Cheng, Lin Jing, Mingjie Tang, Ming Zhou, Ran Tao, Ruihao Gong, Siheng Chen, Tuney Zheng, Wayne Xin Zhao, Weicheng Gu, Weifeng Lv, Wei Zhang, Xianglong Liu, Yan Xing, Yaxin Du, Yizhi Li, Zhoujun Li

Pith reviewed 2026-05-13 18:52 UTC · model grok-4.3

classification 💻 cs.AR cs.AIcs.CL
keywords code generationchain of thoughtindustrial codeworld modelerror-driven learningVerilog simulationGPU optimizationself-verification
0
0 comments X

The pith

InCoder-32B-Thinking uses error-driven chain-of-thought synthesis and an industrial code world model to generate reasoning traces that achieve top open-source results on both general and hardware-focused code benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper works to fill the gap in expert reasoning traces for industrial software tasks such as chip design, GPU optimization, and embedded systems. It does so by training a 32B model on data created through the Error-driven Chain-of-Thought framework, where reasoning chains emerge from multi-turn dialogues that incorporate direct error feedback from the environment. This is paired with an Industrial Code World Model trained on execution traces from Verilog simulations and GPU profiling, which learns how code changes affect hardware behavior and supports self-verification of outcomes before compilation. All generated traces undergo validation through domain toolchains to align with the depth and distribution seen in real industrial work. If the approach holds, it shows that open-source models can reach high accuracy on specialized constraints without needing vast amounts of human-annotated expert traces.

Core claim

InCoder-32B-Thinking is trained on reasoning traces synthesized by the ECoT framework, which explicitly models the error-correction process through environmental feedback, together with the ICWM that captures causal dynamics from domain-specific execution traces to enable prediction of code effects on hardware. This produces training data validated by toolchains and yields 81.3 percent on LiveCodeBench v5, 84.0 percent on CAD-Coder, and 38.0 percent on KernelBench, placing the model at the top tier among open-source systems across general and industrial domains.

What carries the argument

The Error-driven Chain-of-Thought (ECoT) synthesis framework, which builds reasoning chains from multi-turn dialogues with environmental error feedback, combined with the Industrial Code World Model (ICWM) that learns causal dynamics of code on hardware from execution traces to support self-verification.

If this is right

  • Explicit modeling of error-correction processes improves handling of hardware constraints and timing semantics in code generation.
  • Self-verification via predicted execution outcomes raises accuracy on tasks requiring compilation or profiling feedback.
  • Validated synthetic data from domain toolchains reduces dependence on scarce human expert annotations for industrial domains.
  • Performance gains appear consistently across 14 general coding benchmarks and 9 industrial ones including Verilog and GPU optimization.
  • The world model component allows the system to anticipate hardware behavior before actual runs, supporting iterative refinement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same error-feedback loop could scale to other simulation-heavy areas such as robotics control or scientific computing pipelines.
  • Internal world models might let models refine their own reasoning traces over multiple simulated steps without external calls.
  • If the approach generalizes, it points toward using toolchain-validated loops as a route to deeper reasoning in any domain with executable feedback.
  • Future tests could check whether the synthesized traces transfer to new hardware platforms not seen during ICWM training.

Load-bearing premise

Synthesized reasoning traces created via error feedback and toolchain validation accurately reflect the natural depth and distribution of expert industrial reasoning without introducing artifacts from the synthesis process.

What would settle it

A side-by-side comparison where human experts produce reasoning traces on the same industrial tasks and the model's traces show measurably lower success rates or different correction patterns when both are executed through the same domain toolchains.

Figures

Figures reproduced from arXiv: 2604.03144 by Aishan Liu, Bryan Dai, Chenghua Lin, Chuan Hao, Fanglin Xu, Jiajun Wu, Jian Yang, Joseph Li, Junhang Cheng, Lin Jing, Mingjie Tang, Ming Zhou, Ran Tao, Ruihao Gong, Siheng Chen, Tuney Zheng, Wayne Xin Zhao, Weicheng Gu, Weifeng Lv, Wei Zhang, Xianglong Liu, Yan Xing, Yaxin Du, Yizhi Li, Zhoujun Li.

Figure 1
Figure 1. Figure 1: Overview of InCoder-32B-Thinking, a coder for general and industrial code [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The performance of InCoder-32B-Thinking on code benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of CUDA kernel implementations for Hinge Loss. Through reasoning, [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the data engine pipeline. Left: task seeds and environment bundles are passed through an elicitation, execution, feedback loop against real backends, producing multi-turn trajectories Dreal. Right: Dreal trains an ICWM that simulates the real backends during large-scale synthesis; periodic audits keep the world model calibrated across iterations. 2.1. Task Seeding and Environment Bundling We re… view at source ↗
Figure 5
Figure 5. Figure 5: ICWM fidelity across five industrial domains. Outcome prediction accuracy measures [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Statistics of median thinking length (T) and answer length (A) per task category, [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Performance visualization of InCoder-32B-Thinking across nine industrial code [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
read the original abstract

Industrial software development across chip design, GPU optimization, and embedded systems lacks expert reasoning traces showing how engineers reason about hardware constraints and timing semantics. In this work, we propose InCoder-32B-Thinking, trained on the data from the Error-driven Chain-of-Thought (ECoT) synthesis framework with an industrial code world model (ICWM) to generate reasoning traces. Specifically, ECoT generates reasoning chains by synthesizing the thinking content from multi-turn dialogue with environmental error feedback, explicitly modeling the error-correction process. ICWM is trained on domain-specific execution traces from Verilog simulation, GPU profiling, etc., learns the causal dynamics of how code affects hardware behavior, and enables self-verification by predicting execution outcomes before actual compilation. All synthesized reasoning traces are validated through domain toolchains, creating training data matching the natural reasoning depth distribution of industrial tasks. Evaluation on 14 general (81.3% on LiveCodeBench v5) and 9 industrial benchmarks (84.0% in CAD-Coder and 38.0% on KernelBench) shows InCoder-32B-Thinking achieves top-tier open-source results across all domains.GPU Optimization

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The manuscript introduces InCoder-32B-Thinking, a 32B-parameter model trained on reasoning traces synthesized by the Error-driven Chain-of-Thought (ECoT) framework. ECoT generates chains via multi-turn error-feedback dialogues; these are augmented by an Industrial Code World Model (ICWM) trained on domain execution traces (Verilog simulation, GPU profiling) that enables self-verification by predicting outcomes before compilation. All traces are validated through domain toolchains. The model is evaluated on 14 general benchmarks (81.3% on LiveCodeBench v5) and 9 industrial benchmarks (84.0% on CAD-Coder, 38.0% on KernelBench), claiming top-tier open-source results across domains.

Significance. If the synthesized traces accurately capture industrial reasoning depth without synthesis artifacts and the ICWM self-verification is shown to be non-circular, the work could provide a scalable route to high-quality training data for code models in hardware-constrained domains. The reported benchmark numbers indicate competitive performance, but the absence of training details, ablations, and distributional comparisons prevents assessing whether gains stem from the proposed method.

major comments (3)
  1. Abstract: the claim that synthesized traces create 'training data matching the natural reasoning depth distribution of industrial tasks' is load-bearing for the central contribution, yet no quantitative comparison (chain length, error-correction frequency, branching factor) to human expert traces on the same tasks is supplied.
  2. Abstract: the ICWM is trained on execution traces and then used to generate the very training data via self-verification; without equations, controls, or an ablation isolating this loop, the reported scores (e.g., 81.3% LiveCodeBench v5) cannot be attributed to genuine modeling rather than self-referential bias.
  3. Abstract: no training details, hyper-parameters, dataset sizes, or ablation studies are provided for either the ICWM or the final InCoder-32B-Thinking model, rendering it impossible to verify that the benchmark results support the claimed industrial-reasoning modeling.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have revised the manuscript to address all major comments by adding quantitative comparisons, formal equations for the ICWM, ablation studies, and full training details. Our point-by-point responses follow.

read point-by-point responses
  1. Referee: Abstract: the claim that synthesized traces create 'training data matching the natural reasoning depth distribution of industrial tasks' is load-bearing for the central contribution, yet no quantitative comparison (chain length, error-correction frequency, branching factor) to human expert traces on the same tasks is supplied.

    Authors: We agree that a direct quantitative comparison strengthens the claim. In the revised manuscript we added Appendix B, which compares ECoT-synthesized traces against 120 human expert traces collected from industrial Verilog and GPU kernel reviews. The analysis reports average chain length (synthesized 12.4 vs human 11.8 steps), error-correction frequency (0.32 vs 0.29), and branching factor (1.8 vs 1.7), showing close distributional alignment. The abstract has been updated to state that the traces “approximate” rather than exactly “match” the natural distribution. revision: yes

  2. Referee: Abstract: the ICWM is trained on execution traces and then used to generate the very training data via self-verification; without equations, controls, or an ablation isolating this loop, the reported scores (e.g., 81.3% LiveCodeBench v5) cannot be attributed to genuine modeling rather than self-referential bias.

    Authors: We share the concern about potential circularity. Section 3.1 now contains the explicit equations for the ICWM forward model (predicting execution outcomes from code state) and the self-verification loss. We also added Section 5.2 with an ablation that trains an otherwise identical model without ICWM self-verification; this variant drops 7.2 points on LiveCodeBench v5 and 9.5 points on CAD-Coder, indicating that the performance gain is not solely attributable to self-referential bias. revision: yes

  3. Referee: Abstract: no training details, hyper-parameters, dataset sizes, or ablation studies are provided for either the ICWM or the final InCoder-32B-Thinking model, rendering it impossible to verify that the benchmark results support the claimed industrial-reasoning modeling.

    Authors: We acknowledge that these details were omitted from the original submission. The revised manuscript includes a new Section 4 that reports: ICWM training on 450k domain execution traces with learning rate 2e-5, batch size 256, and 3 epochs; ECoT synthesis producing 2.3M validated traces; full hyper-parameters for the 32B model; and comprehensive ablations on each component. These additions enable direct verification of the reported benchmark results. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external traces and benchmarks

full rationale

The paper trains ICWM on independent domain execution traces (Verilog simulation, GPU profiling) and uses it within ECoT to synthesize reasoning chains that are then validated through external domain toolchains. No equations, self-definitions, or fitted-input renamings are present in the abstract or description that would make the training data or performance claims equivalent to the inputs by construction. Benchmark results on LiveCodeBench v5, CAD-Coder, and KernelBench are reported as separate external evaluations. The process does not reduce to a self-referential loop or self-citation load-bearing step.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

With only the abstract available, the ledger is necessarily incomplete; the central claim depends on the unstated validity of the ECoT synthesis process and the ICWM's ability to predict execution outcomes accurately.

invented entities (1)
  • Industrial Code World Model (ICWM) no independent evidence
    purpose: Learns causal dynamics of code on hardware and enables self-verification by predicting execution outcomes
    Introduced as a core training component in the abstract

pith-pipeline@v0.9.0 · 5595 in / 1150 out tokens · 50057 ms · 2026-05-13T18:52:22.752227+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. GIFT: Guided Fine-Tuning and Transfer for Enhancing Instruction-Tuned Language Models

    cs.CL 2026-05 unverdicted novelty 5.0

    GIFT guides adapter fine-tuning on base models with confidence signals from instruction-tuned models before merging, yielding task-specialized models that outperform direct fine-tuning on math and knowledge benchmarks.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · cited by 1 Pith paper · 15 internal anchors

  1. [1]

    gpt-oss-120b & gpt-oss-20b Model Card

    1 Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925,

  2. [2]

    Opencodereasoning: Advancing data distillation for competitive coding

    2 Wasi Uddin Ahmad, Sean Narenthiran, Somshubra Majumdar, Aleksander Ficek, Siddhartha Jain, Jocelyn Huang, Vahid Noroozi, and Boris Ginsburg. Opencodereasoning: Advancing data distillation for competitive coding.arXiv preprint arXiv:2504.01943,

  3. [3]

    Kevin: Multi-turn rl for generating cuda kernels.arXiv preprint arXiv:2507.11948,

    6 Carlo Baronio, Pietro Marsella, Ben Pan, Simon Guo, and Silas Alberti. Kevin: Multi-turn rl for generating cuda kernels.arXiv preprint arXiv:2507.11948,

  4. [4]

    $\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

    URL https://arxiv. org/abs/2506.07982. 8 Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan. tau2- bench: Evaluating conversational agents in a dual-control environment.arXiv preprint arXiv:2506.07982,

  5. [5]

    AscendKernelGen: A Systematic Study of LLM-Based Kernel Generation for Neural Processing Units

    9 Xinzi Cao, Jianyang Zhai, Pengfei Li, Zhiheng Hu, Cen Yan, Bingxu Mu, Guanghuan Fang, Bin She, Jiayu Li, Yihan Su, et al. Ascendkernelgen: A systematic study of llm-based kernel generation for neural processing units.arXiv preprint arXiv:2601.07160,

  6. [6]

    Evaluating Large Language Models Trained on Code

    URL https://arxiv.org/abs/2107.03374. 11 Jade Copet, Quentin Carbonneaux, Gal Cohen, Jonas Gehring, Jacob Kahn, Jannik Kossen, Felix Kreuk, Emily McMilin, Michel Meyer, Yuxiang Wei, et al. Cwm: An open-weights llm for research on code generation with world models.arXiv preprint arXiv:2510.02387,

  7. [7]

    Hazelwood, Gabriel Synnaeve, and Hugh Leather

    12 Chris Cummins, Volker Seeker, Dejan Grubisic, Mostafa Elhoushi, Youwei Liang, Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Kim M. Hazelwood, Gabriel Synnaeve, and Hugh Leather. Large language models for compiler optimization.CoRR, abs/2309.07062,

  8. [8]

    Cuda agent: Large-scale agentic rl for high-performance cuda kernel generation.arXiv preprint arXiv:2602.24286,

    14 Weinan Dai, Hanlin Wu, Qiying Yu, Huan-ang Gao, Jiahao Li, Chengquan Jiang, Weiqiang Lou, Yufan Song, Hongli Yu, Jiaze Chen, et al. Cuda agent: Large-scale agentic rl for high-performance cuda kernel generation.arXiv preprint arXiv:2602.24286,

  9. [9]

    DeepSeek-Coder-V2: Breaking the barrier of closed-source models in code intelligence.arXiv preprint arXiv:2406.11931,

    15 DeepSeek AI. DeepSeek-Coder-V2: Breaking the barrier of closed-source models in code intelligence.arXiv preprint arXiv:2406.11931,

  10. [10]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    URLhttps://arxiv.org/abs/2501.12948. 17 Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114,

  11. [11]

    From explicit cot to implicit cot: Learning to internalize cot step by step

    18 Yuntian Deng, Yejin Choi, and Stuart Shieber. From explicit cot to implicit cot: Learning to internalize cot step by step.arXiv preprint arXiv:2405.14838,

  12. [12]

    16 20 Zachary Englhardt, Richard Li, Dilini Nissanka, Zhihan Zhang, Girish Narayanswamy, Joseph Breda, Xin Liu, Shwetak N

    URL https://arxiv.org/abs/2402.07844. 16 20 Zachary Englhardt, Richard Li, Dilini Nissanka, Zhihan Zhang, Girish Narayanswamy, Joseph Breda, Xin Liu, Shwetak N. Patel, and Vikram Iyer. Exploring and characterizing large language models for embedded system development and debugging. In Florian ’Floyd’ Mueller, Penny Kyburz, Julie R. Williamson, and Corina ...

  13. [13]

    URL https://doi.org/10.1145/3613905.3650764

    doi: 10.1145/3613905.3650764. URL https://doi.org/10.1145/3613905.3650764. 21 Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Quentin Carbonneaux, Taco Cohen, and Gabriel Synnaeve. Rlef: Grounding code llms in execution feedback with reinforcement learning.arXiv preprint arXiv:2410.02089,

  14. [14]

    23 Alex Gu, Baptiste Rozière, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida I. Wang. Cruxeval: A benchmark for code reasoning, understanding and execution. arXiv preprint arXiv:2401.03065,

  15. [16]

    26 Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al

    URL https: //arxiv.org/abs/2505.19713. 26 Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

  16. [17]

    Qwen2.5-Coder Technical Report

    URL https://arxiv.org/abs/2409.12186. 28 Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contami- nation free evaluation of large language models for code.arXiv preprint arXiv:2403.07974,

  17. [18]

    Realbench: Benchmarking verilog generation models with real-world ip designs.arXiv preprint arXiv:2507.16200,

    30 Pengwei Jin, Di Huang, Chongxiao Li, Shuyao Cheng, Yang Zhao, Xinyao Zheng, Jiaguo Zhu, Shuyi Xing, Bohan Dou, Rui Zhang, et al. Realbench: Benchmarking verilog generation models with real-world ip designs.arXiv preprint arXiv:2507.16200,

  18. [19]

    Recad: Reinforcement learning enhanced parametric cad model generation with vision-language models.arXiv preprint arXiv:2512.06328,

    33 Jiahao Li, Yusheng Luo, Yunzhong Lou, and Xiangdong Zhou. Recad: Reinforcement learning enhanced parametric cad model generation with vision-language models.arXiv preprint arXiv:2512.06328,

  19. [21]

    36 Jinyang Li, Binyuan Hui, Ge Qu, Binhua Li, Jiaxi Yang, Bowen Li, Bailin Wang, Bowen Qin, Rongyu Cao, Ruiying Geng, et al

    URL https://arxiv.org/abs/2502.14752. 36 Jinyang Li, Binyuan Hui, Ge Qu, Binhua Li, Jiaxi Yang, Bowen Li, Bailin Wang, Bowen Qin, Rongyu Cao, Ruiying Geng, et al. Can llm already serve as a database interface.A big bench for large-scale database grounded text-to-sqls. CoRR, abs/2305.03111,

  20. [22]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    37 Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556,

  21. [23]

    Invited paper: Verilogeval: Evaluating large language models for verilog code generation

    40 Mingjie Liu, Nathaniel Ross Pinckney, Brucek Khailany, and Haoxing Ren. Invited paper: Verilogeval: Evaluating large language models for verilog code generation. InIEEE/ACM International Conference on Computer Aided Design, ICCAD 2023, San Francisco, CA, USA, October 28 - Nov. 2, 2023, pages 1–8. IEEE,

  22. [24]

    URLhttps://doi.org/10.1109/ICCAD57390.2023.10323812

    doi: 10.1109/ICCAD57390.2023.10323812. URLhttps://doi.org/10.1109/ICCAD57390.2023.10323812. 41 Shang Liu, Wenji Fang, Yao Lu, Jing Wang, Qijun Zhang, Hongce Zhang, and Zhiyao Xie. Rtlcoder: Fully open-source and efficient llm-assisted rtl code generation technique.IEEE T ransactions on Computer-Aided Design of Integrated Circuits and Systems, 44(4):1448–1461,

  23. [26]

    URLhttps://doi.org/10.48550/arXiv.2412.00535

    URL https: //arxiv.org/abs/2412.00535. 43 Yifei Liu, Li Lyna Zhang, Yi Zhu, Bingcheng Dong, Xudong Zhou, Ning Shang, Fan Yang, and Mao Yang. rstar-coder: Scaling competitive code reasoning with a large-scale verified dataset.arXiv preprint arXiv:2505.21297,

  24. [27]

    Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems.arXiv preprint arXiv:2412.09413, 2024

    URLhttps://arxiv.org/abs/2412.09413. 47 MiniMax. Minimax m2.5: Built for real-world productivity. https://www.minimax.io/news/ minimax-m25,

  25. [29]

    Zhang, William Hu, Christopher Ré, and Azalia Mirhoseini

    URL https://arxiv.org/abs/2502.10517. 53 Shishir G. Patil, Huanzhi Mao, Charlie Cheng-Jie Ji, Fanjia Yan, Vishnu Suresh, Ion Stoica, and Joseph E. Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InAdvances in Neural Information Processing Systems,

  26. [30]

    56 Qwen Team

    URL https://qwenlm.github.io/b log/qwen3-coder. 56 Qwen Team. Qwen3-coder-next technical report. URL https://github.com/QwenLM/Qwen 3-Coder/blob/main/qwen3_coder_next_tech_report.pdf. Accessed: 2026-02-03. 57 Qwen Team. Qwen3.5: Towards native multimodal agents, February

  27. [31]

    Proximal Policy Optimization Algorithms

    URL https: //qwen.ai/blog?id=qwen3.5. 19 58 John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

  28. [32]

    Seed-coder: Let the code model curate data for itself.arXiv preprint arXiv:2506.03524,

    59 ByteDance Seed, Yuyu Zhang, Jing Su, Yifan Sun, Chenguang Xi, Xia Xiao, Shen Zheng, Anxiang Zhang, Kaibo Liu, Daoguang Zan, et al. Seed-coder: Let the code model curate data for itself.arXiv preprint arXiv:2506.03524,

  29. [33]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    60 TOCOL SERVERS. Mcp-universe: Benchmarking large language models with real-world model context pro-tocol servers.origins, 25:55–128. 61 Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv prepr...

  30. [34]

    Kimi K2: Open Agentic Intelligence

    67 Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534,

  31. [35]

    68 Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276,

  32. [36]

    71 Shailja Thakur, Baleegh Ahmad, Hammond Fan, et al

    URLhttps://github.com/laude-institute/terminal-bench. 71 Shailja Thakur, Baleegh Ahmad, Hammond Fan, et al. VeriGen: A large language model for verilog code generation.arXiv preprint arXiv:2308.00708,

  33. [37]

    Verireason: Reinforcement learning with testbench feedback for reasoning-enhanced verilog generation.arXiv preprint arXiv:2505.11849,

    73 Yiting Wang, Guoheng Sun, Wanghao Ye, Gang Qu, and Ang Li. Verireason: Reinforcement learning with testbench feedback for reasoning-enhanced verilog generation.arXiv preprint arXiv:2505.11849,

  34. [39]

    76 Ruiyang Xu, Jialun Cao, Mingyuan Wu, Wenliang Zhong, Yaojie Lu, Ben He, Xianpei Han, Shing-Chi Cheung, and Le Sun

    URLhttps://arxiv.org/abs/2505.11480. 76 Ruiyang Xu, Jialun Cao, Mingyuan Wu, Wenliang Zhong, Yaojie Lu, Ben He, Xianpei Han, Shing-Chi Cheung, and Le Sun. Embedagent: Benchmarking large language models in embedded system development.arXiv preprint arXiv:2506.11003,

  35. [40]

    Qwen3 Technical Report

    77 An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

  36. [41]

    Embedgenius: Towards automated software development for generic embedded iot systems.arXiv preprint arXiv:2412.09058,

    78 Huanqi Yang, Mingzhe Li, Mingda Han, Zhenjiang Li, and Weitao Xu. Embedgenius: Towards automated software development for generic embedded iot systems.arXiv preprint arXiv:2412.09058,

  37. [42]

    From code foundation models to agents and applications: A practical guide to code intelligence.arXiv preprint arXiv:2511.18538,

    79 Jian Yang, Wei Zhang, Shark Liu, Jiajun Wu, Shawn Guo, and Yizhi Li. From code foundation models to agents and applications: A practical guide to code intelligence.arXiv preprint arXiv:2511.18538,

  38. [43]

    org/abs/2603.16790

    URL https://arxiv. org/abs/2603.16790. 81 Jian Yang, Wei Zhang, Jiajun Wu, Junhang Cheng, Shawn Guo, Haowen Wang, Weicheng Gu, Yaxin Du, Joseph Li, Fanglin Xu, et al. Incoder-32b: Code foundation model for industrial scenarios.arXiv preprint arXiv:2603.16790,

  39. [44]

    Kimi-dev: Agentless training as skill prior for swe-agents, 2025 c

    82 Zonghan Yang, Shengjie Wang, Kelin Fu, Wenyang He, Weimin Xiong, Yibo Liu, Yibo Miao, Bofei Gao, Yejie Wang, Yingwei Ma, et al. Kimi-dev: Agentless training as skill prior for swe-agents.arXiv preprint arXiv:2509.23045,

  40. [45]

    Spider: A large-scale human-labeled dataset for com- plex and cross-domain semantic parsing and text-to-sql task.arXiv preprint arXiv:1809.08887,

    83 Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, et al. Spider: A large-scale human-labeled dataset for com- plex and cross-domain semantic parsing and text-to-sql task.arXiv preprint arXiv:1809.08887,

  41. [46]

    VeriThoughts: Enabling automated Verilog code generation us- ing reasoning and formal verification,

    84 Patrick Yubeaton, Andre Nakkab, Weihua Xiao, Luca Collini, Ramesh Karri, Chinmay Hegde, and Siddharth Garg. Verithoughts: Enabling automated verilog code generation using reasoning and formal verification.arXiv preprint arXiv:2505.20302,

  42. [47]

    GLM-5: from Vibe Coding to Agentic Engineering

    85 Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chengxing Xie, Cunxiang Wang, et al. Glm-5: from vibe coding to agentic engineering. arXiv preprint arXiv:2602.15763,

  43. [48]

    Kat-coder technical report.arXiv preprint arXiv:2510.18779,

    86 Zizheng Zhan, Ken Deng, Xiaojiang Zhang, Jinghui Wang, Huaixi Tang, Zhiyi Lai, Haoyang Huang, Wen Xiang, Kun Wu, Wenhao Zhuang, et al. Kat-coder technical report.arXiv preprint arXiv:2510.18779,

  44. [49]

    o1-coder: an o1 replication for coding

    87 Yuxiang Zhang, Shangxi Wu, Yuqi Yang, Jiangming Shu, Jinlin Xiao, Chao Kong, and Jitao Sang. o1-coder: an o1 replication for coding.arXiv preprint arXiv:2412.00154,

  45. [50]

    Qimeng-codev-r1: Reasoning-enhanced verilog generation.arXiv preprint arXiv:2505.24183,

    21 88 Yaoyu Zhu, Di Huang, Hanqi Lyu, Xiaoyun Zhang, Chongxiao Li, Wenxuan Shi, Yutong Wu, Jianan Mu, Jinghua Wang, Yang Zhao, et al. Qimeng-codev-r1: Reasoning-enhanced verilog generation.arXiv preprint arXiv:2505.24183,

  46. [51]

    BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

    89 Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al. Bigcodebench: Bench- marking code generation with diverse function calls and complex instructions.arXiv preprint arXiv:2406.15877,