Recognition: 2 theorem links
· Lean TheoremDeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence
Pith reviewed 2026-05-16 00:55 UTC · model grok-4.3
The pith
An open-source code model matches or exceeds closed-source leaders on coding and math benchmarks after training on six trillion extra tokens.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DeepSeek-Coder-V2 is further pre-trained from an intermediate checkpoint of DeepSeek-V2 using an additional six trillion tokens. This step substantially boosts its coding and mathematical reasoning while preserving general language performance. The model expands language coverage from 86 to 338 and context length from 16K to 128K. On standard benchmarks it records higher scores than GPT-4-Turbo, Claude 3 Opus, and Gemini 1.5 Pro in coding and math evaluations.
What carries the argument
Continued pre-training on a large code and math corpus using a Mixture-of-Experts architecture that activates only a subset of parameters during inference.
Load-bearing premise
The reported benchmark scores reflect genuine new capability rather than overlap between the training data and the test problems.
What would settle it
Running the model on a fresh set of coding problems created after the training data cutoff and comparing results against the published scores would test whether the gains hold on unseen tasks.
read the original abstract
We present DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code language model that achieves performance comparable to GPT4-Turbo in code-specific tasks. Specifically, DeepSeek-Coder-V2 is further pre-trained from an intermediate checkpoint of DeepSeek-V2 with additional 6 trillion tokens. Through this continued pre-training, DeepSeek-Coder-V2 substantially enhances the coding and mathematical reasoning capabilities of DeepSeek-V2, while maintaining comparable performance in general language tasks. Compared to DeepSeek-Coder-33B, DeepSeek-Coder-V2 demonstrates significant advancements in various aspects of code-related tasks, as well as reasoning and general capabilities. Additionally, DeepSeek-Coder-V2 expands its support for programming languages from 86 to 338, while extending the context length from 16K to 128K. In standard benchmark evaluations, DeepSeek-Coder-V2 achieves superior performance compared to closed-source models such as GPT4-Turbo, Claude 3 Opus, and Gemini 1.5 Pro in coding and math benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents DeepSeek-Coder-V2, an open-source Mixture-of-Experts code language model obtained via continued pre-training of an intermediate DeepSeek-V2 checkpoint on an additional 6 trillion tokens. It claims substantial gains in coding and mathematical reasoning over the prior DeepSeek-Coder-33B, expansion from 86 to 338 programming languages and from 16K to 128K context length, maintenance of general-language performance, and superior results relative to closed-source models (GPT-4-Turbo, Claude 3 Opus, Gemini 1.5 Pro) on coding and math benchmarks.
Significance. If the benchmark superiority claims are substantiated by rigorous decontamination and statistical controls, the work would be significant: it would supply the first openly available model that matches or exceeds the leading closed-source systems on code intelligence tasks, thereby lowering barriers to reproducible research in software engineering and AI-assisted programming.
major comments (2)
- [Abstract] Abstract: the headline claim of superiority over GPT-4-Turbo, Claude 3 Opus, and Gemini 1.5 Pro on coding and math benchmarks is presented without any description of the evaluation protocol, the precise benchmark suite (HumanEval, MBPP, GSM8K, MATH, etc.), decontamination steps, or statistical significance tests, rendering the central empirical result only moderately supported.
- [Abstract / Training section] The description of continued pre-training on 6 T tokens supplies no overlap statistics, membership-inference results, or ablation that removes any examples overlapping the test prompts; in the absence of such checks the observed margins cannot be confidently attributed to generalization rather than leakage.
minor comments (2)
- [Abstract] The abstract states performance is 'comparable' on general language tasks but does not quantify the degradation or improvement relative to the base DeepSeek-V2 checkpoint.
- [Abstract] No table or figure is referenced that would allow direct comparison of the new model's scores against the closed-source baselines on each individual benchmark.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract and training details. We address each major comment below and have made revisions to strengthen the presentation of our results.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline claim of superiority over GPT-4-Turbo, Claude 3 Opus, and Gemini 1.5 Pro on coding and math benchmarks is presented without any description of the evaluation protocol, the precise benchmark suite (HumanEval, MBPP, GSM8K, MATH, etc.), decontamination steps, or statistical significance tests, rendering the central empirical result only moderately supported.
Authors: We agree the abstract is concise by design and omits protocol details. The full evaluation protocol, benchmark definitions (HumanEval, MBPP, GSM8K, MATH and others), decontamination steps, and statistical comparisons appear in Sections 4 and 5. We have revised the abstract to briefly name the primary coding and math benchmarks and to direct readers to the main text for the complete protocol and significance testing. revision: yes
-
Referee: [Abstract / Training section] The description of continued pre-training on 6 T tokens supplies no overlap statistics, membership-inference results, or ablation that removes any examples overlapping the test prompts; in the absence of such checks the observed margins cannot be confidently attributed to generalization rather than leakage.
Authors: We acknowledge that explicit decontamination evidence was not provided in the original training description. We have added a dedicated paragraph in the Training section that reports n-gram overlap statistics between the 6 T token corpus and the test sets of the reported benchmarks, together with an ablation that measures performance after removing any overlapping examples. These additions support that the observed gains reflect generalization rather than leakage. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper presents an empirical model, DeepSeek-Coder-V2, trained by continued pre-training on 6 trillion tokens from DeepSeek-V2. Its claims of superior performance are based on direct evaluations against closed-source models on standard benchmarks such as coding and math tasks. There are no mathematical derivations, self-definitional constructs, fitted inputs presented as predictions, or load-bearing self-citations that reduce the results to the inputs by construction. The analysis is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math Transformer-based MoE architecture improves efficiency for large-scale language modeling
- domain assumption Additional domain-specific pre-training enhances coding and math reasoning while preserving general language performance
Forward citations
Cited by 18 Pith papers
-
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
The AI Scientist framework enables LLMs to independently conduct the full scientific process from idea generation to paper writing and review, demonstrated across three ML subfields with papers costing under $15 each.
-
StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning
StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
-
Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs
Coral cuts multi-LLM serving costs by up to 2.79x and raises goodput by up to 2.39x on heterogeneous GPUs through adaptive joint optimization and a lossless two-stage decomposition that solves quickly.
-
An Empirical Study of Speculative Decoding on Software Engineering Tasks
Speculative decoding accelerates LLM inference on SE tasks without accuracy loss, with model-based methods suiting code generation and model-free methods suiting repository-level repair and editing.
-
Evaluating the Environmental Impact of using SLMs and Prompt Engineering for Code Generation
Chain-of-Thought prompting balances high accuracy with low energy use in small language models for code generation, while multi-sampling strategies add high energy costs for small accuracy gains.
-
FVRuleLearner: Operator-Level Reasoning Tree (OP-Tree)-Based Rules Learning for Formal Verification
FVRuleLearner introduces an Operator Reasoning Tree to learn operator-specific rules that improve natural-language to SystemVerilog assertion generation, raising syntax correctness by 3.95% and functional correctness ...
-
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution
SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.
-
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models
Omni-MATH supplies 4428 human-verified Olympiad math problems that expose top LLMs achieving only 52.55% to 60.54% accuracy on the most difficult items.
-
Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts
Loss-Free Balancing keeps expert loads balanced in MoE models by dynamically adjusting routing-score biases based on recent usage, avoiding auxiliary-loss interference and yielding better performance.
-
Adversarial SQL Injection Generation with LLM-Based Architectures
RADAGAS-GPT4o achieves a 22.73% bypass rate against 10 WAFs, succeeding more against AI/ML-based firewalls than rule-based ones.
-
PaT: Planning-after-Trial for Efficient Test-Time Code Generation
PaT defers planning until after failed trials in LLM code generation, enabling heterogeneous cheap-plus-powerful model setups that match large-model performance at roughly 69% lower cost.
-
SynConfRoute: Syntax-Aware Routing for Efficient Code Completion with Small CodeLLMs
SynConfRoute routes code completions using syntax validation and token confidence, improving pass@1 by up to 31% on hard tasks and reducing accelerator usage by 58% versus always using the largest model.
-
InCoder-32B-Thinking: Industrial Code World Model for Thinking
InCoder-32B-Thinking uses error-feedback synthesized thinking traces and a code world model to reach top open-source scores on general and industrial code benchmarks including 81.3% on LiveCodeBench and 84.0% on CAD-Coder.
-
Exploring the AI Obedience: Why is Generating a Pure Color Image Harder than CyberPunk?
Generative AI exhibits a paradox of simplicity where complex scene generation succeeds but deterministic tasks like pure color images fail, addressed via a new hierarchical obedience framework and Violin benchmark sho...
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
-
Scaling Synthetic Data Creation with 1,000,000,000 Personas
A curated set of one billion personas enables scalable, diverse synthetic data generation for LLM training across reasoning, instructions, knowledge, NPCs, and tools.
-
PubSwap: Public-Data Off-Policy Coordination for Federated RLVR
PubSwap uses a small public dataset for selective off-policy response swapping in federated RLVR to improve coordination and performance over standard baselines on math and medical reasoning tasks.
-
OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering
OmniJigsaw is a self-supervised proxy task that reconstructs shuffled audio-visual clips via joint integration, sample-level selection, and clip-level masking strategies, yielding gains on 15 video, audio, and reasoni...
Reference graph
Works this paper leans on
- [1]
-
[2]
Program Synthesis with Large Language Models
J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton. Program synthesis with large language models, 2021a. J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 202...
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P . d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
URL http://arxiv.org/abs/1803.05457. K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators
Y. Dubois, B. Galambosi, P . Liang, and T. B. Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
15 D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y. Wu, Y. Li, et al. Deepseek- coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Measuring Massive Multitask Language Understanding
D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300,
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[9]
Measuring Mathematical Problem Solving With the MATH Dataset
D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Mea- suring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874,
work page internal anchor Pith review Pith/arXiv arXiv
- [10]
-
[11]
C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
FastText.zip: Compressing text classification models
Association for Computational Linguistics. doi: 10.18653/v1/P17-1147. URL https://aclanthology.org/P17-1147. A. Joulin, E. Grave, P . Bojanowski, M. Douze, H. Jégou, and T. Mikolov. Fasttext. zip: Compress- ing text classification models. arXiv preprint arXiv:1612.03651,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/p17-1147
-
[13]
doi: 10.1162/tacl\_a\_00276. URL https://doi.org/10.1162/tacl_a_00276. H. Li, Y. Zhang, F. Koto, Y. Yang, H. Zhao, Y. Gong, N. Duan, and T. Baldwin. CMMLU: Measur- ing massive multitask language understanding in Chinese. arXiv preprint arXiv:2306.09212, 2023a. R. Li, L. B. Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou, M. Marone, C. Akiki, J. Li, J. C...
work page internal anchor Pith review doi:10.1162/tacl
-
[14]
URL https://lmsys.org/ blog/2024-04-19-arena-hard/ . J. Liu, C. S. Xia, Y. Wang, and L. Zhang. Is your code generated by chatGPT really correct? rigor- ous evaluation of large language models for code generation. In Thirty-seventh Conference 16 on Neural Information Processing Systems, 2023a. URL https://openreview.net/for um?id=1qvx610Cu7. T. Liu, C. Xu,...
work page doi:10.48550/a 2024
-
[15]
StarCoder 2 and The Stack v2: The Next Generation
A. Lozhkov, R. Li, L. B. Allal, F. Cassano, J. Lamy-Poirier, N. Tazi, A. Tang, D. Pykhtar, J. Liu, Y. Wei, et al. Starcoder 2 and the stack v2: The next generation. arXiv preprint arXiv:2402.19173,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
American invitational mathematics examination - aime
MAA. American invitational mathematics examination - aime. American Invitational Mathematics Examination - AIME 2024,
work page 2024
-
[17]
Accessed: 2024-05-29. Netmind.AI. Odyssey-math. https://github.com/protagolabs/odyssey-math/tree /main,
work page 2024
-
[18]
B. Peng, J. Quesnelle, H. Fan, and E. Shippole. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
M. Reid, N. Savinov, D. Teplyashin, D. Lepikhin, T. Lillicrap, J.-b. Alayrac, R. Soricut, A. Lazari- dou, O. Firat, J. Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Code Llama: Open Foundation Models for Code
B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, T. Remez, J. Rapin, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Z. Shao, P . Wang, Q. Zhu, R. Xu, J. Song, M. Zhang, Y. Li, Y. Wu, and D. Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
- [22]
-
[23]
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. V . Le, E. H. Chi, D. Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Llama 2: Open Foundation and Fine-Tuned Chat Models
H. Touvron, L. Martin, K. Stone, P . Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P . Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
17 L. Xu, H. Hu, X. Zhang, L. Li, C. Cao, Y. Li, Y. Xu, K. Sun, D. Yu, C. Yu, Y. Tian, Q. Dong, W. Liu, B. Shi, Y. Cui, J. Li, J. Zeng, R. Wang, W. Xie, Y. Li, Y. Patterson, Z. Tian, Y. Zhang, H. Zhou, S. Liu, Z. Zhao, Q. Zhao, C. Yue, X. Zhang, Z. Yang, K. Richardson, and Z. Lan. CLUE: A chi- nese language understanding evaluation benchmark. In D. Scott,...
work page 2020
-
[26]
URL https://doi.org/10.18653/v1/2020.coling-main.419
doi: 10.18653/V1/2020.COLING-MAIN.419. URL https://doi.org/10.18653/v1/2020.coling-main.419. L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P . Xing, H. Zhang, J. E. Gonzalez, and I. Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena,
- [27]
-
[28]
doi: 10.48550/arXiv.2304.06364. URL https://doi.org/10.48550/arXiv.2304.06364. 18 A. Supported Programming Languages ABAP , ActionScript, Ada, Agda, AGS Script, Alloy, AmbientTalk, AMD GPU, AMPL, ANSYS Parametric Design Language, ANTLR, Apache Configuration, APL, AppleScript, Arc, Arduino, ASP , AspectJ, Assembly, Asymptote, Augeas, AutoHotkey, AutoIt, AW...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.