arxiv: 2406.11931 · v1 · submitted 2024-06-17 · 💻 cs.SE · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

DeepSeek-AI , Qihao Zhu , Daya Guo , Zhihong Shao , Dejian Yang , Peiyi Wang , Runxin Xu , Y. Wu

show 32 more authors

Yukun Li Huazuo Gao Shirong Ma Wangding Zeng Xiao Bi Zihui Gu Hanwei Xu Damai Dai Kai Dong Liyue Zhang Yishi Piao Zhibin Gou Zhenda Xie Zhewen Hao Bingxuan Wang Junxiao Song Deli Chen Xin Xie Kang Guan Yuxiang You Aixin Liu Qiushi Du Wenjun Gao Xuan Lu Qinyu Chen Yaohui Wang Chengqi Deng Jiashi Li Chenggang Zhao Chong Ruan Fuli Luo Wenfeng Liang

Authors on Pith no claims yet

Pith reviewed 2026-05-16 00:55 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.LG

keywords code language modelmixture of expertsopen-source modelcoding benchmarksmathematical reasoningprogramming languagescontext length

0 comments

The pith

An open-source code model matches or exceeds closed-source leaders on coding and math benchmarks after training on six trillion extra tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DeepSeek-Coder-V2 as an open-source Mixture-of-Experts model that closes the performance gap with proprietary systems on code-related work. It starts from an earlier checkpoint and adds six trillion tokens of continued pre-training focused on code and math. The result is stronger reasoning on coding tasks, support for 338 programming languages instead of 86, and context length raised to 128K tokens. The authors position this as evidence that open models can deliver competitive code intelligence without closed-source restrictions.

Core claim

DeepSeek-Coder-V2 is further pre-trained from an intermediate checkpoint of DeepSeek-V2 using an additional six trillion tokens. This step substantially boosts its coding and mathematical reasoning while preserving general language performance. The model expands language coverage from 86 to 338 and context length from 16K to 128K. On standard benchmarks it records higher scores than GPT-4-Turbo, Claude 3 Opus, and Gemini 1.5 Pro in coding and math evaluations.

What carries the argument

Continued pre-training on a large code and math corpus using a Mixture-of-Experts architecture that activates only a subset of parameters during inference.

Load-bearing premise

The reported benchmark scores reflect genuine new capability rather than overlap between the training data and the test problems.

What would settle it

Running the model on a fresh set of coding problems created after the training data cutoff and comparing results against the published scores would test whether the gains hold on unseen tasks.

read the original abstract

We present DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code language model that achieves performance comparable to GPT4-Turbo in code-specific tasks. Specifically, DeepSeek-Coder-V2 is further pre-trained from an intermediate checkpoint of DeepSeek-V2 with additional 6 trillion tokens. Through this continued pre-training, DeepSeek-Coder-V2 substantially enhances the coding and mathematical reasoning capabilities of DeepSeek-V2, while maintaining comparable performance in general language tasks. Compared to DeepSeek-Coder-33B, DeepSeek-Coder-V2 demonstrates significant advancements in various aspects of code-related tasks, as well as reasoning and general capabilities. Additionally, DeepSeek-Coder-V2 expands its support for programming languages from 86 to 338, while extending the context length from 16K to 128K. In standard benchmark evaluations, DeepSeek-Coder-V2 achieves superior performance compared to closed-source models such as GPT4-Turbo, Claude 3 Opus, and Gemini 1.5 Pro in coding and math benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DeepSeek-Coder-V2 is a solid open MoE code model that hits GPT-4 Turbo levels on benchmarks after 6T more tokens, with big gains in language coverage and context, though eval details on contamination are missing.

read the letter

The main point is that DeepSeek-Coder-V2 is an open-source MoE model that reaches GPT-4 Turbo levels on coding and math benchmarks after extra pre-training on 6 trillion tokens. They take an intermediate checkpoint from DeepSeek-V2 and continue training to boost code and reasoning skills without losing much on general tasks. The expansion to 338 languages and 128K context from previous versions is a clear step forward over their 33B coder model. This work does well by delivering a competitive open alternative that people can actually use and build on. The direct comparisons to closed models like Claude and Gemini make the claims straightforward to evaluate against known benchmarks. The soft spot is the evaluation setup. The abstract gives no information on data decontamination or test set overlap checks, which leaves the superiority claim open to the contamination issue raised in the stress test. Without those details, it's hard to rule out that the margins come from leakage rather than real gains. If the full paper addresses this, it would be fine, but it's a gap worth noting. This paper is for researchers and developers focused on code generation tools who need strong open models. A reader looking for practical open-source options gets value from the scale and results shown. It deserves a serious referee because the empirical scale is large enough to matter, even with the need for more on methods. I'd recommend sending it to peer review, with attention to the evaluation rigor in any revisions.

Referee Report

2 major / 2 minor

Summary. The paper presents DeepSeek-Coder-V2, an open-source Mixture-of-Experts code language model obtained via continued pre-training of an intermediate DeepSeek-V2 checkpoint on an additional 6 trillion tokens. It claims substantial gains in coding and mathematical reasoning over the prior DeepSeek-Coder-33B, expansion from 86 to 338 programming languages and from 16K to 128K context length, maintenance of general-language performance, and superior results relative to closed-source models (GPT-4-Turbo, Claude 3 Opus, Gemini 1.5 Pro) on coding and math benchmarks.

Significance. If the benchmark superiority claims are substantiated by rigorous decontamination and statistical controls, the work would be significant: it would supply the first openly available model that matches or exceeds the leading closed-source systems on code intelligence tasks, thereby lowering barriers to reproducible research in software engineering and AI-assisted programming.

major comments (2)

[Abstract] Abstract: the headline claim of superiority over GPT-4-Turbo, Claude 3 Opus, and Gemini 1.5 Pro on coding and math benchmarks is presented without any description of the evaluation protocol, the precise benchmark suite (HumanEval, MBPP, GSM8K, MATH, etc.), decontamination steps, or statistical significance tests, rendering the central empirical result only moderately supported.
[Abstract / Training section] The description of continued pre-training on 6 T tokens supplies no overlap statistics, membership-inference results, or ablation that removes any examples overlapping the test prompts; in the absence of such checks the observed margins cannot be confidently attributed to generalization rather than leakage.

minor comments (2)

[Abstract] The abstract states performance is 'comparable' on general language tasks but does not quantify the degradation or improvement relative to the base DeepSeek-V2 checkpoint.
[Abstract] No table or figure is referenced that would allow direct comparison of the new model's scores against the closed-source baselines on each individual benchmark.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and training details. We address each major comment below and have made revisions to strengthen the presentation of our results.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim of superiority over GPT-4-Turbo, Claude 3 Opus, and Gemini 1.5 Pro on coding and math benchmarks is presented without any description of the evaluation protocol, the precise benchmark suite (HumanEval, MBPP, GSM8K, MATH, etc.), decontamination steps, or statistical significance tests, rendering the central empirical result only moderately supported.

Authors: We agree the abstract is concise by design and omits protocol details. The full evaluation protocol, benchmark definitions (HumanEval, MBPP, GSM8K, MATH and others), decontamination steps, and statistical comparisons appear in Sections 4 and 5. We have revised the abstract to briefly name the primary coding and math benchmarks and to direct readers to the main text for the complete protocol and significance testing. revision: yes
Referee: [Abstract / Training section] The description of continued pre-training on 6 T tokens supplies no overlap statistics, membership-inference results, or ablation that removes any examples overlapping the test prompts; in the absence of such checks the observed margins cannot be confidently attributed to generalization rather than leakage.

Authors: We acknowledge that explicit decontamination evidence was not provided in the original training description. We have added a dedicated paragraph in the Training section that reports n-gram overlap statistics between the 6 T token corpus and the test sets of the reported benchmarks, together with an ablation that measures performance after removing any overlapping examples. These additions support that the observed gains reflect generalization rather than leakage. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper presents an empirical model, DeepSeek-Coder-V2, trained by continued pre-training on 6 trillion tokens from DeepSeek-V2. Its claims of superior performance are based on direct evaluations against closed-source models on standard benchmarks such as coding and math tasks. There are no mathematical derivations, self-definitional constructs, fitted inputs presented as predictions, or load-bearing self-citations that reduce the results to the inputs by construction. The analysis is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard transformer and MoE assumptions plus the empirical premise that continued pre-training on code data improves domain performance without degrading general capabilities.

axioms (2)

standard math Transformer-based MoE architecture improves efficiency for large-scale language modeling
Invoked implicitly as the base for DeepSeek-V2 continuation
domain assumption Additional domain-specific pre-training enhances coding and math reasoning while preserving general language performance
Core premise of the continued pre-training step

pith-pipeline@v0.9.0 · 5649 in / 1301 out tokens · 76177 ms · 2026-05-16T00:55:13.404239+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
cs.AI 2024-08 unverdicted novelty 8.0

The AI Scientist framework enables LLMs to independently conduct the full scientific process from idea generation to paper writing and review, demonstrated across three ML subfields with papers costing under $15 each.
StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning
cs.SE 2026-05 unverdicted novelty 7.0

StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs
cs.DC 2026-05 unverdicted novelty 7.0

Coral cuts multi-LLM serving costs by up to 2.79x and raises goodput by up to 2.39x on heterogeneous GPUs through adaptive joint optimization and a lossless two-stage decomposition that solves quickly.
An Empirical Study of Speculative Decoding on Software Engineering Tasks
cs.SE 2026-04 unverdicted novelty 7.0

Speculative decoding accelerates LLM inference on SE tasks without accuracy loss, with model-based methods suiting code generation and model-free methods suiting repository-level repair and editing.
Evaluating the Environmental Impact of using SLMs and Prompt Engineering for Code Generation
cs.SE 2026-04 unverdicted novelty 7.0

Chain-of-Thought prompting balances high accuracy with low energy use in small language models for code generation, while multi-sampling strategies add high energy costs for small accuracy gains.
FVRuleLearner: Operator-Level Reasoning Tree (OP-Tree)-Based Rules Learning for Formal Verification
cs.AR 2026-03 unverdicted novelty 7.0

FVRuleLearner introduces an Operator Reasoning Tree to learn operator-specific rules that improve natural-language to SystemVerilog assertion generation, raising syntax correctness by 3.95% and functional correctness ...
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution
cs.SE 2025-02 unverdicted novelty 7.0

SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models
cs.CL 2024-10 conditional novelty 7.0

Omni-MATH supplies 4428 human-verified Olympiad math problems that expose top LLMs achieving only 52.55% to 60.54% accuracy on the most difficult items.
Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts
cs.LG 2024-08 conditional novelty 7.0

Loss-Free Balancing keeps expert loads balanced in MoE models by dynamically adjusting routing-score biases based on recent usage, avoiding auxiliary-loss interference and yielding better performance.
Adversarial SQL Injection Generation with LLM-Based Architectures
cs.CR 2026-05 unverdicted novelty 6.0

RADAGAS-GPT4o achieves a 22.73% bypass rate against 10 WAFs, succeeding more against AI/ML-based firewalls than rule-based ones.
PaT: Planning-after-Trial for Efficient Test-Time Code Generation
cs.CL 2026-05 unverdicted novelty 6.0

PaT defers planning until after failed trials in LLM code generation, enabling heterogeneous cheap-plus-powerful model setups that match large-model performance at roughly 69% lower cost.
SynConfRoute: Syntax-Aware Routing for Efficient Code Completion with Small CodeLLMs
cs.SE 2026-05 unverdicted novelty 6.0

SynConfRoute routes code completions using syntax validation and token confidence, improving pass@1 by up to 31% on hard tasks and reducing accelerator usage by 58% versus always using the largest model.
InCoder-32B-Thinking: Industrial Code World Model for Thinking
cs.AR 2026-04 unverdicted novelty 6.0

InCoder-32B-Thinking uses error-feedback synthesized thinking traces and a code world model to reach top open-source scores on general and industrial code benchmarks including 81.3% on LiveCodeBench and 84.0% on CAD-Coder.
Exploring the AI Obedience: Why is Generating a Pure Color Image Harder than CyberPunk?
cs.CV 2026-02 unverdicted novelty 6.0

Generative AI exhibits a paradox of simplicity where complex scene generation succeeds but deterministic tasks like pure color images fail, addressed via a new hierarchical obedience framework and Violin benchmark sho...
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
cs.CV 2025-08 unverdicted novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
Scaling Synthetic Data Creation with 1,000,000,000 Personas
cs.CL 2024-06 unverdicted novelty 6.0

A curated set of one billion personas enables scalable, diverse synthetic data generation for LLM training across reasoning, instructions, knowledge, NPCs, and tools.
PubSwap: Public-Data Off-Policy Coordination for Federated RLVR
cs.LG 2026-04 unverdicted novelty 5.0

PubSwap uses a small public dataset for selective off-policy response swapping in federated RLVR to improve coordination and performance over standard baselines on math and medical reasoning tasks.
OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering
cs.CV 2026-04 unverdicted novelty 5.0

OmniJigsaw is a self-supervised proxy task that reconstructs shuffled audio-visual clips via joint integration, sample-level selection, and clip-level masking strategies, yielding gains on 15 video, audio, and reasoni...

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · cited by 18 Pith papers · 17 internal anchors

[1]

L. B. Allal, R. Li, D. Kocetkov, C. Mou, C. Akiki, C. M. Ferrandis, N. Muennighoff, M. Mishra, A. Gu, M. Dey, et al. Santacoder: don’t reach for the stars! arXiv preprint arXiv:2301.03988,

work page arXiv
[2]

Program Synthesis with Large Language Models

J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton. Program synthesis with large language models, 2021a. J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 202...

work page internal anchor Pith review Pith/arXiv arXiv
[3]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P . d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

URL http://arxiv.org/abs/1803.05457. K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Y. Dubois, B. Galambosi, P . Liang, and T. B. Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

15 D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y. Wu, Y. Li, et al. Deepseek- coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Measuring Massive Multitask Language Understanding

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2009
[9]

Measuring Mathematical Problem Solving With the MATH Dataset

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Mea- suring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Huang, Y

Y. Huang, Y. Bai, Z. Zhu, J. Zhang, J. Zhang, T. Su, J. Liu, C. Lv, Y. Zhang, J. Lei, et al. C-Eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322,

work page arXiv
[11]

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

FastText.zip: Compressing text classification models

Association for Computational Linguistics. doi: 10.18653/v1/P17-1147. URL https://aclanthology.org/P17-1147. A. Joulin, E. Grave, P . Bojanowski, M. Douze, H. Jégou, and T. Mikolov. Fasttext. zip: Compress- ing text classification models. arXiv preprint arXiv:1612.03651,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/p17-1147
[13]

emnlp-main.1742/

doi: 10.1162/tacl\_a\_00276. URL https://doi.org/10.1162/tacl_a_00276. H. Li, Y. Zhang, F. Koto, Y. Yang, H. Zhao, Y. Gong, N. Duan, and T. Baldwin. CMMLU: Measur- ing massive multitask language understanding in Chinese. arXiv preprint arXiv:2306.09212, 2023a. R. Li, L. B. Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou, M. Marone, C. Akiki, J. Li, J. C...

work page internal anchor Pith review doi:10.1162/tacl
[14]

URL https://lmsys.org/ blog/2024-04-19-arena-hard/ . J. Liu, C. S. Xia, Y. Wang, and L. Zhang. Is your code generated by chatGPT really correct? rigor- ous evaluation of large language models for code generation. In Thirty-seventh Conference 16 on Neural Information Processing Systems, 2023a. URL https://openreview.net/for um?id=1qvx610Cu7. T. Liu, C. Xu,...

work page doi:10.48550/a 2024
[15]

StarCoder 2 and The Stack v2: The Next Generation

A. Lozhkov, R. Li, L. B. Allal, F. Cassano, J. Lamy-Poirier, N. Tazi, A. Tang, D. Pykhtar, J. Liu, Y. Wei, et al. Starcoder 2 and the stack v2: The next generation. arXiv preprint arXiv:2402.19173,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

American invitational mathematics examination - aime

MAA. American invitational mathematics examination - aime. American Invitational Mathematics Examination - AIME 2024,

work page 2024
[17]

Netmind.AI

Accessed: 2024-05-29. Netmind.AI. Odyssey-math. https://github.com/protagolabs/odyssey-math/tree /main,

work page 2024
[18]

B. Peng, J. Quesnelle, H. Fan, and E. Shippole. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

M. Reid, N. Savinov, D. Teplyashin, D. Lepikhin, T. Lillicrap, J.-b. Alayrac, R. Soricut, A. Lazari- dou, O. Firat, J. Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Code Llama: Open Foundation Models for Code

B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, T. Remez, J. Rapin, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Z. Shao, P . Wang, Q. Zhu, R. Xu, J. Song, M. Zhang, Y. Li, Y. Wu, and D. Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Q. Shi, M. Tang, K. Narasimhan, and S. Yao. Can language models solve olympiad programming? arXiv preprint arXiv:2404.10952,

work page arXiv
[23]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. V . Le, E. H. Chi, D. Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P . Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P . Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

17 L. Xu, H. Hu, X. Zhang, L. Li, C. Cao, Y. Li, Y. Xu, K. Sun, D. Yu, C. Yu, Y. Tian, Q. Dong, W. Liu, B. Shi, Y. Cui, J. Li, J. Zeng, R. Wang, W. Xie, Y. Li, Y. Patterson, Z. Tian, Y. Zhang, H. Zhou, S. Liu, Z. Zhao, Q. Zhao, C. Yue, X. Zhang, Z. Yang, K. Richardson, and Z. Lan. CLUE: A chi- nese language understanding evaluation benchmark. In D. Scott,...

work page 2020
[26]

URL https://doi.org/10.18653/v1/2020.coling-main.419

doi: 10.18653/V1/2020.COLING-MAIN.419. URL https://doi.org/10.18653/v1/2020.coling-main.419. L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P . Xing, H. Zhang, J. E. Gonzalez, and I. Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena,

work page doi:10.18653/v1/2020.coling-main.419 2020
[27]

Zhong, R

W. Zhong, R. Cui, Y. Guo, Y. Liang, S. Lu, Y. Wang, A. Saied, W. Chen, and N. Duan. AGIEval: A human-centric benchmark for evaluating foundation models. CoRR, abs/2304.06364,

work page arXiv
[28]

Zhong, R

doi: 10.48550/arXiv.2304.06364. URL https://doi.org/10.48550/arXiv.2304.06364. 18 A. Supported Programming Languages ABAP , ActionScript, Ada, Agda, AGS Script, Alloy, AmbientTalk, AMD GPU, AMPL, ANSYS Parametric Design Language, ANTLR, Apache Configuration, APL, AppleScript, Arc, Arduino, ASP , AspectJ, Assembly, Asymptote, Augeas, AutoHotkey, AutoIt, AW...

work page doi:10.48550/arxiv.2304.06364