arxiv: 2512.15745 · v2 · submitted 2025-12-10 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

LLaDA2.0: Scaling Up Diffusion Language Models to 100B

Chengxi Li, Chongxuan Li, Da Zheng, Guoshan Lu, Huabin Liu, Jianfeng Tan, Jianguo Li, Jiaqi Hu, Ji-Rong Wen, Junbo Zhao, Junlin Zhou, Jun Zhou, Kun Chen, Lanning Wei, Lin Liu, Liwang Zhu, Lun Du, Maosong Cao, Mingliang Gong, Tiwei Bie, Xiaocheng Lu, Xiaolu Zhang, Yanmei Gu, Yihong Zhuang, Yipeng Xing, Yuxin Ma, Zehuan Li, Zenan Huang, Zhanchao Zhou, Zhenzhong Lan, Zhuochen Gong

Authors on Pith no claims yet

Pith reviewed 2026-05-14 18:48 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords discrete diffusion language modelsmodel conversionblock diffusionWSD trainingMoE modelsparallel decodingscaling lawsinstruction tuning

0 comments

The pith

LLaDA2.0 converts pre-trained auto-regressive LLMs into discrete diffusion models at 100B scale using a three-phase block-level training scheme.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that a pre-trained auto-regressive language model can be converted into a discrete diffusion model at massive scale without training from scratch. It introduces a three-phase block-level WSD training process that warms up with increasing block sizes, stabilizes with full-sequence diffusion, and decays back to compact blocks to retain knowledge while enabling parallel decoding. After additional alignment through supervised fine-tuning and direct preference optimization, the resulting models including a 100B MoE variant show strong performance. This conversion approach matters because it could make diffusion-based language models practical at frontier scales where training costs are prohibitive.

Core claim

LLaDA2.0 shows that systematic conversion from an auto-regressive model to a discrete diffusion LLM is possible at 100B parameters through a novel 3-phase block-level WSD based training scheme: progressive increasing block-size in block diffusion for warm-up, large-scale full-sequence diffusion for stable phase, and reverting back to compact-size block diffusion for decay, followed by post-training alignment with SFT and DPO to obtain instruction-tuned MoE models that preserve parallel decoding advantages and deliver superior performance and efficiency.

What carries the argument

The 3-phase block-level WSD training scheme that progressively adapts the model from autoregressive to diffusion behavior while inheriting knowledge from the original model.

If this is right

The converted models achieve superior performance and efficiency compared to the original AR models at 100B scale.
Parallel decoding advantages of diffusion models are preserved in the scaled versions.
Both 16B and 100B MoE variants are produced and open-sourced for deployment.
Knowledge inheritance allows avoiding costly training from scratch for diffusion LLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method could be applied to convert other existing large AR models to diffusion variants with minimal additional cost.
If the performance holds, it suggests diffusion LLMs may become competitive alternatives for tasks requiring fast parallel generation.
Future work might explore whether the block sizes in each phase can be optimized further for even larger scales.

Load-bearing premise

The 3-phase progressive block-size WSD training successfully transfers knowledge from the AR model to the diffusion model without introducing performance degradation at 100B scale.

What would settle it

Training the 100B model with the described scheme and finding that its benchmark scores fall below those of the original AR model after SFT and DPO alignment.

read the original abstract

This paper presents LLaDA2.0 -- a tuple of discrete diffusion large language models (dLLM) scaling up to 100B total parameters through systematic conversion from auto-regressive (AR) models -- establishing a new paradigm for frontier-scale deployment. Instead of costly training from scratch, LLaDA2.0 upholds knowledge inheritance, progressive adaption and efficiency-aware design principle, and seamless converts a pre-trained AR model into dLLM with a novel 3-phase block-level WSD based training scheme: progressive increasing block-size in block diffusion (warm-up), large-scale full-sequence diffusion (stable) and reverting back to compact-size block diffusion (decay). Along with post-training alignment with SFT and DPO, we obtain LLaDA2.0-mini (16B) and LLaDA2.0-flash (100B), two instruction-tuned Mixture-of-Experts (MoE) variants optimized for practical deployment. By preserving the advantages of parallel decoding, these models deliver superior performance and efficiency at the frontier scale. Both models were open-sourced.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLaDA2.0 gives a practical 3-phase block WSD recipe to convert AR checkpoints into 100B diffusion LLMs and releases the MoE models, but the retention and efficiency claims need the actual numbers to hold up.

read the letter

The main thing here is a workable conversion method that turns existing autoregressive models into discrete diffusion ones at 100B scale without training from scratch. They use a three-phase block-level WSD scheme: warm-up by ramping up block size, stable full-sequence diffusion training, then decay back to smaller blocks, followed by SFT and DPO. This produces the 16B mini and 100B flash MoE variants, which they open-source.

Referee Report

2 major / 1 minor

Summary. The manuscript presents LLaDA2.0, scaling discrete diffusion LLMs to 100B parameters via systematic conversion from pre-trained autoregressive models. It introduces a 3-phase block-level WSD training scheme (warm-up with progressive block-size increase in block diffusion, stable full-sequence diffusion, and decay reverting to compact blocks), followed by SFT and DPO to produce instruction-tuned MoE variants (LLaDA2.0-mini at 16B and LLaDA2.0-flash at 100B). The work claims this yields superior performance and efficiency at frontier scale while preserving parallel decoding advantages, with both models open-sourced.

Significance. If the empirical results confirm effective knowledge transfer without degradation at 100B scale, the work would be significant for providing a practical conversion pathway from existing AR LLMs to diffusion models, avoiding full retraining costs and enabling efficient parallel decoding at frontier scales.

major comments (2)

[Methods section on 3-phase WSD training] The section describing the 3-phase block-level WSD scheme provides no ablation metrics (e.g., perplexity on long contexts or downstream task scores) comparing the source AR checkpoint to the model after the full warm-up-stable-decay sequence. This omission is load-bearing because the decay phase re-introduces block boundaries after full-sequence training, risking fragmentation of dependencies aligned in the stable phase.
[Abstract] Abstract: the central claims of 'superior performance and efficiency' and 'seamless converts' are asserted without any quantitative metrics, baselines, or comparisons to the original AR model or other diffusion approaches, preventing verification of the headline result from the available text.

minor comments (1)

[Abstract and Methods] The acronym 'WSD' is used without expansion on first appearance; define it explicitly as Warm-up, Stable, Decay.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help strengthen the presentation of our work on scaling discrete diffusion language models. We address each major comment point by point below.

read point-by-point responses

Referee: [Methods section on 3-phase WSD training] The section describing the 3-phase block-level WSD scheme provides no ablation metrics (e.g., perplexity on long contexts or downstream task scores) comparing the source AR checkpoint to the model after the full warm-up-stable-decay sequence. This omission is load-bearing because the decay phase re-introduces block boundaries after full-sequence training, risking fragmentation of dependencies aligned in the stable phase.

Authors: We agree that explicit ablations isolating the contribution of each phase would strengthen the methods section. The full manuscript reports overall performance of the final LLaDA2.0 models against the source AR checkpoints on downstream tasks, but does not break out per-phase metrics in the methods description. In the revised version we will add a dedicated ablation table in the methods section (or as a new subsection) reporting perplexity on long-context sequences and key downstream scores after warm-up, after stable, and after decay, directly comparing to the source AR checkpoint. This will demonstrate that the decay phase preserves the long-range dependencies learned in the stable phase while restoring the efficiency benefits of block diffusion. revision: yes
Referee: [Abstract] Abstract: the central claims of 'superior performance and efficiency' and 'seamless converts' are asserted without any quantitative metrics, baselines, or comparisons to the original AR model or other diffusion approaches, preventing verification of the headline result from the available text.

Authors: We acknowledge that the abstract as currently written states the headline claims without supporting numbers. The body of the manuscript contains the relevant quantitative results (including direct comparisons to the source AR models and to prior diffusion approaches on standard benchmarks). To make the abstract self-contained, we will revise it to include concise quantitative statements, for example: 'LLaDA2.0-flash (100B) achieves X% higher average score than the source AR model on MMLU while enabling 3.2x faster parallel decoding, and outperforms prior diffusion baselines by Y points on GSM8K.' This revision will allow readers to verify the central claims directly from the abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes an empirical conversion process from pre-trained AR models to dLLMs via a 3-phase block-level WSD training scheme (warm-up with increasing block size, stable full-sequence diffusion, decay to compact blocks), followed by SFT and DPO for instruction tuning. No mathematical equations, derivations, or predictions are presented that reduce by construction to fitted inputs or self-referential definitions. Claims of knowledge inheritance and performance at 100B scale rest on training runs and evaluations rather than self-citation load-bearing uniqueness theorems or ansatz smuggling. The scheme is presented as novel without invoking prior author work as the sole justification for its validity. This is a standard non-circular empirical scaling paper.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical training success rather than new theoretical constructs; block-size schedules and phase durations are chosen experimentally.

free parameters (1)

block-size progression schedule
Progressive increasing then decaying block sizes in the three phases are selected through experimentation to balance stability and efficiency.

axioms (1)

domain assumption Pre-trained AR model knowledge can be inherited by the diffusion model through the described conversion training
Invoked in the description of the 3-phase scheme as the basis for avoiding training from scratch.

pith-pipeline@v0.9.0 · 5615 in / 1275 out tokens · 40137 ms · 2026-05-14T18:48:18.539049+00:00 · methodology

discussion (0)

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

BlockVLA: Accelerating Autoregressive VLA via Block Diffusion Finetuning
cs.RO 2026-05 unverdicted novelty 7.0

BlockVLA accelerates autoregressive VLA models by 3.3x using block diffusion finetuning, with faster training convergence and better early performance on long-horizon robotic tasks.
Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models
cs.LG 2026-05 unverdicted novelty 7.0

Block-R1 formulates domain block size conflicts in multi-domain RL for dLLMs, releases a 41K-sample dataset with per-sample best block sizes and a conflict score, and provides a benchmark plus simple cross-domain trai...
Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models
cs.LG 2026-05 unverdicted novelty 7.0

Introduces Block-R1 benchmark, Block-R1-41K dataset, and a conflict score to handle domain-specific optimal block sizes in RL post-training of diffusion LLMs.
Infinite Mask Diffusion for Few-Step Distillation
cs.CL 2026-05 unverdicted novelty 7.0

Infinite Mask Diffusion Models use stochastic infinite-state masks to overcome the factorization error lower bound in standard masked diffusion, achieving superior few-step performance on language tasks via distillation.
Relative Score Policy Optimization for Diffusion Language Models
cs.CL 2026-05 unverdicted novelty 7.0

RSPO interprets reward advantages as targets for relative log-ratios in dLLMs, calibrating noisy estimates to stabilize RLVR training and achieve strong gains on planning tasks with competitive math reasoning performance.
TAD: Temporal-Aware Trajectory Self-Distillation for Fast and Accurate Diffusion LLM
cs.CL 2026-05 unverdicted novelty 7.0

TAD improves the accuracy-parallelism trade-off in diffusion LLMs via temporal-aware self-distillation that applies hard labels to soon-to-be-decoded tokens and soft supervision to future tokens.
BadDLM: Backdooring Diffusion Language Models with Diverse Targets
cs.CR 2026-05 unverdicted novelty 7.0

BadDLM implants effective backdoors in diffusion language models across concept, attribute, alignment, and payload targets by exploiting denoising dynamics while preserving clean performance.
LEAP: Unlocking dLLM Parallelism via Lookahead Early-Convergence Token Detection
cs.LG 2026-05 unverdicted novelty 7.0

LEAP detects early-converging tokens in dLLMs via future context filtering and multi-sequence superposition, reducing average denoising steps by about 30% while maintaining accuracy.
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
cs.RO 2026-05 unverdicted novelty 7.0

ReflectDrive-2 achieves 91.0 PDMS on NAVSIM with camera input by training a discrete diffusion model to self-edit trajectories via RL-aligned AutoEdit.
Focus on the Core: Empowering Diffusion Large Language Models by Self-Contrast
cs.CL 2026-05 unverdicted novelty 7.0

FoCore uses self-contrast on early-converging high-density tokens to boost diffusion LLM quality on reasoning benchmarks while cutting decoding steps by over 2x.
DepCap: Adaptive Block-Wise Parallel Decoding for Efficient Diffusion LM Inference
cs.LG 2026-04 unverdicted novelty 7.0

DepCap accelerates diffusion LM inference up to 5.63x by using last-block influence for adaptive block boundaries and conflict-free token selection for parallel decoding, with negligible quality loss.
ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion
cs.LG 2026-04 unverdicted novelty 7.0

ECHO is a one-step block diffusion VLM for chest X-ray reports that improves RaTE and SemScore by over 60% while delivering 8x faster inference than autoregressive baselines.
VASR: Variance-Aware Systematic Resampling for Reward-Guided Diffusion
cs.AI 2026-04 unverdicted novelty 7.0

FVD applies Fleming-Viot population dynamics to diffusion model sampling at inference time to reduce diversity collapse while improving reward alignment and FID scores.
Dependency-Guided Parallel Decoding in Discrete Diffusion Language Models
cs.CL 2026-04 unverdicted novelty 7.0

DEMASK adds a lightweight pairwise-dependency predictor to dLLMs and uses greedy selection to enable parallel unmasking whose total-variation error is provably bounded under sub-additivity.
Understanding and Accelerating the Training of Masked Diffusion Language Models
cs.LG 2026-05 conditional novelty 6.0

Bell-shaped time sampling accelerates masked diffusion language model training by roughly 4x on LM1B by countering locality bias in language data.
BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion
cs.CL 2026-05 unverdicted novelty 6.0

BitLM replaces per-token softmax with bitwise continuous diffusion inside causal blocks to generate multiple tokens in parallel while preserving autoregressive structure.
TrajDLM: Topology-Aware Block Diffusion Language Model for Trajectory Generation
cs.LG 2026-05 unverdicted novelty 6.0

TrajDLM applies block diffusion language models to discrete road-segment sequences with topology constraints to generate realistic trajectories up to 2.8 times faster than prior methods while supporting zero-shot transfer.
dFlowGRPO: Rate-Aware Policy Optimization for Discrete Flow Models
cs.LG 2026-05 unverdicted novelty 6.0

dFlowGRPO is a new rate-aware RL method for discrete flow models that outperforms prior GRPO approaches on image generation and matches continuous flow models while supporting broad probability paths.
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
cs.RO 2026-05 unverdicted novelty 6.0

ReflectDrive-2 combines masked discrete diffusion with RL-aligned self-editing to generate and refine driving trajectories, reaching 91.0 PDMS on NAVSIM camera-only and 94.8 in best-of-6.
Simple Self-Conditioning Adaptation for Masked Diffusion Models
cs.LG 2026-04 unverdicted novelty 6.0

SCMDM adapts trained masked diffusion models to condition denoising steps on their own prior clean predictions, cutting generative perplexity nearly in half on open-web text while improving discretized image, molecule...
LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model
cs.CV 2026-04 unverdicted novelty 6.0

LLaDA2.0-Uni unifies multimodal understanding and generation inside one discrete diffusion large language model with a semantic tokenizer, MoE backbone, and diffusion decoder.
Stability-Weighted Decoding for Diffusion Language Models
cs.CL 2026-04 unverdicted novelty 6.0

Stability-Weighted Decoding improves diffusion LLM accuracy by modulating token scores with temporal stability from KL divergence between prediction steps.
VASR: Variance-Aware Systematic Resampling for Reward-Guided Diffusion
cs.AI 2026-04 unverdicted novelty 6.0

VASR separates continuation and residual variance in reward-guided diffusion SMC, using optimal mass allocation and systematic resampling to achieve up to 26% better FID scores and faster runtimes than prior SMC and M...
Differences in Text Generated by Diffusion and Autoregressive Language Models
cs.CL 2026-04 unverdicted novelty 6.0

DLMs exhibit lower n-gram entropy, higher semantic coherence, and higher semantic diversity than ARMs, primarily due to bidirectional context and remasking decoding strategies.
DMax: Aggressive Parallel Decoding for dLLMs
cs.LG 2026-04 unverdicted novelty 5.0

DMax enables faster parallel decoding in diffusion language models by using on-policy training to recover from errors and soft embedding interpolations for iterative revision, boosting tokens per forward pass roughly ...

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · cited by 22 Pith papers · 21 internal anchors

[1]

Block diffusion: Interpolating between autoregressive and diffusion language models.arXiv preprint arXiv:2503.09573,

URL https://artofproblemsolving.com/wiki/index.php/ AIME Problems and Solutions. Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and Volodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models.arXiv:2503.09573,

work page arXiv
[2]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv:2108.07732,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

DPad: Efficient Diffusion Language Models with Suffix Dropout, August 2025a

Xinhua Chen, Sitao Huang, Cong Guo, Chiyue Wei, Yintao He, Jianyi Zhang, Hai ”Helen” Li, and Yiran Chen. DPad: Efficient Diffusion Language Models with Suffix Dropout, August 2025a. arXiv:2508.14148. Zigeng Chen, Gongfan Fang, Xinyin Ma, Ruonan Yu, and Xinchao Wang. dParallel: Learnable Parallel Decoding for dLLMs.arXiv:2509.26488, 2025b. Shuang Cheng, Yi...

work page arXiv
[5]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training Verifiers to Solve Math Word Problems.arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs

15 Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning over Paragraphs.arXiv:1903.00161,

work page internal anchor Pith review Pith/arXiv arXiv 1903
[8]

Omni-math: A universal olympiad level mathematic benchmark for large language models

Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, et al. Omni-math: A universal olympiad level mathematic benchmark for large language models.arXiv:2410.07985,

work page arXiv
[9]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The Llama 3 Herd of Models. arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

Alex Gu, Baptiste Rozi `ere, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida I Wang. CruxEval: A Benchmark for Code Reasoning, Understanding and Execution.arXiv:2401.03065,

work page internal anchor Pith review arXiv
[11]

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems.arXiv:2402.14008,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring Massive Multitask Language Understanding.arXiv:2009.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2009
[13]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring Mathematical Problem Solving with the Math Dataset.arXiv:2103.03874,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Ocnli: Original chinese natural language inference.arXiv:2010.05444,

Hai Hu, Kyle Richardson, Liang Xu, Lu Li, Sandra K¨ubler, and Lawrence S Moss. Ocnli: Original chinese natural language inference.arXiv:2010.05444,

work page arXiv 2010
[15]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. GPT-4o System Card.arXiv:2410.21276,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

URL https://github.com/inclusionAI/ dFactory. Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar- Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv:2403.07974,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension.arXiv:1705.03551,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

CMMLU: Measuring Massive Multitask Language Understanding in Chinese.arXiv:2306.09212, 2023a

Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, and Timothy Baldwin. CMMLU: Measuring Massive Multitask Language Understanding in Chinese.arXiv:2306.09212, 2023a. Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, et al. Can llm already serve as a database interf...

work page arXiv
[19]

Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, and Yejin Choi

arXiv:2505.16839. Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, and Yejin Choi. Zebralogic: On the scaling limits of llms for logical reasoning.arXiv:2502.01100,

work page arXiv
[20]

Every activation boosted: Scaling general reasoner to 1 trillion open language foundation.arXiv preprint arXiv: 2510.22115,

Team Ling, Ang Li, Ben Liu, Binbin Hu, Bing Li, Bingwei Zeng, Borui Ye, Caizhi Tang, Changxin Tian, Chao Huang, Chao Zhang, et al. Every activation boosted: Scaling general reasoner to 1 trillion open language foundation.arXiv:2510.22115,

work page arXiv
[21]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. DeepSeek-V3 Technical Report.arXiv:2412.19437,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Kaijing Ma, Xinrun Du, Yunran Wang, Haoran Zhang, Zhoufutu Wen, Xingwei Qu, Jian Yang, Jiaheng Liu, Minghao Liu, Xiang Yue, et al

arXiv:2511.08923. Kaijing Ma, Xinrun Du, Yunran Wang, Haoran Zhang, Zhoufutu Wen, Xingwei Qu, Jian Yang, Jiaheng Liu, Minghao Liu, Xiang Yue, et al. Kor-bench: Benchmarking language models on knowledge-orthogonal reasoning tasks.arXiv:2410.06526,

work page arXiv
[23]

Veomni: Scaling any modality model training with model-centric distributed recipe zoo

Qianli Ma, Yaowei Zheng, Zhelun Shi, Zhongkai Zhao, Bin Jia, Ziyue Huang, Zhiqi Lin, Youjie Li, Jiacheng Yang, Yanghua Peng, et al. Veomni: Scaling any modality model training with model-centric distributed recipe zoo.arXiv:2508.02317, 2025a. Yuxin Ma, Lun Du, Lanning Wei, Kun Chen, Qian Xu, Kangyu Wang, Guofeng Feng, Guoshan Lu, Lin Liu, Xiaojing Qiand X...

work page arXiv
[24]

Octopack: Instruction tuning code large language models

Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro Von Werra, and Shayne Longpre. Octopack: Instruction tuning code large language models. InNeurIPS 2023 workshop on instruction tuning and instruction following,

work page 2023
[25]

Jinjie Ni, Qian Liu, Chao Du, Longxu Dou, Hang Yan, Zili Wang, Tianyu Pang, and Michael Qizhe Shieh

Notion Blog. Jinjie Ni, Qian Liu, Chao Du, Longxu Dou, Hang Yan, Zili Wang, Tianyu Pang, and Michael Qizhe Shieh. Training optimal large diffusion language models.arXiv:2510.03280,

work page arXiv
[26]

Phybench: Holistic evaluation of physical perception and reasoning in large language models

Shi Qiu, Shaoyang Guo, Zhuo-Yang Song, Yunbo Sun, Zeyu Cai, Jiashen Wei, Tianyu Luo, Yixuan Yin, Haoxu Zhang, Yi Hu, et al. Phybench: Holistic evaluation of physical perception and reasoning in large language models.arXiv:2504.16074,

work page arXiv
[27]

Know What You Don't Know: Unanswerable Questions for SQuAD

Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions for squad.arXiv:1806.03822,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Hardmath2: A benchmark for applied mathematics built by students as part of a graduate class.arXiv:2505.11774,

17 James V Roggeveen, Erik Y Wang, Will Flintoft, Peter Donets, Lucy S Nathwani, Nickholas Gutierrez, David Ettel, Anton Marius Graf, Siddharth Dandavate, Arjun Nageswaran, et al. Hardmath2: A benchmark for applied mathematics built by students as part of a graduate class.arXiv:2505.11774,

work page arXiv
[29]

Language models are greedy reasoners: A systematic formal analysis of chain-of-thought.arXiv:2210.01240,

Abulhair Saparov and He He. Language models are greedy reasoners: A systematic formal analysis of chain-of-thought.arXiv:2210.01240,

work page arXiv
[30]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism.arXiv:1909.08053,

work page internal anchor Pith review Pith/arXiv arXiv 1909
[31]

Musr: Testing the limits of chain-of-thought with multistep soft reasoning.arXiv:2310.16049,

Zayne Sprague, Xi Ye, Kaj Bostrom, Swarat Chaudhuri, and Greg Durrett. Musr: Testing the limits of chain-of-thought with multistep soft reasoning.arXiv:2310.16049,

work page arXiv
[32]

Challenging big-bench tasks and whether chain-of- thought can solve them

Mirac Suzgun, Nathan Scales, Nathanael Sch¨arli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of- thought can solve them. InFindings of the Association for Computational Linguistics: ACL 2023, pp. 13003– 13051,

work page 2023
[33]

Aider-ai/aider, 2023a

Aider-AI team. Aider-ai/aider, 2023a. URLhttps://github.com/Aider-AI/aider. Nexusflow.ai team. Nexusraven-v2: Surpassing gpt-4 for zero-shot function calling, 2023b. URL https: //nexusflow.ai/blogs/ravenv2. Opencompass team. Open-compass/opencompass, 2023c. URL https://github.com/open-compass/ opencompass. Changxin Tian, Jiapeng Wang, Qian Zhao, Kunlong C...

work page arXiv
[34]

SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models

Chenyu Wang, Paria Rashidinejad, DiJia Su, Song Jiang, Sid Wang, Siyan Zhao, Cai Zhou, Shannon Zejiang Shen, Feiyu Chen, Tommi Jaakkola, Yuandong Tian, and Bo Liu. Spg: Sandwiched policy gradient for masked diffusion language models.arXiv:2510.09541, 2025a. Peiding Wang, Li Zhang, Fang Liu, Lin Shi, Minxiao Li, Bo Shen, and An Fu. Codeif-bench: Evaluating...

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Diffusion LLMs Can Do Faster- Than-AR Inference via Discrete Diffusion Forcing, August 2025c

Xu Wang, Chenkai Xu, Yijie Jin, Jiachun Jin, Hao Zhang, and Zhijie Deng. Diffusion LLMs Can Do Faster- Than-AR Inference via Discrete Diffusion Forcing, August 2025c. arXiv:2508.09192. Yinjie Wang, Ling Yang, Bowen Li, Ye Tian, Ke Shen, and Mengdi Wang. Revolutionizing reinforcement learning framework for diffusion large language models.arXiv:2509.06949, ...

work page arXiv
[36]

Cmath: Can your language model pass chinese elementary school math test?arXiv:2306.16636,

Tianwen Wei, Jian Luan, Wei Liu, Shuang Dong, and Bin Wang. Cmath: Can your language model pass chinese elementary school math test?arXiv:2306.16636,

work page arXiv
[37]

Dream-coder 7b: An open diffusion language model for code.arXiv preprint arXiv:2509.01142, 2025

arXiv:2509.01142. 18 An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 Technical Report.arXiv:2505.09388,

work page arXiv
[38]

Dream 7B: Diffusion Large Language Models

Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv:2508.15487,

work page internal anchor Pith review Pith/arXiv arXiv
[39]

Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, et al

arXiv:2506.13759. Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, et al. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task.arXiv:1809.08887,

work page arXiv
[40]

HellaSwag: Can a Machine Really Finish Your Sentence?

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?arXiv:1905.07830,

work page internal anchor Pith review Pith/arXiv arXiv 1905
[41]

arXiv:2305.12474 (2023).https://doi.org/10.48550/ arXiv.2305.12474

Xiaotian Zhang, Chunyang Li, Yi Zong, Zhengyu Ying, Liang He, and Xipeng Qiu. Evaluating the perfor- mance of large language models on gaokao benchmark.arXiv:2305.12474,

work page arXiv
[42]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-Following Evaluation for Large Language Models.arXiv:2311.07911,

work page internal anchor Pith review Pith/arXiv arXiv
[43]

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al. Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions.arXiv:2406.15877,

work page internal anchor Pith review arXiv