Recognition: 2 theorem links
· Lean TheoremLLaDA2.0: Scaling Up Diffusion Language Models to 100B
Pith reviewed 2026-05-14 18:48 UTC · model grok-4.3
The pith
LLaDA2.0 converts pre-trained auto-regressive LLMs into discrete diffusion models at 100B scale using a three-phase block-level training scheme.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLaDA2.0 shows that systematic conversion from an auto-regressive model to a discrete diffusion LLM is possible at 100B parameters through a novel 3-phase block-level WSD based training scheme: progressive increasing block-size in block diffusion for warm-up, large-scale full-sequence diffusion for stable phase, and reverting back to compact-size block diffusion for decay, followed by post-training alignment with SFT and DPO to obtain instruction-tuned MoE models that preserve parallel decoding advantages and deliver superior performance and efficiency.
What carries the argument
The 3-phase block-level WSD training scheme that progressively adapts the model from autoregressive to diffusion behavior while inheriting knowledge from the original model.
If this is right
- The converted models achieve superior performance and efficiency compared to the original AR models at 100B scale.
- Parallel decoding advantages of diffusion models are preserved in the scaled versions.
- Both 16B and 100B MoE variants are produced and open-sourced for deployment.
- Knowledge inheritance allows avoiding costly training from scratch for diffusion LLMs.
Where Pith is reading between the lines
- This method could be applied to convert other existing large AR models to diffusion variants with minimal additional cost.
- If the performance holds, it suggests diffusion LLMs may become competitive alternatives for tasks requiring fast parallel generation.
- Future work might explore whether the block sizes in each phase can be optimized further for even larger scales.
Load-bearing premise
The 3-phase progressive block-size WSD training successfully transfers knowledge from the AR model to the diffusion model without introducing performance degradation at 100B scale.
What would settle it
Training the 100B model with the described scheme and finding that its benchmark scores fall below those of the original AR model after SFT and DPO alignment.
read the original abstract
This paper presents LLaDA2.0 -- a tuple of discrete diffusion large language models (dLLM) scaling up to 100B total parameters through systematic conversion from auto-regressive (AR) models -- establishing a new paradigm for frontier-scale deployment. Instead of costly training from scratch, LLaDA2.0 upholds knowledge inheritance, progressive adaption and efficiency-aware design principle, and seamless converts a pre-trained AR model into dLLM with a novel 3-phase block-level WSD based training scheme: progressive increasing block-size in block diffusion (warm-up), large-scale full-sequence diffusion (stable) and reverting back to compact-size block diffusion (decay). Along with post-training alignment with SFT and DPO, we obtain LLaDA2.0-mini (16B) and LLaDA2.0-flash (100B), two instruction-tuned Mixture-of-Experts (MoE) variants optimized for practical deployment. By preserving the advantages of parallel decoding, these models deliver superior performance and efficiency at the frontier scale. Both models were open-sourced.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents LLaDA2.0, scaling discrete diffusion LLMs to 100B parameters via systematic conversion from pre-trained autoregressive models. It introduces a 3-phase block-level WSD training scheme (warm-up with progressive block-size increase in block diffusion, stable full-sequence diffusion, and decay reverting to compact blocks), followed by SFT and DPO to produce instruction-tuned MoE variants (LLaDA2.0-mini at 16B and LLaDA2.0-flash at 100B). The work claims this yields superior performance and efficiency at frontier scale while preserving parallel decoding advantages, with both models open-sourced.
Significance. If the empirical results confirm effective knowledge transfer without degradation at 100B scale, the work would be significant for providing a practical conversion pathway from existing AR LLMs to diffusion models, avoiding full retraining costs and enabling efficient parallel decoding at frontier scales.
major comments (2)
- [Methods section on 3-phase WSD training] The section describing the 3-phase block-level WSD scheme provides no ablation metrics (e.g., perplexity on long contexts or downstream task scores) comparing the source AR checkpoint to the model after the full warm-up-stable-decay sequence. This omission is load-bearing because the decay phase re-introduces block boundaries after full-sequence training, risking fragmentation of dependencies aligned in the stable phase.
- [Abstract] Abstract: the central claims of 'superior performance and efficiency' and 'seamless converts' are asserted without any quantitative metrics, baselines, or comparisons to the original AR model or other diffusion approaches, preventing verification of the headline result from the available text.
minor comments (1)
- [Abstract and Methods] The acronym 'WSD' is used without expansion on first appearance; define it explicitly as Warm-up, Stable, Decay.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help strengthen the presentation of our work on scaling discrete diffusion language models. We address each major comment point by point below.
read point-by-point responses
-
Referee: [Methods section on 3-phase WSD training] The section describing the 3-phase block-level WSD scheme provides no ablation metrics (e.g., perplexity on long contexts or downstream task scores) comparing the source AR checkpoint to the model after the full warm-up-stable-decay sequence. This omission is load-bearing because the decay phase re-introduces block boundaries after full-sequence training, risking fragmentation of dependencies aligned in the stable phase.
Authors: We agree that explicit ablations isolating the contribution of each phase would strengthen the methods section. The full manuscript reports overall performance of the final LLaDA2.0 models against the source AR checkpoints on downstream tasks, but does not break out per-phase metrics in the methods description. In the revised version we will add a dedicated ablation table in the methods section (or as a new subsection) reporting perplexity on long-context sequences and key downstream scores after warm-up, after stable, and after decay, directly comparing to the source AR checkpoint. This will demonstrate that the decay phase preserves the long-range dependencies learned in the stable phase while restoring the efficiency benefits of block diffusion. revision: yes
-
Referee: [Abstract] Abstract: the central claims of 'superior performance and efficiency' and 'seamless converts' are asserted without any quantitative metrics, baselines, or comparisons to the original AR model or other diffusion approaches, preventing verification of the headline result from the available text.
Authors: We acknowledge that the abstract as currently written states the headline claims without supporting numbers. The body of the manuscript contains the relevant quantitative results (including direct comparisons to the source AR models and to prior diffusion approaches on standard benchmarks). To make the abstract self-contained, we will revise it to include concise quantitative statements, for example: 'LLaDA2.0-flash (100B) achieves X% higher average score than the source AR model on MMLU while enabling 3.2x faster parallel decoding, and outperforms prior diffusion baselines by Y points on GSM8K.' This revision will allow readers to verify the central claims directly from the abstract. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper describes an empirical conversion process from pre-trained AR models to dLLMs via a 3-phase block-level WSD training scheme (warm-up with increasing block size, stable full-sequence diffusion, decay to compact blocks), followed by SFT and DPO for instruction tuning. No mathematical equations, derivations, or predictions are presented that reduce by construction to fitted inputs or self-referential definitions. Claims of knowledge inheritance and performance at 100B scale rest on training runs and evaluations rather than self-citation load-bearing uniqueness theorems or ansatz smuggling. The scheme is presented as novel without invoking prior author work as the sole justification for its validity. This is a standard non-circular empirical scaling paper.
Axiom & Free-Parameter Ledger
free parameters (1)
- block-size progression schedule
axioms (1)
- domain assumption Pre-trained AR model knowledge can be inherited by the diffusion model through the described conversion training
Forward citations
Cited by 25 Pith papers
-
BlockVLA: Accelerating Autoregressive VLA via Block Diffusion Finetuning
BlockVLA accelerates autoregressive VLA models by 3.3x using block diffusion finetuning, with faster training convergence and better early performance on long-horizon robotic tasks.
-
Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models
Block-R1 formulates domain block size conflicts in multi-domain RL for dLLMs, releases a 41K-sample dataset with per-sample best block sizes and a conflict score, and provides a benchmark plus simple cross-domain trai...
-
Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models
Introduces Block-R1 benchmark, Block-R1-41K dataset, and a conflict score to handle domain-specific optimal block sizes in RL post-training of diffusion LLMs.
-
Infinite Mask Diffusion for Few-Step Distillation
Infinite Mask Diffusion Models use stochastic infinite-state masks to overcome the factorization error lower bound in standard masked diffusion, achieving superior few-step performance on language tasks via distillation.
-
Relative Score Policy Optimization for Diffusion Language Models
RSPO interprets reward advantages as targets for relative log-ratios in dLLMs, calibrating noisy estimates to stabilize RLVR training and achieve strong gains on planning tasks with competitive math reasoning performance.
-
TAD: Temporal-Aware Trajectory Self-Distillation for Fast and Accurate Diffusion LLM
TAD improves the accuracy-parallelism trade-off in diffusion LLMs via temporal-aware self-distillation that applies hard labels to soon-to-be-decoded tokens and soft supervision to future tokens.
-
BadDLM: Backdooring Diffusion Language Models with Diverse Targets
BadDLM implants effective backdoors in diffusion language models across concept, attribute, alignment, and payload targets by exploiting denoising dynamics while preserving clean performance.
-
LEAP: Unlocking dLLM Parallelism via Lookahead Early-Convergence Token Detection
LEAP detects early-converging tokens in dLLMs via future context filtering and multi-sequence superposition, reducing average denoising steps by about 30% while maintaining accuracy.
-
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
ReflectDrive-2 achieves 91.0 PDMS on NAVSIM with camera input by training a discrete diffusion model to self-edit trajectories via RL-aligned AutoEdit.
-
Focus on the Core: Empowering Diffusion Large Language Models by Self-Contrast
FoCore uses self-contrast on early-converging high-density tokens to boost diffusion LLM quality on reasoning benchmarks while cutting decoding steps by over 2x.
-
DepCap: Adaptive Block-Wise Parallel Decoding for Efficient Diffusion LM Inference
DepCap accelerates diffusion LM inference up to 5.63x by using last-block influence for adaptive block boundaries and conflict-free token selection for parallel decoding, with negligible quality loss.
-
ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion
ECHO is a one-step block diffusion VLM for chest X-ray reports that improves RaTE and SemScore by over 60% while delivering 8x faster inference than autoregressive baselines.
-
VASR: Variance-Aware Systematic Resampling for Reward-Guided Diffusion
FVD applies Fleming-Viot population dynamics to diffusion model sampling at inference time to reduce diversity collapse while improving reward alignment and FID scores.
-
Dependency-Guided Parallel Decoding in Discrete Diffusion Language Models
DEMASK adds a lightweight pairwise-dependency predictor to dLLMs and uses greedy selection to enable parallel unmasking whose total-variation error is provably bounded under sub-additivity.
-
Understanding and Accelerating the Training of Masked Diffusion Language Models
Bell-shaped time sampling accelerates masked diffusion language model training by roughly 4x on LM1B by countering locality bias in language data.
-
BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion
BitLM replaces per-token softmax with bitwise continuous diffusion inside causal blocks to generate multiple tokens in parallel while preserving autoregressive structure.
-
TrajDLM: Topology-Aware Block Diffusion Language Model for Trajectory Generation
TrajDLM applies block diffusion language models to discrete road-segment sequences with topology constraints to generate realistic trajectories up to 2.8 times faster than prior methods while supporting zero-shot transfer.
-
dFlowGRPO: Rate-Aware Policy Optimization for Discrete Flow Models
dFlowGRPO is a new rate-aware RL method for discrete flow models that outperforms prior GRPO approaches on image generation and matches continuous flow models while supporting broad probability paths.
-
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
ReflectDrive-2 combines masked discrete diffusion with RL-aligned self-editing to generate and refine driving trajectories, reaching 91.0 PDMS on NAVSIM camera-only and 94.8 in best-of-6.
-
Simple Self-Conditioning Adaptation for Masked Diffusion Models
SCMDM adapts trained masked diffusion models to condition denoising steps on their own prior clean predictions, cutting generative perplexity nearly in half on open-web text while improving discretized image, molecule...
-
LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model
LLaDA2.0-Uni unifies multimodal understanding and generation inside one discrete diffusion large language model with a semantic tokenizer, MoE backbone, and diffusion decoder.
-
Stability-Weighted Decoding for Diffusion Language Models
Stability-Weighted Decoding improves diffusion LLM accuracy by modulating token scores with temporal stability from KL divergence between prediction steps.
-
VASR: Variance-Aware Systematic Resampling for Reward-Guided Diffusion
VASR separates continuation and residual variance in reward-guided diffusion SMC, using optimal mass allocation and systematic resampling to achieve up to 26% better FID scores and faster runtimes than prior SMC and M...
-
Differences in Text Generated by Diffusion and Autoregressive Language Models
DLMs exhibit lower n-gram entropy, higher semantic coherence, and higher semantic diversity than ARMs, primarily due to bidirectional context and remasking decoding strategies.
-
DMax: Aggressive Parallel Decoding for dLLMs
DMax enables faster parallel decoding in diffusion language models by using on-policy training to recover from errors and soft embedding interpolations for iterative revision, boosting tokens per forward pass roughly ...
Reference graph
Works this paper leans on
-
[1]
URL https://artofproblemsolving.com/wiki/index.php/ AIME Problems and Solutions. Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and Volodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models.arXiv:2503.09573,
-
[2]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv:2108.07732,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv:2107.03374,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
DPad: Efficient Diffusion Language Models with Suffix Dropout, August 2025a
Xinhua Chen, Sitao Huang, Cong Guo, Chiyue Wei, Yintao He, Jianyi Zhang, Hai ”Helen” Li, and Yiran Chen. DPad: Efficient Diffusion Language Models with Suffix Dropout, August 2025a. arXiv:2508.14148. Zigeng Chen, Gongfan Fang, Xinyin Ma, Ruonan Yu, and Xinchao Wang. dParallel: Learnable Parallel Decoding for dLLMs.arXiv:2509.26488, 2025b. Shuang Cheng, Yi...
-
[5]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv:1803.05457,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training Verifiers to Solve Math Word Problems.arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs
15 Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning over Paragraphs.arXiv:1903.00161,
work page internal anchor Pith review Pith/arXiv arXiv 1903
-
[8]
Omni-math: A universal olympiad level mathematic benchmark for large language models
Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, et al. Omni-math: A universal olympiad level mathematic benchmark for large language models.arXiv:2410.07985,
-
[9]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The Llama 3 Herd of Models. arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
Alex Gu, Baptiste Rozi `ere, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida I Wang. CruxEval: A Benchmark for Code Reasoning, Understanding and Execution.arXiv:2401.03065,
work page internal anchor Pith review arXiv
-
[11]
Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems.arXiv:2402.14008,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring Massive Multitask Language Understanding.arXiv:2009.03300,
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[13]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring Mathematical Problem Solving with the Math Dataset.arXiv:2103.03874,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Ocnli: Original chinese natural language inference.arXiv:2010.05444,
Hai Hu, Kyle Richardson, Liang Xu, Lu Li, Sandra K¨ubler, and Lawrence S Moss. Ocnli: Original chinese natural language inference.arXiv:2010.05444,
-
[15]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. GPT-4o System Card.arXiv:2410.21276,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
URL https://github.com/inclusionAI/ dFactory. Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar- Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv:2403.07974,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension.arXiv:1705.03551,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
CMMLU: Measuring Massive Multitask Language Understanding in Chinese.arXiv:2306.09212, 2023a
Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, and Timothy Baldwin. CMMLU: Measuring Massive Multitask Language Understanding in Chinese.arXiv:2306.09212, 2023a. Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, et al. Can llm already serve as a database interf...
-
[19]
arXiv:2505.16839. Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, and Yejin Choi. Zebralogic: On the scaling limits of llms for logical reasoning.arXiv:2502.01100,
-
[20]
Team Ling, Ang Li, Ben Liu, Binbin Hu, Bing Li, Bingwei Zeng, Borui Ye, Caizhi Tang, Changxin Tian, Chao Huang, Chao Zhang, et al. Every activation boosted: Scaling general reasoner to 1 trillion open language foundation.arXiv:2510.22115,
-
[21]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. DeepSeek-V3 Technical Report.arXiv:2412.19437,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
arXiv:2511.08923. Kaijing Ma, Xinrun Du, Yunran Wang, Haoran Zhang, Zhoufutu Wen, Xingwei Qu, Jian Yang, Jiaheng Liu, Minghao Liu, Xiang Yue, et al. Kor-bench: Benchmarking language models on knowledge-orthogonal reasoning tasks.arXiv:2410.06526,
-
[23]
Veomni: Scaling any modality model training with model-centric distributed recipe zoo
Qianli Ma, Yaowei Zheng, Zhelun Shi, Zhongkai Zhao, Bin Jia, Ziyue Huang, Zhiqi Lin, Youjie Li, Jiacheng Yang, Yanghua Peng, et al. Veomni: Scaling any modality model training with model-centric distributed recipe zoo.arXiv:2508.02317, 2025a. Yuxin Ma, Lun Du, Lanning Wei, Kun Chen, Qian Xu, Kangyu Wang, Guofeng Feng, Guoshan Lu, Lin Liu, Xiaojing Qiand X...
-
[24]
Octopack: Instruction tuning code large language models
Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro Von Werra, and Shayne Longpre. Octopack: Instruction tuning code large language models. InNeurIPS 2023 workshop on instruction tuning and instruction following,
work page 2023
-
[25]
Jinjie Ni, Qian Liu, Chao Du, Longxu Dou, Hang Yan, Zili Wang, Tianyu Pang, and Michael Qizhe Shieh
Notion Blog. Jinjie Ni, Qian Liu, Chao Du, Longxu Dou, Hang Yan, Zili Wang, Tianyu Pang, and Michael Qizhe Shieh. Training optimal large diffusion language models.arXiv:2510.03280,
-
[26]
Phybench: Holistic evaluation of physical perception and reasoning in large language models
Shi Qiu, Shaoyang Guo, Zhuo-Yang Song, Yunbo Sun, Zeyu Cai, Jiashen Wei, Tianyu Luo, Yixuan Yin, Haoxu Zhang, Yi Hu, et al. Phybench: Holistic evaluation of physical perception and reasoning in large language models.arXiv:2504.16074,
-
[27]
Know What You Don't Know: Unanswerable Questions for SQuAD
Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions for squad.arXiv:1806.03822,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
17 James V Roggeveen, Erik Y Wang, Will Flintoft, Peter Donets, Lucy S Nathwani, Nickholas Gutierrez, David Ettel, Anton Marius Graf, Siddharth Dandavate, Arjun Nageswaran, et al. Hardmath2: A benchmark for applied mathematics built by students as part of a graduate class.arXiv:2505.11774,
-
[29]
Abulhair Saparov and He He. Language models are greedy reasoners: A systematic formal analysis of chain-of-thought.arXiv:2210.01240,
-
[30]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism.arXiv:1909.08053,
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[31]
Musr: Testing the limits of chain-of-thought with multistep soft reasoning.arXiv:2310.16049,
Zayne Sprague, Xi Ye, Kaj Bostrom, Swarat Chaudhuri, and Greg Durrett. Musr: Testing the limits of chain-of-thought with multistep soft reasoning.arXiv:2310.16049,
-
[32]
Challenging big-bench tasks and whether chain-of- thought can solve them
Mirac Suzgun, Nathan Scales, Nathanael Sch¨arli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of- thought can solve them. InFindings of the Association for Computational Linguistics: ACL 2023, pp. 13003– 13051,
work page 2023
-
[33]
Aider-AI team. Aider-ai/aider, 2023a. URLhttps://github.com/Aider-AI/aider. Nexusflow.ai team. Nexusraven-v2: Surpassing gpt-4 for zero-shot function calling, 2023b. URL https: //nexusflow.ai/blogs/ravenv2. Opencompass team. Open-compass/opencompass, 2023c. URL https://github.com/open-compass/ opencompass. Changxin Tian, Jiapeng Wang, Qian Zhao, Kunlong C...
-
[34]
SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models
Chenyu Wang, Paria Rashidinejad, DiJia Su, Song Jiang, Sid Wang, Siyan Zhao, Cai Zhou, Shannon Zejiang Shen, Feiyu Chen, Tommi Jaakkola, Yuandong Tian, and Bo Liu. Spg: Sandwiched policy gradient for masked diffusion language models.arXiv:2510.09541, 2025a. Peiding Wang, Li Zhang, Fang Liu, Lin Shi, Minxiao Li, Bo Shen, and An Fu. Codeif-bench: Evaluating...
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
Diffusion LLMs Can Do Faster- Than-AR Inference via Discrete Diffusion Forcing, August 2025c
Xu Wang, Chenkai Xu, Yijie Jin, Jiachun Jin, Hao Zhang, and Zhijie Deng. Diffusion LLMs Can Do Faster- Than-AR Inference via Discrete Diffusion Forcing, August 2025c. arXiv:2508.09192. Yinjie Wang, Ling Yang, Bowen Li, Ye Tian, Ke Shen, and Mengdi Wang. Revolutionizing reinforcement learning framework for diffusion large language models.arXiv:2509.06949, ...
-
[36]
Cmath: Can your language model pass chinese elementary school math test?arXiv:2306.16636,
Tianwen Wei, Jian Luan, Wei Liu, Shuang Dong, and Bin Wang. Cmath: Can your language model pass chinese elementary school math test?arXiv:2306.16636,
-
[37]
Dream-coder 7b: An open diffusion language model for code.arXiv preprint arXiv:2509.01142, 2025
arXiv:2509.01142. 18 An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 Technical Report.arXiv:2505.09388,
-
[38]
Dream 7B: Diffusion Large Language Models
Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv:2508.15487,
work page internal anchor Pith review Pith/arXiv arXiv
-
[39]
arXiv:2506.13759. Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, et al. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task.arXiv:1809.08887,
-
[40]
HellaSwag: Can a Machine Really Finish Your Sentence?
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?arXiv:1905.07830,
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[41]
arXiv:2305.12474 (2023).https://doi.org/10.48550/ arXiv.2305.12474
Xiaotian Zhang, Chunyang Li, Yi Zong, Zhengyu Ying, Liang He, and Xipeng Qiu. Evaluating the perfor- mance of large language models on gaokao benchmark.arXiv:2305.12474,
-
[42]
Instruction-Following Evaluation for Large Language Models
Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-Following Evaluation for Large Language Models.arXiv:2311.07911,
work page internal anchor Pith review Pith/arXiv arXiv
-
[43]
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al. Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions.arXiv:2406.15877,
work page internal anchor Pith review arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.