arxiv: 2506.17298 · v1 · pith:SDI5XCOCnew · submitted 2025-06-17 · 💻 cs.CL · cs.AI· cs.LG

Mercury: Ultra-Fast Language Models Based on Diffusion

Inception Labs , Samar Khanna , Siddhant Kharbanda , Shufan Li , Harshit Varma , Eric Wang , Sawyer Birnbaum , Ziyang Luo

show 5 more authors

Yanis Miraoui Akash Palrecha Stefano Ermon Aditya Grover Volodymyr Kuleshov

This is my paper

Pith reviewed 2026-05-17 02:00 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords diffusion language modelsfast inferencecode generationparallel token predictiontransformer architecturespeed-quality trade-off

0 comments

The pith

Diffusion LLMs generate code at over 1100 tokens per second while matching frontier quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Mercury presents language models trained with diffusion to predict many tokens in parallel rather than one at a time. The authors apply this approach to coding tasks and release two sizes, Mini and Small. On NVIDIA H100 GPUs these models reach throughputs of 1109 and 737 tokens per second. Independent tests show they run up to ten times faster than speed-optimized frontier models while delivering comparable results on code benchmarks. Real-world developer rankings place them second in quality but first in speed.

Core claim

Mercury Coder models are Transformer-parameterized diffusion LLMs trained to predict multiple tokens in parallel. This design yields state-of-the-art throughputs of 1109 tokens/sec for the Mini variant and 737 tokens/sec for the Small variant on H100 GPUs. The models outperform speed-optimized frontier models by up to 10x on average while maintaining comparable quality across code benchmarks in multiple languages and real-world use on Copilot Arena, where they rank second in quality and first in speed overall.

What carries the argument

Diffusion process inside a Transformer architecture that enables parallel multi-token prediction instead of sequential autoregressive generation.

If this is right

Coding assistants can respond in real time at throughputs previously limited to much smaller models.
The same diffusion approach can be scaled to larger sizes without requiring new hardware-specific optimizations.
Public API access allows direct comparison of latency and output quality against existing commercial models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the parallel-prediction trick generalizes cleanly, diffusion may replace autoregressive decoding as the default inference method for latency-sensitive applications.
The speed-quality frontier reported here could be tested on non-coding tasks such as math or dialogue to see whether the same gains appear outside code.
Future work could measure whether diffusion LLMs reduce the cost of serving many concurrent users compared with optimized autoregressive baselines.

Load-bearing premise

Independent benchmark rankings and arena evaluations accurately capture both speed and quality in ways that hold for typical developer workflows.

What would settle it

A controlled production deployment that measures end-to-end latency and quality on a fresh set of coding tasks and finds the reported 10x speed advantage disappears or quality drops below the frontier baseline.

read the original abstract

We present Mercury, a new generation of commercial-scale large language models (LLMs) based on diffusion. These models are parameterized via the Transformer architecture and trained to predict multiple tokens in parallel. In this report, we detail Mercury Coder, our first set of diffusion LLMs designed for coding applications. Currently, Mercury Coder comes in two sizes: Mini and Small. These models set a new state-of-the-art on the speed-quality frontier. Based on independent evaluations conducted by Artificial Analysis, Mercury Coder Mini and Mercury Coder Small achieve state-of-the-art throughputs of 1109 tokens/sec and 737 tokens/sec, respectively, on NVIDIA H100 GPUs and outperform speed-optimized frontier models by up to 10x on average while maintaining comparable quality. We discuss additional results on a variety of code benchmarks spanning multiple languages and use-cases as well as real-world validation by developers on Copilot Arena, where the model currently ranks second on quality and is the fastest model overall. We also release a public API at https://platform.inceptionlabs.ai/ and free playground at https://chat.inceptionlabs.ai

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Mercury uses diffusion for parallel token prediction in coding LLMs and reports high throughputs from external benchmarks, but the paper does not reproduce or detail those comparisons itself.

read the letter

The main thing to know is that this paper presents Mercury Coder, a pair of diffusion-based models (Mini and Small) for code generation that predict multiple tokens in parallel rather than sequentially. They claim this yields state-of-the-art speeds of 1109 and 737 tokens per second on H100 GPUs while matching the quality of frontier models, backed by third-party rankings from Artificial Analysis and Copilot Arena, plus an API release for testing.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces Mercury, a family of diffusion-based large language models parameterized via the Transformer architecture and trained to predict multiple tokens in parallel. Focusing on the Mercury Coder Mini and Small variants for coding applications, it claims state-of-the-art throughputs of 1109 tokens/sec and 737 tokens/sec on NVIDIA H100 GPUs per independent evaluations by Artificial Analysis, along with up to 10x average outperformance over speed-optimized frontier models while maintaining comparable quality. Additional results on code benchmarks across languages and use-cases are mentioned, as is real-world validation via Copilot Arena (second on quality, fastest overall), with a public API and playground released.

Significance. If the speed-quality claims hold under controlled conditions, this would represent a notable advance in efficient LLM inference by showing that diffusion models can deliver substantially higher token throughputs than standard autoregressive approaches without quality loss, with direct relevance to real-time coding tools and serving systems. The reliance on third-party rankings provides external grounding, though the absence of self-contained experimental protocols limits immediate verifiability and broader adoption of the diffusion parallel-prediction approach.

major comments (1)

Abstract: The central claims of SOTA throughputs (1109/737 tokens/sec) and up to 10x outperformance with comparable quality rest entirely on citations to Artificial Analysis and Copilot Arena without any description in the manuscript of the underlying benchmark protocols, including prompt distributions, output lengths, batch sizes, temperature settings, hardware utilization, or exact baseline configurations. This renders the speed-quality comparisons non-reproducible and non-falsifiable from the paper alone, as any mismatch in evaluation conditions could attribute gains to serving optimizations rather than the diffusion mechanism.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive comments, which help strengthen the manuscript's clarity and verifiability. We address the major comment below and commit to revisions that improve transparency around the evaluation protocols while preserving the value of the independent third-party assessments.

read point-by-point responses

Referee: Abstract: The central claims of SOTA throughputs (1109/737 tokens/sec) and up to 10x outperformance with comparable quality rest entirely on citations to Artificial Analysis and Copilot Arena without any description in the manuscript of the underlying benchmark protocols, including prompt distributions, output lengths, batch sizes, temperature settings, hardware utilization, or exact baseline configurations. This renders the speed-quality comparisons non-reproducible and non-falsifiable from the paper alone, as any mismatch in evaluation conditions could attribute gains to serving optimizations rather than the diffusion mechanism.

Authors: We agree that the manuscript would benefit from greater self-contained detail on the evaluation conditions to support reproducibility. In the revised version, we will add a dedicated subsection under Experiments that describes the benchmark protocols for Artificial Analysis and Copilot Arena to the extent the information is available. This will include prompt distributions, typical output lengths, batch sizes, temperature settings, hardware utilization on NVIDIA H100 GPUs, and the specific speed-optimized frontier models used as baselines. We will also explicitly note how the diffusion-based parallel token prediction contributes to throughput improvements independent of serving stack optimizations. While certain low-level implementation details from the third-party evaluator remain outside our direct control, the added description will allow readers to better assess the claims and distinguish the diffusion mechanism's role. revision: yes

Circularity Check

0 steps flagged

No circularity detected; performance claims rest on external third-party benchmarks without internal derivation reductions

full rationale

The manuscript presents Mercury as a diffusion-based Transformer LLM trained for parallel token prediction and reports throughput and quality metrics (1109/737 tokens/sec, up to 10x outperformance) exclusively via citations to independent external evaluations by Artificial Analysis and Copilot Arena. No equations, fitted parameters, ansatzes, or uniqueness theorems appear in the provided text that reduce by construction to the paper's own inputs or self-citations. The central claims are empirical performance statements whose validity depends on the external benchmark protocols rather than any self-referential derivation chain, satisfying the criteria for a self-contained result with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivation or theoretical framework presented; claims are empirical performance results.

pith-pipeline@v0.9.0 · 5547 in / 864 out tokens · 48469 ms · 2026-05-17T02:00:43.664501+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Mercury models are parameterized via the Transformer architecture and trained to predict multiple tokens in parallel... generating tokens in parallel in a coarse-to-fine manner
IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

achieve state-of-the-art throughputs of 1109 tokens/sec and 737 tokens/sec... outperform speed-optimized frontier models by up to 10x
IndisputableMonolith.Foundation.EightTick phase_eighth_power_is_one unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

diffusion models... parallel generation, which can greatly improve speed

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Re-Mask and Redirect: Exploiting Denoising Irreversibility in Diffusion Language Models
cs.CL 2026-03 conditional novelty 8.0

Re-masking committed refusal tokens plus compliance prefixes bypasses safety in diffusion language models at 74-98% success across tested models.
Large Language Diffusion Models
cs.CL 2025-02 unverdicted novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models
cs.LG 2026-05 unverdicted novelty 7.0

Introduces Block-R1 benchmark, Block-R1-41K dataset, and a conflict score to handle domain-specific optimal block sizes in RL post-training of diffusion LLMs.
Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models
cs.LG 2026-05 unverdicted novelty 7.0

Block-R1 formulates domain block size conflicts in multi-domain RL for dLLMs, releases a 41K-sample dataset with per-sample best block sizes and a conflict score, and provides a benchmark plus simple cross-domain trai...
Infinite Mask Diffusion for Few-Step Distillation
cs.CL 2026-05 unverdicted novelty 7.0

Infinite Mask Diffusion Models use stochastic infinite-state masks to overcome the factorization error lower bound in standard masked diffusion, achieving superior few-step performance on language tasks via distillation.
ChipCraftBrain: Validation-First RTL Generation via Multi-Agent Orchestration
cs.AR 2026-04 unverdicted novelty 7.0

ChipCraftBrain achieves 97.2% pass rate on VerilogEval and 94.7% on CVDP benchmarks for generating functional RTL code using adaptive multi-agent orchestration and hybrid reasoning.
Remask, Don't Replace: Token-to-Mask Refinement in Diffusion Large Language Models
cs.CL 2026-04 unverdicted novelty 7.0

Token-to-Mask remasking improves self-correction in diffusion LLMs by resetting erroneous commitments to masks rather than overwriting them, yielding +13.33 points on AIME 2025 and +8.56 on CMATH.
NI Sampling: Accelerating Discrete Diffusion Sampling by Token Order Optimization
cs.LG 2026-04 unverdicted novelty 7.0

NI Sampling accelerates discrete diffusion language models up to 14.3 times by training a neural indicator to select which tokens to sample at each step using a trajectory-preserving objective.
DepCap: Adaptive Block-Wise Parallel Decoding for Efficient Diffusion LM Inference
cs.LG 2026-04 unverdicted novelty 7.0

DepCap accelerates diffusion LM inference up to 5.63x by using last-block influence for adaptive block boundaries and conflict-free token selection for parallel decoding, with negligible quality loss.
Diffusion Language Models for Speech Recognition
cs.CL 2026-04 unverdicted novelty 7.0

Diffusion language models and a CTC-USDM joint decoder improve ASR accuracy over standard approaches.
Attention-Based Sampler for Diffusion Language Models
cs.CL 2026-03 conditional novelty 7.0

Attn-Sampler decodes diffusion language models by selecting tokens in descending order of attention column sums, yielding higher quality and more parallel generation than token-level greedy baselines.
Flow Map Language Models: One-step Language Modeling via Continuous Denoising
cs.CL 2026-02 unverdicted novelty 7.0

Continuous flow language models match discrete diffusion baselines and their distilled one-step flow map versions exceed 8-step discrete diffusion quality on LM1B and OWT.
Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models
cs.CL 2026-05 unverdicted novelty 6.0

TABOM models inference unmasking preferences as a Boltzmann distribution over predictive entropies and derives a ranking loss to align DLM training with observed trajectories, yielding gains in new domains and reduced...
Simple Self-Conditioning Adaptation for Masked Diffusion Models
cs.LG 2026-04 unverdicted novelty 6.0

SCMDM adapts trained masked diffusion models to condition denoising steps on their own prior clean predictions, cutting generative perplexity nearly in half on open-web text while improving discretized image, molecule...
Dataset-Level Metrics Attenuate Non-Determinism: A Fine-Grained Non-Determinism Evaluation in Diffusion Language Models
cs.LG 2026-04 unverdicted novelty 6.0

Dataset-level metrics in diffusion language models mask substantial sample-level non-determinism that varies with model and system factors, which a new Factor Variance Attribution metric can decompose.
ART for Diffusion Sampling: A Reinforcement Learning Approach to Timestep Schedule
cs.LG 2026-01 unverdicted novelty 6.0

ART reparameterizes diffusion sampling time and uses RL to learn optimal timestep schedules that reduce discretization error and improve generation quality across budgets and datasets.
Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed
cs.CL 2025-12 unverdicted novelty 6.0

Efficient-DLM converts AR models to dLMs via block-wise causal attention and position-dependent masking, yielding higher accuracy and 2.7-4.5x throughput than Dream 7B and Qwen3 4B.
Saber: An Efficient Sampling with Adaptive Acceleration and Backtracking Enhanced Remasking for Diffusion Language Model
cs.AI 2025-10 unverdicted novelty 6.0

Saber improves both speed and accuracy of diffusion language models on code generation by dynamically adjusting unmasking steps and reverting low-confidence tokens via backtracking.
FS-DFM: Fast and Accurate Long Text Generation with Few-Step Diffusion Language Models
cs.CL 2025-09 conditional novelty 6.0

FS-DFM enables 1024-token generation at perplexity parity with 1024-step baselines using only 8 steps via explicit step-budget training, reliable updates, and teacher guidance.
Diffusion Language Models Know the Answer Before Decoding
cs.CL 2025-08 conditional novelty 6.0

DLMs show early answer convergence allowing Prophet to cut decoding steps by up to 3.4x on LLaDA-8B and Dream-7B while keeping output quality.
On the Quantization Robustness of Diffusion Language Models in Coding Benchmarks
cs.LG 2026-04 unverdicted novelty 4.0

Diffusion coding model CoDA shows smaller accuracy drops than Qwen3-1.7B under 2-4 bit quantization on HumanEval and MBPP.
A Survey of Reinforcement Learning for Large Reasoning Models
cs.CL 2025-09 accept novelty 3.0

A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 21 Pith papers · 16 internal anchors

[1]

URLhttps://api.semanticscholar

The claude 3 model family: Opus, sonnet, haiku. URLhttps://api.semanticscholar. org/CorpusID:268232499

work page
[2]

Top latest ai code generator statistics and trends in 2024, 2024

9CV9. Top latest ai code generator statistics and trends in 2024, 2024. URLhttps:// blog.9cv9.com/top-latest-ai-code-generator-statistics-and-trends-in-2024 . 8

work page 2024
[3]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Structured denoising diffusion models in discrete state-spaces.Advances in Neural Infor- mation Processing Systems, 34:17981–17993, 2021

Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advances in Neural Infor- mation Processing Systems, 34:17981–17993, 2021

work page 2021
[5]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, EllenJiang, CarrieJ.Cai, MichaelTerry, QuocV.Le, andCharlesSutton. Program synthesis with large language models.ArXiv, abs/2108.07732, 2021. URL https://api. semanticscholar.org/CorpusID:237142385

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

Efficient Training of Language Models to Fill in the Middle

Mo Bavarian, Heewoo Jun, Nikolas A. Tezak, John Schulman, Christine McLeavey, Jerry Tworek, and Mark Chen. Efficient training of language models to fill in the middle.ArXiv, abs/2207.14255, 2022. URL https://api.semanticscholar.org/CorpusID:251135268

work page internal anchor Pith review arXiv 2022
[7]

Videogenerationmodelsasworldsimulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, JoeTaylor, TroyLuhman, EricLuhman, etal. Videogenerationmodelsasworldsimulators. OpenAI Blog, 1:8, 2024

work page 2024
[8]

Lan- guage models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners. Advances in neural information processing systems , 33:1877–1901, 2020

work page 1901
[9]

Feld- man, Arjun Guha, Michael Greenberg, and Abhinav Jangda

Federico Cassano, John Gouwar, Daniel Nguyen, Sy Duy Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q. Feld- man, Arjun Guha, Michael Greenberg, and Abhinav Jangda. Multipl-e: A scalable and extensible approach to benchmarking neural code generation. 2022. URL https: //api.semanticscholar.org/CorpusID:254854172

work page 2022
[10]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé, Jared Ka- plan, Harrison Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mo Bavarian, Clemens Winter, P...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[11]

Copilot arena: A platform for code llm evaluation in the wild.arXiv preprint arXiv:2502.09328 , 2025

Wayne Chi, Valerie Chen, Anastasios Nikolas Angelopoulos, Wei-Lin Chiang, Aditya Mit- tal, Naman Jain, Tianjun Zhang, Ion Stoica, Chris Donahue, and Ameet Talwalkar. Copilot arena: A platform for code llm evaluation in the wild.arXiv preprint arXiv:2502.09328 , 2025

work page arXiv 2025
[12]

Gemini2.0Flash

GoogleDeepMind. Gemini2.0Flash. https://deepmind.google/technologies/gemini/ flash/. Accessed: 2025-03-18

work page 2025
[14]

DeepSeek-AI, Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y. Wu, Yukun Li, Huazuo Gao, Shirong Ma, Wangding Zeng, Xiao Bi, Zihui Gu, Hanwei Xu, Damai Dai, Kai Dong, Liyue Zhang, Yishi Piao, Zhibin Gou, Zhenda Xie, Zhewen Hao, Bing-Li Wang, Jun-Mei Song, Deli Chen, Xin Xie, Kang Guan, Yu mei You, Aixin Liu, Qiushi Du, Wenjun Gao, ...

work page internal anchor Pith review Pith/arXiv arXiv
[15]

URL https://api.semanticscholar.org/CorpusID:270562723

work page
[17]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Likelihood-based diffusion language models

Ishaan Gulrajani and Tatsunori B Hashimoto. Likelihood-based diffusion language models. Advances in Neural Information Processing Systems , 36, 2024

work page 2024
[20]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems , 33:6840–6851, 2020

work page 2020
[21]

Qwen2.5-Coder Technical Report

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, An Yang, Rui Men, Fei Huang, Shanghaoran Quan, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. Qwen2.5-coder tech- nical report. ArXiv, abs/2409.12186, 2024. URL https://api.semanticscholar.org/ CorpusID:272707390. 12

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

OpenAI Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Mkadry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alexan- der Kirillov, Alex Nichol, Alex Paino, Alex Renzin, Alexandre Passos, Alexander Kir- illov, Alexi Christakis, A...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

The amazon nova fam- ily of models: Technical report and model card

Amazon Artificial General Intelligence. The amazon nova fam- ily of models: Technical report and model card. Amazon Techni- cal Reports , 2024. URL https://www.amazon.science/publications/ the-amazon-nova-family-of-models-technical-report-and-model-card

work page 2024
[24]

Enabling autoregressive models to fill in masked tokens.arXiv preprint arXiv:2502.06901 , 2025

Daniel Israel, Aditya Grover, and Guy Van den Broeck. Enabling autoregressive models to fill in masked tokens.arXiv preprint arXiv:2502.06901 , 2025

work page arXiv 2025
[25]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and con- tamination free evaluation of large language models for code.ArXiv, abs/2403.07974, 2024. URL https://api.semanticscholar.org/CorpusID:268379413

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Diffusion-lm improves controllable text generation.Advances in Neural Information Pro- cessing Systems, 35:4328–4343, 2022

Xiang Li, John Thickstun, Ishaan Gulrajani, Percy S Liang, and Tatsunori B Hashimoto. Diffusion-lm improves controllable text generation.Advances in Neural Information Pro- cessing Systems, 35:4328–4343, 2022

work page 2022
[27]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

JiaweiLiu, ChunXia, YuyaoWang, andLingmingZhang. Isyourcodegeneratedbychatgpt really correct? rigorous evaluation of large language models for code generation.ArXiv, abs/2305.01210, 2023. URL https://api.semanticscholar.org/CorpusID:258437095

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion language modeling by estimating the ratios of the data distribution.arXiv preprint arXiv:2310.16834 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Mistral small 3, January 2025

Mistral AI. Mistral small 3, January 2025. URL https://mistral.ai/news/ mistral-small-3. Accessed: 2025-03-18. 14

work page 2025
[31]

Training language mod- els to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language mod- els to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022
[32]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. In Pro- ceedings of the IEEE/CVF International Conference on Computer Vision , pp. 4195–4205, 2023

work page 2023
[33]

RWKV: Reinventing RNNs for the Transformer Era

Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Bider- man, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, et al. Rwkv: Reinventing rnns for the transformer era.arXiv preprint arXiv:2305.13048 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems , 36:53728–53741, 2023

work page 2023
[35]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Om- mer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition , pp. 10684–10695, 2022

work page 2022
[36]

Simple and effective masked diffusion language models,

Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models. arXiv preprint arXiv:2406.07524 , 2024

work page arXiv 2024
[37]

Deep un- supervised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep un- supervised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pp. 2256–2265. PMLR, 2015

work page 2015
[38]

Generative modeling by estimating gradients of the data distribution

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems , 32, 2019

work page 2019
[39]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Ji- ahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

Codestral 25.01, 2025

Mistral AI team. Codestral 25.01, 2025. URL https://mistral.ai/news/ codestral-2501. Accessed: 2025-03-18

work page 2025
[41]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017
[42]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems , 35:24824–24837, 2022

work page 2022
[43]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024. 15

work page internal anchor Pith review Pith/arXiv arXiv 2024