Mercury: Ultra-Fast Language Models Based on Diffusion
Pith reviewed 2026-05-17 02:00 UTC · model grok-4.3
The pith
Diffusion LLMs generate code at over 1100 tokens per second while matching frontier quality.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Mercury Coder models are Transformer-parameterized diffusion LLMs trained to predict multiple tokens in parallel. This design yields state-of-the-art throughputs of 1109 tokens/sec for the Mini variant and 737 tokens/sec for the Small variant on H100 GPUs. The models outperform speed-optimized frontier models by up to 10x on average while maintaining comparable quality across code benchmarks in multiple languages and real-world use on Copilot Arena, where they rank second in quality and first in speed overall.
What carries the argument
Diffusion process inside a Transformer architecture that enables parallel multi-token prediction instead of sequential autoregressive generation.
If this is right
- Coding assistants can respond in real time at throughputs previously limited to much smaller models.
- The same diffusion approach can be scaled to larger sizes without requiring new hardware-specific optimizations.
- Public API access allows direct comparison of latency and output quality against existing commercial models.
Where Pith is reading between the lines
- If the parallel-prediction trick generalizes cleanly, diffusion may replace autoregressive decoding as the default inference method for latency-sensitive applications.
- The speed-quality frontier reported here could be tested on non-coding tasks such as math or dialogue to see whether the same gains appear outside code.
- Future work could measure whether diffusion LLMs reduce the cost of serving many concurrent users compared with optimized autoregressive baselines.
Load-bearing premise
Independent benchmark rankings and arena evaluations accurately capture both speed and quality in ways that hold for typical developer workflows.
What would settle it
A controlled production deployment that measures end-to-end latency and quality on a fresh set of coding tasks and finds the reported 10x speed advantage disappears or quality drops below the frontier baseline.
read the original abstract
We present Mercury, a new generation of commercial-scale large language models (LLMs) based on diffusion. These models are parameterized via the Transformer architecture and trained to predict multiple tokens in parallel. In this report, we detail Mercury Coder, our first set of diffusion LLMs designed for coding applications. Currently, Mercury Coder comes in two sizes: Mini and Small. These models set a new state-of-the-art on the speed-quality frontier. Based on independent evaluations conducted by Artificial Analysis, Mercury Coder Mini and Mercury Coder Small achieve state-of-the-art throughputs of 1109 tokens/sec and 737 tokens/sec, respectively, on NVIDIA H100 GPUs and outperform speed-optimized frontier models by up to 10x on average while maintaining comparable quality. We discuss additional results on a variety of code benchmarks spanning multiple languages and use-cases as well as real-world validation by developers on Copilot Arena, where the model currently ranks second on quality and is the fastest model overall. We also release a public API at https://platform.inceptionlabs.ai/ and free playground at https://chat.inceptionlabs.ai
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Mercury, a family of diffusion-based large language models parameterized via the Transformer architecture and trained to predict multiple tokens in parallel. Focusing on the Mercury Coder Mini and Small variants for coding applications, it claims state-of-the-art throughputs of 1109 tokens/sec and 737 tokens/sec on NVIDIA H100 GPUs per independent evaluations by Artificial Analysis, along with up to 10x average outperformance over speed-optimized frontier models while maintaining comparable quality. Additional results on code benchmarks across languages and use-cases are mentioned, as is real-world validation via Copilot Arena (second on quality, fastest overall), with a public API and playground released.
Significance. If the speed-quality claims hold under controlled conditions, this would represent a notable advance in efficient LLM inference by showing that diffusion models can deliver substantially higher token throughputs than standard autoregressive approaches without quality loss, with direct relevance to real-time coding tools and serving systems. The reliance on third-party rankings provides external grounding, though the absence of self-contained experimental protocols limits immediate verifiability and broader adoption of the diffusion parallel-prediction approach.
major comments (1)
- Abstract: The central claims of SOTA throughputs (1109/737 tokens/sec) and up to 10x outperformance with comparable quality rest entirely on citations to Artificial Analysis and Copilot Arena without any description in the manuscript of the underlying benchmark protocols, including prompt distributions, output lengths, batch sizes, temperature settings, hardware utilization, or exact baseline configurations. This renders the speed-quality comparisons non-reproducible and non-falsifiable from the paper alone, as any mismatch in evaluation conditions could attribute gains to serving optimizations rather than the diffusion mechanism.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help strengthen the manuscript's clarity and verifiability. We address the major comment below and commit to revisions that improve transparency around the evaluation protocols while preserving the value of the independent third-party assessments.
read point-by-point responses
-
Referee: Abstract: The central claims of SOTA throughputs (1109/737 tokens/sec) and up to 10x outperformance with comparable quality rest entirely on citations to Artificial Analysis and Copilot Arena without any description in the manuscript of the underlying benchmark protocols, including prompt distributions, output lengths, batch sizes, temperature settings, hardware utilization, or exact baseline configurations. This renders the speed-quality comparisons non-reproducible and non-falsifiable from the paper alone, as any mismatch in evaluation conditions could attribute gains to serving optimizations rather than the diffusion mechanism.
Authors: We agree that the manuscript would benefit from greater self-contained detail on the evaluation conditions to support reproducibility. In the revised version, we will add a dedicated subsection under Experiments that describes the benchmark protocols for Artificial Analysis and Copilot Arena to the extent the information is available. This will include prompt distributions, typical output lengths, batch sizes, temperature settings, hardware utilization on NVIDIA H100 GPUs, and the specific speed-optimized frontier models used as baselines. We will also explicitly note how the diffusion-based parallel token prediction contributes to throughput improvements independent of serving stack optimizations. While certain low-level implementation details from the third-party evaluator remain outside our direct control, the added description will allow readers to better assess the claims and distinguish the diffusion mechanism's role. revision: yes
Circularity Check
No circularity detected; performance claims rest on external third-party benchmarks without internal derivation reductions
full rationale
The manuscript presents Mercury as a diffusion-based Transformer LLM trained for parallel token prediction and reports throughput and quality metrics (1109/737 tokens/sec, up to 10x outperformance) exclusively via citations to independent external evaluations by Artificial Analysis and Copilot Arena. No equations, fitted parameters, ansatzes, or uniqueness theorems appear in the provided text that reduce by construction to the paper's own inputs or self-citations. The central claims are empirical performance statements whose validity depends on the external benchmark protocols rather than any self-referential derivation chain, satisfying the criteria for a self-contained result with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Mercury models are parameterized via the Transformer architecture and trained to predict multiple tokens in parallel... generating tokens in parallel in a coarse-to-fine manner
-
IndisputableMonolith.Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
achieve state-of-the-art throughputs of 1109 tokens/sec and 737 tokens/sec... outperform speed-optimized frontier models by up to 10x
-
IndisputableMonolith.Foundation.EightTickphase_eighth_power_is_one unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
diffusion models... parallel generation, which can greatly improve speed
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 22 Pith papers
-
Re-Mask and Redirect: Exploiting Denoising Irreversibility in Diffusion Language Models
Re-masking committed refusal tokens plus compliance prefixes bypasses safety in diffusion language models at 74-98% success across tested models.
-
Large Language Diffusion Models
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
-
Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models
Introduces Block-R1 benchmark, Block-R1-41K dataset, and a conflict score to handle domain-specific optimal block sizes in RL post-training of diffusion LLMs.
-
Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models
Block-R1 formulates domain block size conflicts in multi-domain RL for dLLMs, releases a 41K-sample dataset with per-sample best block sizes and a conflict score, and provides a benchmark plus simple cross-domain trai...
-
Infinite Mask Diffusion for Few-Step Distillation
Infinite Mask Diffusion Models use stochastic infinite-state masks to overcome the factorization error lower bound in standard masked diffusion, achieving superior few-step performance on language tasks via distillation.
-
ChipCraftBrain: Validation-First RTL Generation via Multi-Agent Orchestration
ChipCraftBrain achieves 97.2% pass rate on VerilogEval and 94.7% on CVDP benchmarks for generating functional RTL code using adaptive multi-agent orchestration and hybrid reasoning.
-
Remask, Don't Replace: Token-to-Mask Refinement in Diffusion Large Language Models
Token-to-Mask remasking improves self-correction in diffusion LLMs by resetting erroneous commitments to masks rather than overwriting them, yielding +13.33 points on AIME 2025 and +8.56 on CMATH.
-
NI Sampling: Accelerating Discrete Diffusion Sampling by Token Order Optimization
NI Sampling accelerates discrete diffusion language models up to 14.3 times by training a neural indicator to select which tokens to sample at each step using a trajectory-preserving objective.
-
DepCap: Adaptive Block-Wise Parallel Decoding for Efficient Diffusion LM Inference
DepCap accelerates diffusion LM inference up to 5.63x by using last-block influence for adaptive block boundaries and conflict-free token selection for parallel decoding, with negligible quality loss.
-
Diffusion Language Models for Speech Recognition
Diffusion language models and a CTC-USDM joint decoder improve ASR accuracy over standard approaches.
-
Attention-Based Sampler for Diffusion Language Models
Attn-Sampler decodes diffusion language models by selecting tokens in descending order of attention column sums, yielding higher quality and more parallel generation than token-level greedy baselines.
-
Flow Map Language Models: One-step Language Modeling via Continuous Denoising
Continuous flow language models match discrete diffusion baselines and their distilled one-step flow map versions exceed 8-step discrete diffusion quality on LM1B and OWT.
-
Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models
TABOM models inference unmasking preferences as a Boltzmann distribution over predictive entropies and derives a ranking loss to align DLM training with observed trajectories, yielding gains in new domains and reduced...
-
Simple Self-Conditioning Adaptation for Masked Diffusion Models
SCMDM adapts trained masked diffusion models to condition denoising steps on their own prior clean predictions, cutting generative perplexity nearly in half on open-web text while improving discretized image, molecule...
-
Dataset-Level Metrics Attenuate Non-Determinism: A Fine-Grained Non-Determinism Evaluation in Diffusion Language Models
Dataset-level metrics in diffusion language models mask substantial sample-level non-determinism that varies with model and system factors, which a new Factor Variance Attribution metric can decompose.
-
ART for Diffusion Sampling: A Reinforcement Learning Approach to Timestep Schedule
ART reparameterizes diffusion sampling time and uses RL to learn optimal timestep schedules that reduce discretization error and improve generation quality across budgets and datasets.
-
Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed
Efficient-DLM converts AR models to dLMs via block-wise causal attention and position-dependent masking, yielding higher accuracy and 2.7-4.5x throughput than Dream 7B and Qwen3 4B.
-
Saber: An Efficient Sampling with Adaptive Acceleration and Backtracking Enhanced Remasking for Diffusion Language Model
Saber improves both speed and accuracy of diffusion language models on code generation by dynamically adjusting unmasking steps and reverting low-confidence tokens via backtracking.
-
FS-DFM: Fast and Accurate Long Text Generation with Few-Step Diffusion Language Models
FS-DFM enables 1024-token generation at perplexity parity with 1024-step baselines using only 8 steps via explicit step-budget training, reliable updates, and teacher guidance.
-
Diffusion Language Models Know the Answer Before Decoding
DLMs show early answer convergence allowing Prophet to cut decoding steps by up to 3.4x on LLaDA-8B and Dream-7B while keeping output quality.
-
On the Quantization Robustness of Diffusion Language Models in Coding Benchmarks
Diffusion coding model CoDA shows smaller accuracy drops than Qwen3-1.7B under 2-4 bit quantization on HumanEval and MBPP.
-
A Survey of Reinforcement Learning for Large Reasoning Models
A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.
Reference graph
Works this paper leans on
-
[1]
URLhttps://api.semanticscholar
The claude 3 model family: Opus, sonnet, haiku. URLhttps://api.semanticscholar. org/CorpusID:268232499
-
[2]
Top latest ai code generator statistics and trends in 2024, 2024
9CV9. Top latest ai code generator statistics and trends in 2024, 2024. URLhttps:// blog.9cv9.com/top-latest-ai-code-generator-statistics-and-trends-in-2024 . 8
work page 2024
-
[3]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advances in Neural Infor- mation Processing Systems, 34:17981–17993, 2021
work page 2021
-
[5]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, EllenJiang, CarrieJ.Cai, MichaelTerry, QuocV.Le, andCharlesSutton. Program synthesis with large language models.ArXiv, abs/2108.07732, 2021. URL https://api. semanticscholar.org/CorpusID:237142385
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[6]
Efficient Training of Language Models to Fill in the Middle
Mo Bavarian, Heewoo Jun, Nikolas A. Tezak, John Schulman, Christine McLeavey, Jerry Tworek, and Mark Chen. Efficient training of language models to fill in the middle.ArXiv, abs/2207.14255, 2022. URL https://api.semanticscholar.org/CorpusID:251135268
work page internal anchor Pith review arXiv 2022
-
[7]
Videogenerationmodelsasworldsimulators
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, JoeTaylor, TroyLuhman, EricLuhman, etal. Videogenerationmodelsasworldsimulators. OpenAI Blog, 1:8, 2024
work page 2024
-
[8]
Lan- guage models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners. Advances in neural information processing systems , 33:1877–1901, 2020
work page 1901
-
[9]
Feld- man, Arjun Guha, Michael Greenberg, and Abhinav Jangda
Federico Cassano, John Gouwar, Daniel Nguyen, Sy Duy Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q. Feld- man, Arjun Guha, Michael Greenberg, and Abhinav Jangda. Multipl-e: A scalable and extensible approach to benchmarking neural code generation. 2022. URL https: //api.semanticscholar.org/CorpusID:254854172
work page 2022
-
[10]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé, Jared Ka- plan, Harrison Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mo Bavarian, Clemens Winter, P...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[11]
Copilot arena: A platform for code llm evaluation in the wild.arXiv preprint arXiv:2502.09328 , 2025
Wayne Chi, Valerie Chen, Anastasios Nikolas Angelopoulos, Wei-Lin Chiang, Aditya Mit- tal, Naman Jain, Tianjun Zhang, Ion Stoica, Chris Donahue, and Ameet Talwalkar. Copilot arena: A platform for code llm evaluation in the wild.arXiv preprint arXiv:2502.09328 , 2025
-
[12]
GoogleDeepMind. Gemini2.0Flash. https://deepmind.google/technologies/gemini/ flash/. Accessed: 2025-03-18
work page 2025
-
[14]
DeepSeek-AI, Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y. Wu, Yukun Li, Huazuo Gao, Shirong Ma, Wangding Zeng, Xiao Bi, Zihui Gu, Hanwei Xu, Damai Dai, Kai Dong, Liyue Zhang, Yishi Piao, Zhibin Gou, Zhenda Xie, Zhewen Hao, Bing-Li Wang, Jun-Mei Song, Deli Chen, Xin Xie, Kang Guan, Yu mei You, Aixin Liu, Qiushi Du, Wenjun Gao, ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
URL https://api.semanticscholar.org/CorpusID:270562723
-
[17]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 , 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
Likelihood-based diffusion language models
Ishaan Gulrajani and Tatsunori B Hashimoto. Likelihood-based diffusion language models. Advances in Neural Information Processing Systems , 36, 2024
work page 2024
-
[20]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems , 33:6840–6851, 2020
work page 2020
-
[21]
Qwen2.5-Coder Technical Report
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, An Yang, Rui Men, Fei Huang, Shanghaoran Quan, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. Qwen2.5-coder tech- nical report. ArXiv, abs/2409.12186, 2024. URL https://api.semanticscholar.org/ CorpusID:272707390. 12
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
OpenAI Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Mkadry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alexan- der Kirillov, Alex Nichol, Alex Paino, Alex Renzin, Alexandre Passos, Alexander Kir- illov, Alexi Christakis, A...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
The amazon nova fam- ily of models: Technical report and model card
Amazon Artificial General Intelligence. The amazon nova fam- ily of models: Technical report and model card. Amazon Techni- cal Reports , 2024. URL https://www.amazon.science/publications/ the-amazon-nova-family-of-models-technical-report-and-model-card
work page 2024
-
[24]
Enabling autoregressive models to fill in masked tokens.arXiv preprint arXiv:2502.06901 , 2025
Daniel Israel, Aditya Grover, and Guy Van den Broeck. Enabling autoregressive models to fill in masked tokens.arXiv preprint arXiv:2502.06901 , 2025
-
[25]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and con- tamination free evaluation of large language models for code.ArXiv, abs/2403.07974, 2024. URL https://api.semanticscholar.org/CorpusID:268379413
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Xiang Li, John Thickstun, Ishaan Gulrajani, Percy S Liang, and Tatsunori B Hashimoto. Diffusion-lm improves controllable text generation.Advances in Neural Information Pro- cessing Systems, 35:4328–4343, 2022
work page 2022
-
[27]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437 , 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
JiaweiLiu, ChunXia, YuyaoWang, andLingmingZhang. Isyourcodegeneratedbychatgpt really correct? rigorous evaluation of large language models for code generation.ArXiv, abs/2305.01210, 2023. URL https://api.semanticscholar.org/CorpusID:258437095
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution
Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion language modeling by estimating the ratios of the data distribution.arXiv preprint arXiv:2310.16834 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
Mistral AI. Mistral small 3, January 2025. URL https://mistral.ai/news/ mistral-small-3. Accessed: 2025-03-18. 14
work page 2025
-
[31]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language mod- els to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022
work page 2022
-
[32]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. In Pro- ceedings of the IEEE/CVF International Conference on Computer Vision , pp. 4195–4205, 2023
work page 2023
-
[33]
RWKV: Reinventing RNNs for the Transformer Era
Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Bider- man, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, et al. Rwkv: Reinventing rnns for the transformer era.arXiv preprint arXiv:2305.13048 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems , 36:53728–53741, 2023
work page 2023
-
[35]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Om- mer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition , pp. 10684–10695, 2022
work page 2022
-
[36]
Simple and effective masked diffusion language models,
Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models. arXiv preprint arXiv:2406.07524 , 2024
-
[37]
Deep un- supervised learning using nonequilibrium thermodynamics
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep un- supervised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pp. 2256–2265. PMLR, 2015
work page 2015
-
[38]
Generative modeling by estimating gradients of the data distribution
Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems , 32, 2019
work page 2019
-
[39]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Ji- ahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[40]
Mistral AI team. Codestral 25.01, 2025. URL https://mistral.ai/news/ codestral-2501. Accessed: 2025-03-18
work page 2025
-
[41]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017
work page 2017
-
[42]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems , 35:24824–24837, 2022
work page 2022
-
[43]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024. 15
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.