arxiv: 2508.14444 · v4 · submitted 2025-08-20 · 💻 cs.CL · cs.AI· cs.LG

NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model

NVIDIA: Aarti Basant , Abhijit Khairnar , Abhijit Paithankar , Abhinav Khattar , Adithya Renduchintala , Aditya Malte , Akhiad Bercovich , Akshay Hazare

show 207 more authors

Alejandra Rico Aleksander Ficek Alex Kondratenko Alex Shaposhnikov Alexander Bukharin Ali Taghibakhshi Amelia Barton Ameya Sunil Mahabaleshwarkar Amy Shen Andrew Tao Ann Guan Anna Shors Anubhav Mandarwal Arham Mehta Arun Venkatesan Ashton Sharabiani Ashwath Aithal Ashwin Poojary Ayush Dattagupta Balaram Buddharaju Banghua Zhu Barnaby Simkin Bilal Kartal Bita Darvish Rouhani Bobby Chen Boris Ginsburg Brandon Norick Brian Yu Bryan Catanzaro Charles Wang Charlie Truong Chetan Mungekar Chintan Patel Chris Alexiuk Christian Munley Christopher Parisien Dan Su Daniel Afrimi Daniel Korzekwa Daniel Rohrer Daria Gitman David Mosallanezhad Deepak Narayanan Dima Rekesh Dina Yared Dmytro Pykhtar Dong Ahn Duncan Riach Eileen Long Elliott Ning Eric Chung Erick Galinkin Evelina Bakhturina Gargi Prasad Gerald Shen Haifeng Qian Haim Elisha Harsh Sharma Hayley Ross Helen Ngo Herman Sahota Hexin Wang Hoo Chang Shin Hua Huang Iain Cunningham Igor Gitman Ivan Moshkov Jaehun Jung Jan Kautz Jane Polak Scowcroft Jared Casper Jian Zhang Jiaqi Zeng Jimmy Zhang Jinze Xue Jocelyn Huang Joey Conway John Kamalu Jonathan Cohen Joseph Jennings Julien Veron Vialard Junkeun Yi Jupinder Parmar Kari Briski Katherine Cheung Katherine Luna Keith Wyss Keshav Santhanam Kezhi Kong Krzysztof Pawelec Kumar Anik Kunlun Li Kushan Ahmadian Lawrence McAfee Laya Sleiman Leon Derczynski Luis Vega Maer Rodrigues de Melo Makesh Narsimhan Sreedhar Marcin Chochowski Mark Cai Markus Kliegl Marta Stepniewska-Dziubinska Matvei Novikov Mehrzad Samadi Meredith Price Meriem Boubdir Michael Boone Michael Evans Michal Bien Michal Zawalski Miguel Martinez Mike Chrzanowski Mohammad Shoeybi Mostofa Patwary Namit Dhameja Nave Assaf Negar Habibi Nidhi Bhatia Nikki Pope Nima Tajbakhsh Nirmal Kumar Juluru Oleg Rybakov Oleksii Hrinchuk Oleksii Kuchaiev Oluwatobi Olabiyi Pablo Ribalta Padmavathy Subramanian Parth Chadha Pavlo Molchanov Peter Dykas Peter Jin Piotr Bialecki Piotr Januszewski Pradeep Thalasta Prashant Gaikwad Prasoon Varshney Pritam Gundecha Przemek Tredak Rabeeh Karimi Mahabadi Rajen Patel Ran El-Yaniv Ranjit Rajan Ria Cheruvu Rima Shahbazyan Ritika Borkar Ritu Gala Roger Waleffe Ruoxi Zhang Russell J. Hewett Ryan Prenger Sahil Jain Samuel Kriman Sanjeev Satheesh Saori Kaji Sarah Yurick Saurav Muralidharan Sean Narenthiran Seonmyeong Bak Sepehr Sameni Seungju Han Shanmugam Ramasamy Shaona Ghosh Sharath Turuvekere Sreenivas Shelby Thomas Shizhe Diao Shreya Gopal Shrimai Prabhumoye Shubham Toshniwal Shuoyang Ding Siddharth Singh Siddhartha Jain Somshubra Majumdar Soumye Singhal Stefania Alborghetti Syeda Nahida Akter Terry Kong Tim Moon Tomasz Hliwiak Tomer Asida Tony Wang Tugrul Konuk Twinkle Vashishth Tyler Poon Udi Karpas Vahid Noroozi Venkat Srinivasan Vijay Korthikanti Vikram Fugro Vineeth Kalluru Vitaly Kurin Vitaly Lavrukhin Wasi Uddin Ahmad Wei Du Wonmin Byeon Ximing Lu Xin Dong Yashaswi Karnati Yejin Choi Yian Zhang Ying Lin Yonggan Fu Yoshi Suhara Zhen Dong Zhiyu Li Zhongbo Zhu Zijia Chen

This is my paper

Pith reviewed 2026-05-18 10:57 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords hybrid Mamba-Transformerreasoning modelMamba-2model compressioninference throughputlarge language modelsNemotron

0 comments

The pith

A hybrid Mamba-Transformer model matches similar-sized models on reasoning accuracy while running up to 6x faster on long traces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Nemotron-Nano-9B-v2, a 9B-parameter model that replaces most self-attention layers with Mamba-2 layers to speed up generation of extended reasoning sequences. It starts from a 12B base model pretrained on 20 trillion tokens, then applies Minitron compression and distillation to fit 128k-token inference on a single A10G GPU. The central goal is to deliver on-par or better accuracy than models such as Qwen3-8B on reasoning benchmarks without losing the speed advantage. A sympathetic reader would care because the design targets practical deployment of long-chain reasoning on modest hardware while keeping benchmark performance intact.

Core claim

Nemotron-Nano-9B-v2 builds on the Nemotron-H architecture by replacing the majority of self-attention layers with Mamba-2 layers, pretrains a 12B model on 20 trillion tokens using an FP8 recipe, then compresses and distills it via the Minitron strategy to produce a 9B model that supports up to 128k tokens on a single NVIDIA A10G GPU in bfloat16 while achieving on-par or superior accuracy on reasoning benchmarks and up to 6x higher inference throughput for 8k-input and 16k-output workloads compared with similarly sized models.

What carries the argument

Hybrid Nemotron-H architecture that substitutes most Transformer self-attention layers with Mamba-2 layers, followed by Minitron compression and distillation.

If this is right

Supports up to 128k-token inference on a single NVIDIA A10G GPU with 22GiB memory in bfloat16.
Delivers up to 6x higher inference throughput versus similarly sized models in 8k-input and 16k-output reasoning settings.
Maintains or exceeds accuracy of models such as Qwen3-8B on reasoning benchmarks.
Releases the 9B and 12B checkpoints plus most pre- and post-training datasets for public use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same hybrid replacement and compression pattern could be tested on other long-output tasks such as code synthesis or multi-step planning.
Mamba-2 layers appear especially useful during the extended generation phase of reasoning, suggesting targeted replacement rather than full replacement may be optimal.
If the throughput gains hold on other hardware, the approach offers a route to scale reasoning models without proportional increases in compute cost.

Load-bearing premise

The Minitron compression and distillation step preserves reasoning performance on the chosen benchmarks without introducing hidden degradation on broader or out-of-distribution tasks.

What would settle it

A clear accuracy drop on a new suite of reasoning tasks or longer sequences outside the reported benchmarks would show that the compression introduced hidden degradation.

read the original abstract

We introduce Nemotron-Nano-9B-v2, a hybrid Mamba-Transformer language model designed to increase throughput for reasoning workloads while achieving state-of-the-art accuracy compared to similarly-sized models. Nemotron-Nano-9B-v2 builds on the Nemotron-H architecture, in which the majority of the self-attention layers in the common Transformer architecture are replaced with Mamba-2 layers, to achieve improved inference speed when generating the long thinking traces needed for reasoning. We create Nemotron-Nano-9B-v2 by first pre-training a 12-billion-parameter model (Nemotron-Nano-12B-v2-Base) on 20 trillion tokens using an FP8 training recipe. After aligning Nemotron-Nano-12B-v2-Base, we employ the Minitron strategy to compress and distill the model with the goal of enabling inference on up to 128k tokens on a single NVIDIA A10G GPU (22GiB of memory, bfloat16 precision). Compared to existing similarly-sized models (e.g., Qwen3-8B), we show that Nemotron-Nano-9B-v2 achieves on-par or better accuracy on reasoning benchmarks while achieving up to 6x higher inference throughput in reasoning settings like 8k input and 16k output tokens. We are releasing Nemotron-Nano-9B-v2, Nemotron-Nano12B-v2-Base, and Nemotron-Nano-9B-v2-Base checkpoints along with the majority of our pre- and post-training datasets on Hugging Face.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Nemotron-Nano-9B-v2, a hybrid Mamba-Transformer model obtained by pre-training Nemotron-Nano-12B-v2-Base on 20 trillion tokens with FP8, followed by alignment and Minitron compression/distillation to 9B parameters. It claims on-par or better accuracy than similarly sized models such as Qwen3-8B on reasoning benchmarks, together with up to 6x higher inference throughput for long reasoning traces (e.g., 8k input / 16k output), while enabling 128k-token inference on a single A10G GPU; the authors release the 9B and 12B checkpoints plus most pre- and post-training datasets.

Significance. If the performance claims hold after proper validation, the work would demonstrate a practical route to high-throughput reasoning models by replacing most attention layers with Mamba-2 while retaining accuracy, with direct relevance to deployment on memory-constrained hardware. The public release of both the 12B base and the compressed 9B model, together with the majority of the training data, constitutes a clear reproducibility strength that elevates the contribution beyond typical model-release papers.

major comments (2)

[§4 and Table 2] §4 (Experimental Results) and Table 2: the central claim that Minitron compression from the 12B base preserves (or improves) reasoning accuracy relative to Qwen3-8B is load-bearing, yet the manuscript provides no ablation comparing Nemotron-Nano-12B-v2-Base versus the final 9B model on the same benchmark suite, nor any out-of-distribution or harder reasoning probes; without these data the 'on-par or better' statement cannot be substantiated and the skeptic concern about silent degradation remains open.
[§4.2] §4.2 (Benchmark Evaluation): reported accuracies lack error bars, standard deviations, or the number of evaluation runs, so it is impossible to determine whether observed differences versus Qwen3-8B are statistically meaningful or within noise; this directly affects the reliability of the headline performance comparison.

minor comments (2)

[Figure 3] Figure 3 (throughput curves): the 6x speedup is stated for 8k/16k token settings; the caption should explicitly name the baseline model and hardware configuration used for the comparison.
[§3.2] §3.2 (Minitron Compression): the description of the distillation objective and layer-pruning schedule is brief; adding the precise hyper-parameters and loss weighting would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for highlighting the potential significance of the hybrid Mamba-Transformer approach. We address each major comment point by point below, indicating revisions where the manuscript will be updated to strengthen the presentation of results.

read point-by-point responses

Referee: [§4 and Table 2] §4 (Experimental Results) and Table 2: the central claim that Minitron compression from the 12B base preserves (or improves) reasoning accuracy relative to Qwen3-8B is load-bearing, yet the manuscript provides no ablation comparing Nemotron-Nano-12B-v2-Base versus the final 9B model on the same benchmark suite, nor any out-of-distribution or harder reasoning probes; without these data the 'on-par or better' statement cannot be substantiated and the skeptic concern about silent degradation remains open.

Authors: We agree that a direct side-by-side comparison of Nemotron-Nano-12B-v2-Base and the compressed 9B model on the reasoning benchmarks would provide clearer evidence that the Minitron step preserves accuracy. In the revised manuscript we have added these results to Section 4 and an updated Table 2. The new data confirm that the 9B model retains competitive performance relative to the 12B base across the reported tasks. On out-of-distribution and harder probes, the existing benchmark suite already spans multiple reasoning domains that test generalization; we have nevertheless added a short discussion and one additional challenging evaluation in the appendix of the revision to further address concerns about potential silent degradation. revision: yes
Referee: [§4.2] §4.2 (Benchmark Evaluation): reported accuracies lack error bars, standard deviations, or the number of evaluation runs, so it is impossible to determine whether observed differences versus Qwen3-8B are statistically meaningful or within noise; this directly affects the reliability of the headline performance comparison.

Authors: We recognize that the absence of error bars and run counts limits the ability to assess statistical significance. The original evaluations followed the single-run protocol standard in large-scale LLM papers to control compute cost. In the revision we have updated §4.2 to report the number of evaluation runs performed for each benchmark and have added error bars (or standard deviations) for those tasks where multiple runs were feasible. While the performance trends remain consistent across independent benchmarks, we have also inserted a brief limitations paragraph acknowledging that full multi-run statistics were not obtained for every metric. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external benchmarks

full rationale

The paper describes an empirical pipeline: pre-train a 12B hybrid Mamba-Transformer base on 20T tokens, align it, apply Minitron compression/distillation to obtain the 9B model, then measure accuracy and throughput on standard reasoning benchmarks against external models such as Qwen3-8B. No equations, fitted parameters, or first-principles derivations are presented whose outputs are definitionally equivalent to their inputs. Throughput and accuracy numbers are obtained by direct measurement on held-out benchmarks and hardware, not by renaming or re-deriving quantities internal to the paper. Self-citations to prior Minitron work are not load-bearing for the central performance claim, which is falsifiable against independent baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central empirical claim rests on standard LLM training assumptions plus the unverified premise that Minitron distillation preserves reasoning quality; no new physical entities or mathematical axioms are introduced.

axioms (1)

domain assumption Next-token prediction on large text corpora produces useful reasoning capabilities in hybrid architectures.
Implicit foundation for all reported pre-training and alignment steps.

pith-pipeline@v0.9.0 · 6886 in / 1384 out tokens · 41488 ms · 2026-05-18T10:57:27.835184+00:00 · methodology

discussion (0)

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models
cs.CL 2026-05 unverdicted novelty 7.0

dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.
Star Elastic: Many-in-One Reasoning LLMs with Efficient Budget Control
cs.LG 2026-05 unverdicted novelty 7.0

Star Elastic trains N nested submodels in a single post-training job on a parent reasoning LLM, supporting elastic budget control that matches or exceeds independent baselines while cutting training compute by up to 360x.
The limits of bio-molecular modeling with large language models : a cross-scale evaluation
cs.LG 2026-04 unverdicted novelty 7.0

LLMs perform adequately on bio-molecular classification tasks but remain weak on regression, with hybrid architectures outperforming others on long sequences and fine-tuning hurting generalization.
Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling
cs.CL 2025-12 conditional novelty 7.0

Four Over Six adaptively scales blocks in NVFP4 quantization to smaller FP4 values, making representable value distributions more uniform and reducing quantization error especially for near-maximal values.
PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding
cs.CL 2026-05 unverdicted novelty 6.0

PARD-2 uses Confidence-Adaptive Token optimization to align draft model training with acceptance length in speculative decoding, enabling dual-mode operation and up to 6.94x lossless speedup on Llama3.1-8B.
Priming: Hybrid State Space Models From Pre-trained Transformers
cs.LG 2026-05 unverdicted novelty 6.0

Priming transfers knowledge from pre-trained Transformers to hybrid SSM-attention models, recovering performance with minimal additional tokens and showing Gated KalmaNet outperforming Mamba-2 on long-context reasonin...
Nexus: Same Pretraining Loss, Better Downstream Generalization via Common Minima
cs.LG 2026-04 unverdicted novelty 6.0

Nexus optimizer improves LLM downstream performance by converging to common minima across data sources despite identical pretraining loss.
Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing
cs.LG 2026-04 unverdicted novelty 6.0

Stochastic training with random cross-layer KV attention enables depth-wise cache sharing in transformers, cutting memory footprint while preserving or improving performance.
M$^2$RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling
cs.LG 2026-03 unverdicted novelty 6.0

M²RNN achieves perfect state tracking at unseen lengths and outperforms Gated DeltaNet hybrids by 0.4-0.5 perplexity on 7B models with 3x smaller recurrent states.
LinMU: Multimodal Understanding Made Linear
cs.CV 2026-01 conditional novelty 6.0

LinMU achieves linear-complexity multimodal understanding by swapping self-attention for an M-MATE dual-branch block and distilling from a frozen teacher VLM, matching accuracy with up to 2.7x faster TTFT and 9x highe...
Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed
cs.CL 2025-12 unverdicted novelty 6.0

Efficient-DLM converts AR models to dLMs via block-wise causal attention and position-dependent masking, yielding higher accuracy and 2.7-4.5x throughput than Dream 7B and Qwen3 4B.
MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling
cs.CL 2025-11 unverdicted novelty 6.0

MiroThinker shows that scaling agent-environment interactions via reinforcement learning lets a 72B open-source model reach up to 81.9% on GAIA and approach commercial performance on research benchmarks.
Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation
cs.AI 2025-10 unverdicted novelty 6.0

A Dirichlet-prior Bayesian estimator for model success probability replaces Pass@k, delivering faster-converging and more stable rankings with credible intervals on math benchmarks.
Multilinguality at the Edge: Developing Language Models for the Global South
cs.CL 2026-04 unverdicted novelty 5.0

A survey of 232 papers on the intersection of multilingual language modeling and edge deployment identifies the 'last mile' challenge for Global South communities and offers recommendations for more inclusive NLP.
Ranking Reasoning LLMs under Test-Time Scaling
cs.LG 2026-03 accept novelty 5.0

Many established statistical ranking techniques produce orderings of reasoning LLMs under test-time scaling that closely match a Bayesian gold standard, with mean Kendall tau_b of 0.93-0.95 at full trials and best met...
NVIDIA Nemotron 3: Efficient and Open Intelligence
cs.CL 2025-12 unverdicted novelty 5.0

NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.
Hybrid Architectures for Language Models: Systematic Analysis and Design Insights
cs.CL 2025-10 unverdicted novelty 4.0

This work systematically compares inter-layer and intra-layer hybridization strategies for combining self-attention and Mamba-style state space models, evaluating them on language modeling, downstream tasks, long-cont...