NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model
Pith reviewed 2026-05-18 10:57 UTC · model grok-4.3
The pith
A hybrid Mamba-Transformer model matches similar-sized models on reasoning accuracy while running up to 6x faster on long traces.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Nemotron-Nano-9B-v2 builds on the Nemotron-H architecture by replacing the majority of self-attention layers with Mamba-2 layers, pretrains a 12B model on 20 trillion tokens using an FP8 recipe, then compresses and distills it via the Minitron strategy to produce a 9B model that supports up to 128k tokens on a single NVIDIA A10G GPU in bfloat16 while achieving on-par or superior accuracy on reasoning benchmarks and up to 6x higher inference throughput for 8k-input and 16k-output workloads compared with similarly sized models.
What carries the argument
Hybrid Nemotron-H architecture that substitutes most Transformer self-attention layers with Mamba-2 layers, followed by Minitron compression and distillation.
If this is right
- Supports up to 128k-token inference on a single NVIDIA A10G GPU with 22GiB memory in bfloat16.
- Delivers up to 6x higher inference throughput versus similarly sized models in 8k-input and 16k-output reasoning settings.
- Maintains or exceeds accuracy of models such as Qwen3-8B on reasoning benchmarks.
- Releases the 9B and 12B checkpoints plus most pre- and post-training datasets for public use.
Where Pith is reading between the lines
- The same hybrid replacement and compression pattern could be tested on other long-output tasks such as code synthesis or multi-step planning.
- Mamba-2 layers appear especially useful during the extended generation phase of reasoning, suggesting targeted replacement rather than full replacement may be optimal.
- If the throughput gains hold on other hardware, the approach offers a route to scale reasoning models without proportional increases in compute cost.
Load-bearing premise
The Minitron compression and distillation step preserves reasoning performance on the chosen benchmarks without introducing hidden degradation on broader or out-of-distribution tasks.
What would settle it
A clear accuracy drop on a new suite of reasoning tasks or longer sequences outside the reported benchmarks would show that the compression introduced hidden degradation.
read the original abstract
We introduce Nemotron-Nano-9B-v2, a hybrid Mamba-Transformer language model designed to increase throughput for reasoning workloads while achieving state-of-the-art accuracy compared to similarly-sized models. Nemotron-Nano-9B-v2 builds on the Nemotron-H architecture, in which the majority of the self-attention layers in the common Transformer architecture are replaced with Mamba-2 layers, to achieve improved inference speed when generating the long thinking traces needed for reasoning. We create Nemotron-Nano-9B-v2 by first pre-training a 12-billion-parameter model (Nemotron-Nano-12B-v2-Base) on 20 trillion tokens using an FP8 training recipe. After aligning Nemotron-Nano-12B-v2-Base, we employ the Minitron strategy to compress and distill the model with the goal of enabling inference on up to 128k tokens on a single NVIDIA A10G GPU (22GiB of memory, bfloat16 precision). Compared to existing similarly-sized models (e.g., Qwen3-8B), we show that Nemotron-Nano-9B-v2 achieves on-par or better accuracy on reasoning benchmarks while achieving up to 6x higher inference throughput in reasoning settings like 8k input and 16k output tokens. We are releasing Nemotron-Nano-9B-v2, Nemotron-Nano12B-v2-Base, and Nemotron-Nano-9B-v2-Base checkpoints along with the majority of our pre- and post-training datasets on Hugging Face.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Nemotron-Nano-9B-v2, a hybrid Mamba-Transformer model obtained by pre-training Nemotron-Nano-12B-v2-Base on 20 trillion tokens with FP8, followed by alignment and Minitron compression/distillation to 9B parameters. It claims on-par or better accuracy than similarly sized models such as Qwen3-8B on reasoning benchmarks, together with up to 6x higher inference throughput for long reasoning traces (e.g., 8k input / 16k output), while enabling 128k-token inference on a single A10G GPU; the authors release the 9B and 12B checkpoints plus most pre- and post-training datasets.
Significance. If the performance claims hold after proper validation, the work would demonstrate a practical route to high-throughput reasoning models by replacing most attention layers with Mamba-2 while retaining accuracy, with direct relevance to deployment on memory-constrained hardware. The public release of both the 12B base and the compressed 9B model, together with the majority of the training data, constitutes a clear reproducibility strength that elevates the contribution beyond typical model-release papers.
major comments (2)
- [§4 and Table 2] §4 (Experimental Results) and Table 2: the central claim that Minitron compression from the 12B base preserves (or improves) reasoning accuracy relative to Qwen3-8B is load-bearing, yet the manuscript provides no ablation comparing Nemotron-Nano-12B-v2-Base versus the final 9B model on the same benchmark suite, nor any out-of-distribution or harder reasoning probes; without these data the 'on-par or better' statement cannot be substantiated and the skeptic concern about silent degradation remains open.
- [§4.2] §4.2 (Benchmark Evaluation): reported accuracies lack error bars, standard deviations, or the number of evaluation runs, so it is impossible to determine whether observed differences versus Qwen3-8B are statistically meaningful or within noise; this directly affects the reliability of the headline performance comparison.
minor comments (2)
- [Figure 3] Figure 3 (throughput curves): the 6x speedup is stated for 8k/16k token settings; the caption should explicitly name the baseline model and hardware configuration used for the comparison.
- [§3.2] §3.2 (Minitron Compression): the description of the distillation objective and layer-pruning schedule is brief; adding the precise hyper-parameters and loss weighting would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for highlighting the potential significance of the hybrid Mamba-Transformer approach. We address each major comment point by point below, indicating revisions where the manuscript will be updated to strengthen the presentation of results.
read point-by-point responses
-
Referee: [§4 and Table 2] §4 (Experimental Results) and Table 2: the central claim that Minitron compression from the 12B base preserves (or improves) reasoning accuracy relative to Qwen3-8B is load-bearing, yet the manuscript provides no ablation comparing Nemotron-Nano-12B-v2-Base versus the final 9B model on the same benchmark suite, nor any out-of-distribution or harder reasoning probes; without these data the 'on-par or better' statement cannot be substantiated and the skeptic concern about silent degradation remains open.
Authors: We agree that a direct side-by-side comparison of Nemotron-Nano-12B-v2-Base and the compressed 9B model on the reasoning benchmarks would provide clearer evidence that the Minitron step preserves accuracy. In the revised manuscript we have added these results to Section 4 and an updated Table 2. The new data confirm that the 9B model retains competitive performance relative to the 12B base across the reported tasks. On out-of-distribution and harder probes, the existing benchmark suite already spans multiple reasoning domains that test generalization; we have nevertheless added a short discussion and one additional challenging evaluation in the appendix of the revision to further address concerns about potential silent degradation. revision: yes
-
Referee: [§4.2] §4.2 (Benchmark Evaluation): reported accuracies lack error bars, standard deviations, or the number of evaluation runs, so it is impossible to determine whether observed differences versus Qwen3-8B are statistically meaningful or within noise; this directly affects the reliability of the headline performance comparison.
Authors: We recognize that the absence of error bars and run counts limits the ability to assess statistical significance. The original evaluations followed the single-run protocol standard in large-scale LLM papers to control compute cost. In the revision we have updated §4.2 to report the number of evaluation runs performed for each benchmark and have added error bars (or standard deviations) for those tasks where multiple runs were feasible. While the performance trends remain consistent across independent benchmarks, we have also inserted a brief limitations paragraph acknowledging that full multi-run statistics were not obtained for every metric. revision: partial
Circularity Check
No significant circularity; empirical claims rest on external benchmarks
full rationale
The paper describes an empirical pipeline: pre-train a 12B hybrid Mamba-Transformer base on 20T tokens, align it, apply Minitron compression/distillation to obtain the 9B model, then measure accuracy and throughput on standard reasoning benchmarks against external models such as Qwen3-8B. No equations, fitted parameters, or first-principles derivations are presented whose outputs are definitionally equivalent to their inputs. Throughput and accuracy numbers are obtained by direct measurement on held-out benchmarks and hardware, not by renaming or re-deriving quantities internal to the paper. Self-citations to prior Minitron work are not load-bearing for the central performance claim, which is falsifiable against independent baselines.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Next-token prediction on large text corpora produces useful reasoning capabilities in hybrid architectures.
Forward citations
Cited by 17 Pith papers
-
Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models
dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.
-
Star Elastic: Many-in-One Reasoning LLMs with Efficient Budget Control
Star Elastic trains N nested submodels in a single post-training job on a parent reasoning LLM, supporting elastic budget control that matches or exceeds independent baselines while cutting training compute by up to 360x.
-
The limits of bio-molecular modeling with large language models : a cross-scale evaluation
LLMs perform adequately on bio-molecular classification tasks but remain weak on regression, with hybrid architectures outperforming others on long sequences and fine-tuning hurting generalization.
-
Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling
Four Over Six adaptively scales blocks in NVFP4 quantization to smaller FP4 values, making representable value distributions more uniform and reducing quantization error especially for near-maximal values.
-
PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding
PARD-2 uses Confidence-Adaptive Token optimization to align draft model training with acceptance length in speculative decoding, enabling dual-mode operation and up to 6.94x lossless speedup on Llama3.1-8B.
-
Priming: Hybrid State Space Models From Pre-trained Transformers
Priming transfers knowledge from pre-trained Transformers to hybrid SSM-attention models, recovering performance with minimal additional tokens and showing Gated KalmaNet outperforming Mamba-2 on long-context reasonin...
-
Nexus: Same Pretraining Loss, Better Downstream Generalization via Common Minima
Nexus optimizer improves LLM downstream performance by converging to common minima across data sources despite identical pretraining loss.
-
Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing
Stochastic training with random cross-layer KV attention enables depth-wise cache sharing in transformers, cutting memory footprint while preserving or improving performance.
-
M$^2$RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling
M²RNN achieves perfect state tracking at unseen lengths and outperforms Gated DeltaNet hybrids by 0.4-0.5 perplexity on 7B models with 3x smaller recurrent states.
-
LinMU: Multimodal Understanding Made Linear
LinMU achieves linear-complexity multimodal understanding by swapping self-attention for an M-MATE dual-branch block and distilling from a frozen teacher VLM, matching accuracy with up to 2.7x faster TTFT and 9x highe...
-
Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed
Efficient-DLM converts AR models to dLMs via block-wise causal attention and position-dependent masking, yielding higher accuracy and 2.7-4.5x throughput than Dream 7B and Qwen3 4B.
-
MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling
MiroThinker shows that scaling agent-environment interactions via reinforcement learning lets a 72B open-source model reach up to 81.9% on GAIA and approach commercial performance on research benchmarks.
-
Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation
A Dirichlet-prior Bayesian estimator for model success probability replaces Pass@k, delivering faster-converging and more stable rankings with credible intervals on math benchmarks.
-
Multilinguality at the Edge: Developing Language Models for the Global South
A survey of 232 papers on the intersection of multilingual language modeling and edge deployment identifies the 'last mile' challenge for Global South communities and offers recommendations for more inclusive NLP.
-
Ranking Reasoning LLMs under Test-Time Scaling
Many established statistical ranking techniques produce orderings of reasoning LLMs under test-time scaling that closely match a Bayesian gold standard, with mean Kendall tau_b of 0.93-0.95 at full trials and best met...
-
NVIDIA Nemotron 3: Efficient and Open Intelligence
NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.
-
Hybrid Architectures for Language Models: Systematic Analysis and Design Insights
This work systematically compares inter-layer and intra-layer hybridization strategies for combining self-attention and Mamba-style state space models, evaluating them on language modeling, downstream tasks, long-cont...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.