arxiv: 2604.12374 · v1 · submitted 2026-04-14 · 💻 cs.LG · cs.AI· cs.CL

Recognition: unknown

Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

NVIDIA: Aakshita Chandiramani , Aaron Blakeman , Abdullahi Olaoye , Abhibha Gupta , Abhilash Somasamudramath , Abhinav Khattar , Adeola Adesoba , Adi Renduchintala

show 537 more authors

Adil Asif Aditya Agrawal Aditya Vavre Ahmad Kiswani Aishwarya Padmakumar Ajay Hotchandani Akanksha Shukla Akhiad Bercovich Aleksander Ficek Aleksandr Shaposhnikov Alex Gronskiy Alex Kondratenko Alex Neefus Alex Steiner Alex Yang Alexander Bukharin Alexander Young Ali Hatamizadeh Ali Taghibakhshi Alina Galiautdinova Alisa Liu Alok Kumar Ameya Sunil Mahabaleshwarkar Amir Klein Amit Zuker Amnon Geifman Anahita Bhiwandiwalla Ananth Subramaniam Andrew Tao Anjaney Shrivastava Anjulie Agrusa Ankur Srivastava Ankur Verma Ann Guan Anna Shors Annamalai Chockalingam Anubhav Mandarwal Aparnaa Ramani Arham Mehta Arti Jain Arun Venkatesan Asha Anoosheh Ashwath Aithal Ashwin Poojary Asif Ahamed Asit Mishra Asli Sabanci Demiroz Asma Kuriparambil Thekkumpate Atefeh Sohrabizadeh Avinash Kaur Ayush Dattagupta Barath Subramaniam Anandan Bardiya Sadeghi Barnaby Simkin Ben Lanir Benedikt Schifferer Benjamin Chislett Besmira Nushi Bilal Kartal Bill Thiede Bita Darvish Rouhani Bobby Chen Boris Ginsburg Brandon Norick Branislav Kisacanin Brian Yu Bryan Catanzaro Buvaneswari Mani Carlo del Mundo Chankyu Lee Chanran Kim Chantal Hwang Chao Ni Charles Wang Charlie Truong Cheng-Ping Hsieh Chenhan Yu Chenjie Luo Cherie Wang Chetan Mungekar Chintan Patel Chris Alexiuk Chris Holguin Chris Wing Christian Munley Christopher Parisien Chuck Desai Chunyang Sheng Collin Neale Cyril Meurillon Dakshi Kumar Dan Gil Dan Su Dane Corneil Daniel Afrimi Daniel Burkhardt Eliuth Triana Daniel Egert Daniel Fatade Daniel Lo Daniel Rohrer Daniel Serebrenik Daniil Sorokin Daria Gitman Daria Levy Darko Stosic David Edelsohn David Messina David Mosallanezhad David Tamok Deena Donia Deepak Narayanan Devin O'Kelly Dheeraj Peri Dhruv Nathawani Di Wu Dima Rekesh Dina Yared Divyanshu Kakwani Dmitry Konyagin Brandon Tuttle Dong Ahn Dongfu Jiang Dorrin Poorkay Douglas O'Flaherty Duncan Riach Dusan Stosic Dustin Van Stee Edgar Minasyan Edward Lin Eileen Peters Long Elad Segal Elena Lantz Elena Lewis Ellie Evans Elliott Ning Eric Chung Eric Harper Eric Pham-Hung Eric W. Tramel Erick Galinkin Erik Pounds Esti Etrog Evan Briones Evan Wu Evelina Bakhturina Evgeny Tsykunov Ewa Dobrowolska Farshad Saberi Movahed Farzan Memarian Fay Wang Fei Jia Felipe Soares Felipe Vieira Frujeri Feng Chen Fengguang Lin Ferenc Galko Fortuna Zhang Frankie Siino Frida Hou Gantavya Bhatt Gargi Prasad Geethapriya Venkataramani Geetika Gupta George Armstrong Gerald Shen Giulio Borghesi Gordana Neskovic Gorkem Batmaz Grace Lam Grace Wu Greg Pauloski Greyson Davis Grigor Nalbandyan Guoming Zhang Guy Farber Guyue Huang Haifeng Qian Haran Kumar Shiv Kumar Harry Kim Harsh Sharma Hayate Iso Hayley Ross Herbert Hum Herman Sahota Hexin Wang Himanshu Soni Hiren Upadhyay Huy Nguyen Iain Cunningham Ido Galil Ido Shahaf Igino Padovani Igor Gitman Igor Shovkun Ikroop Dhillon Ilya Loshchilov Ingrid Kelly Itamar Schen Itay Levy Ivan Moshkov Izik Golan Izzy Putterman Jain Tu Jan Baczek Jan Kautz Jane Polak Scowcroft Janica Rosenberg Jared Casper Jarrod Pflum Jason Grant Jason Sewall Jatin Mitra Jeffrey Glick Jenny Chen Jesse Oliver Jiacheng Xu Jiafan Zhu Jialin Song Jian Zhang Jiaqi Zeng Jie Lou Jill Milton Jim Chow Jimmy Zhang Jinhang Choi Jining Huang Jocelyn Huang Joel Caruso Joey Conway Joey Guman Johan Jatko John Kamalu Johnny Greco Jonathan Cohen Jonathan Raiman Joseph Jennings Joyjit Daw Juan Yu Julio Tapia Junkeun Yi Jupinder Parmar Jyothi Achar Kari Briski Kartik Mattoo Katherine Cheung Katherine Luna Keith Wyss Kevin Shih Kezhi Kong Khanh Nguyen Khushi Bhardwaj Kirill Buryak Kirthi Shankar Sivamani Konstantinos Krommydas Kris Murphy Krishna C. Puvvada Krzysztof Pawelec Kumar Anik Laikh Tewari Laya Sleiman Leo Du Leon Derczynski Li Ding Lilach Ilan Lingjie Wu Lizzie Wei Luis Vega Lun Su Maarten Van Segbroeck Maer Rodrigues de Melo Magaret Zhang Mahan Fathi Makesh Narsimhan Sreedhar Makesh Sreedhar Makesh Tarun Chandran Manuel Reyes Gomez Maor Ashkenazi Marc Cuevas Marc Romeijn Margaret Zhang Mark Cai Mark Gabel Markus Kliegl Martyna Patelka Maryam Moosaei Matthew Varacalli Matvei Novikov Mauricio Ferrato Mehrzad Samadi Melissa Corpuz Meng Xin Mengdi Wang Mengru Wang Meredith Price Micah Schaffer Michael Andersch Michael Boone Michael Evans Michael Z Wang Miguel Martinez Mikail Khona Mike Chrzanowski Mike Hollinger Mingyuan Ma Minseok Lee Mohammad Dabbah Mohammad Shoeybi Mostofa Patwary Nabin Mulepati Nader Khalil Najeeb Nabwani Nancy Agarwal Nanthini Balasubramaniam Narimane Hennouni Narsi Kodukula Natalie Hereth Nathaniel Pinckney Nave Assaf Negar Habibi Nestor Qin Neta Zmora Netanel Haber Nick Reamaroon Nickson Quak Nidhi Bhatia Nikhil Jukar Nikki Pope Nikolai Ludwig Nima Tajbakhsh Nir Ailon Nirmal Juluru Nirmalya De Nowel Pitt Oleg Rybakov Oleksii Hrinchuk Oleksii Kuchaiev Olivier Delalleau Oluwatobi Olabiyi Omer Ullman Argov Omri Almog Omri Puny Oren Tropp Otavio Padovani Ouye Xie Parth Chadha Pasha Shamis Paul Gibbons Pavlo Molchanov Peter Belcak Peter Jin Pinky Xu Piotr Januszewski Pooya Jannaty Prachi Shevate Pradeep Thalasta Pranav Prashant Thombre Prasoon Varshney Prerana Gambhir Pritam Gundecha Przemek Tredak Qing Miao Qiyu Wan Quan Tran Minh Rabeeh Karimi Mahabadi Rachel Oberman Rachit Garg Rahul Kandu Raina Zhong Ran El-Yaniv Ran Zilberstein Rasoul Shafipour Renee Yao Renjie Pi Richard Mazzarese Richard Wang Rick Izzo Ridhima Singla Rima Shahbazyan Rishabh Garg Ritika Borkar Ritu Gala Riyad Islam Robert Clark Robert Hesse Roger Waleffe Rohit Varma Kalidindi Rohit Watve Roi Koren Ron Fan Ruchika Kharwar Ruisi Cai Ruoxi Zhang Russell J. Hewett Ryan Prenger Ryan Timbrook Ryota Egashira Sadegh Mahdavi Sagar Singh Ashutosh Joshi Sahil Modi Samuel Kriman Sandeep Pombra Sanjay Kariyappa Sanjeev Satheesh Santiago Pombo Saori Kaji Satish Pasumarthi Saurav Mishra Saurav Muralidharan Scott Hara Sean Narenthiran Sebastian Rogawski Seonjin Na Seonmyeong Bak Sepehr Sameni Seth Poulos Shahar Mor Shantanu Acharya Shaona Ghosh Adam Lord Sharath Turuvekere Sreenivas Shaun Kotek Shaya Gharghabi Shelby Thomas Sheng-Chieh Lin Shibani Likhite Shiqing Fan Shiyang Chen Shreya Gopal Shrimai Prabhumoye Shubham Pachori Shubham Toshniwal Shuo Zhang Shuoyang Ding Shyam Renjith Shyamala Prayaga Siddhartha Jain Simeng Sun Sirisha Rella Sirshak Das Smita Ithape Sneha Harishchandra S Somshubra Majumdar Soumye Singhal Sri Harsha Singudasu Sriharsha Niverty Stas Sergienko Stefana Gloginic Stefania Alborghetti Stephen Ge Stephen McCullough Sugam Dipak Devare Suguna Varshini Velury Sukrit Rao Sumeet Kumar Barua Sunny Gai Suseella Panguluri Sushil Koundinyan Swathi Patnam Sweta Priyadarshi Swetha Bhendigeri Syeda Nahida Akter Sylendran Arunagiri Tailling Yuan Talor Abramovich Tan Bui Tan Yu Terry Kong Thanh Do Thomas Gburek Thorgane Marques Tiffany Moore Tijmen Blankevoort Tim Moon Timothy Ma Tiyasa Mitra Tomasz Grzegorzek Tomer Asida Tomer Bar Natan Tomer Keren Tomer Ronen Traian Rebedea Trenton Starkey Tugrul Konuk Twinkle Vashishth Tyler Condensa Udi Karpas Ushnish De Vahid Noorozi Vahid Noroozi Vanshil Atul Shah Veena Vaidyanathan Venkat Srinivasan Venmugil Elango Victor Cui Vijay Korthikanti Vikas Mehta Virginia Adams Virginia Wu Vitaly Kurin Vitaly Lavrukhin Vladimir Anisimov Wan Seo Wanli Jiang Wasi Uddin Ahmad Wei Du Wei Ping Wei-Ming Chen Wendy Quan Wenliang Dai Wenwen Gao Will Jennings William Zhang Xiaowei Ren Xiaowen Xin Xin Li Yang Yu Yangyi Chen Yaniv Galron Yashaswi Karnati Yejin Choi Yev Meyer Yi-Fu Wu Yian Zhang Ying Lin Yonatan Geifman Yonggan Fu Yoshi Suhara Youngeun Kwon Yuan Zhang Yuki Huang Zach Moshe Zhilin Wang Zhiyu Cheng Zhongbo Zhu Zhuolin Yang Zihan Liu Zijia Chen Zijie Yan Zuhair Ahmed

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:39 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords hybrid Mamba-TransformerMixture-of-ExpertsLatentMoEspeculative decodinginference efficiencylong contextopen-source LLMagentic reasoning

0 comments

The pith

A 120B hybrid Mamba-Transformer MoE model matches accuracy with up to 7.5x inference speedup and 1M context length

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Nemotron 3 Super as a 120 billion parameter hybrid model that mixes Mamba sequence modeling with attention layers inside a Mixture-of-Experts structure. It introduces LatentMoE to improve accuracy relative to both computation and parameter count, plus MTP layers that enable faster inference through built-in speculative decoding. The model is pre-trained on 25 trillion tokens in low precision, then refined with supervised fine-tuning and reinforcement learning. It reaches accuracy levels similar to other large models on standard benchmarks while supporting contexts up to one million tokens and running at substantially higher speeds. A reader would care because the design points toward open models that deliver strong reasoning capability at lower inference cost and with longer context handling than typical dense alternatives.

Core claim

By pre-training a 120B (12B active) hybrid Mamba-Attention Mixture-of-Experts model with LatentMoE and MTP layers on 25 trillion tokens followed by SFT and RL post-training, the resulting system achieves comparable accuracy on common benchmarks, extends to 1M context length, and delivers up to 2.2x higher inference throughput than GPT-OSS-120B and 7.5x higher than Qwen3.5-122B.

What carries the argument

LatentMoE, a new Mixture-of-Experts architecture that optimizes accuracy per FLOP and per parameter, together with MTP layers that accelerate inference via native speculative decoding.

If this is right

The open checkpoints and datasets allow direct deployment and community adaptation for agentic workflows.
The 1M context length supports single-pass processing of extended documents or histories.
Higher inference throughput lowers latency and cost for repeated reasoning steps.
Pre-training in NVFP4 shows that low-precision formats can sustain hybrid MoE training at this scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the hybrid Mamba-Transformer pattern scales, future models may default to mixing state-space and attention layers for different context regimes.
The efficiency per active parameter could let researchers train and serve larger effective models within fixed compute limits.
Open access to the 25T token corpus may enable independent study of scaling behavior specific to this architecture.
Production agentic systems might see reduced energy use if the throughput claims prove consistent outside benchmark settings.

Load-bearing premise

The reported benchmark accuracy and throughput gains were measured under conditions that fairly compare to the baseline models and that these gains hold for agentic reasoning tasks without hidden trade-offs from the new components.

What would settle it

A side-by-side run of the open-sourced model against GPT-OSS-120B and Qwen3.5-122B on a long-context agentic reasoning benchmark, measuring both accuracy and tokens per second under identical hardware and batch settings.

read the original abstract

We describe the pre-training, post-training, and quantization of Nemotron 3 Super, a 120 billion (active 12 billion) parameter hybrid Mamba-Attention Mixture-of-Experts model. Nemotron 3 Super is the first model in the Nemotron 3 family to 1) be pre-trained in NVFP4, 2) leverage LatentMoE, a new Mixture-of-Experts architecture that optimizes for both accuracy per FLOP and accuracy per parameter, and 3) include MTP layers for inference acceleration through native speculative decoding. We pre-trained Nemotron 3 Super on 25 trillion tokens followed by post-training using supervised fine tuning (SFT) and reinforcement learning (RL). The final model supports up to 1M context length and achieves comparable accuracy on common benchmarks, while also achieving up to 2.2x and 7.5x higher inference throughput compared to GPT-OSS-120B and Qwen3.5-122B, respectively. Nemotron 3 Super datasets, along with the base, post-trained, and quantized checkpoints, are open-sourced on HuggingFace.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NVIDIA open-sourced a 120B hybrid Mamba-Transformer MoE with LatentMoE and MTP layers, but the performance claims rest on unshown benchmark details.

read the letter

This is an industry model-release paper for Nemotron 3 Super, a 120B-parameter (12B active) hybrid that mixes Mamba and attention, adds a new LatentMoE variant aimed at accuracy per FLOP and per parameter, and includes MTP layers for native speculative decoding. They pre-trained on 25T tokens in NVFP4, followed by SFT and RL, then open-sourced the checkpoints and datasets on Hugging Face. The model is claimed to handle 1M context while matching common benchmarks and running 2.2x faster than GPT-OSS-120B and 7.5x faster than Qwen3.5-122B at inference.

Referee Report

2 major / 0 minor

Summary. The manuscript presents Nemotron 3 Super, a 120B-parameter (12B active) hybrid Mamba-Transformer Mixture-of-Experts model. It covers pre-training on 25 trillion tokens in NVFP4 precision, the introduction of LatentMoE for improved accuracy per FLOP and per parameter, MTP layers enabling native speculative decoding, post-training via SFT and RL, 1M context support, comparable accuracy on common benchmarks, and inference throughput gains of up to 2.2x versus GPT-OSS-120B and 7.5x versus Qwen3.5-122B, with all datasets and checkpoints (base, post-trained, quantized) open-sourced on Hugging Face.

Significance. If the performance claims hold under verifiable conditions, the work would be significant for efficient scaling of models suited to agentic reasoning. The open-sourcing of model artifacts and datasets is a clear strength that supports reproducibility. The hybrid architecture, LatentMoE, and MTP innovations could influence future designs balancing accuracy, parameter efficiency, and inference speed.

major comments (2)

[Abstract] Abstract: The central claims of 'comparable accuracy on common benchmarks' and specific throughput multipliers (2.2x and 7.5x) are stated without any referenced tables, benchmark lists, error bars, hardware details, batch/precision settings, prompt lengths, or decoding configurations. This is load-bearing because the manuscript supplies no evaluation protocol, preventing verification that the gains are apples-to-apples or that LatentMoE/MTP deliver net benefits for agentic reasoning without hidden trade-offs.
[Abstract] The title and abstract emphasize suitability for agentic reasoning, yet results are limited to unspecified 'common benchmarks' with no agentic-task metrics, long-context agent evaluations, or generalization tests. This is load-bearing for the paper's positioning, as the 1M-context and efficiency claims require evidence that they extend beyond standard benchmarks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and constructive feedback on our manuscript. We address each major comment below in detail and have made revisions to improve clarity and verifiability where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims of 'comparable accuracy on common benchmarks' and specific throughput multipliers (2.2x and 7.5x) are stated without any referenced tables, benchmark lists, error bars, hardware details, batch/precision settings, prompt lengths, or decoding configurations. This is load-bearing because the manuscript supplies no evaluation protocol, preventing verification that the gains are apples-to-apples or that LatentMoE/MTP deliver net benefits for agentic reasoning without hidden trade-offs.

Authors: We agree that the abstract would be strengthened by explicit cross-references to the supporting evaluation details. The full manuscript contains a dedicated Experiments section (Section 4) that specifies the benchmark suite (including MMLU, GSM8K, HumanEval, MATH, and others), hardware platform (NVIDIA H100 GPUs), batch sizes, inference precision (FP8), prompt lengths, and decoding configurations (including MTP speculative decoding parameters with acceptance rates). Throughput numbers were measured under matched conditions to the cited baselines (GPT-OSS-120B and Qwen3.5-122B) using the same prompt distributions and hardware; these are reported with standard deviations in Table 5. We have revised the abstract to cite Table 4 for accuracy results and Table 5 for throughput, along with a brief reference to the evaluation protocol in Section 4.1. This makes the claims directly verifiable. The net benefit of LatentMoE and MTP for agentic reasoning is discussed in Section 4.3, where we show that the efficiency gains reduce latency in multi-turn interactions without accuracy degradation on the reported benchmarks. revision: yes
Referee: [Abstract] The title and abstract emphasize suitability for agentic reasoning, yet results are limited to unspecified 'common benchmarks' with no agentic-task metrics, long-context agent evaluations, or generalization tests. This is load-bearing for the paper's positioning, as the 1M-context and efficiency claims require evidence that they extend beyond standard benchmarks.

Authors: The title and abstract position the model for agentic reasoning based on its architectural features: 1M context support for long agent trajectories, the hybrid Mamba-Transformer backbone for efficient long-sequence handling, and MTP layers for native speculative decoding that accelerates iterative reasoning loops. While the primary quantitative results use established reasoning and coding benchmarks (which serve as proxies for agentic capabilities), we acknowledge that dedicated agentic evaluations (e.g., WebArena-style tasks or multi-step tool-use benchmarks) are not included. We have revised the abstract to qualify the positioning more precisely and added a short discussion paragraph in Section 5 explaining how the 1M context and throughput improvements directly benefit agentic workflows, supported by long-context needle-in-haystack results in the appendix. No new experiments were feasible at this stage, but the textual clarification addresses the concern without overstating the evidence. revision: partial

Circularity Check

0 steps flagged

No derivation chain present; purely empirical model description

full rationale

The paper consists entirely of an empirical account of architecture choices (hybrid Mamba-Transformer with LatentMoE and MTP), training regimen (25T tokens in NVFP4, followed by SFT/RL), and reported outcomes (1M context, benchmark accuracy, throughput multipliers). No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. Performance claims rest on external measurements and open-sourced checkpoints rather than any internal reduction to the inputs themselves. This is the standard case of a self-contained engineering report with negligible circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests entirely on empirical training and evaluation results rather than on mathematical axioms, derivations, or postulated entities. No free parameters, axioms, or invented entities with independent evidence are invoked.

pith-pipeline@v0.9.0 · 8136 in / 1144 out tokens · 38460 ms · 2026-05-10T15:39:48.243196+00:00 · methodology

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Can a Single Message Paralyze the AI Infrastructure? The Rise of AbO-DDoS Attacks through Targeted Mobius Injection
cs.CR 2026-05 unverdicted novelty 7.0

Mobius Injection exploits semantic closure in LLM agents to enable single-message AbO-DDoS attacks achieving up to 51x call amplification and 229x latency inflation.
BenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD
cs.AI 2026-05 unverdicted novelty 7.0

BenchCAD is a new benchmark showing that frontier multimodal models recover coarse geometry but fail to generate faithful parametric CAD programs for industrial parts.
BenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD
cs.AI 2026-05 unverdicted novelty 7.0

BenchCAD benchmark shows frontier multimodal models recover coarse geometry but fail to produce accurate parametric CAD programs for industrial parts, with limited generalization after fine-tuning.
Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence
cs.LG 2026-04 unverdicted novelty 5.0

Nemotron 3 Nano Omni is an efficient open multimodal model supporting audio, text, images, and video with reported accuracy gains and leading results on document understanding and long audio-video tasks.
Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence
cs.LG 2026-04 unverdicted novelty 4.0

Nemotron 3 Nano Omni is an efficient open multimodal model supporting audio alongside text, images, and video, with accuracy improvements and lower latency than its predecessor.

Reference graph

Works this paper leans on

10 extracted references · 8 canonical work pages · cited by 3 Pith papers · 4 internal anchors

[1]

Program Synthesis with Large Language Models

URLhttps://openreview.net/forum?id=VmEkhV2yCX. Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program Synthesis with Large Language Models, 2021. URLhttps://arxiv.org/abs/2108.07732. Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich, ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2025.naacl-long.306 2021
[2]

Steve Harris and Andy Seaborne

URLhttps://arxiv.org/abs/2412.16339. Steve Harris and Andy Seaborne. SPARQL 1.1 Query Language. https://www.w3.org/TR/ sparql11-query/, 2013. W3C Recommendation. Adib Hasan, Ileana Rugina, and Alex Wang. Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning.arXiv preprint arXiv:2401.10862, 2024. B. Hassibi, D.G. Stork...

work page doi:10.1109/icnn 2013
[3]

& Xiao, C

URLhttps://arxiv.org/abs/2404.03027. Rabeeh Karimi Mahabadi, Sanjeev Satheesh, Shrimai Prabhumoye, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset, 2025. URLhttps://arxiv.org/abs/2508.15096. Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, ...

work page arXiv 2025
[4]

NVIDIA Technical Blog. NVIDIA. Nemotron 3 nano: Open, efficient mixture-of-experts hybrid mamba-transformer model for agentic reasoning, 2025a. URLhttps://arxiv.org/abs/2512.20848. NVIDIA. Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models, 2025b. URLhttps://arxiv.org/abs/2504.03624. NVIDIA. NVIDIA Nemotron 3: Efficient and Ope...

work page arXiv
[5]

gpt-oss-120b & gpt-oss-20b Model Card

NVIDIA Developer Blog. 45 Nemotron 3 Super : Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning NVIDIA Corporation. Transformerengine pull request #2177: [common] add support for nvfp4 quantization and casting.https://github.com/NVIDIA/TransformerEngine/pull/2177, 2024. Accessed: 2026-02-18. OpenAI. gpt-oss-120b & gpt-...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Association for Computational Linguistics. URLhttps://www.aclweb.org/anthology/ D13-1020. Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WinoGrande: An Adversarial Winograd Schema Challenge at Scale, 2019. URLhttps://arxiv.org/abs/1907. 10641. Mark Saroufim, Jiannan Wang, Bert Maher, Sahan Paliskara, Laura Wang, Shahin Sefati, and ...

work page internal anchor Pith review Pith/arXiv arXiv 2019
[7]

URLhttps://openreview.net/pdf?id=T4wMdeFEjX. 46 Nemotron 3 Super : Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism, 2020. URLhttps:...

work page internal anchor Pith review arXiv 2020
[8]

Personax: A recommendation agent-oriented user modeling framework for long behavior sequence

Association for Computational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/v1/2025. acl-long.123. URLhttps://aclanthology.org/2025.acl-long.123/. Kimi Team. Kimi K2: Open Agentic Intelligence, 2025. URLhttps://arxiv.org/abs/2507.20534. Ling Team, Anqi Shen, Baihui Li, Bin Hu, Bin Jing, Cai Chen, Chao Huang, Chao Zhang, Chaokun Yang, Cheng Lin, et al...

work page doi:10.18653/v1/2025 2025
[9]

This fusion constraint is applied within each layer: only that layer’s Q, K, and V projections are fused and required to share one quantization format

Linear layer fusion.Inference runtimes often fuse linear operators, which imposes a shared quantization format across the fused group. This fusion constraint is applied within each layer: only that layer’s Q, K, and V projections are fused and required to share one quantization format. For 50 Nemotron 3 Super : Open, Efficient Mixture-of-Experts Hybrid Ma...
[10]

This shared-format restriction is also applied within each MoE layer: only sparse experts inside the same layer are coupled

MoE layer constraints.vLLM and TensorRT-LLM quantized MoE APIs require all sparse experts in a constrained MoE group to share one quantization format. This shared-format restriction is also applied within each MoE layer: only sparse experts inside the same layer are coupled. In Nemotron 3 Super, each sparse expert containsup_proj and down_proj, and these ...