arxiv: 2512.20856 · v1 · pith:F2KYGL7Fnew · submitted 2025-12-24 · 💻 cs.CL · cs.AI· cs.LG

NVIDIA Nemotron 3: Efficient and Open Intelligence

NVIDIA: Aaron Blakeman , Aaron Grattafiori , Aarti Basant , Abhibha Gupta , Abhinav Khattar , Adi Renduchintala , Aditya Vavre , Akanksha Shukla

show 349 more authors

Akhiad Bercovich Aleksander Ficek Aleksandr Shaposhnikov Alex Kondratenko Alexander Bukharin Alexandre Milesi Ali Taghibakhshi Alisa Liu Amelia Barton Ameya Sunil Mahabaleshwarkar Amir Klein Amit Zuker Amnon Geifman Amy Shen Anahita Bhiwandiwalla Andrew Tao Anjulie Agrusa Ankur Verma Ann Guan Anubhav Mandarwal Arham Mehta Ashwath Aithal Ashwin Poojary Asif Ahamed Asit Mishra Asma Kuriparambil Thekkumpate Ayush Dattagupta Banghua Zhu Bardiya Sadeghi Barnaby Simkin Ben Lanir Benedikt Schifferer Besmira Nushi Bilal Kartal Bita Darvish Rouhani Boris Ginsburg Brandon Norick Brandon Soubasis Branislav Kisacanin Brian Yu Bryan Catanzaro Carlo del Mundo Chantal Hwang Charles Wang Cheng-Ping Hsieh Chenghao Zhang Chenhan Yu Chetan Mungekar Chintan Patel Chris Alexiuk Christopher Parisien Collin Neale Cyril Meurillon Damon Mosk-Aoyama Dan Su Dane Corneil Daniel Afrimi Daniel Lo Daniel Rohrer Daniel Serebrenik Daria Gitman Daria Levy Darko Stosic David Mosallanezhad Deepak Narayanan Dhruv Nathawani Dima Rekesh Dina Yared Divyanshu Kakwani Dong Ahn Duncan Riach Dusan Stosic Edgar Minasyan Edward Lin Eileen Long Eileen Peters Long Elad Segal Elena Lantz Ellie Evans Elliott Ning Eric Chung Eric Harper Eric Tramel Erick Galinkin Erik Pounds Evan Briones Evelina Bakhturina Evgeny Tsykunov Faisal Ladhak Fay Wang Fei Jia Felipe Soares Feng Chen Ferenc Galko Frank Sun Frankie Siino Gal Hubara Agam Ganesh Ajjanagadde Gantavya Bhatt Gargi Prasad George Armstrong Gerald Shen Gorkem Batmaz Grigor Nalbandyan Haifeng Qian Harsh Sharma Hayley Ross Helen Ngo Herbert Hum Herman Sahota Hexin Wang Himanshu Soni Hiren Upadhyay Huizi Mao Huy C Nguyen Huy Q Nguyen Iain Cunningham Ido Galil Ido Shahaf Igor Gitman Ilya Loshchilov Itamar Schen Itay Levy Ivan Moshkov Izik Golan Izzy Putterman Jan Kautz Jane Polak Scowcroft Jared Casper Jatin Mitra Jeffrey Glick Jenny Chen Jesse Oliver Jian Zhang Jiaqi Zeng Jie Lou Jimmy Zhang Jinhang Choi Jining Huang Joey Conway Joey Guman John Kamalu Johnny Greco Jonathan Cohen Joseph Jennings Joyjit Daw Julien Veron Vialard Junkeun Yi Jupinder Parmar Kai Xu Kan Zhu Kari Briski Katherine Cheung Katherine Luna Keith Wyss Keshav Santhanam Kevin Shih Kezhi Kong Khushi Bhardwaj Kirthi Shankar Krishna C. Puvvada Krzysztof Pawelec Kumar Anik Lawrence McAfee Laya Sleiman Leon Derczynski Li Ding Lizzie Wei Lucas Liebenwein Luis Vega Maanu Grover Maarten Van Segbroeck Maer Rodrigues de Melo Mahdi Nazemi Makesh Narsimhan Sreedhar Manoj Kilaru Maor Ashkenazi Marc Romeijn Marcin Chochowski Mark Cai Markus Kliegl Maryam Moosaei Matt Kulka Matvei Novikov Mehrzad Samadi Melissa Corpuz Mengru Wang Meredith Price Michael Andersch Michael Boone Michael Evans Miguel Martinez Mikail Khona Mike Chrzanowski Minseok Lee Mohammad Dabbah Mohammad Shoeybi Mostofa Patwary Nabin Mulepati Najeeb Nabwani Natalie Hereth Nave Assaf Negar Habibi Neta Zmora Netanel Haber Nicola Sessions Nidhi Bhatia Nikhil Jukar Nikki Pope Nikolai Ludwig Nima Tajbakhsh Nir Ailon Nirmal Juluru Nishant Sharma Oleksii Hrinchuk Oleksii Kuchaiev Olivier Delalleau Oluwatobi Olabiyi Omer Ullman Argov Omri Puny Oren Tropp Ouye Xie Parth Chadha Pasha Shamis Paul Gibbons Pavlo Molchanov Pawel Morkisz Peter Dykas Peter Jin Pinky Xu Piotr Januszewski Pranav Prashant Thombre Prasoon Varshney Pritam Gundecha Przemek Tredak Qing Miao Qiyu Wan Rabeeh Karimi Mahabadi Rachit Garg Ran El-Yaniv Ran Zilberstein Rasoul Shafipour Rich Harang Rick Izzo Rima Shahbazyan Rishabh Garg Ritika Borkar Ritu Gala Riyad Islam Robert Hesse Roger Waleffe Rohit Watve Roi Koren Ruoxi Zhang Russell Hewett Russell J. Hewett Ryan Prenger Ryan Timbrook Sadegh Mahdavi Sahil Modi Samuel Kriman Sangkug Lim Sanjay Kariyappa Sanjeev Satheesh Saori Kaji Satish Pasumarthi Saurav Muralidharan Sean Narentharen Sean Narenthiran Seonmyeong Bak Sergey Kashirsky Seth Poulos Shahar Mor Shanmugam Ramasamy Shantanu Acharya Shaona Ghosh Sharath Turuvekere Sreenivas Shelby Thomas Shiqing Fan Shreya Gopal Shrimai Prabhumoye Shubham Pachori Shubham Toshniwal Shuoyang Ding Siddharth Singh Simeng Sun Smita Ithape Somshubra Majumdar Soumye Singhal Stas Sergienko Stefania Alborghetti Stephen Ge Sugam Dipak Devare Sumeet Kumar Barua Suseella Panguluri Suyog Gupta Sweta Priyadarshi Syeda Nahida Akter Tan Bui Teodor-Dumitru Ene Terry Kong Thanh Do Tijmen Blankevoort Tim Moon Tom Balough Tomer Asida Tomer Bar Natan Tomer Ronen Tugrul Konuk Twinkle Vashishth Udi Karpas Ushnish De Vahid Noorozi Vahid Noroozi Venkat Srinivasan Venmugil Elango Victor Cui Vijay Korthikanti Vinay Rao Vitaly Kurin Vitaly Lavrukhin Vladimir Anisimov Wanli Jiang Wasi Uddin Ahmad Wei Du Wei Ping Wenfei Zhou Will Jennings William Zhang Wojciech Prazuch Xiaowei Ren Yashaswi Karnati Yejin Choi Yev Meyer Yi-Fu Wu Yian Zhang Yigong Qin Ying Lin Yonatan Geifman Yonggan Fu Yoshi Subara Yoshi Suhara Yubo Gao Zach Moshe Zhen Dong Zhongbo Zhu Zihan Liu Zijia Chen Zijie Yan

This is my paper

Pith reviewed 2026-05-18 01:36 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords Nemotron 3Mamba-Transformer hybridMixture-of-ExpertsLatentMoElong contextreinforcement learning post-trainingmodel efficiencyopen model weights

0 comments

The pith

Nemotron 3 models use a hybrid Mamba-Transformer Mixture-of-Experts design to support 1M-token contexts with high throughput and RL-tuned reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the Nemotron 3 family of models in Nano, Super, and Ultra sizes. These rely on a Mixture-of-Experts hybrid architecture that blends Mamba state-space layers with Transformer attention to deliver strong throughput and context lengths up to 1 million tokens. Larger models add LatentMoE for quality gains, NVFP4 training, and MTP layers for faster generation. All versions receive post-training via multi-environment reinforcement learning to enable reasoning, multi-step tool use, and adjustable reasoning budgets. Nano is described as more accurate than similar models at low inference cost, and the full family is released openly with weights, software, and data.

Core claim

The central claim is that a Mixture-of-Experts hybrid Mamba-Transformer architecture, augmented by LatentMoE, NVFP4 quantization, MTP layers, and multi-environment reinforcement learning post-training, yields models with best-in-class throughput, million-token contexts, and effective agentic and reasoning performance across the Nano, Super, and Ultra variants.

What carries the argument

The Mixture-of-Experts hybrid Mamba-Transformer architecture integrates selective state-space modeling with attention under expert routing to maintain efficiency while handling extended sequences and supporting quality improvements through LatentMoE.

If this is right

Applications can maintain practical speeds while reasoning over contexts as long as 1 million tokens, such as full-document analysis or long multi-turn interactions.
Adjustable reasoning budgets let the same model switch between quick responses and deeper multi-step tool use depending on the task.
Open release of weights, training recipes, and redistribution-permitted data allows direct replication and extension by external developers.
The Super variant targets high-volume workloads like IT automation through built-in support for collaborative agents.
The Ultra variant targets top accuracy on complex reasoning benchmarks while retaining the efficiency features of the family.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The hybrid approach may cut hardware and energy costs for long-context deployments in production agent systems.
Multi-environment RL training could extend to create agents that adapt across more varied real-world tool sets than those shown.
Full public access to recipes and data might accelerate similar efficiency gains in other model families.
Built-in tool-use support could simplify integration into larger multi-agent workflows.

Load-bearing premise

The described hybrid architecture, LatentMoE, NVFP4, MTP layers, and multi-environment RL post-training together produce the stated gains in accuracy, throughput, and reasoning without post-hoc benchmark selection or undisclosed data filtering.

What would settle it

Independent runs of the released Nano model on standard public benchmarks, measuring both accuracy and real-world inference throughput against comparable open models on the same hardware, would confirm or refute the performance claims.

read the original abstract

We introduce the Nemotron 3 family of models - Nano, Super, and Ultra. These models deliver strong agentic, reasoning, and conversational capabilities. The Nemotron 3 family uses a Mixture-of-Experts hybrid Mamba-Transformer architecture to provide best-in-class throughput and context lengths of up to 1M tokens. Super and Ultra models are trained with NVFP4 and incorporate LatentMoE, a novel approach that improves model quality. The two larger models also include MTP layers for faster text generation. All Nemotron 3 models are post-trained using multi-environment reinforcement learning enabling reasoning, multi-step tool use, and support granular reasoning budget control. Nano, the smallest model, outperforms comparable models in accuracy while remaining extremely cost-efficient for inference. Super is optimized for collaborative agents and high-volume workloads such as IT ticket automation. Ultra, the largest model, provides state-of-the-art accuracy and reasoning performance. Nano is released together with its technical report and this white paper, while Super and Ultra will follow in the coming months. We will openly release the model weights, pre- and post-training software, recipes, and all data for which we hold redistribution rights.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Nemotron 3 is an open NVIDIA model family using hybrid Mamba-Transformer MoE plus LatentMoE and multi-environment RL, but the strong performance claims rest on assertions without visible numbers or comparisons.

read the letter

The main point here is that NVIDIA has put out the Nemotron 3 family—Nano, Super, and Ultra—built on a Mixture-of-Experts hybrid of Mamba and Transformer. The larger models add LatentMoE, NVFP4 training, and MTP layers, while all three use multi-environment reinforcement learning after pre-training to support reasoning, tool use, and adjustable reasoning budgets. Nano is already released with weights, code, recipes, and usable data; the others are coming later. They target up to 1M token context and practical agent workloads like IT automation.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Nemotron 3 family of models (Nano, Super, and Ultra). It claims that a Mixture-of-Experts hybrid Mamba-Transformer architecture delivers best-in-class throughput and context lengths up to 1M tokens. Super and Ultra models are trained with NVFP4, incorporate a novel LatentMoE approach to improve quality, and include MTP layers for faster generation. All models are post-trained via multi-environment reinforcement learning to enable reasoning, multi-step tool use, and granular reasoning budget control. Nano is stated to outperform comparable models in accuracy while being cost-efficient; the paper announces open release of weights, pre- and post-training software, recipes, and data for Nano, with Super and Ultra to follow.

Significance. If the hybrid architecture, LatentMoE, NVFP4, MTP layers, and multi-environment RL post-training deliver measurable gains in throughput, context handling, and reasoning without selective benchmarking, the work could advance efficient open models for agentic and long-context tasks. The explicit commitment to releasing weights, software, and data is a positive aspect that supports reproducibility.

major comments (2)

[Abstract] Abstract: the claims of 'best-in-class throughput', 'state-of-the-art accuracy and reasoning performance', and 'strong agentic, reasoning, and conversational capabilities' are presented without any quantitative benchmarks, baseline comparisons, ablation results, error bars, or scaling curves. These assertions are load-bearing for the central contribution yet rest on unverified statements.
[Abstract] Abstract: the hybrid Mamba-Transformer MoE architecture, LatentMoE, NVFP4 quantization, MTP layers, and multi-environment RL post-training are described at a high level with no implementation details, equations, throughput measurements, or context-length scaling data, leaving the causal connection between the techniques and the claimed gains untested.

minor comments (2)

The manuscript distinguishes this white paper from a separate technical report for Nano; explicitly stating which quantitative results and ablations appear in each document would improve clarity.
No model sizes, parameter counts, or training data details are provided, which would help readers contextualize the efficiency and performance claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight opportunities to better support the abstract claims with evidence from the full paper. We address each point below and will incorporate revisions to improve verifiability while preserving the manuscript's focus as a technical announcement accompanying the open release.

read point-by-point responses

Referee: [Abstract] Abstract: the claims of 'best-in-class throughput', 'state-of-the-art accuracy and reasoning performance', and 'strong agentic, reasoning, and conversational capabilities' are presented without any quantitative benchmarks, baseline comparisons, ablation results, error bars, or scaling curves. These assertions are load-bearing for the central contribution yet rest on unverified statements.

Authors: We agree that the abstract would benefit from concrete quantitative anchors to allow readers to immediately assess the claims. The full manuscript contains detailed benchmark tables, baseline comparisons (e.g., against Llama-3 and Mistral variants), throughput measurements on H100 hardware, and scaling results for context length. To directly address this, we will revise the abstract to incorporate a small number of key supported figures, such as relative throughput gains and accuracy deltas on standard reasoning and agentic benchmarks, drawn from the evaluation sections. This keeps the abstract concise while making the central claims verifiable. revision: yes
Referee: [Abstract] Abstract: the hybrid Mamba-Transformer MoE architecture, LatentMoE, NVFP4 quantization, MTP layers, and multi-environment RL post-training are described at a high level with no implementation details, equations, throughput measurements, or context-length scaling data, leaving the causal connection between the techniques and the claimed gains untested.

Authors: The manuscript body provides additional architectural diagrams, training hyper-parameters, and high-level pseudocode for components such as LatentMoE and the multi-environment RL setup, along with measured throughput and context-length results. We acknowledge that the abstract itself does not explicitly link these elements to the gains. We will therefore revise the abstract to include brief, high-level implementation notes and direct references to the specific quantitative results (e.g., generation speed from MTP and scaling behavior) that appear later in the paper. Complete equations, code, and full recipes will be released with the Nano weights and technical report. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper is an empirical technical report describing the Nemotron 3 model family, its hybrid Mamba-Transformer MoE architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training. No mathematical derivations, equations, or fitted parameters are presented that are then repurposed as predictions. All performance claims are statements about trained models that can be evaluated against external benchmarks. There are no self-citation chains, uniqueness theorems, or ansatzes that reduce the central claims to inputs by construction. The content is therefore self-contained.

Axiom & Free-Parameter Ledger

2 free parameters · 0 axioms · 1 invented entities

Review conducted from abstract only; full training details, hyperparameter counts, data mixtures, and benchmark protocols are not available, preventing exhaustive enumeration of free parameters or background assumptions.

free parameters (2)

Model scale and mixture-of-experts routing hyperparameters
Standard but unspecified training choices that determine final quality and throughput.
Reinforcement learning environment and reward parameters
Multi-environment RL setup requires many tuned values not detailed in the abstract.

invented entities (1)

LatentMoE no independent evidence
purpose: Novel approach claimed to improve model quality
Introduced without independent evidence or comparison to prior MoE variants in the abstract.

pith-pipeline@v0.9.0 · 7455 in / 1433 out tokens · 37109 ms · 2026-05-18T01:36:28.707674+00:00 · methodology

discussion (0)

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
cs.LG 2026-04 unverdicted novelty 8.0

Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.
LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models
cs.LG 2026-05 unverdicted novelty 7.0

LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.
Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling
cs.CL 2025-12 conditional novelty 7.0

Four Over Six adaptively scales blocks in NVFP4 quantization to smaller FP4 values, making representable value distributions more uniform and reducing quantization error especially for near-maximal values.
PrivacySIM: Evaluating LLM Simulation of User Privacy Behavior
cs.CR 2026-05 unverdicted novelty 6.0

PrivacySIM shows that conditioning LLMs on user personas like demographics and attitudes improves simulation of privacy choices but reaches only 40.4% accuracy against real responses from 1,000 users.
Structured Recurrent Mixers for Massively Parallelized Sequence Generation
cs.CL 2026-05 unverdicted novelty 6.0

Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, delivering higher efficiency, information capacity, and throughput than other linear-complexity models.
Priming: Hybrid State Space Models From Pre-trained Transformers
cs.LG 2026-05 unverdicted novelty 6.0

Priming transfers knowledge from pre-trained Transformers to hybrid SSM-attention models, recovering performance with minimal additional tokens and showing Gated KalmaNet outperforming Mamba-2 on long-context reasonin...
Hypothesis generation and updating in large language models
cs.LG 2026-05 unverdicted novelty 6.0

LLMs exhibit Bayesian-like hypothesis updating with strong-sampling bias and an evaluation-generation gap but generalize poorly outside observed data.
Evaluation Awareness in Language Models Has Limited Effect on Behaviour
cs.CL 2026-05 conditional novelty 6.0

Verbalised evaluation awareness in large reasoning models has only small effects on their outputs across safety and alignment tests.
When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models
cs.CL 2026-05 unverdicted novelty 6.0

LLM accuracy on controlled procedural arithmetic drops from 61% at 5 steps to 20% at 95 steps, with failures including skipped steps, premature answers, and hallucinated operations.
AVISE: Framework for Evaluating the Security of AI Systems
cs.CR 2026-04 unverdicted novelty 6.0

AVISE provides a new framework and automated SET that identifies jailbreak vulnerabilities in language models with 92% accuracy, finding all nine tested models vulnerable to an augmented Red Queen attack.
Unsupervised Confidence Calibration for Reasoning LLMs from a Single Generation
cs.LG 2026-04 unverdicted novelty 6.0

Unsupervised single-generation confidence calibration for reasoning LLMs via offline self-consistency proxy distillation outperforms baselines on math and QA tasks and improves selective prediction.
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
cs.LG 2026-04 unverdicted novelty 6.0

Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.
How Transformers Learn to Plan via Multi-Token Prediction
cs.LG 2026-04 conditional novelty 6.0

Multi-token prediction induces a two-stage reverse reasoning process in Transformers via gradient decoupling, improving planning on synthetic and realistic tasks.
Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents
cs.AI 2026-04 unverdicted novelty 6.0

Claw-Eval is a new trajectory-aware benchmark for LLM agents that records execution traces, audit logs, and environment snapshots to evaluate completion, safety, and robustness across 300 tasks, revealing that opaque ...
SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding
cs.DC 2026-02 unverdicted novelty 6.0

SPEED-Bench is a new standardized benchmark for speculative decoding that supplies semantically diverse qualitative data and throughput-oriented splits across concurrency levels, integrated with vLLM and TensorRT-LLM.
Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace
cs.AI 2026-05 unverdicted novelty 5.0 partial

Shepherd is a runtime system that formalizes meta-agent operations via typed execution traces, enabling fast forking and demonstrated improvements in agent intervention, optimization, and training on benchmarks.
Irminsul: MLA-Native Position-Independent Caching for Agentic LLM Serving
cs.DC 2026-05 unverdicted novelty 5.0

Irminsul recovers up to 83% of prompt tokens above exact-prefix matching and delivers 63% prefill energy savings per cache hit on MLA-MoE models by content-hashing CDC chunks and applying closed-form kr correction.

Reference graph

Works this paper leans on

201 extracted references · 201 canonical work pages · cited by 16 Pith papers · 48 internal anchors

[1]

2023 , eprint=

GPQA: A Graduate-Level Google-Proof Q&A Benchmark , author=. 2023 , eprint=

work page 2023
[2]

Lightman, Hunter and Kosaraju, Vineet and Burda, Yura and Edwards, Harri and Baker, Bowen and Lee, Teddy and Leike, Jan and Schulman, John and Sutskever, Ilya and Cobbe, Karl , journal=

work page
[3]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Livecodebench: Holistic and contamination free evaluation of large language models for code , author=. arXiv preprint arXiv:2403.07974 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Zhou, Jeffrey and Lu, Tianjian and Mishra, Swaroop and Brahma, Siddhartha and Basu, Sujoy and Luan, Yi and Zhou, Denny and Hou, Le , journal=

work page
[5]

Patil and Ion Stoica and Joseph E

Fanjia Yan and Huanzhi Mao and Charlie Cheng-Jie Ji and Tianjun Zhang and Shishir G. Patil and Ion Stoica and Joseph E. Gonzalez , year=

work page
[6]

Gonzalez and Ion Stoica , month =

Tianle Li and Wei-Lin Chiang and Evan Frick and Lisa Dunlap and Banghua Zhu and Joseph E. Gonzalez and Ion Stoica , month =

work page
[7]

2024 , eprint=

SciCode: A Research Coding Benchmark Curated by Scientists , author=. 2024 , eprint=

work page 2024
[8]

2025 , eprint=

Humanity's Last Exam , author=. 2025 , eprint=

work page 2025
[9]

2024 , journal =

HelpSteer2: Open-source dataset for training top-performing reward models , author =. 2024 , journal =

work page 2024
[10]

2025 , journal =

HelpSteer3-Preference: Open Human-Annotated Preference Data across Diverse Tasks and Languages , author =. 2025 , journal =

work page 2025
[11]

2022 , eprint =

Model soups: averaging weights of multiple fine‐tuned models improves accuracy without increasing inference time , author =. 2022 , eprint =

work page 2022
[12]

2024 , journal =

WorkBench: a Benchmark Dataset for Agents in a Realistic Workplace Setting , author =. 2024 , journal =

work page 2024
[13]

AEGIS 2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails

Ghosh, Shaona and Varshney, Prasoon and Sreedhar, Makesh Narsimhan and Padmakumar, Aishwarya and Rebedea, Traian and Varghese, Jibin Rajan and Parisien, Christopher. AEGIS 2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Com...

work page doi:10.18653/v1/2025.naacl-long.306 2025
[14]

arXiv preprint arXiv:2401.10862 , year=

Pruning for protection: Increasing jailbreak resistance in aligned llms without fine-tuning , author=. arXiv preprint arXiv:2401.10862 , year=

work page arXiv
[15]

arXiv preprint arXiv:2404.03027 , year=

Jailbreakv: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks , author=. arXiv preprint arXiv:2404.03027 , year=

work page arXiv
[16]

2024 , month =

Gretel Synthetic Safety Alignment Dataset , author=. 2024 , month =

work page 2024
[17]

2024 , url=

Physics Big , author=. 2024 , url=

work page 2024
[18]

2025 , url=

IChO-IPhO-RL-v2-formated , author=. 2025 , url=

work page 2025
[19]

arXiv preprint arXiv:2309.11998 , year =

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , author =. arXiv preprint arXiv:2309.11998 , year =

work page arXiv
[20]

WildChat: 1M ChatGPT Interaction Logs in the Wild

WildChat: 1M ChatGPT Interaction Logs in the Wild , author =. arXiv preprint arXiv:2405.01470 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[21]

2024 , journal =

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author =. 2024 , journal =

work page 2024
[22]

W hen2 C all: When (not) to Call Tools

Ross, Hayley and Mahabaleshwarkar, Ameya Sunil and Suhara, Yoshi. W hen2 C all: When (not) to Call Tools. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025

work page 2025
[23]

2024 , journal =

ToolACE: Winning the Points of LLM Function Calling , author =. 2024 , journal =

work page 2024
[24]

2025 , journal =

APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay , author =. 2025 , journal =

work page 2025
[25]

2023 , journal =

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author =. 2023 , journal =

work page 2023
[26]

Advances in Neural Information Processing Systems (NeurIPS) , series =

Deep Reinforcement Learning from Human Preferences , author =. Advances in Neural Information Processing Systems (NeurIPS) , series =

work page
[27]

2022 , journal =

Training language models to follow instructions with human feedback , author =. 2022 , journal =

work page 2022
[28]

2412.15285 , archivePrefix=

Steven Feng and Shrimai Prabhumoye and Kezhi Kong and Dan Su and Mostofa Patwary and Mohammad Shoeybi and Bryan Catanzaro , year=. 2412.15285 , archivePrefix=

work page arXiv
[29]

arXiv preprint arXiv:2504.11409 , year=

Efficient hybrid language model compression through group-aware ssm pruning , author=. arXiv preprint arXiv:2504.11409 , year=

work page arXiv
[30]

Nemotron- CC : Transforming C ommon C rawl into a Refined Long-Horizon Pretraining Dataset

Su, Dan and Kong, Kezhi and Lin, Ying and Jennings, Joseph and Norick, Brandon and Kliegl, Markus and Patwary, Mostofa and Shoeybi, Mohammad and Catanzaro, Bryan. Nemotron- CC : Transforming C ommon C rawl into a Refined Long-Horizon Pretraining Dataset. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Lon...

work page doi:10.18653/v1/2025.acl-long.123 2025
[31]

FP8 Formats for Deep Learning

Paulius Micikevicius and Dusan Stosic and Neil Burgess and Marius Cornea and Pradeep Dubey and Richard Grisenthwaite and Sangwon Ha and Alexander Heinecke and Patrick Judd and John Kamalu and Naveen Mellempudi and Stuart Oberman and Mohammad Shoeybi and Michael Siu and Hao Wu , year=. 2209.05433 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

2024 , url=

Jupinder Parmar and Shrimai Prabhumoye and Joseph Jennings and Mostofa Patwary and Sandeep Subramanian and Dan Su and Chen Zhu and Deepak Narayanan and Aastha Jhunjhunwala and Ayush Dattagupta and Vibhu Jawa and Jiwei Liu and Ameya Mahabaleshwarkar and Osvald Nitski and Annika Brundyn and James Maki and Miguel Martinez and Jiaxuan You and John Kamalu and ...

work page 2024
[33]

2406.11704 , archivePrefix=

NVIDIA , year=. 2406.11704 , archivePrefix=

work page arXiv
[35]

Li, Jeffrey and Fang, Alex and Smyrnis, Georgios and Ivgi, Maor and Jordan, Matt and Gadre, Samir Yitzhak and Bansal, Hritik and Guha, Etash and Keh, Sedrick Scott and Arora, Kushal and others , journal=

work page
[36]

Advances in Neural Information Processing Systems , volume=

Penedo, Guilherme and Kydl. Advances in Neural Information Processing Systems , volume=

work page
[37]

Muennighoff, Niklas and Rush, Alexander and Barak, Boaz and Le Scao, Teven and Tazi, Nouamane and Piktus, Aleksandra and Pyysalo, Sampo and Wolf, Thomas and Raffel, Colin A , journal=

work page
[38]

Maini, Pratyush and Seto, Skyler and Bai, He and Grangier, David and Zhang, Yizhe and Jaitly, Navdeep , booktitle=

work page
[39]

2024 , eprint=

Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation , author=. 2024 , eprint=

work page 2024
[40]

2022 , eprint=

Language Models are Multilingual Chain-of-Thought Reasoners , author=. 2022 , eprint=

work page 2022
[41]

The Llama 3 Herd of Models

Llama Team @ Meta , year=. 2407.21783 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[42]

Qwen2.5 Technical Report

Qwen , year=. 2412.15115 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[43]

2407.14679 , archivePrefix=

Saurav Muralidharan and Sharath Turuvekere Sreenivas and Raviraj Joshi and Marcin Chochowski and Mostofa Patwary and Mohammad Shoeybi and Bryan Catanzaro and Jan Kautz and Pavlo Molchanov , year=. 2407.14679 , archivePrefix=

work page arXiv
[44]

2408.11796 , archivePrefix=

Sharath Turuvekere Sreenivas and Saurav Muralidharan and Raviraj Joshi and Marcin Chochowski and Ameya Sunil Mahabaleshwarkar and Gerald Shen and Jiaqi Zeng and Zijia Chen and Yoshi Suhara and Shizhe Diao and Chenhan Yu and Wei-Chun Chen and Hayley Ross and Oluwatobi Olabiyi and Ashwath Aithal and Oleksii Kuchaiev and Daniel Korzekwa and Pavlo Molchanov a...

work page arXiv
[45]

2411.19146 , archivePrefix=

Akhiad Bercovich and Tomer Ronen and Talor Abramovich and Nir Ailon and Nave Assaf and Mohammad Dabbah and Ido Galil and Amnon Geifman and Yonatan Geifman and Izhak Golan and Netanel Haber and Ehud Karpas and Roi Koren and Itay Levy and Pavlo Molchanov and Shahar Mor and Zach Moshe and Najeeb Nabwani and Omri Puny and Ran Rubin and Itamar Schen and Ido Sh...

work page arXiv
[46]

(2023b) in the survey

Xin Men and Mingyu Xu and Qingyu Zhang and Bingning Wang and Hongyu Lin and Yaojie Lu and Xianpei Han and Weipeng Chen , year=. 2403.03853 , archivePrefix=

work page arXiv
[47]

2502.04223 , archivePrefix=

Ilia Karmanov and Amala Sanjay Deshmukh and Lukas Voegtle and Philipp Fischer and Kateryna Chumachenko and Timo Roman and Jarno Seppänen and Jupinder Parmar and Joseph Jennings and Andrew Tao and Karan Sapra , year=. 2502.04223 , archivePrefix=

work page arXiv
[48]

2025 , url=

OpenAI , title=. 2025 , url=

work page 2025
[49]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton and Oriol Vinyals and Jeff Dean , year=. 1503.02531 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[50]

2502.00203 , archivePrefix=

Shengyang Sun and Yian Zhang and Alexander Bukharin and David Mosallanezhad and Jiaqi Zeng and Soumye Singhal and Gerald Shen and Adithya Renduchintala and Tugrul Konuk and Yi Dong and Zhilin Wang and Dmitry Chichkov and Olivier Delalleau and Oleksii Kuchaiev , year=. 2502.00203 , archivePrefix=

work page arXiv
[51]

2025 , eprint=

Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset , author=. 2025 , eprint=

work page 2025
[52]

2410.12881 , archivePrefix=

Syeda Nahida Akter and Shrimai Prabhumoye and John Kamalu and Sanjeev Satheesh and Eric Nyberg and Mostofa Patwary and Mohammad Shoeybi and Bryan Catanzaro , year=. 2410.12881 , archivePrefix=

work page arXiv
[53]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt , year=. 2103.03874 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[54]

Ghosh, Shaona and Varshney, Prasoon and Sreedhar, Makesh Narsimhan and Padmakumar, Aishwarya and Rebedea, Traian and Varghese, Jibin Rajan and Parisien, Christopher , journal=

work page
[55]

Training Verifiers to Solve Math Word Problems

Karl Cobbe and Vineet Kosaraju and Mohammad Bavarian and Mark Chen and Heewoo Jun and Lukasz Kaiser and Matthias Plappert and Jerry Tworek and Jacob Hilton and Reiichiro Nakano and Christopher Hesse and John Schulman , year=. 2110.14168 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[56]

2309.14402 , archivePrefix=

Zeyuan Allen-Zhu and Yuanzhi Li , year=. 2309.14402 , archivePrefix=

work page arXiv
[57]

2310.06786 , archivePrefix=

Keiran Paster and Marco Dos Santos and Zhangir Azerbayev and Jimmy Ba , year=. 2310.06786 , archivePrefix=

work page arXiv
[59]

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Loubna Ben Allal and Anton Lozhkov and Elie Bakouch and Gabriel Martín Blázquez and Guilherme Penedo and Lewis Tunstall and Andrés Marafioti and Hynek Kydlíček and Agustín Piqueres Lajarín and Vaibhav Srivastav and Joshua Lochner and Caleb Fahlgren and Xuan-Son Nguyen and Clémentine Fourrier and Ben Burtenshaw and Hugo Larcher and Haojun Zhao and Cyril Za...

work page internal anchor Pith review Pith/arXiv arXiv
[60]

arXiv preprint arXiv:2505.02881 , year=

Rewriting pre-training data boosts llm performance in math and code , author=. arXiv preprint arXiv:2505.02881 , year=

work page arXiv
[61]

Attention Is All You Need

Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin , year=. 1706.03762 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[62]

Dai, Wenliang and Lee, Nayeon and Wang, Boxin and Yang, Zhuolin and Liu, Zihan and Barker, Jon and Rintamaki, Tuomas and Shoeybi, Mohammad and Catanzaro, Bryan and Ping, Wei , journal=

work page
[63]

Li, Zhiqi and Chen, Guo and Liu, Shilong and Wang, Shihao and VS, Vibashan and Ji, Yishen and Lan, Shiyi and Zhang, Hao and Zhao, Yilin and Radhakrishnan, Subhashree and others , journal=

work page
[64]

Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae , journal=

work page
[65]

Alayrac, Jean-Baptiste and Donahue, Jeff and Luc, Pauline and Miech, Antoine and Barr, Iain and Hasson, Yana and Lenc, Karel and Mensch, Arthur and Millican, Katherine and Reynolds, Malcolm and others , journal=

work page
[66]

Chen, Zhe and Wang, Weiyun and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Cui, Erfei and Zhu, Jinguo and Ye, Shenglong and Tian, Hao and Liu, Zhaoyang and others , journal=

work page
[67]

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin and Michael Maire and Serge Belongie and Lubomir Bourdev and Ross Girshick and James Hays and Pietro Perona and Deva Ramanan and C. Lawrence Zitnick and Piotr Dollár , year=. 1405.0312 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[68]

Sharma, Piyush and Ding, Nan and Goodman, Sebastian and Soricut, Radu , booktitle=

work page
[69]

Ordonez, Vicente and Kulkarni, Girish and Berg, Tamara , journal=

work page
[70]

2022 , organization=

Li, Junnan and Li, Dongxu and Xiong, Caiming and Hoi, Steven , booktitle=. 2022 , organization=

work page 2022
[71]

Goyal, Yash and Khot, Tejas and Summers-Stay, Douglas and Batra, Dhruv and Parikh, Devi , booktitle=

work page
[72]

2017 , publisher=

Krishna, Ranjay and Zhu, Yuke and Groth, Oliver and Johnson, Justin and Hata, Kenji and Kravitz, Joshua and Chen, Stephanie and Kalantidis, Yannis and Li, Li-Jia and Shamma, David A and others , journal=. 2017 , publisher=

work page 2017
[73]

Kafle, Kushal and Price, Brian and Cohen, Scott and Kanan, Christopher , booktitle=

work page
[74]

Marafioti, Andres and Laurencon, Hugo , year =

work page
[75]

2019 , organization=

Mishra, Anand and Shekhar, Shashank and Singh, Ajeet Kumar and Chakraborty, Anirban , booktitle=. 2019 , organization=

work page 2019
[76]

COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images

Andreas Veit and Tomas Matera and Lukas Neumann and Jiri Matas and Serge Belongie , year=. 1601.07140 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[77]

arXiv preprint arXiv:2208.05358 , year=

Lindstr. arXiv preprint arXiv:2208.05358 , year=

work page arXiv
[78]

Marino, Kenneth and Rastegari, Mohammad and Farhadi, Ali and Mottaghi, Roozbeh , booktitle=

work page
[79]

Hudson, Drew A and Manning, Christopher D , booktitle=

work page
[80]

Lu, Pan and Mishra, Swaroop and Xia, Tanglin and Qiu, Liang and Chang, Kai-Wei and Zhu, Song-Chun and Tafjord, Oyvind and Clark, Peter and Kalyan, Ashwin , journal=

work page
[81]

Xiang Yue and Yuansheng Ni and Kai Zhang and Tianyu Zheng and Ruoqi Liu and Ge Zhang and Samuel Stevens and Dongfu Jiang and Weiming Ren and Yuxuan Sun and Cong Wei and Botao Yu and Ruibin Yuan and Renliang Sun and Ming Yin and Boyuan Zheng and Zhenzhu Yang and Yibo Liu and Wenhao Huang and Huan Sun and Yu Su and Wenhu Chen , booktitle=

work page
[82]

International Conference on Learning Representations (ICLR) , year =

Lu, Pan and Bansal, Hritik and Xia, Tony and Liu, Jiacheng and Li, Chunyuan and Hajishirzi, Hannaneh and Cheng, Hao and Chang, Kai-Wei and Galley, Michel and Gao, Jianfeng , title =. International Conference on Learning Representations (ICLR) , year =

work page

Showing first 80 references.