pith. machine review for the scientific record. sign in

arxiv: 2512.20856 · v1 · pith:F2KYGL7Fnew · submitted 2025-12-24 · 💻 cs.CL · cs.AI· cs.LG

NVIDIA Nemotron 3: Efficient and Open Intelligence

NVIDIA: Aaron Blakeman , Aaron Grattafiori , Aarti Basant , Abhibha Gupta , Abhinav Khattar , Adi Renduchintala , Aditya Vavre , Akanksha Shukla
show 349 more authors
Akhiad Bercovich Aleksander Ficek Aleksandr Shaposhnikov Alex Kondratenko Alexander Bukharin Alexandre Milesi Ali Taghibakhshi Alisa Liu Amelia Barton Ameya Sunil Mahabaleshwarkar Amir Klein Amit Zuker Amnon Geifman Amy Shen Anahita Bhiwandiwalla Andrew Tao Anjulie Agrusa Ankur Verma Ann Guan Anubhav Mandarwal Arham Mehta Ashwath Aithal Ashwin Poojary Asif Ahamed Asit Mishra Asma Kuriparambil Thekkumpate Ayush Dattagupta Banghua Zhu Bardiya Sadeghi Barnaby Simkin Ben Lanir Benedikt Schifferer Besmira Nushi Bilal Kartal Bita Darvish Rouhani Boris Ginsburg Brandon Norick Brandon Soubasis Branislav Kisacanin Brian Yu Bryan Catanzaro Carlo del Mundo Chantal Hwang Charles Wang Cheng-Ping Hsieh Chenghao Zhang Chenhan Yu Chetan Mungekar Chintan Patel Chris Alexiuk Christopher Parisien Collin Neale Cyril Meurillon Damon Mosk-Aoyama Dan Su Dane Corneil Daniel Afrimi Daniel Lo Daniel Rohrer Daniel Serebrenik Daria Gitman Daria Levy Darko Stosic David Mosallanezhad Deepak Narayanan Dhruv Nathawani Dima Rekesh Dina Yared Divyanshu Kakwani Dong Ahn Duncan Riach Dusan Stosic Edgar Minasyan Edward Lin Eileen Long Eileen Peters Long Elad Segal Elena Lantz Ellie Evans Elliott Ning Eric Chung Eric Harper Eric Tramel Erick Galinkin Erik Pounds Evan Briones Evelina Bakhturina Evgeny Tsykunov Faisal Ladhak Fay Wang Fei Jia Felipe Soares Feng Chen Ferenc Galko Frank Sun Frankie Siino Gal Hubara Agam Ganesh Ajjanagadde Gantavya Bhatt Gargi Prasad George Armstrong Gerald Shen Gorkem Batmaz Grigor Nalbandyan Haifeng Qian Harsh Sharma Hayley Ross Helen Ngo Herbert Hum Herman Sahota Hexin Wang Himanshu Soni Hiren Upadhyay Huizi Mao Huy C Nguyen Huy Q Nguyen Iain Cunningham Ido Galil Ido Shahaf Igor Gitman Ilya Loshchilov Itamar Schen Itay Levy Ivan Moshkov Izik Golan Izzy Putterman Jan Kautz Jane Polak Scowcroft Jared Casper Jatin Mitra Jeffrey Glick Jenny Chen Jesse Oliver Jian Zhang Jiaqi Zeng Jie Lou Jimmy Zhang Jinhang Choi Jining Huang Joey Conway Joey Guman John Kamalu Johnny Greco Jonathan Cohen Joseph Jennings Joyjit Daw Julien Veron Vialard Junkeun Yi Jupinder Parmar Kai Xu Kan Zhu Kari Briski Katherine Cheung Katherine Luna Keith Wyss Keshav Santhanam Kevin Shih Kezhi Kong Khushi Bhardwaj Kirthi Shankar Krishna C. Puvvada Krzysztof Pawelec Kumar Anik Lawrence McAfee Laya Sleiman Leon Derczynski Li Ding Lizzie Wei Lucas Liebenwein Luis Vega Maanu Grover Maarten Van Segbroeck Maer Rodrigues de Melo Mahdi Nazemi Makesh Narsimhan Sreedhar Manoj Kilaru Maor Ashkenazi Marc Romeijn Marcin Chochowski Mark Cai Markus Kliegl Maryam Moosaei Matt Kulka Matvei Novikov Mehrzad Samadi Melissa Corpuz Mengru Wang Meredith Price Michael Andersch Michael Boone Michael Evans Miguel Martinez Mikail Khona Mike Chrzanowski Minseok Lee Mohammad Dabbah Mohammad Shoeybi Mostofa Patwary Nabin Mulepati Najeeb Nabwani Natalie Hereth Nave Assaf Negar Habibi Neta Zmora Netanel Haber Nicola Sessions Nidhi Bhatia Nikhil Jukar Nikki Pope Nikolai Ludwig Nima Tajbakhsh Nir Ailon Nirmal Juluru Nishant Sharma Oleksii Hrinchuk Oleksii Kuchaiev Olivier Delalleau Oluwatobi Olabiyi Omer Ullman Argov Omri Puny Oren Tropp Ouye Xie Parth Chadha Pasha Shamis Paul Gibbons Pavlo Molchanov Pawel Morkisz Peter Dykas Peter Jin Pinky Xu Piotr Januszewski Pranav Prashant Thombre Prasoon Varshney Pritam Gundecha Przemek Tredak Qing Miao Qiyu Wan Rabeeh Karimi Mahabadi Rachit Garg Ran El-Yaniv Ran Zilberstein Rasoul Shafipour Rich Harang Rick Izzo Rima Shahbazyan Rishabh Garg Ritika Borkar Ritu Gala Riyad Islam Robert Hesse Roger Waleffe Rohit Watve Roi Koren Ruoxi Zhang Russell Hewett Russell J. Hewett Ryan Prenger Ryan Timbrook Sadegh Mahdavi Sahil Modi Samuel Kriman Sangkug Lim Sanjay Kariyappa Sanjeev Satheesh Saori Kaji Satish Pasumarthi Saurav Muralidharan Sean Narentharen Sean Narenthiran Seonmyeong Bak Sergey Kashirsky Seth Poulos Shahar Mor Shanmugam Ramasamy Shantanu Acharya Shaona Ghosh Sharath Turuvekere Sreenivas Shelby Thomas Shiqing Fan Shreya Gopal Shrimai Prabhumoye Shubham Pachori Shubham Toshniwal Shuoyang Ding Siddharth Singh Simeng Sun Smita Ithape Somshubra Majumdar Soumye Singhal Stas Sergienko Stefania Alborghetti Stephen Ge Sugam Dipak Devare Sumeet Kumar Barua Suseella Panguluri Suyog Gupta Sweta Priyadarshi Syeda Nahida Akter Tan Bui Teodor-Dumitru Ene Terry Kong Thanh Do Tijmen Blankevoort Tim Moon Tom Balough Tomer Asida Tomer Bar Natan Tomer Ronen Tugrul Konuk Twinkle Vashishth Udi Karpas Ushnish De Vahid Noorozi Vahid Noroozi Venkat Srinivasan Venmugil Elango Victor Cui Vijay Korthikanti Vinay Rao Vitaly Kurin Vitaly Lavrukhin Vladimir Anisimov Wanli Jiang Wasi Uddin Ahmad Wei Du Wei Ping Wenfei Zhou Will Jennings William Zhang Wojciech Prazuch Xiaowei Ren Yashaswi Karnati Yejin Choi Yev Meyer Yi-Fu Wu Yian Zhang Yigong Qin Ying Lin Yonatan Geifman Yonggan Fu Yoshi Subara Yoshi Suhara Yubo Gao Zach Moshe Zhen Dong Zhongbo Zhu Zihan Liu Zijia Chen Zijie Yan
This is my paper

Pith reviewed 2026-05-18 01:36 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords Nemotron 3Mamba-Transformer hybridMixture-of-ExpertsLatentMoElong contextreinforcement learning post-trainingmodel efficiencyopen model weights
0
0 comments X

The pith

Nemotron 3 models use a hybrid Mamba-Transformer Mixture-of-Experts design to support 1M-token contexts with high throughput and RL-tuned reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the Nemotron 3 family of models in Nano, Super, and Ultra sizes. These rely on a Mixture-of-Experts hybrid architecture that blends Mamba state-space layers with Transformer attention to deliver strong throughput and context lengths up to 1 million tokens. Larger models add LatentMoE for quality gains, NVFP4 training, and MTP layers for faster generation. All versions receive post-training via multi-environment reinforcement learning to enable reasoning, multi-step tool use, and adjustable reasoning budgets. Nano is described as more accurate than similar models at low inference cost, and the full family is released openly with weights, software, and data.

Core claim

The central claim is that a Mixture-of-Experts hybrid Mamba-Transformer architecture, augmented by LatentMoE, NVFP4 quantization, MTP layers, and multi-environment reinforcement learning post-training, yields models with best-in-class throughput, million-token contexts, and effective agentic and reasoning performance across the Nano, Super, and Ultra variants.

What carries the argument

The Mixture-of-Experts hybrid Mamba-Transformer architecture integrates selective state-space modeling with attention under expert routing to maintain efficiency while handling extended sequences and supporting quality improvements through LatentMoE.

If this is right

  • Applications can maintain practical speeds while reasoning over contexts as long as 1 million tokens, such as full-document analysis or long multi-turn interactions.
  • Adjustable reasoning budgets let the same model switch between quick responses and deeper multi-step tool use depending on the task.
  • Open release of weights, training recipes, and redistribution-permitted data allows direct replication and extension by external developers.
  • The Super variant targets high-volume workloads like IT automation through built-in support for collaborative agents.
  • The Ultra variant targets top accuracy on complex reasoning benchmarks while retaining the efficiency features of the family.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The hybrid approach may cut hardware and energy costs for long-context deployments in production agent systems.
  • Multi-environment RL training could extend to create agents that adapt across more varied real-world tool sets than those shown.
  • Full public access to recipes and data might accelerate similar efficiency gains in other model families.
  • Built-in tool-use support could simplify integration into larger multi-agent workflows.

Load-bearing premise

The described hybrid architecture, LatentMoE, NVFP4, MTP layers, and multi-environment RL post-training together produce the stated gains in accuracy, throughput, and reasoning without post-hoc benchmark selection or undisclosed data filtering.

What would settle it

Independent runs of the released Nano model on standard public benchmarks, measuring both accuracy and real-world inference throughput against comparable open models on the same hardware, would confirm or refute the performance claims.

read the original abstract

We introduce the Nemotron 3 family of models - Nano, Super, and Ultra. These models deliver strong agentic, reasoning, and conversational capabilities. The Nemotron 3 family uses a Mixture-of-Experts hybrid Mamba-Transformer architecture to provide best-in-class throughput and context lengths of up to 1M tokens. Super and Ultra models are trained with NVFP4 and incorporate LatentMoE, a novel approach that improves model quality. The two larger models also include MTP layers for faster text generation. All Nemotron 3 models are post-trained using multi-environment reinforcement learning enabling reasoning, multi-step tool use, and support granular reasoning budget control. Nano, the smallest model, outperforms comparable models in accuracy while remaining extremely cost-efficient for inference. Super is optimized for collaborative agents and high-volume workloads such as IT ticket automation. Ultra, the largest model, provides state-of-the-art accuracy and reasoning performance. Nano is released together with its technical report and this white paper, while Super and Ultra will follow in the coming months. We will openly release the model weights, pre- and post-training software, recipes, and all data for which we hold redistribution rights.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Nemotron 3 family of models (Nano, Super, and Ultra). It claims that a Mixture-of-Experts hybrid Mamba-Transformer architecture delivers best-in-class throughput and context lengths up to 1M tokens. Super and Ultra models are trained with NVFP4, incorporate a novel LatentMoE approach to improve quality, and include MTP layers for faster generation. All models are post-trained via multi-environment reinforcement learning to enable reasoning, multi-step tool use, and granular reasoning budget control. Nano is stated to outperform comparable models in accuracy while being cost-efficient; the paper announces open release of weights, pre- and post-training software, recipes, and data for Nano, with Super and Ultra to follow.

Significance. If the hybrid architecture, LatentMoE, NVFP4, MTP layers, and multi-environment RL post-training deliver measurable gains in throughput, context handling, and reasoning without selective benchmarking, the work could advance efficient open models for agentic and long-context tasks. The explicit commitment to releasing weights, software, and data is a positive aspect that supports reproducibility.

major comments (2)
  1. [Abstract] Abstract: the claims of 'best-in-class throughput', 'state-of-the-art accuracy and reasoning performance', and 'strong agentic, reasoning, and conversational capabilities' are presented without any quantitative benchmarks, baseline comparisons, ablation results, error bars, or scaling curves. These assertions are load-bearing for the central contribution yet rest on unverified statements.
  2. [Abstract] Abstract: the hybrid Mamba-Transformer MoE architecture, LatentMoE, NVFP4 quantization, MTP layers, and multi-environment RL post-training are described at a high level with no implementation details, equations, throughput measurements, or context-length scaling data, leaving the causal connection between the techniques and the claimed gains untested.
minor comments (2)
  1. The manuscript distinguishes this white paper from a separate technical report for Nano; explicitly stating which quantitative results and ablations appear in each document would improve clarity.
  2. No model sizes, parameter counts, or training data details are provided, which would help readers contextualize the efficiency and performance claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight opportunities to better support the abstract claims with evidence from the full paper. We address each point below and will incorporate revisions to improve verifiability while preserving the manuscript's focus as a technical announcement accompanying the open release.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claims of 'best-in-class throughput', 'state-of-the-art accuracy and reasoning performance', and 'strong agentic, reasoning, and conversational capabilities' are presented without any quantitative benchmarks, baseline comparisons, ablation results, error bars, or scaling curves. These assertions are load-bearing for the central contribution yet rest on unverified statements.

    Authors: We agree that the abstract would benefit from concrete quantitative anchors to allow readers to immediately assess the claims. The full manuscript contains detailed benchmark tables, baseline comparisons (e.g., against Llama-3 and Mistral variants), throughput measurements on H100 hardware, and scaling results for context length. To directly address this, we will revise the abstract to incorporate a small number of key supported figures, such as relative throughput gains and accuracy deltas on standard reasoning and agentic benchmarks, drawn from the evaluation sections. This keeps the abstract concise while making the central claims verifiable. revision: yes

  2. Referee: [Abstract] Abstract: the hybrid Mamba-Transformer MoE architecture, LatentMoE, NVFP4 quantization, MTP layers, and multi-environment RL post-training are described at a high level with no implementation details, equations, throughput measurements, or context-length scaling data, leaving the causal connection between the techniques and the claimed gains untested.

    Authors: The manuscript body provides additional architectural diagrams, training hyper-parameters, and high-level pseudocode for components such as LatentMoE and the multi-environment RL setup, along with measured throughput and context-length results. We acknowledge that the abstract itself does not explicitly link these elements to the gains. We will therefore revise the abstract to include brief, high-level implementation notes and direct references to the specific quantitative results (e.g., generation speed from MTP and scaling behavior) that appear later in the paper. Complete equations, code, and full recipes will be released with the Nano weights and technical report. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper is an empirical technical report describing the Nemotron 3 model family, its hybrid Mamba-Transformer MoE architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training. No mathematical derivations, equations, or fitted parameters are presented that are then repurposed as predictions. All performance claims are statements about trained models that can be evaluated against external benchmarks. There are no self-citation chains, uniqueness theorems, or ansatzes that reduce the central claims to inputs by construction. The content is therefore self-contained.

Axiom & Free-Parameter Ledger

2 free parameters · 0 axioms · 1 invented entities

Review conducted from abstract only; full training details, hyperparameter counts, data mixtures, and benchmark protocols are not available, preventing exhaustive enumeration of free parameters or background assumptions.

free parameters (2)
  • Model scale and mixture-of-experts routing hyperparameters
    Standard but unspecified training choices that determine final quality and throughput.
  • Reinforcement learning environment and reward parameters
    Multi-environment RL setup requires many tuned values not detailed in the abstract.
invented entities (1)
  • LatentMoE no independent evidence
    purpose: Novel approach claimed to improve model quality
    Introduced without independent evidence or comparison to prior MoE variants in the abstract.

pith-pipeline@v0.9.0 · 7455 in / 1433 out tokens · 37109 ms · 2026-05-18T01:36:28.707674+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

    cs.LG 2026-04 unverdicted novelty 8.0

    Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.

  2. LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models

    cs.LG 2026-05 unverdicted novelty 7.0

    LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.

  3. Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling

    cs.CL 2025-12 conditional novelty 7.0

    Four Over Six adaptively scales blocks in NVFP4 quantization to smaller FP4 values, making representable value distributions more uniform and reducing quantization error especially for near-maximal values.

  4. PrivacySIM: Evaluating LLM Simulation of User Privacy Behavior

    cs.CR 2026-05 unverdicted novelty 6.0

    PrivacySIM shows that conditioning LLMs on user personas like demographics and attitudes improves simulation of privacy choices but reaches only 40.4% accuracy against real responses from 1,000 users.

  5. Structured Recurrent Mixers for Massively Parallelized Sequence Generation

    cs.CL 2026-05 unverdicted novelty 6.0

    Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, delivering higher efficiency, information capacity, and throughput than other linear-complexity models.

  6. Priming: Hybrid State Space Models From Pre-trained Transformers

    cs.LG 2026-05 unverdicted novelty 6.0

    Priming transfers knowledge from pre-trained Transformers to hybrid SSM-attention models, recovering performance with minimal additional tokens and showing Gated KalmaNet outperforming Mamba-2 on long-context reasonin...

  7. Hypothesis generation and updating in large language models

    cs.LG 2026-05 unverdicted novelty 6.0

    LLMs exhibit Bayesian-like hypothesis updating with strong-sampling bias and an evaluation-generation gap but generalize poorly outside observed data.

  8. Evaluation Awareness in Language Models Has Limited Effect on Behaviour

    cs.CL 2026-05 conditional novelty 6.0

    Verbalised evaluation awareness in large reasoning models has only small effects on their outputs across safety and alignment tests.

  9. When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    LLM accuracy on controlled procedural arithmetic drops from 61% at 5 steps to 20% at 95 steps, with failures including skipped steps, premature answers, and hallucinated operations.

  10. AVISE: Framework for Evaluating the Security of AI Systems

    cs.CR 2026-04 unverdicted novelty 6.0

    AVISE provides a new framework and automated SET that identifies jailbreak vulnerabilities in language models with 92% accuracy, finding all nine tested models vulnerable to an augmented Red Queen attack.

  11. Unsupervised Confidence Calibration for Reasoning LLMs from a Single Generation

    cs.LG 2026-04 unverdicted novelty 6.0

    Unsupervised single-generation confidence calibration for reasoning LLMs via offline self-consistency proxy distillation outperforms baselines on math and QA tasks and improves selective prediction.

  12. Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

    cs.LG 2026-04 unverdicted novelty 6.0

    Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.

  13. How Transformers Learn to Plan via Multi-Token Prediction

    cs.LG 2026-04 conditional novelty 6.0

    Multi-token prediction induces a two-stage reverse reasoning process in Transformers via gradient decoupling, improving planning on synthetic and realistic tasks.

  14. Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents

    cs.AI 2026-04 unverdicted novelty 6.0

    Claw-Eval is a new trajectory-aware benchmark for LLM agents that records execution traces, audit logs, and environment snapshots to evaluate completion, safety, and robustness across 300 tasks, revealing that opaque ...

  15. SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding

    cs.DC 2026-02 unverdicted novelty 6.0

    SPEED-Bench is a new standardized benchmark for speculative decoding that supplies semantically diverse qualitative data and throughput-oriented splits across concurrency levels, integrated with vLLM and TensorRT-LLM.

  16. Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace

    cs.AI 2026-05 unverdicted novelty 5.0 partial

    Shepherd is a runtime system that formalizes meta-agent operations via typed execution traces, enabling fast forking and demonstrated improvements in agent intervention, optimization, and training on benchmarks.

  17. Irminsul: MLA-Native Position-Independent Caching for Agentic LLM Serving

    cs.DC 2026-05 unverdicted novelty 5.0

    Irminsul recovers up to 83% of prompt tokens above exact-prefix matching and delivers 63% prefill energy savings per cache hit on MLA-MoE models by content-hashing CDC chunks and applying closed-form kr correction.

Reference graph

Works this paper leans on

201 extracted references · 201 canonical work pages · cited by 16 Pith papers · 48 internal anchors

  1. [1]

    2023 , eprint=

    GPQA: A Graduate-Level Google-Proof Q&A Benchmark , author=. 2023 , eprint=

  2. [2]

    Lightman, Hunter and Kosaraju, Vineet and Burda, Yura and Edwards, Harri and Baker, Bowen and Lee, Teddy and Leike, Jan and Schulman, John and Sutskever, Ilya and Cobbe, Karl , journal=

  3. [3]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Livecodebench: Holistic and contamination free evaluation of large language models for code , author=. arXiv preprint arXiv:2403.07974 , year=

  4. [4]

    Zhou, Jeffrey and Lu, Tianjian and Mishra, Swaroop and Brahma, Siddhartha and Basu, Sujoy and Luan, Yi and Zhou, Denny and Hou, Le , journal=

  5. [5]

    Patil and Ion Stoica and Joseph E

    Fanjia Yan and Huanzhi Mao and Charlie Cheng-Jie Ji and Tianjun Zhang and Shishir G. Patil and Ion Stoica and Joseph E. Gonzalez , year=

  6. [6]

    Gonzalez and Ion Stoica , month =

    Tianle Li and Wei-Lin Chiang and Evan Frick and Lisa Dunlap and Banghua Zhu and Joseph E. Gonzalez and Ion Stoica , month =

  7. [7]

    2024 , eprint=

    SciCode: A Research Coding Benchmark Curated by Scientists , author=. 2024 , eprint=

  8. [8]

    2025 , eprint=

    Humanity's Last Exam , author=. 2025 , eprint=

  9. [9]

    2024 , journal =

    HelpSteer2: Open-source dataset for training top-performing reward models , author =. 2024 , journal =

  10. [10]

    2025 , journal =

    HelpSteer3-Preference: Open Human-Annotated Preference Data across Diverse Tasks and Languages , author =. 2025 , journal =

  11. [11]

    2022 , eprint =

    Model soups: averaging weights of multiple fine‐tuned models improves accuracy without increasing inference time , author =. 2022 , eprint =

  12. [12]

    2024 , journal =

    WorkBench: a Benchmark Dataset for Agents in a Realistic Workplace Setting , author =. 2024 , journal =

  13. [13]

    AEGIS 2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails

    Ghosh, Shaona and Varshney, Prasoon and Sreedhar, Makesh Narsimhan and Padmakumar, Aishwarya and Rebedea, Traian and Varghese, Jibin Rajan and Parisien, Christopher. AEGIS 2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Com...

  14. [14]

    arXiv preprint arXiv:2401.10862 , year=

    Pruning for protection: Increasing jailbreak resistance in aligned llms without fine-tuning , author=. arXiv preprint arXiv:2401.10862 , year=

  15. [15]

    arXiv preprint arXiv:2404.03027 , year=

    Jailbreakv: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks , author=. arXiv preprint arXiv:2404.03027 , year=

  16. [16]

    2024 , month =

    Gretel Synthetic Safety Alignment Dataset , author=. 2024 , month =

  17. [17]

    2024 , url=

    Physics Big , author=. 2024 , url=

  18. [18]

    2025 , url=

    IChO-IPhO-RL-v2-formated , author=. 2025 , url=

  19. [19]

    arXiv preprint arXiv:2309.11998 , year =

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , author =. arXiv preprint arXiv:2309.11998 , year =

  20. [20]

    WildChat: 1M ChatGPT Interaction Logs in the Wild

    WildChat: 1M ChatGPT Interaction Logs in the Wild , author =. arXiv preprint arXiv:2405.01470 , year =

  21. [21]

    2024 , journal =

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author =. 2024 , journal =

  22. [22]

    W hen2 C all: When (not) to Call Tools

    Ross, Hayley and Mahabaleshwarkar, Ameya Sunil and Suhara, Yoshi. W hen2 C all: When (not) to Call Tools. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025

  23. [23]

    2024 , journal =

    ToolACE: Winning the Points of LLM Function Calling , author =. 2024 , journal =

  24. [24]

    2025 , journal =

    APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay , author =. 2025 , journal =

  25. [25]

    2023 , journal =

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author =. 2023 , journal =

  26. [26]

    Advances in Neural Information Processing Systems (NeurIPS) , series =

    Deep Reinforcement Learning from Human Preferences , author =. Advances in Neural Information Processing Systems (NeurIPS) , series =

  27. [27]

    2022 , journal =

    Training language models to follow instructions with human feedback , author =. 2022 , journal =

  28. [28]

    2412.15285 , archivePrefix=

    Steven Feng and Shrimai Prabhumoye and Kezhi Kong and Dan Su and Mostofa Patwary and Mohammad Shoeybi and Bryan Catanzaro , year=. 2412.15285 , archivePrefix=

  29. [29]

    arXiv preprint arXiv:2504.11409 , year=

    Efficient hybrid language model compression through group-aware ssm pruning , author=. arXiv preprint arXiv:2504.11409 , year=

  30. [30]

    Nemotron- CC : Transforming C ommon C rawl into a Refined Long-Horizon Pretraining Dataset

    Su, Dan and Kong, Kezhi and Lin, Ying and Jennings, Joseph and Norick, Brandon and Kliegl, Markus and Patwary, Mostofa and Shoeybi, Mohammad and Catanzaro, Bryan. Nemotron- CC : Transforming C ommon C rawl into a Refined Long-Horizon Pretraining Dataset. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Lon...

  31. [31]

    FP8 Formats for Deep Learning

    Paulius Micikevicius and Dusan Stosic and Neil Burgess and Marius Cornea and Pradeep Dubey and Richard Grisenthwaite and Sangwon Ha and Alexander Heinecke and Patrick Judd and John Kamalu and Naveen Mellempudi and Stuart Oberman and Mohammad Shoeybi and Michael Siu and Hao Wu , year=. 2209.05433 , archivePrefix=

  32. [32]

    2024 , url=

    Jupinder Parmar and Shrimai Prabhumoye and Joseph Jennings and Mostofa Patwary and Sandeep Subramanian and Dan Su and Chen Zhu and Deepak Narayanan and Aastha Jhunjhunwala and Ayush Dattagupta and Vibhu Jawa and Jiwei Liu and Ameya Mahabaleshwarkar and Osvald Nitski and Annika Brundyn and James Maki and Miguel Martinez and Jiaxuan You and John Kamalu and ...

  33. [33]

    2406.11704 , archivePrefix=

    NVIDIA , year=. 2406.11704 , archivePrefix=

  34. [35]

    Li, Jeffrey and Fang, Alex and Smyrnis, Georgios and Ivgi, Maor and Jordan, Matt and Gadre, Samir Yitzhak and Bansal, Hritik and Guha, Etash and Keh, Sedrick Scott and Arora, Kushal and others , journal=

  35. [36]

    Advances in Neural Information Processing Systems , volume=

    Penedo, Guilherme and Kydl. Advances in Neural Information Processing Systems , volume=

  36. [37]

    Muennighoff, Niklas and Rush, Alexander and Barak, Boaz and Le Scao, Teven and Tazi, Nouamane and Piktus, Aleksandra and Pyysalo, Sampo and Wolf, Thomas and Raffel, Colin A , journal=

  37. [38]

    Maini, Pratyush and Seto, Skyler and Bai, He and Grangier, David and Zhang, Yizhe and Jaitly, Navdeep , booktitle=

  38. [39]

    2024 , eprint=

    Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation , author=. 2024 , eprint=

  39. [40]

    2022 , eprint=

    Language Models are Multilingual Chain-of-Thought Reasoners , author=. 2022 , eprint=

  40. [41]

    The Llama 3 Herd of Models

    Llama Team @ Meta , year=. 2407.21783 , archivePrefix=

  41. [42]

    Qwen2.5 Technical Report

    Qwen , year=. 2412.15115 , archivePrefix=

  42. [43]

    2407.14679 , archivePrefix=

    Saurav Muralidharan and Sharath Turuvekere Sreenivas and Raviraj Joshi and Marcin Chochowski and Mostofa Patwary and Mohammad Shoeybi and Bryan Catanzaro and Jan Kautz and Pavlo Molchanov , year=. 2407.14679 , archivePrefix=

  43. [44]

    2408.11796 , archivePrefix=

    Sharath Turuvekere Sreenivas and Saurav Muralidharan and Raviraj Joshi and Marcin Chochowski and Ameya Sunil Mahabaleshwarkar and Gerald Shen and Jiaqi Zeng and Zijia Chen and Yoshi Suhara and Shizhe Diao and Chenhan Yu and Wei-Chun Chen and Hayley Ross and Oluwatobi Olabiyi and Ashwath Aithal and Oleksii Kuchaiev and Daniel Korzekwa and Pavlo Molchanov a...

  44. [45]

    2411.19146 , archivePrefix=

    Akhiad Bercovich and Tomer Ronen and Talor Abramovich and Nir Ailon and Nave Assaf and Mohammad Dabbah and Ido Galil and Amnon Geifman and Yonatan Geifman and Izhak Golan and Netanel Haber and Ehud Karpas and Roi Koren and Itay Levy and Pavlo Molchanov and Shahar Mor and Zach Moshe and Najeeb Nabwani and Omri Puny and Ran Rubin and Itamar Schen and Ido Sh...

  45. [46]

    (2023b) in the survey

    Xin Men and Mingyu Xu and Qingyu Zhang and Bingning Wang and Hongyu Lin and Yaojie Lu and Xianpei Han and Weipeng Chen , year=. 2403.03853 , archivePrefix=

  46. [47]

    2502.04223 , archivePrefix=

    Ilia Karmanov and Amala Sanjay Deshmukh and Lukas Voegtle and Philipp Fischer and Kateryna Chumachenko and Timo Roman and Jarno Seppänen and Jupinder Parmar and Joseph Jennings and Andrew Tao and Karan Sapra , year=. 2502.04223 , archivePrefix=

  47. [48]

    2025 , url=

    OpenAI , title=. 2025 , url=

  48. [49]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton and Oriol Vinyals and Jeff Dean , year=. 1503.02531 , archivePrefix=

  49. [50]

    2502.00203 , archivePrefix=

    Shengyang Sun and Yian Zhang and Alexander Bukharin and David Mosallanezhad and Jiaqi Zeng and Soumye Singhal and Gerald Shen and Adithya Renduchintala and Tugrul Konuk and Yi Dong and Zhilin Wang and Dmitry Chichkov and Olivier Delalleau and Oleksii Kuchaiev , year=. 2502.00203 , archivePrefix=

  50. [51]

    2025 , eprint=

    Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset , author=. 2025 , eprint=

  51. [52]

    2410.12881 , archivePrefix=

    Syeda Nahida Akter and Shrimai Prabhumoye and John Kamalu and Sanjeev Satheesh and Eric Nyberg and Mostofa Patwary and Mohammad Shoeybi and Bryan Catanzaro , year=. 2410.12881 , archivePrefix=

  52. [53]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt , year=. 2103.03874 , archivePrefix=

  53. [54]

    Ghosh, Shaona and Varshney, Prasoon and Sreedhar, Makesh Narsimhan and Padmakumar, Aishwarya and Rebedea, Traian and Varghese, Jibin Rajan and Parisien, Christopher , journal=

  54. [55]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe and Vineet Kosaraju and Mohammad Bavarian and Mark Chen and Heewoo Jun and Lukasz Kaiser and Matthias Plappert and Jerry Tworek and Jacob Hilton and Reiichiro Nakano and Christopher Hesse and John Schulman , year=. 2110.14168 , archivePrefix=

  55. [56]

    2309.14402 , archivePrefix=

    Zeyuan Allen-Zhu and Yuanzhi Li , year=. 2309.14402 , archivePrefix=

  56. [57]

    2310.06786 , archivePrefix=

    Keiran Paster and Marco Dos Santos and Zhangir Azerbayev and Jimmy Ba , year=. 2310.06786 , archivePrefix=

  57. [59]

    SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

    Loubna Ben Allal and Anton Lozhkov and Elie Bakouch and Gabriel Martín Blázquez and Guilherme Penedo and Lewis Tunstall and Andrés Marafioti and Hynek Kydlíček and Agustín Piqueres Lajarín and Vaibhav Srivastav and Joshua Lochner and Caleb Fahlgren and Xuan-Son Nguyen and Clémentine Fourrier and Ben Burtenshaw and Hugo Larcher and Haojun Zhao and Cyril Za...

  58. [60]

    arXiv preprint arXiv:2505.02881 , year=

    Rewriting pre-training data boosts llm performance in math and code , author=. arXiv preprint arXiv:2505.02881 , year=

  59. [61]

    Attention Is All You Need

    Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin , year=. 1706.03762 , archivePrefix=

  60. [62]

    Dai, Wenliang and Lee, Nayeon and Wang, Boxin and Yang, Zhuolin and Liu, Zihan and Barker, Jon and Rintamaki, Tuomas and Shoeybi, Mohammad and Catanzaro, Bryan and Ping, Wei , journal=

  61. [63]

    Li, Zhiqi and Chen, Guo and Liu, Shilong and Wang, Shihao and VS, Vibashan and Ji, Yishen and Lan, Shiyi and Zhang, Hao and Zhao, Yilin and Radhakrishnan, Subhashree and others , journal=

  62. [64]

    Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae , journal=

  63. [65]

    Alayrac, Jean-Baptiste and Donahue, Jeff and Luc, Pauline and Miech, Antoine and Barr, Iain and Hasson, Yana and Lenc, Karel and Mensch, Arthur and Millican, Katherine and Reynolds, Malcolm and others , journal=

  64. [66]

    Chen, Zhe and Wang, Weiyun and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Cui, Erfei and Zhu, Jinguo and Ye, Shenglong and Tian, Hao and Liu, Zhaoyang and others , journal=

  65. [67]

    Microsoft COCO: Common Objects in Context

    Tsung-Yi Lin and Michael Maire and Serge Belongie and Lubomir Bourdev and Ross Girshick and James Hays and Pietro Perona and Deva Ramanan and C. Lawrence Zitnick and Piotr Dollár , year=. 1405.0312 , archivePrefix=

  66. [68]

    Sharma, Piyush and Ding, Nan and Goodman, Sebastian and Soricut, Radu , booktitle=

  67. [69]

    Ordonez, Vicente and Kulkarni, Girish and Berg, Tamara , journal=

  68. [70]

    2022 , organization=

    Li, Junnan and Li, Dongxu and Xiong, Caiming and Hoi, Steven , booktitle=. 2022 , organization=

  69. [71]

    Goyal, Yash and Khot, Tejas and Summers-Stay, Douglas and Batra, Dhruv and Parikh, Devi , booktitle=

  70. [72]

    2017 , publisher=

    Krishna, Ranjay and Zhu, Yuke and Groth, Oliver and Johnson, Justin and Hata, Kenji and Kravitz, Joshua and Chen, Stephanie and Kalantidis, Yannis and Li, Li-Jia and Shamma, David A and others , journal=. 2017 , publisher=

  71. [73]

    Kafle, Kushal and Price, Brian and Cohen, Scott and Kanan, Christopher , booktitle=

  72. [74]

    Marafioti, Andres and Laurencon, Hugo , year =

  73. [75]

    2019 , organization=

    Mishra, Anand and Shekhar, Shashank and Singh, Ajeet Kumar and Chakraborty, Anirban , booktitle=. 2019 , organization=

  74. [76]

    COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images

    Andreas Veit and Tomas Matera and Lukas Neumann and Jiri Matas and Serge Belongie , year=. 1601.07140 , archivePrefix=

  75. [77]

    arXiv preprint arXiv:2208.05358 , year=

    Lindstr. arXiv preprint arXiv:2208.05358 , year=

  76. [78]

    Marino, Kenneth and Rastegari, Mohammad and Farhadi, Ali and Mottaghi, Roozbeh , booktitle=

  77. [79]

    Hudson, Drew A and Manning, Christopher D , booktitle=

  78. [80]

    Lu, Pan and Mishra, Swaroop and Xia, Tanglin and Qiu, Liang and Chang, Kai-Wei and Zhu, Song-Chun and Tafjord, Oyvind and Clark, Peter and Kalyan, Ashwin , journal=

  79. [81]

    Xiang Yue and Yuansheng Ni and Kai Zhang and Tianyu Zheng and Ruoqi Liu and Ge Zhang and Samuel Stevens and Dongfu Jiang and Weiming Ren and Yuxuan Sun and Cong Wei and Botao Yu and Ruibin Yuan and Renliang Sun and Ming Yin and Boyuan Zheng and Zhenzhu Yang and Yibo Liu and Wenhao Huang and Huan Sun and Yu Su and Wenhu Chen , booktitle=

  80. [82]

    International Conference on Learning Representations (ICLR) , year =

    Lu, Pan and Bansal, Hritik and Xia, Tony and Liu, Jiacheng and Li, Chunyuan and Hajishirzi, Hannaneh and Cheng, Hao and Chang, Kai-Wei and Galley, Michel and Gao, Jianfeng , title =. International Conference on Learning Representations (ICLR) , year =

Showing first 80 references.