hub

Nemotron 3 nano: Open, efficient mixture-of- experts hybrid mamba-transformer model for agentic reasoning

Aaron Blakeman, Aaron Grattafiori, Aarti Basant, Abhibha Gupta, Abhinav Khattar, Adi Renduchintala, Aditya Vavre, Akanksha Shukla, Akhiad Bercovich, Aleksander Ficek, et al · 2025 · arXiv 2512.20848

19 Pith papers cite this work. Polarity classification is still indexing.

19 Pith papers citing it

read on arXiv browse 19 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 dataset 1 method 1

citation-polarity summary

background 1 unclear 1 use dataset 1 use method 1

representative citing papers

TW-LegalBench: Measuring Taiwanese Legal Understanding

cs.CL · 2026-06-17 · unverdicted · novelty 8.0

TW-LegalBench evaluates 13 LLMs on over 30,000 Taiwanese legal tasks from exams and judgments, showing top models pass lawyer thresholds but struggle with exact statute citations.

A Verifiable Search Is Not a Learnable Chain-of-Thought

cs.LG · 2026-06-20 · unverdicted · novelty 7.0

Verifiable search procedures cannot be learned as forward chain-of-thought by language models; they instead learn memorization, verification, or require precomputed catalogs.

Star Elastic: Many-in-One Reasoning LLMs with Efficient Budget Control

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Star Elastic trains N nested submodels in a single post-training job on a parent reasoning LLM, supporting elastic budget control that matches or exceeds independent baselines while cutting training compute by up to 360x.

Can an MLP Absorb Its Own Skip Connection?

cs.LG · 2026-04-26 · accept · novelty 7.0

Skip-connected MLPs and residual-free MLPs of equal width represent generically disjoint function classes for common activations, with explicit impossibility proofs and a non-generic absorption condition for ReLU and GELU.

MetaGAI: A Large-Scale and High-Quality Benchmark for Generative AI Model and Data Card Generation

cs.AI · 2026-04-26 · accept · novelty 7.0

MetaGAI is a new large-scale benchmark for automated model and data card generation, constructed via semantic triangulation and multi-agent agents with human-in-the-loop verification.

Neglected Free Lunch from Post-training: Progress Advantage for LLM Agents

cs.LG · 2026-06-24 · unverdicted · novelty 6.0

The log-probability ratio from RL post-training recovers the optimal advantage function, providing an effective free signal for test-time scaling, uncertainty estimation, and failure attribution in LLM agents.

A-Evolve-Training: Autonomous Post-Training of a 30B Model

cs.AI · 2026-06-09 · unverdicted · novelty 6.0

An autonomous post-training system for a 30B model achieves near-top human performance on a reasoning leaderboard and revises its search policy after detecting that its dev metric had become misleading.

End-to-End Context Compression at Scale

cs.CL · 2026-06-08 · unverdicted · novelty 6.0

LCLMs are scaled 0.6B-encoder 4B-decoder compressors pre-trained on over 350B tokens that improve the Pareto frontier for general-task performance, compression speed, and peak memory in long-context language model inference.

REPOT: Recoverable Program-of-Thought via Checkpoint Repair

cs.SE · 2026-05-28 · unverdicted · novelty 6.0

RePoT recovers from PoT failures via deterministic verified replay and checkpoint repair, yielding +3 to +11pp gains on planning benchmarks and showing checkpoint state as the key recovery signal over error-only feedback.

Diagnosing Overhead in Dispatch Operations: Cross-architecture Observatory

cs.DC · 2026-05-20 · unverdicted · novelty 6.0

DODOCO measurements show MoE routing imbalance is intrinsic to architecture and real text, not correctable by EP scaling or represented by mock tokens, forming two persistent Gini bands.

PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding

cs.CL · 2026-05-09 · unverdicted · novelty 6.0

PARD-2 uses Confidence-Adaptive Token optimization to align draft model training with acceptance length in speculative decoding, enabling dual-mode operation and up to 6.94x lossless speedup on Llama3.1-8B.

Priming: Hybrid State Space Models From Pre-trained Transformers

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

Priming transfers knowledge from pre-trained Transformers to hybrid SSM-attention models, recovering performance with minimal additional tokens and showing Gated KalmaNet outperforming Mamba-2 on long-context reasoning at 32B scale.

Normalized Architectures are Natively 4-Bit

cs.LG · 2026-05-07 · conditional · novelty 6.0

nGPT's hypersphere constraint makes dot-product signal accumulate constructively under 4-bit quantization while noise averages out, enabling native low-precision training.

EPM-RL: Reinforcement Learning for On-Premise Product Mapping in E-Commerce

cs.CL · 2026-04-27 · unverdicted · novelty 6.0

EPM-RL uses PEFT followed by RL with agent-based rewards from judge models to create a trainable in-house product mapping model that improves on fine-tuning alone and beats API baselines in quality-cost while enabling private use.

Value-and-Structure Alignment for Routing-Consistent Quantization of Mixture-of-Experts Models

cs.CL · 2026-06-04 · unverdicted · novelty 5.0

VSRAQ is a MoE-specific quantization objective that combines value and structure alignment to preserve expert-selection behavior and reduce quality loss without inference overhead.

SpikingBrain2.0: Brain-Inspired Foundation Models for Efficient Long-Context and Cross-Platform Inference

cs.LG · 2026-04-24 · unverdicted · novelty 5.0

SpikingBrain2.0 is a 5B hybrid spiking-Transformer that recovers most base model performance while delivering 10x TTFT speedup at 4M context and supporting over 10M tokens on limited GPUs via dual sparse attention and dual quantization paths.

Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

cs.LG · 2026-04-14 · unverdicted · novelty 5.0

Nemotron 3 Super is an open 120B hybrid Mamba-Attention MoE model with new LatentMoE architecture and MTP layers that matches accuracy of similar models while delivering up to 7.5x higher inference throughput.

A Model Context Protocol Server for Quantum Execution in Hybrid Quantum-HPC Environments

quant-ph · 2026-04-09 · unverdicted · novelty 5.0

An MCP server framework lets LLM agents run quantum primitives like sampling and expectation value computation on hybrid platforms by interpreting prompts and invoking tools for OpenQASM and CUDA-Q.

On Subquadratic Architectures: From Applications to Principles

cs.LG · 2026-06-10 · unverdicted · novelty 4.0

xLSTM outperforms Mamba-2 and Gated DeltaNet on tasks with complex dependencies because its gating scheme enables more flexible and stable state tracking and memory accumulation.

citing papers explorer

Showing 2 of 2 citing papers after filters.

Can an MLP Absorb Its Own Skip Connection? cs.LG · 2026-04-26 · accept · none · ref 7
Skip-connected MLPs and residual-free MLPs of equal width represent generically disjoint function classes for common activations, with explicit impossibility proofs and a non-generic absorption condition for ReLU and GELU.
MetaGAI: A Large-Scale and High-Quality Benchmark for Generative AI Model and Data Card Generation cs.AI · 2026-04-26 · accept · none · ref 1
MetaGAI is a new large-scale benchmark for automated model and data card generation, constructed via semantic triangulation and multi-agent agents with human-in-the-loop verification.

Nemotron 3 nano: Open, efficient mixture-of- experts hybrid mamba-transformer model for agentic reasoning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer