Sumi is an openly released 7B parameter uniform diffusion language model pretrained from scratch on 1.5T tokens that matches autoregressive models on several benchmarks.
super hub Mixed citations
gpt-oss-120b & gpt-oss-20b Model Card
Mixed citation behavior. Most common role is background (41%).
abstract
We present gpt-oss-120b and gpt-oss-20b, two open-weight reasoning models that push the frontier of accuracy and inference cost. The models use an efficient mixture-of-expert transformer architecture and are trained using large-scale distillation and reinforcement learning. We optimize the models to have strong agentic capabilities (deep research browsing, python tool use, and support for developer-provided functions), all while using a rendered chat format that enables clear instruction following and role delineation. Both models achieve strong results on benchmarks ranging from mathematics, coding, and safety. We release the model weights, inference implementations, tool environments, and tokenizers under an Apache 2.0 license to enable broad use and further research.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We present gpt-oss-120b and gpt-oss-20b, two open-weight reasoning models that push the frontier of accuracy and inference cost. The models use an efficient mixture-of-expert transformer architecture and are trained using large-scale distillation and reinforcement learning. We optimize the models to have strong agentic capabilities (deep research browsing, python tool use, and support for developer-provided functions), all while using a rendered chat format that enables clear instruction following and role delineation. Both models achieve strong results on benchmarks ranging from mathematics,
authors
co-cited works
representative citing papers
TW-LegalBench evaluates 13 LLMs on over 30,000 Taiwanese legal tasks from exams and judgments, showing top models pass lawyer thresholds but struggle with exact statute citations.
UltraEP is the first exact-load real-time expert balancer for large-EP MoE training and serving on rack-scale nodes, reaching 94.3% of ideal throughput and 1.49x over no-balancing.
RobotValues is a benchmark of 10K value-conflict scenarios that reveals VLMs default to safety and accommodation while failing to follow instructions to prioritize other values 80% of the time.
Presents the first fully open pipeline for clinical LLMs by unifying eight public QA datasets with three clinician-vetted synthetic extensions and applying it to five base models to achieve benchmark gains while maintaining auditability.
MathAtlas is the first large-scale benchmark for autoformalizing graduate mathematics, where even strong models reach only 9.8% correctness on theorem statements and drop to 2.6% on the hardest dependency-deep subset.
LLMs lack temporal awareness of medical knowledge, showing gradual performance decline on up-to-date facts, much lower accuracy on historical knowledge (25-54% relative), and inconsistent year-to-year predictions.
Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.
Soohak is a 439-problem mathematician-curated benchmark where frontier LLMs reach at most 30.4% on research math challenges and no model exceeds 50% on refusal for ill-posed problems.
MathConstraint generates scalable, automatically verifiable combinatorial problems where LLMs achieve 18.5-66.9% accuracy without tools but roughly double that with solver access.
IRIS-14B is the first LLM trained explicitly for GIMPLE-to-LLVM IR translation and outperforms much larger models by up to 44 percentage points on real-world C code.
RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on 8x RTX 4090.
InfiniteScienceGym procedurally generates unbounded scientific repositories with exact ground-truth QA pairs to benchmark LLMs on data reasoning, abstention, and tool use without static datasets.
Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.
Tessera performs kernel-granularity disaggregation on heterogeneous GPUs, achieving up to 2.3x throughput and 1.6x cost efficiency gains for large model inference while generalizing beyond prior methods.
The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.
SpeechCombine produces instruction-following SLMs via speech pre-training followed by direct weight combination with the text LLM instruction delta, without any speech instruction tuning.
OpenSafeIntent benchmark shows models fail to calibrate safety across intent shifts in matched dual-use prompts, indicating current evaluations are insufficient.
A 0.6B LM with length-aware attention adjustments performs competitive in-context retrieval at million-token scale on MS MARCO, NQ, and LIMIT benchmarks.
LLM-generated research ideas cluster more around bridge-like opportunities and synthesis methods than the broader distribution seen in human papers.
ELDR reduces median TPOT by 5.9-13.9% in PD-disaggregated MoE serving via expert signatures from prefill, K-means partitioning, and locality-band routing with KV-co-indexed signature cache.
Introduces GenAI agent framework for auditing personalization algorithms via synthetic accounts with fixed personas, applied to X post-2024 election showing amplification of toxic and right-leaning content varying by ideology.
SABER-Math is an automated benchmark for mathematical IR that uses LLM summaries, topic similarities, and preference tournaments on 283K problems to create reranking tasks, showing embedding models outperform baselines but struggle in symbol-heavy areas and that MTEB does not predict math performanc
LLM agents often fail to abstain at the right time in uncertain multi-turn tasks, and the CONVOLVE context engineering method raises timely abstention rates on WebShop from 26.7 to 57.4 without parameter updates.
citing papers explorer
-
Awakening Dormant Experts:Counterfactual Routing to Mitigate MoE Hallucinations
Counterfactual Routing awakens dormant experts in MoE models via layer-wise perturbation and a new CEI metric, raising factual accuracy 3.1% on average across TruthfulQA, FACTOR, and TriviaQA without extra inference cost.
-
Exploration and Exploitation Errors Are Measurable for Language Model Agents
A policy-agnostic metric and controllable 2D grid environments with task DAGs enable measurement of exploration and exploitation errors in language model agents from observed actions.
-
CodeSpecBench: Benchmarking LLMs for Executable Behavioral Specification Generation
CodeSpecBench shows LLMs achieve at most 20.2% pass rate on repository-level executable behavioral specification generation, revealing that strong code generation does not imply deep semantic understanding.
-
Too Nice to Tell the Truth: Quantifying Agreeableness-Driven Sycophancy in Role-Playing Language Models
Agreeableness in AI personas reliably predicts sycophantic behavior in 9 of 13 tested language models.
-
Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation
The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.
-
ActFER: Agentic Facial Expression Recognition via Active Tool-Augmented Visual Reasoning
ActFER reformulates facial expression recognition as active tool-augmented visual reasoning with a custom reinforcement learning algorithm UC-GRPO that outperforms passive MLLM baselines on AU prediction.
-
An Agentic Evaluation Architecture for Historical Bias Detection in Educational Textbooks
An agentic architecture with multimodal screening, a five-agent jury, meta-synthesis, and source attribution protocol detects biases in Romanian history textbooks more accurately than zero-shot baselines, achieving 83.3% acceptable excerpts and human preference in 64.8% of blind comparisons.
-
Beyond Social Pressure: Benchmarking Epistemic Attack in Large Language Models
PPT-Bench measures how LLMs change answers under epistemic, value, authority, and identity pressures at baseline, single-turn, and multi-turn levels, finding separable inconsistency patterns across five models.
-
How Independent are Large Language Models? A Statistical Framework for Auditing Behavioral Entanglement and Reweighting Verifier Ensembles
A new auditing framework reveals widespread behavioral entanglement among LLMs and shows that reweighting ensembles based on measured independence improves verification accuracy by up to 4.5%.
-
InfiniLoRA: Disaggregated Multi-LoRA Serving for Large Language Models
InfiniLoRA decouples LoRA execution from base-model inference and reports 3.05x higher request throughput plus 54% more adapters meeting strict latency SLOs.
-
Self-Preference Bias in Rubric-Based Evaluation of Large Language Models
Rubric-based LLM judges show self-preference bias, incorrectly marking their own failed outputs as satisfied up to 50% more often on verifiable benchmarks and skewing scores by 10 points on subjective ones.
-
Turbulence-like 5/3 spectral scaling in contextual representations of language as a complex system
Contextual language embeddings exhibit a robust 5/3 power-law spectrum in token-sequence fluctuations, analogous to Kolmogorov turbulence.
-
TriAttention: Efficient Long Reasoning with Trigonometric KV Compression
TriAttention compresses KV cache by exploiting stable pre-RoPE Q/K concentration and trigonometric distance preferences to match full-attention reasoning accuracy with far lower memory and higher speed.
-
LiveFact: A Dynamic, Time-Aware Benchmark for LLM-Driven Fake News Detection
LiveFact is a new time-aware benchmark that evaluates LLMs on reasoning with dynamic and incomplete information for fake news detection, identifying a significant reasoning gap in model behavior.
-
BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence
BAS aggregates utility from an answer-or-abstain model across risk thresholds and is uniquely maximized by truthful confidence estimates.
-
The limits of bio-molecular modeling with large language models : a cross-scale evaluation
LLMs perform adequately on bio-molecular classification tasks but remain weak on regression, with hybrid architectures outperforming others on long sequences and fine-tuning hurting generalization.
-
Evaluating the Environmental Impact of using SLMs and Prompt Engineering for Code Generation
Chain-of-Thought prompting balances high accuracy with low energy use in small language models for code generation, while multi-sampling strategies add high energy costs for small accuracy gains.
-
Learning in Blocks: A Multi Agent Debate Assisted Personalized Adaptive Learning Framework for Language Learning
Learning in Blocks uses heterogeneous multi-agent debate to score CEFR-aligned conversational competence, enforce 70% mastery progression, and deliver spaced review, yielding better outcomes than feedback alone in an 8-week study of 180 A2 learners.
-
ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding
ChartNet is a million-scale multimodal dataset for chart understanding created via code-guided synthesis spanning 24 chart types with five aligned modalities per sample.
-
How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data
TESSY creates stylistically consistent synthetic data via teacher-student token interleaving, yielding 11.25% and 6.68% gains on code benchmarks where pure teacher data causes 3.25% and 10.02% drops.
-
PACED: Distillation and On-Policy Self-Distillation at the Frontier of Student Competence
PACED applies student pass-rate weighting w(p)=p(1-p) to distillation, concentrating on the zone of proximal development and delivering up to +8.2 gains on AIME tasks with reduced forgetting.
-
EvoESAP: Non-Uniform Expert Pruning for Sparse MoE
EvoESAP uses evolutionary search guided by a speculative-decoding-inspired ESAP metric to discover non-uniform layer-wise sparsity allocations for MoE expert pruning, improving generation accuracy up to 19.6% at 50% sparsity.
-
Learning to Discover at Test Time
TTT-Discover applies test-time RL to set new state-of-the-art results on math inequalities, GPU kernels, algorithm contests, and single-cell denoising using an open model and public code.
-
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces
Terminal-Bench 2.0 is a new benchmark of 89 realistic terminal tasks on which frontier AI agents score below 65%.
-
Expos\'ia: Teaching and Assessment of Academic Writing Skills for Research Project Proposals and Peer Feedback
Expos'ia is the first public dataset linking student proposals, peer feedback, and pedagogically-grounded assessments, used to benchmark LLMs on scoring academic writing and reviews.
-
A Benchmark for Evaluating Outcome-Driven Constraint Violations in Autonomous AI Agents
A new benchmark of 40 scenarios finds state-of-the-art LLMs exhibit outcome-driven constraint violations in 0-62.8% of cases under KPI pressure, with no consistent safety gains across model generations.
-
SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios
SWE-EVO shows GPT-5.4 with OpenHands reaching only 25% success on complex multi-file evolution tasks versus 72.8% on SWE-Bench Verified, and introduces Fix Rate as a partial-progress metric.
-
Scalable Agentic Reasoning for Designing Biologics Targeting Intrinsically Disordered Proteins
StructBioReasoner is a scalable multi-agent system that designs IDP-targeting biologics, with over 50% of 787 candidates for Der f 21 showing better binding free energy than human-designed references.
-
Cognitive Alpha Mining via LLM-Driven Code-Based Evolution
CogAlpha combines LLM reasoning with code-level evolutionary search to discover financial alphas that show higher predictive accuracy and generalization than prior methods on five stock datasets.
-
SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators
SnapStream deploys sparse KV attention in a production inference system on dataflow accelerators, delivering 4x on-chip memory savings for DeepSeek-671B at 128k context with up to 1832 tokens/sec and minimal accuracy loss on LongBench-v2, AIME24, and LiveCodeBench.
-
Key and Value Weights Are Probably All You Need: On the Necessity of the Query, Key, Value weight Triplet in Self-Attention Transformers
One of the Q, K or V weights in transformer self-attention is redundant and replaceable by the identity matrix under mild assumptions, reducing parameters by 25 percent with no loss in small-model performance.
-
When Identity Skews Debate: Anonymization for Bias-Reduced Multi-Agent Reasoning
Anonymization in multi-agent debate reduces identity bias by equalizing self and peer weights in a Bayesian update model, quantified by the Identity Bias Coefficient.
-
Efficient numeracy in language models through single-token number embeddings
BitTokens represent numbers as single tokens via IEEE 754 binary format, allowing small language models to learn basic arithmetic algorithms nearly perfectly.
-
SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence
Presents SpatialScore benchmark for MLLM spatial reasoning, evaluates 49 models showing large human gap, and supplies SpatialCorpus plus SpatialAgent to improve performance.
-
Program-as-Weights: A Programming Paradigm for Fuzzy Functions
A 4B compiler model generates LoRA adapters from natural-language specs, enabling a frozen 0.6B interpreter to match Qwen3-32B performance on fuzzy text tasks at 50× less memory.
-
Addressing Over-Refusal in LLMs with Competing Rewards
SEAR trains one LLM via adversarial process rewards to explore harmful reasoning paths but flip to safe outputs, reducing over-refusal while preserving safety.
-
Benchmarking Large Language Models on Floating-Point Error Classification
Introduces InterFLOPBench benchmark and evaluates 14 LLMs on multi-label classification of six floating-point error categories in C code, with top models exceeding 0.88 overall F1 but lower scores on subtle errors like underflow.
-
Wait, am I Being Fair? Characterizing Deductive Stereotyping and Mitigating It with Fair-GCG
The paper characterizes deductive stereotyping in LLMs and introduces Fair-GCG to discover injection phrases that improve fairness across benchmarks, reasoning, and real-world tasks.
-
PerceptionRubrics: Calibrating Multimodal Evaluation to Human Perception
PerceptionRubrics introduces a gated rubric evaluation framework for multimodal models that exposes reliability gaps and shows an 8% perception deficit between open and closed models.
-
ToxiREX: A Dataset on Toxic REasoning in ConteXt
ToxiREX is a new dataset of 128k Reddit comments in six languages with hierarchical annotations for implicit toxicity in conversational context based on an existing reasoning schema.
-
EconSimulacra: A Digital Twin Platform of Socio-Economic Systems Powered by LLM Agents
EconSimulacra is a multi-agent LLM simulator that couples economy, mobility, and social networks through shared internal states to reproduce nonlinear relationships between online attention and offline popularity.
-
RADIANT-PET: Reasoning-Augmented PET/CT Lesion Segmentation with Large Language Models and Reinforcement Learning
RADIANT-PET improves PET/CT lesion segmentation accuracy by layering LLM-based adjudication and RL optimization on top of a high-sensitivity voxel segmentation stage, with largest gains when radiology reports are available.
-
Measuring & Mitigating Over-Alignment for LLMs in Multilingual Criminal Law Courts
Creates TF-RefusalBench to quantify over-alignment in LLMs on criminal-law tasks across four languages and shows abliteration mitigates refusals with little performance loss.
-
Toward Open-Set Speaker Attribute Prediction with Keyword-Appended LLM Embeddings
Proposes keyword-appended LLM embeddings plus top-k negative loss for open-set speaker attribute prediction that outperforms closed-set baselines on LibriTTS-P and generalizes to unseen synonyms.
-
Honeyquest for LLMs: Rethinking Cyber Deception for AI Attackers
LLMs fall for deceptive traps at higher rates than humans, lack the human attention-diversion effect, and exploit traps 73.4% of the time even after recognizing them in reasoning.
-
What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations?
Safety-aligned LLMs treat benign and harmful compliance demonstrations differently in in-context learning, with preference optimization preventing benign examples from increasing harmful compliance and strong recency bias in ordering.
-
FoMoE: Breaking the Full-Replica Barrier with a Federation of MoEs
FoMoE partitions expert layers across workers in MoE LLMs, skips non-resident experts, and reports up to 1.42x lower communication than baselines plus 1.4x throughput gains while maintaining stable routing.
-
When Rules Learn: A Self-Evolving Agent for Legal Case Retrieval
An LLM agent self-evolves a set of query-rewriting rules that raise BM25 performance on the LeCaRD-v2 legal retrieval benchmark above human-designed and greedy baselines.
-
From Passive Generation to Investigation: A Proactive Scientific Peer Review Agent
ProReviewer is an MDP-formulated proactive peer review agent trained with SFT and RL on an 8B model that outperforms larger frontier LLMs on review quality metrics.
-
ReSET: Accurate Latency-Critical NVFP4 Reasoning via Step-Aware Temperature Scaling
ReSET mitigates accuracy degradation in NVFP4-quantized reasoning models via step-aware entropy-based temperature scaling and provides a small-M CUDA kernel for up to 2.5x kernel speedup and 2x end-to-end speedup.