Sumi is an openly released 7B parameter uniform diffusion language model pretrained from scratch on 1.5T tokens that matches autoregressive models on several benchmarks.
super hub Mixed citations
gpt-oss-120b & gpt-oss-20b Model Card
Mixed citation behavior. Most common role is background (41%).
abstract
We present gpt-oss-120b and gpt-oss-20b, two open-weight reasoning models that push the frontier of accuracy and inference cost. The models use an efficient mixture-of-expert transformer architecture and are trained using large-scale distillation and reinforcement learning. We optimize the models to have strong agentic capabilities (deep research browsing, python tool use, and support for developer-provided functions), all while using a rendered chat format that enables clear instruction following and role delineation. Both models achieve strong results on benchmarks ranging from mathematics, coding, and safety. We release the model weights, inference implementations, tool environments, and tokenizers under an Apache 2.0 license to enable broad use and further research.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We present gpt-oss-120b and gpt-oss-20b, two open-weight reasoning models that push the frontier of accuracy and inference cost. The models use an efficient mixture-of-expert transformer architecture and are trained using large-scale distillation and reinforcement learning. We optimize the models to have strong agentic capabilities (deep research browsing, python tool use, and support for developer-provided functions), all while using a rendered chat format that enables clear instruction following and role delineation. Both models achieve strong results on benchmarks ranging from mathematics,
authors
co-cited works
representative citing papers
TW-LegalBench evaluates 13 LLMs on over 30,000 Taiwanese legal tasks from exams and judgments, showing top models pass lawyer thresholds but struggle with exact statute citations.
UltraEP is the first exact-load real-time expert balancer for large-EP MoE training and serving on rack-scale nodes, reaching 94.3% of ideal throughput and 1.49x over no-balancing.
RobotValues is a benchmark of 10K value-conflict scenarios that reveals VLMs default to safety and accommodation while failing to follow instructions to prioritize other values 80% of the time.
Presents the first fully open pipeline for clinical LLMs by unifying eight public QA datasets with three clinician-vetted synthetic extensions and applying it to five base models to achieve benchmark gains while maintaining auditability.
MathAtlas is the first large-scale benchmark for autoformalizing graduate mathematics, where even strong models reach only 9.8% correctness on theorem statements and drop to 2.6% on the hardest dependency-deep subset.
LLMs lack temporal awareness of medical knowledge, showing gradual performance decline on up-to-date facts, much lower accuracy on historical knowledge (25-54% relative), and inconsistent year-to-year predictions.
Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.
Soohak is a 439-problem mathematician-curated benchmark where frontier LLMs reach at most 30.4% on research math challenges and no model exceeds 50% on refusal for ill-posed problems.
MathConstraint generates scalable, automatically verifiable combinatorial problems where LLMs achieve 18.5-66.9% accuracy without tools but roughly double that with solver access.
IRIS-14B is the first LLM trained explicitly for GIMPLE-to-LLVM IR translation and outperforms much larger models by up to 44 percentage points on real-world C code.
RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on 8x RTX 4090.
InfiniteScienceGym procedurally generates unbounded scientific repositories with exact ground-truth QA pairs to benchmark LLMs on data reasoning, abstention, and tool use without static datasets.
Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.
Tessera performs kernel-granularity disaggregation on heterogeneous GPUs, achieving up to 2.3x throughput and 1.6x cost efficiency gains for large model inference while generalizing beyond prior methods.
The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.
SpeechCombine produces instruction-following SLMs via speech pre-training followed by direct weight combination with the text LLM instruction delta, without any speech instruction tuning.
OpenSafeIntent benchmark shows models fail to calibrate safety across intent shifts in matched dual-use prompts, indicating current evaluations are insufficient.
A 0.6B LM with length-aware attention adjustments performs competitive in-context retrieval at million-token scale on MS MARCO, NQ, and LIMIT benchmarks.
LLM-generated research ideas cluster more around bridge-like opportunities and synthesis methods than the broader distribution seen in human papers.
ELDR reduces median TPOT by 5.9-13.9% in PD-disaggregated MoE serving via expert signatures from prefill, K-means partitioning, and locality-band routing with KV-co-indexed signature cache.
Introduces GenAI agent framework for auditing personalization algorithms via synthetic accounts with fixed personas, applied to X post-2024 election showing amplification of toxic and right-leaning content varying by ideology.
SABER-Math is an automated benchmark for mathematical IR that uses LLM summaries, topic similarities, and preference tournaments on 283K problems to create reranking tasks, showing embedding models outperform baselines but struggle in symbol-heavy areas and that MTEB does not predict math performanc
LLM agents often fail to abstain at the right time in uncertain multi-turn tasks, and the CONVOLVE context engineering method raises timely abstention rates on WebShop from 26.7 to 57.4 without parameter updates.
citing papers explorer
-
ReSET: Accurate Latency-Critical NVFP4 Reasoning via Step-Aware Temperature Scaling
ReSET mitigates accuracy degradation in NVFP4-quantized reasoning models via step-aware entropy-based temperature scaling and provides a small-M CUDA kernel for up to 2.5x kernel speedup and 2x end-to-end speedup.
-
Structured Testbench Generation for LLM-Driven HDL Design and Verification-Oriented Data Curation
STG generates deterministic testbenches 720x faster than iterative LLM flows with higher coverage and fewer false passes, while serving as an 11x faster data curation engine with 127x less energy.
-
Grammar-Constrained Decoding Can Jailbreak LLMs into Generating Malicious Code
Grammar-constrained decoding enables a new jailbreak (CodeSpear) on LLMs for malicious code, countered by CodeShield which trains models to output harmless honeypot code under GCD while preserving refusals.
-
Architecture-Aware Reinforcement Learning Makes Sliding-Window Attention Competitive in Math Reasoning
Reinforcement learning after SFT conversion narrows the performance gap between sliding-window attention and full self-attention on math reasoning benchmarks while preserving linear complexity.
-
Divide and Cooperate: Role-Decomposed Multi-Agent LLM Training with Cross-Agent Learning Signals
DAC decomposes agentic search into cooperative searcher and generator agents with cross-agent signals (abstention reward and hard-positive augmentation), achieving strong QA benchmark performance via LoRA on a shared backbone.
-
Are We Evaluating Knowledge or Phrasing? Mitigating MCQA Sensitivity with ParaEval
ParaEval reduces false performance gaps in MCQA benchmarks from over 2 points to below 1 point by scoring models on multiple paraphrases per answer option instead of single surface forms.
-
End-to-End Context Compression at Scale
LCLMs are scaled 0.6B-encoder 4B-decoder compressors pre-trained on over 350B tokens that improve the Pareto frontier for general-task performance, compression speed, and peak memory in long-context language model inference.
-
Provably Efficient Personalized Multi-Objective Bandits with Proactive Conversational Queries
MO-PQUCB hybrid algorithm integrates proactive conversational queries with bandit feedback via shift-invariant regularization to achieve improved regret bounds in personalized multi-objective bandits.
-
TLRD: Teaching LLMs to Reason over Tabular Data with Tri-Level Rationale Distillation
TLRD distills tri-level rationales (instance features, dataset distributions, neighbor comparisons) from a teacher into student LLMs to close the accuracy gap with tree ensembles on tabular data while generating grounded explanations.
-
Sparsely gated tiny linear experts
Sgatlin replaces transformer FF layers with sparse single linear neurons, improving perplexity across compute budgets and enabling direct interpretation of semantically clustered circuits for factual recall.
-
MLingualFC: Evaluating Jailbreak Vulnerabilities in Multilingual Vision-Language Models
MLingualFC benchmark finds flowchart jailbreaks succeed at high rates for Latin-script languages but much lower rates for Punjabi in multilingual VLMs, pointing to language-dependent safety gaps.
-
RECAP: Regression Evaluation for Continual Adaptation of Prompts
RECAP benchmark finds that six prompt optimization methods show no significant performance gains under proactive continual adaptation to evolving constraints across four LLMs.
-
The Piggyback Hypothesis of Generalization: Explaining and Mitigating Emergent Misalignment
The Piggyback Hypothesis attributes emergent misalignment to chat-template tokens piggybacking finetuned behavior; Token-Regularized Finetuning (TReFT) mitigates it by regularizing prefix token representations.
-
How Language Models Fail: Token-Level Signatures of Committed and Persistent Reasoning Failures
LLM reasoning failures split into committed (early lock-in) and persistent-uncertainty modes with distinct token-level signatures that hold across 23 model-dataset pairs in 20 of 23 falsifiable tests.
-
Vortex: Efficient and Programmable Sparse Attention Serving for AI Agents
Vortex provides a programmable frontend and backend for sparse attention in LLM serving, delivering up to 3.46x throughput over full attention while preserving accuracy.
-
Compress-Distill: Reasoning Trace Compression for Efficient Knowledge Distillation
Post-hoc model-based compression of reasoning traces cuts training tokens to 12-30% and speeds training 2-7.6x while retaining up to 96% of raw-trace accuracy, though raw traces remain superior at every scale.
-
Can LLMs Be Constrained to the Past? Improving Knowledge Cutoff through Recall-Based Prompting
Recall-based prompting (Self-Recall and Question-Recall) outperforms direct-answer and chain-of-thought methods on knowledge cutoff benchmarks, including a new multi-cutoff historical events benchmark.
-
SHIELDS: Automating OS Hardening with Iterative Multi-Agent Remediation
SHIELDS deploys multi-agent LLMs for iterative, feedback-driven OS hardening and reports up to 73% remediation of scan findings, with success tied more to tool use than model size.
-
Noisy memory encoding explains negative polarity illusions
Noisy memory encoding of determiners explains negative polarity illusions, with new acceptability experiments showing stronger illusions for similar determiner pairs.
-
Expert-Aware Refusal Steering
Refusal steering works on MoE LLMs; expert-aware variants succeed with single-expert outputs and refusal signals differ from routing patterns.
-
Consistency Training Can Entrench Misalignment
Consistency training suppresses reward hacking and emergent misalignment but amplifies sycophancy in controlled model organisms, driven by labeling-induced distribution shifts rather than selection operators.
-
LiveBand: Live Accompaniment Generation in the Audio Domain
LiveBand generates high-fidelity music accompaniments to live audio in real time via a causal transformer in audio latent space trained with adversarial sequence-level supervision.
-
GLINT: Sparsely Gated Vision-Language Alignment for Fine-Grained Radiology Representations
GLINT introduces sparsely gated alignment and dense feature regularization on top of DINOv3 and V-JEPA encoders to enable query-specific zero-shot grounding and segmentation in 2D CXR and 3D CT.
-
KForge: LLM-Driven Cross-Platform Kernel Generation for AI Accelerators
KForge uses dual LLM agents for cross-platform kernel generation, reporting 2.12% throughput gain on NVIDIA B200 vs TensorRT-LLM and 5.13x geometric mean speedup on Intel Arc B580 vs PyTorch on 37 workloads.
-
The Epi-LLM Framework: probing LLM behavioral priors through epidemiological agent-based models
Epi-LLM integrates LLMs as agents in ABM epidemic simulations, finding reduced peak infections, 58-65% quarantine compliance, and perceived severity as top predictor with pseudo-R² 0.055 comparable to human data.
-
Traj-Evolve: A Self-Evolving Multi-Agent System for Patient Trajectory Modeling in Lung Cancer Early Detection
Traj-Evolve combines non-parametric experience retrieval and multi-agent RL with a leave-one-out unification strategy to outperform baselines on lung cancer prediction from up to five years of multimodal EHRs, including in never-smokers.
-
POIROT: Interrogating Agents for Failure Detection in Multi-Agent Systems
POIROT protocol repurposes agents in LLM multi-agent systems as an internal diagnostic layer for failure detection, outperforming single-LLM evaluators with gains that increase with complexity, agent count, and fault types.
-
DFlare: Scaling Up Draft Capacity for Block Diffusion Speculative Decoding
DFlare replaces DFlash's shared fused representation with per-draft-layer attention to distinct target-layer combinations, enabling deeper drafts and 2.4M training samples for 5-11% higher speedups than DFlash on Qwen3 and GPT-OSS models.
-
Understanding LLM Behavior in Multi-Target Cross-Lingual Summarization
Introduces the MEA benchmark for multi-target cross-lingual summarization across 24 languages and demonstrates that activation steering from English summarization representations improves performance.
-
DeSQ: Decomposition-based SPARQL Query Generation
DeSQ decomposes questions into atomic constraints, maps them to SPARQL fragments with placeholders, grounds the placeholders, and assembles complete queries, outperforming prior methods on four of five benchmarks.
-
GPU Forecasters: Language Models as Selective Surrogates for Kernel Runtime Optimization
LLMs can forecast GPU kernel performance accurately enough to serve as selective surrogates, allowing kernel searches to consider more candidates and recover faster kernels under fixed GPU evaluation budgets.
-
Neuro-symbolic Syntactic Parsing: Shaping a Neural Network with the CYK Algorithm
CYKNN encodes the CYK algorithm in a recurrent neural network and outperforms large LLMs on parsing a very simple context-free grammar.
-
Eigenvectors of Experts are Training-free Non-collapsing Routers
SSMoE uses eigenvectors of expert weights via SVD to build training-free non-collapsing routers for SMoE models in language and vision tasks.
-
Automatically Attacking Software Reverse Engineering AI Agents
Genetic algorithm prompt generation enables prompt injection into binaries via string assignments to fool LLM-powered decompilers and disassemblers.
-
EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs
EHRBench uses an EHR-LLM-KB pipeline to automatically create 960,067 reliable QA items spanning diagnosis, treatment, and prognosis for large-scale LLM evaluation in clinical decision making.
-
Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents
Harness-updating capability is flat across base model capabilities while harness-benefit is non-monotonic, peaking at mid-tier models in self-evolving LLM agents.
-
REPOT: Recoverable Program-of-Thought via Checkpoint Repair
RePoT recovers from PoT failures via deterministic verified replay and checkpoint repair, yielding +3 to +11pp gains on planning benchmarks and showing checkpoint state as the key recovery signal over error-only feedback.
-
EvoRubric: Self-Evolving Rubric-Driven RL for Open-Ended Generation
EvoRubric is a single-policy RL method that co-evolves a reasoner and a rubric generator with multi-level verification to produce dynamic rewards for open-ended LLM alignment.
-
ReasonOps: Operator Segmentation for LLM Reasoning Traces
Unsupervised clustering on sentence-initial 3-token pivots extracts 7 universal reasoning operators from 44k traces across 12 LLMs that enable model fingerprinting and answer-correctness prediction.
-
HardMTBench: Stress-Testing Chinese-English Translation on Knowledge-Intensive Domains
HardMTBench is a difficulty-aware benchmark of 20,000 directional test items across 12 domains that widens GEMBA score ranges by a factor of two and reveals domain-specific weaknesses in 22 MT systems.
-
HELEA: Hard-Negative Benchmark and LLM-based Reranking for Robust Entity Alignment
HELEA creates hard-negative benchmarks (DW-HN29K, DY-HN27K) where name-overlap baselines fail and reports F1 0.967 on the new sets while preserving strong standard-benchmark scores via encoder retrieval plus untrained LLM reranking.
-
Pruning and Distilling Mixture-of-Experts into Dense Language Models
A systematic MoE-to-dense conversion via expert scoring, grouping, and distillation yields +6.3 pp average accuracy over dense-to-dense pruning at matched parameter count on tested models.
-
Extracting Small Translation Specialists from LLMs by Aggressively Pruning Experts
Aggressive expert pruning in MoE LLMs extracts compact translation specialists that retain near-baseline quality after removing up to 75% of experts (or 90% with short SFT).
-
Query Symbolically or Retrieve Semantically? A Dataset and Method for Semi-Structured Question Answering
DualGraph combines semantic textual KGs with symbolic KGs for semi-structured QA and introduces the SpecsQA benchmark, outperforming baselines on both open and specification questions.
-
JuICE: A Benchmark for Evaluating LLM-Judge in Identifying Cultural Errors
JuICE is a new multilingual benchmark dataset showing top LLM judges reach only F1 0.52 on span-level cultural error detection and miss errors locals readily spot.
-
An Efficient and Privacy-Preserving Architecture for Cross-Institutional Collaborative RAG
FedRAG uses a Scrambled Distributed Attention protocol with feature scrambling and token permutation to enable high-throughput, privacy-preserving federated RAG without special hardware or retraining.
-
Beyond Query Memorization: Large Language Model Routing with Query Decomposition and Historical Matching
DecoR routes LLM queries by decomposing them into capability dimensions and matching to historical examples, yielding higher accuracy and lower inference costs than direct-mapping routers on both in-distribution and OOD data.
-
Inference Time Optimization with Confidence Dynamics
Correct reasoning traces exhibit positive confidence gain while incorrect traces show declining confidence, enabling CDG-based voting that boosts performance on AIME, HMMT and BRUMO benchmarks across multiple LLM architectures.
-
AstroMind: A High-Fidelity Benchmark for Spacecraft Behavior Reasoning Based on Large Language Models
AstroMind is a new physics-grounded benchmark for LLM reasoning on spacecraft behavior across intent inference, maneuver estimation, and threat assessment, evaluated on several open-weight models.
-
PrivFusion: A Privacy-preserving Multi-Agent Framework for Harmonizing Distributed Datasets
PrivFusion deploys agents to cluster semantically similar features and iteratively recommend transformations for harmonizing heterogeneous structured datasets in a privacy-preserving manner, evaluated on four COVID-19 datasets.