super hub Mixed citations

gpt-oss-120b & gpt-oss-20b Model Card

Andy Applebaum, Edwin Arbus, Jason Ai, Lama Ahmad, OpenAI: Sandhini Agarwal, Sam Altman · 2025 · cs.CL · arXiv 2508.10925

Mixed citation behavior. Most common role is background (41%).

410 Pith papers citing it

Background 41% of classified citations

open full Pith review browse 410 citing papers more from Andy Applebaum arXiv PDF

abstract

We present gpt-oss-120b and gpt-oss-20b, two open-weight reasoning models that push the frontier of accuracy and inference cost. The models use an efficient mixture-of-expert transformer architecture and are trained using large-scale distillation and reinforcement learning. We optimize the models to have strong agentic capabilities (deep research browsing, python tool use, and support for developer-provided functions), all while using a rendered chat format that enables clear instruction following and role delineation. Both models achieve strong results on benchmarks ranging from mathematics, coding, and safety. We release the model weights, inference implementations, tool environments, and tokenizers under an Apache 2.0 license to enable broad use and further research.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 33 baseline 16 method 16 other 6 dataset 5

citation-polarity summary

background 31 baseline 16 use method 16 unclear 7 use dataset 5 support 1

claims ledger

abstract We present gpt-oss-120b and gpt-oss-20b, two open-weight reasoning models that push the frontier of accuracy and inference cost. The models use an efficient mixture-of-expert transformer architecture and are trained using large-scale distillation and reinforcement learning. We optimize the models to have strong agentic capabilities (deep research browsing, python tool use, and support for developer-provided functions), all while using a rendered chat format that enables clear instruction following and role delineation. Both models achieve strong results on benchmarks ranging from mathematics,

authors

Andy Applebaum Edwin Arbus Jason Ai Lama Ahmad OpenAI: Sandhini Agarwal Sam Altman

co-cited works

representative citing papers

UltraEP: Unleash MoE Training and Inference on Rack-Scale Nodes with Near-Optimal Load Balancing

cs.DC · 2026-06-02 · unverdicted · novelty 8.0

UltraEP is the first exact-load real-time expert balancer for large-EP MoE training and serving on rack-scale nodes, reaching 94.3% of ideal throughput and 1.49x over no-balancing.

RobotValues: Evaluating Household Robots When Human Values Conflict

cs.RO · 2026-06-02 · unverdicted · novelty 8.0

RobotValues is a benchmark of 10K value-conflict scenarios that reveals VLMs default to safety and accommodation while failing to follow instructions to prioritize other values 80% of the time.

Fully Open Meditron: An Auditable Pipeline for Clinical LLMs

cs.AI · 2026-05-15 · unverdicted · novelty 8.0 · 2 refs

Presents the first fully open pipeline for clinical LLMs by unifying eight public QA datasets with three clinician-vetted synthetic extensions and applying it to five base models to achieve benchmark gains while maintaining auditability.

MathAtlas: A Benchmark for Autoformalization in the Wild

cs.AI · 2026-05-13 · accept · novelty 8.0

MathAtlas is the first large-scale benchmark for autoformalizing graduate mathematics, where even strong models reach only 9.8% correctness on theorem statements and drop to 2.6% on the hardest dependency-deep subset.

Large Language Models Lack Temporal Awareness of Medical Knowledge

cs.LG · 2026-05-13 · unverdicted · novelty 8.0

LLMs lack temporal awareness of medical knowledge, showing gradual performance decline on up-to-date facts, much lower accuracy on historical knowledge (25-54% relative), and inconsistent year-to-year predictions.

Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models

cs.AR · 2026-05-11 · conditional · novelty 8.0

Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.

Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

cs.CL · 2026-05-09 · unverdicted · novelty 8.0 · 2 refs

Soohak is a 439-problem mathematician-curated benchmark where frontier LLMs reach at most 30.4% on research math challenges and no model exceeds 50% on refusal for ill-posed problems.

MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

MathConstraint generates scalable, automatically verifiable combinatorial problems where LLMs achieve 18.5-66.9% accuracy without tools but roughly double that with solver access.

LLM Translation of Compiler Intermediate Representation

cs.PL · 2026-05-07 · unverdicted · novelty 8.0

IRIS-14B is the first LLM trained explicitly for GIMPLE-to-LLVM IR translation and outperforms much larger models by up to 44 percentage points on real-world C code.

Efficient Training on Multiple Consumer GPUs with RoundPipe

cs.DC · 2026-04-29 · conditional · novelty 8.0

RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on 8x RTX 4090.

InfiniteScienceGym: An Unbounded, Procedurally-Generated Benchmark for Scientific Analysis

cs.CL · 2026-04-14 · unverdicted · novelty 8.0

InfiniteScienceGym procedurally generates unbounded scientific repositories with exact ground-truth QA pairs to benchmark LLMs on data reasoning, abstention, and tool use without static datasets.

Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models

cs.CL · 2026-04-13 · conditional · novelty 8.0

Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.

Tessera: Unlocking Heterogeneous GPUs through Kernel-Granularity Disaggregation

cs.DC · 2026-04-11 · unverdicted · novelty 8.0

Tessera performs kernel-granularity disaggregation on heterogeneous GPUs, achieving up to 2.3x throughput and 1.6x cost efficiency gains for large model inference while generalizing beyond prior methods.

Evaluating Large Language Models in Scientific Discovery

cs.AI · 2025-12-17 · unverdicted · novelty 8.0

The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.

Unlocking Speech-Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning

cs.CL · 2026-07-02 · unverdicted · novelty 7.0

SpeechCombine produces instruction-following SLMs via speech pre-training followed by direct weight combination with the text LLM instruction delta, without any speech instruction tuning.

OpenSafeIntent: Evaluating Intent-Calibrated Safe Completion Across Dual-Use Prompt Sets

cs.CL · 2026-07-02 · unverdicted · novelty 7.0

OpenSafeIntent benchmark shows models fail to calibrate safety across intent shifts in matched dual-use prompts, indicating current evaluations are insufficient.

Measuring the Gap Between Human and LLM Research Ideas

cs.CL · 2026-07-01 · unverdicted · novelty 7.0

LLM-generated research ideas cluster more around bridge-like opportunities and synthesis methods than the broader distribution seen in human papers.

Using AI Agents to Automate Black-Box Audits of Personalization Algorithms at Scale

cs.CL · 2026-06-29 · unverdicted · novelty 7.0

Introduces GenAI agent framework for auditing personalization algorithms via synthetic accounts with fixed personas, applied to X post-2024 election showing amplification of toxic and right-leaning content varying by ideology.

SABER-Math: Automated Benchmark for Information Retrieval Evaluation in Mathematics

cs.IR · 2026-06-29 · unverdicted · novelty 7.0

SABER-Math is an automated benchmark for mathematical IR that uses LLM summaries, topic similarities, and preference tournaments on 283K problems to create reranking tasks, showing embedding models outperform baselines but struggle in symbol-heavy areas and that MTEB does not predict math performanc

Agentic Abstention: Do Agents Know When to Stop Instead of Act?

cs.AI · 2026-06-27 · unverdicted · novelty 7.0

LLM agents often fail to abstain at the right time in uncertain multi-turn tasks, and the CONVOLVE context engineering method raises timely abstention rates on WebShop from 26.7 to 57.4 without parameter updates.

CodeChat-Eval: Evaluating Large Language Models in Multi-Turn Code Refinement Dialogues

cs.SE · 2026-06-24 · unverdicted · novelty 7.0

CodeChat-Eval shows LLMs lose 19.2% to 69.2% functional correctness over multi-turn refinement dialogues, with largest drops on logic-level and additive changes.

Do Thinking Tokens Help with Safety?

cs.LG · 2026-06-23 · unverdicted · novelty 7.0

Thinking tokens in reasoning models do not enable safety deliberation; refusal/compliance is strongly predictable from the first token and rarely changes during thinking.

LEDGER: A Long-Context Benchmark of Corporate Annual Reports for Grounded Financial Retrieval and Extraction

cs.CL · 2026-06-11 · unverdicted · novelty 7.0

LEDGER provides a corpus of 4,999 annual reports with 31 labeled KPIs and three benchmarks for page-level retrieval, needle-in-haystack lookup, and full KPI extraction from long documents.

CodeAlchemy: Synthetic Code Rewriting at Scale

cs.CL · 2026-06-08 · unverdicted · novelty 7.0

CodeAlchemy generates 850B+ tokens of synthetic code data across 15 languages via five strategies and enables 3B models to reach 83.5% HumanEval while beating 10x larger frontier models on new DevEval and TraceEval benchmarks.

citing papers explorer

Showing 5 of 5 citing papers after filters.

Tessera: Unlocking Heterogeneous GPUs through Kernel-Granularity Disaggregation cs.DC · 2026-04-11 · unverdicted · none · ref 47 · internal anchor
Tessera performs kernel-granularity disaggregation on heterogeneous GPUs, achieving up to 2.3x throughput and 1.6x cost efficiency gains for large model inference while generalizing beyond prior methods.
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation cs.MM · 2026-05-12 · unverdicted · none · ref 18 · 2 links · internal anchor
Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
Uncertainty-Aware Structured Data Extraction from Full CMR Reports via Distilled LLMs cs.CL · 2026-05-08 · unverdicted · none · ref 14 · internal anchor
CMR-EXTR extracts structured data from CMR reports at 99.65% variable-level accuracy using teacher-student LLM distillation and three-principle uncertainty estimation for quality control.
DisagMoE: Computation-Communication overlapped MoE Training via Disaggregated AF-Pipe Parallelism cs.LG · 2026-05-10 · unverdicted · none · ref 1 · internal anchor
DisagMoE achieves up to 1.8x faster MoE training by disaggregating attention and FFN layers into disjoint GPU groups with a multi-stage uni-directional pipeline and roofline-based bandwidth balancing.
Structured Multi-Criteria Evaluation of Large Language Models with Fuzzy Analytic Hierarchy Process and DualJudge cs.AI · 2026-04-04 · unverdicted · none · ref 12 · internal anchor
Fuzzy AHP and DualJudge deliver more stable and calibrated LLM evaluations than direct scoring by breaking assessments into explicit criteria and adaptively fusing intuitive and deliberative judgments.

gpt-oss-120b & gpt-oss-20b Model Card

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer