super hub Mixed citations

gpt-oss-120b & gpt-oss-20b Model Card

Andy Applebaum, Edwin Arbus, Jason Ai, Lama Ahmad, OpenAI: Sandhini Agarwal, Sam Altman · 2025 · cs.CL · arXiv 2508.10925

Mixed citation behavior. Most common role is background (41%).

433 Pith papers citing it

Background 41% of classified citations

open full Pith review browse 433 citing papers more from Andy Applebaum arXiv PDF

abstract

We present gpt-oss-120b and gpt-oss-20b, two open-weight reasoning models that push the frontier of accuracy and inference cost. The models use an efficient mixture-of-expert transformer architecture and are trained using large-scale distillation and reinforcement learning. We optimize the models to have strong agentic capabilities (deep research browsing, python tool use, and support for developer-provided functions), all while using a rendered chat format that enables clear instruction following and role delineation. Both models achieve strong results on benchmarks ranging from mathematics, coding, and safety. We release the model weights, inference implementations, tool environments, and tokenizers under an Apache 2.0 license to enable broad use and further research.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 33 baseline 16 method 16 other 6 dataset 5

citation-polarity summary

background 31 baseline 16 use method 16 unclear 7 use dataset 5 support 1

claims ledger

abstract We present gpt-oss-120b and gpt-oss-20b, two open-weight reasoning models that push the frontier of accuracy and inference cost. The models use an efficient mixture-of-expert transformer architecture and are trained using large-scale distillation and reinforcement learning. We optimize the models to have strong agentic capabilities (deep research browsing, python tool use, and support for developer-provided functions), all while using a rendered chat format that enables clear instruction following and role delineation. Both models achieve strong results on benchmarks ranging from mathematics,

authors

Andy Applebaum Edwin Arbus Jason Ai Lama Ahmad OpenAI: Sandhini Agarwal Sam Altman

co-cited works

representative citing papers

Sumi: Open Uniform Diffusion Language Model from Scratch

cs.CL · 2026-06-17 · unverdicted · novelty 8.0

Sumi is an openly released 7B parameter uniform diffusion language model pretrained from scratch on 1.5T tokens that matches autoregressive models on several benchmarks.

TW-LegalBench: Measuring Taiwanese Legal Understanding

cs.CL · 2026-06-17 · unverdicted · novelty 8.0

TW-LegalBench evaluates 13 LLMs on over 30,000 Taiwanese legal tasks from exams and judgments, showing top models pass lawyer thresholds but struggle with exact statute citations.

UltraEP: Unleash MoE Training and Inference on Rack-Scale Nodes with Near-Optimal Load Balancing

cs.DC · 2026-06-02 · unverdicted · novelty 8.0

UltraEP is the first exact-load real-time expert balancer for large-EP MoE training and serving on rack-scale nodes, reaching 94.3% of ideal throughput and 1.49x over no-balancing.

RobotValues: Evaluating Household Robots When Human Values Conflict

cs.RO · 2026-06-02 · unverdicted · novelty 8.0

RobotValues is a benchmark of 10K value-conflict scenarios that reveals VLMs default to safety and accommodation while failing to follow instructions to prioritize other values 80% of the time.

Fully Open Meditron: An Auditable Pipeline for Clinical LLMs

cs.AI · 2026-05-15 · unverdicted · novelty 8.0 · 2 refs

Presents the first fully open pipeline for clinical LLMs by unifying eight public QA datasets with three clinician-vetted synthetic extensions and applying it to five base models to achieve benchmark gains while maintaining auditability.

MathAtlas: A Benchmark for Autoformalization in the Wild

cs.AI · 2026-05-13 · accept · novelty 8.0

MathAtlas is the first large-scale benchmark for autoformalizing graduate mathematics, where even strong models reach only 9.8% correctness on theorem statements and drop to 2.6% on the hardest dependency-deep subset.

Large Language Models Lack Temporal Awareness of Medical Knowledge

cs.LG · 2026-05-13 · unverdicted · novelty 8.0

LLMs lack temporal awareness of medical knowledge, showing gradual performance decline on up-to-date facts, much lower accuracy on historical knowledge (25-54% relative), and inconsistent year-to-year predictions.

Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models

cs.AR · 2026-05-11 · conditional · novelty 8.0

Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.

Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

cs.CL · 2026-05-09 · unverdicted · novelty 8.0 · 2 refs

Soohak is a 439-problem mathematician-curated benchmark where frontier LLMs reach at most 30.4% on research math challenges and no model exceeds 50% on refusal for ill-posed problems.

MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

MathConstraint generates scalable, automatically verifiable combinatorial problems where LLMs achieve 18.5-66.9% accuracy without tools but roughly double that with solver access.

LLM Translation of Compiler Intermediate Representation

cs.PL · 2026-05-07 · unverdicted · novelty 8.0

IRIS-14B is the first LLM trained explicitly for GIMPLE-to-LLVM IR translation and outperforms much larger models by up to 44 percentage points on real-world C code.

Efficient Training on Multiple Consumer GPUs with RoundPipe

cs.DC · 2026-04-29 · conditional · novelty 8.0

RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on 8x RTX 4090.

InfiniteScienceGym: An Unbounded, Procedurally-Generated Benchmark for Scientific Analysis

cs.CL · 2026-04-14 · unverdicted · novelty 8.0

InfiniteScienceGym procedurally generates unbounded scientific repositories with exact ground-truth QA pairs to benchmark LLMs on data reasoning, abstention, and tool use without static datasets.

Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models

cs.CL · 2026-04-13 · conditional · novelty 8.0

Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.

Tessera: Unlocking Heterogeneous GPUs through Kernel-Granularity Disaggregation

cs.DC · 2026-04-11 · unverdicted · novelty 8.0

Tessera performs kernel-granularity disaggregation on heterogeneous GPUs, achieving up to 2.3x throughput and 1.6x cost efficiency gains for large model inference while generalizing beyond prior methods.

Evaluating Large Language Models in Scientific Discovery

cs.AI · 2025-12-17 · unverdicted · novelty 8.0

The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.

Unlocking Speech-Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning

cs.CL · 2026-07-02 · unverdicted · novelty 7.0

SpeechCombine produces instruction-following SLMs via speech pre-training followed by direct weight combination with the text LLM instruction delta, without any speech instruction tuning.

OpenSafeIntent: Evaluating Intent-Calibrated Safe Completion Across Dual-Use Prompt Sets

cs.CL · 2026-07-02 · unverdicted · novelty 7.0

OpenSafeIntent benchmark shows models fail to calibrate safety across intent shifts in matched dual-use prompts, indicating current evaluations are insufficient.

Can Language Models Actually Retrieve In-Context? Drowning in Documents at Million Token Scale

cs.CL · 2026-07-01 · unverdicted · novelty 7.0

A 0.6B LM with length-aware attention adjustments performs competitive in-context retrieval at million-token scale on MS MARCO, NQ, and LIMIT benchmarks.

Measuring the Gap Between Human and LLM Research Ideas

cs.CL · 2026-07-01 · unverdicted · novelty 7.0

LLM-generated research ideas cluster more around bridge-like opportunities and synthesis methods than the broader distribution seen in human papers.

ELDR: Expert-Locality-Aware Decode Routing for PD-Disaggregated MoE Serving

cs.DC · 2026-07-01 · unverdicted · novelty 7.0 · 2 refs

ELDR reduces median TPOT by 5.9-13.9% in PD-disaggregated MoE serving via expert signatures from prefill, K-means partitioning, and locality-band routing with KV-co-indexed signature cache.

Using AI Agents to Automate Black-Box Audits of Personalization Algorithms at Scale

cs.CL · 2026-06-29 · unverdicted · novelty 7.0

Introduces GenAI agent framework for auditing personalization algorithms via synthetic accounts with fixed personas, applied to X post-2024 election showing amplification of toxic and right-leaning content varying by ideology.

SABER-Math: Automated Benchmark for Information Retrieval Evaluation in Mathematics

cs.IR · 2026-06-29 · unverdicted · novelty 7.0

SABER-Math is an automated benchmark for mathematical IR that uses LLM summaries, topic similarities, and preference tournaments on 283K problems to create reranking tasks, showing embedding models outperform baselines but struggle in symbol-heavy areas and that MTEB does not predict math performanc

Agentic Abstention: Do Agents Know When to Stop Instead of Act?

cs.AI · 2026-06-27 · unverdicted · novelty 7.0

LLM agents often fail to abstain at the right time in uncertain multi-turn tasks, and the CONVOLVE context engineering method raises timely abstention rates on WebShop from 26.7 to 57.4 without parameter updates.

citing papers explorer

Showing 50 of 433 citing papers.

Beyond Query Memorization: Large Language Model Routing with Query Decomposition and Historical Matching cs.AI · 2026-05-25 · unverdicted · none · ref 12 · internal anchor
DecoR routes LLM queries by decomposing them into capability dimensions and matching to historical examples, yielding higher accuracy and lower inference costs than direct-mapping routers on both in-distribution and OOD data.
Inference Time Optimization with Confidence Dynamics cs.CL · 2026-05-24 · unverdicted · none · ref 1 · internal anchor
Correct reasoning traces exhibit positive confidence gain while incorrect traces show declining confidence, enabling CDG-based voting that boosts performance on AIME, HMMT and BRUMO benchmarks across multiple LLM architectures.
AstroMind: A High-Fidelity Benchmark for Spacecraft Behavior Reasoning Based on Large Language Models cs.CL · 2026-05-23 · unverdicted · none · ref 42 · internal anchor
AstroMind is a new physics-grounded benchmark for LLM reasoning on spacecraft behavior across intent inference, maneuver estimation, and threat assessment, evaluated on several open-weight models.
PrivFusion: A Privacy-preserving Multi-Agent Framework for Harmonizing Distributed Datasets cs.LG · 2026-05-22 · unverdicted · none · ref 12 · internal anchor
PrivFusion deploys agents to cluster semantically similar features and iteratively recommend transformations for harmonizing heterogeneous structured datasets in a privacy-preserving manner, evaluated on four COVID-19 datasets.
SVR-MAD: A Bayesian-Inspired Framework for Posterior-Guided Multi-Agent Debate cs.MA · 2026-05-21 · unverdicted · none · ref 29 · internal anchor
SVR-MAD treats pre-debate signals as priors and debate results as evidence to build a sparser communication graph, cutting token use by up to 61% while preserving or raising accuracy over prior MAD methods.
AnyMo: Geometry-Aware Setup-Agnostic Modeling of Human Motion in the Wild cs.CV · 2026-05-21 · unverdicted · none · ref 51 · 2 links · internal anchor
AnyMo pre-trains a graph encoder on physics-simulated multi-placement IMU data and aligns full-body motion tokens with LLMs to enable zero-shot activity recognition, retrieval, and captioning across unseen datasets and setups.
SWE-Mutation: Can LLMs Generate Reliable Test Suites in Software Engineering? cs.SE · 2026-05-21 · unverdicted · none · ref 90 · internal anchor
SWE-Mutation benchmark shows current LLMs achieve low verification (10.20%) and detection (36.15%) rates on 2,636 mutated variants, exposing weaknesses in generating reliable test suites.
Manifold-Guided Attention Steering cs.LG · 2026-05-20 · unverdicted · none · ref 13 · internal anchor
MAGS learns low-dimensional subspaces from correct versus incorrect reasoning traces and applies targeted projection corrections to attention heads when they deviate from the correctness manifold during inference.
ClinSeekAgent: Automating Multimodal Evidence Seeking for Agentic Clinical Reasoning cs.CL · 2026-05-19 · unverdicted · none · ref 31 · internal anchor
ClinSeekAgent automates active multimodal evidence seeking for clinical reasoning, improving LLM performance on raw EHR and CXR tasks while enabling distillation into smaller models.
LP-Eval: Rubric and Dataset for Measuring the Quality of Legal Proposition Generation cs.CL · 2026-05-19 · unverdicted · none · ref 2 · internal anchor
LP-Eval is a new expert-co-designed rubric and annotated dataset showing that LLMs mostly produce well-formed legal propositions from EU court decisions, with higher expert-rated quality for established cases and improved LLM-as-judge alignment when using the rubric.
OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond cs.LG · 2026-05-19 · unverdicted · none · ref 2 · internal anchor
OScaR mitigates token norm imbalance via canalized rotation and omni-token scaling to enable near-lossless INT2 KV cache quantization with up to 3x decoding speedup and 5.3x memory reduction.
Retrieval-Augmented Linguistic Calibration cs.CL · 2026-05-19 · unverdicted · none · ref 37 · 2 links · internal anchor
Presents a distributional model of linguistic confidence, Faithfulness Divergence metric, and RALC pipeline that boosts faithfulness and calibration on QA benchmarks across LLM families.
Are Rationales Necessary and Sufficient? Tuning LLMs for Explainable Misinformation Detection cs.CL · 2026-05-19 · unverdicted · none · ref 26 · internal anchor
LONSREX introduces a metric-based pipeline to identify necessary and sufficient rationales when creating training data for fine-tuning LLMs on explainable misinformation detection, addressing limitations of naive label-based filtering.
SimGym: A Framework for A/B Test Simulation in E-Commerce with Traffic-Grounded VLM Agents cs.AI · 2026-05-19 · unverdicted · none · ref 1 · internal anchor
SimGym is a browser-based VLM agent framework that simulates A/B test outcomes on e-commerce storefronts with 77% directional agreement on add-to-cart shifts from real buyer traffic.
From Volume to Value: Preference-Aligned Memory Construction for On-Device RAG cs.CL · 2026-05-18 · unverdicted · none · ref 4 · 2 links · internal anchor
EPIC constructs a preference-aligned index for on-device RAG that reduces memory 2404x, cuts retrieval latency 32x, and raises preference-following accuracy 18.79pp across four benchmarks while fitting under 1MB on real devices.
TRACE: Trajectory Correction from Cross-layer Evidence for Hallucination Reduction cs.AI · 2026-05-18 · unverdicted · none · ref 32 · internal anchor
TRACE uses cross-layer candidate trajectories inside frozen LLMs to dynamically select and apply one of three correction operators, delivering mean gains of +12.26 MC1 and +8.65 MC2 points across 15 models and 3 benchmarks with no regressions.
Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers math.OC · 2026-05-18 · unverdicted · none · ref 120 · 2 links · internal anchor
Proposes equivariant optimizer updates matched to layer symmetries for embeddings, SwiGLU MLPs, and MoE routers, with reported gains in validation loss and training stability on several language model architectures.
Stop When Reasoning Converges: Semantic-Preserving Early Exit for Reasoning Models cs.CL · 2026-05-17 · unverdicted · none · ref 89 · internal anchor
PUMA detects reasoning-level semantic redundancy to enable early exit in chains of thought, achieving 26.2% average token reduction across five LRMs and five benchmarks while preserving accuracy and CoT quality.
Self-Improving CAD Generation Agents with Finite Element Analysis as Feedback cs.GR · 2026-05-17 · unverdicted · none · ref 1 · 2 links · internal anchor
CAD generation agents are augmented with FEA feedback plus text blueprint and 21-view image signals, raising Box-IoU on S2O and Fusion360 while showing that base models produce no strict-passing FEA artifacts.
BELIEF: Structured Evidence Modeling and Uncertainty-Aware Fusion for Biomedical Question Answering cs.CL · 2026-05-17 · unverdicted · none · ref 46 · internal anchor
BELIEF improves closed-set biomedical QA by converting documents to structured evidence objects and fusing D-S symbolic belief estimation with LLM inference through reliability-aware arbitration.
Artificial Intolerance: Stigmatizing Language in Clinical Documentation Skews Large Language Model Decision-Making cs.CL · 2026-05-17 · unverdicted · none · ref 18 · internal anchor
Frontier LLMs exhibit bias from stigmatizing language in clinical vignettes across four conditions, skewing decisions toward less aggressive management, with limited mitigation from Chain-of-Thought or self-debiasing prompts.
Capturing LLM Capabilities via Evidence-Calibrated Query Clustering cs.AI · 2026-05-16 · unverdicted · none · ref 1 · 2 links · internal anchor
ECC calibrates semantic embeddings with model comparisons via Bradley-Terry profiles and mixture weights to cluster queries by latent LLM capabilities, claiming 17-18 point gains in ranking quality over baselines.
JSPG: Dynamic Dictionary Filtering via Joint Semantic-Pinyin-Glyph Retrieval for Chinese Contextual ASR cs.CL · 2026-05-16 · unverdicted · none · ref 14 · internal anchor
JSPG jointly combines semantic, pinyin, and glyph retrieval with an extended Smith-Waterman algorithm to dynamically filter keyword dictionaries and improve accuracy in Chinese contextual ASR.
Scalable Knowledge Editing for Mixture-of-Experts LLMs via Tensor-Structured Updates cs.LG · 2026-05-15 · unverdicted · none · ref 12 · internal anchor
A MEMIT-style knowledge editing framework for MoE LLMs that formulates per-expert updates via tensor structure and applies Woodbury identity for low-rank inversions, achieving up to 6x speedup with comparable editing quality.
RAPT: Retrieval-Augmented Post-hoc Thresholding for Multi-Label Classification cs.IR · 2026-05-15 · unverdicted · none · ref 18 · internal anchor
RAPT improves multi-label label-set selection by retrieving similar documents and locally aggregating their thresholding outcomes to adapt per-instance cutoffs.
Agentic Discovery of Neural Architectures: AIRA-Compose and AIRA-Design cs.AI · 2026-05-15 · unverdicted · none · ref 10 · internal anchor
Multi-agent LLM systems discover new Transformer and hybrid architectures that outperform Llama 3.2 at 1B scale and approach human SOTA on long-range benchmarks.
Asking Back: Interaction-Layer Antidistillation Watermarks cs.CR · 2026-05-15 · unverdicted · none · ref 24 · internal anchor
Interaction-layer antidistillation watermarks use system-prompt-induced behavioral markers like explicit follow-up questions that transfer to distilled student models at 45-89% relative fidelity and can be audited via black-box LLM-as-judge queries.
An Interpretable Latency Model for Speculative Decoding in LLM Serving cs.LG · 2026-05-14 · unverdicted · none · ref 1 · internal anchor
The paper presents an interpretable latency model for speculative decoding that infers effective batch size via Little's Law and decomposes demand to predict and explain performance across serving loads, validated on vLLM measurements.
Probing Privacy Leaks in LLM-based Code Generation via Test Generation cs.SE · 2026-05-14 · unverdicted · none · ref 48 · internal anchor
A test-driven pipeline with an auto-constructed privacy feature library detects 2.56 times more confirmed privacy leaks in LLM-based code generation than existing baselines.
CANTANTE: Optimizing Agentic Systems via Contrastive Credit Attribution cs.CL · 2026-05-13 · unverdicted · none · ref 32 · internal anchor
CANTANTE uses contrastive rollouts to attribute system rewards to individual agents, enabling better prompt optimization than prior methods on programming, math, and QA benchmarks.
IfcLLM: Natural Language Querying of IFC Models through Complementary Relational and Graph Representations cs.CL · 2026-05-13 · unverdicted · none · ref 66 · 2 links · internal anchor
IfcLLM combines relational and graph representations of IFC models with an LLM agent to achieve 93.3-100% first-attempt accuracy on natural language queries across three models and 30 scenarios.
ToolMol: Evolutionary Agentic Framework for Multi-objective Drug Discovery cs.LG · 2026-05-12 · unverdicted · none · ref 32 · 2 links · internal anchor
ToolMol integrates evolutionary algorithms with agentic LLMs and precise RDKit tools to optimize multi-objective drug properties, yielding ligands with over 10% better predicted binding affinity and 35% gains in absolute binding free energy on three protein targets.
Task-Adaptive Embedding Refinement via Test-time LLM Guidance cs.CL · 2026-05-12 · unverdicted · none · ref 34 · internal anchor
Test-time LLM feedback refines query embeddings to deliver up to 25% relative gains on zero-shot literature search, intent detection, and related benchmarks.
Overtrained, Not Misaligned cs.LG · 2026-05-12 · unverdicted · none · ref 29 · internal anchor
Emergent misalignment arises from overtraining after primary task convergence and is preventable by early stopping, which retains 93% of task performance on average.
EvoNav: Evolutionary Reward Function Design for Robot Navigation with Large Language Models cs.RO · 2026-05-12 · unverdicted · none · ref 1 · internal anchor
EvoNav automates the design of reward functions for RL robot navigation by evolving LLM proposals through a three-stage cheap-to-expensive evaluation process and claims better policies than hand-crafted or prior automated rewards.
Persona-Conditioned Adversarial Prompting: Multi-Identity Red-Teaming for Adversarial Discovery and Mitigation cs.LG · 2026-05-12 · unverdicted · none · ref 1 · internal anchor
PCAP conditions adversarial searches on multiple attacker personas to discover more diverse and transferable jailbreaks, yielding richer safety fine-tuning datasets that boost model robustness on GPT-OSS 120B.
Measuring Five-Nines Reliability: Sample-Efficient LLM Evaluation in Saturated Benchmarks cs.LG · 2026-05-11 · unverdicted · none · ref 7 · internal anchor
Cross-entropy method sampling reduces inferences needed to estimate five-nines LLM reliability by up to 156x on parameterized GSM8K templates, revealing reliability differences hidden by saturated accuracy scores.
TrajPrism: A Multi-Task Benchmark for Language-Grounded Urban Trajectory Understanding cs.AI · 2026-05-11 · unverdicted · none · ref 1 · internal anchor
TrajPrism introduces a multi-task benchmark with 300K real-world urban trajectories and 2.1M language-grounded task instances across three cities, plus proof-of-concept models showing large gaps versus geometry-only baselines.
DisagMoE: Computation-Communication overlapped MoE Training via Disaggregated AF-Pipe Parallelism cs.LG · 2026-05-10 · unverdicted · none · ref 1 · internal anchor
DisagMoE achieves up to 1.8x faster MoE training by disaggregating attention and FFN layers into disjoint GPU groups with a multi-stage uni-directional pipeline and roofline-based bandwidth balancing.
Test-Time Speculation cs.CL · 2026-05-10 · unverdicted · none · ref 16 · 2 links · internal anchor
TTS adapts speculator models online via target model verifications to improve acceptance lengths by up to 72% over prior methods, with gains increasing for longer generations.
OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents cs.LG · 2026-05-09 · unverdicted · none · ref 1 · 2 links · internal anchor
OTora is a two-stage framework that generates insertion-aware adversarial triggers and ICL-guided genetic payloads to induce reasoning-level denial-of-service in tool-augmented LLM agents across multiple backbones while preserving task correctness.
What Software Engineering Looks Like to AI Agents? -- An Empirical Study of AI-Only Technical Discourse on MoltBook cs.SE · 2026-05-08 · unverdicted · none · ref 11 · 3 links · internal anchor
Empirical analysis of 4707 MoltBook posts shows AI-only technical discourse focuses on security, trust, and abstract topics while lacking concrete runtime and project details found in human GitHub discussions.
Sanity Checks for Long-Form Hallucination Detection cs.CL · 2026-05-08 · unverdicted · none · ref 13 · internal anchor
Hallucination detectors on LLM reasoning traces often rely on final-answer artifacts rather than reasoning validity; once controlled, lightweight lexical trajectory features suffice for robust detection.
LARAG: Link-Aware Retrieval Strategy for RAG Systems in Hyperlinked Technical Documentation cs.IR · 2026-05-08 · unverdicted · none · ref 1 · internal anchor
LARAG improves RAG answer quality on hyperlinked technical documentation by using author-defined links for retrieval, achieving higher BERTScore while using fewer chunks and tokens than standard embedding-based RAG.
Rethinking Dense Sequential Chains: Reasoning Language Models Can Extract Answers from Sparse, Order-Shuffling Chain-of-Thoughts cs.CL · 2026-05-08 · conditional · none · ref 30 · internal anchor
Reasoning language models extract answers from sparse, order-shuffled chain-of-thought traces with little accuracy loss.
Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning cs.CL · 2026-05-07 · unverdicted · none · ref 28 · internal anchor
A training recipe for tool-integrated reasoning models achieves state-of-the-art open-source results on math benchmarks such as 96.7% and 99.2% on AIME 2025 at 4B and 30B scales by balancing tool-use trajectories and optimizing for pass@k during SFT before stable RLVR.
XtraMAC: An Efficient MAC Architecture for Mixed-Precision LLM Inference on FPGA cs.AR · 2026-05-07 · conditional · none · ref 30 · internal anchor
XtraMAC unifies mixed-precision MAC on FPGA via shared integer mantissa products, delivering 1.4-2.0x higher compute density and up to 1.9x better energy efficiency.
Evaluation Awareness in Language Models Has Limited Effect on Behaviour cs.CL · 2026-05-07 · conditional · none · ref 21 · internal anchor
Verbalised evaluation awareness in large reasoning models has only small effects on their outputs across safety and alignment tests.
Formalizing Latent Thoughts: Four Axioms of Thought Representation in LLMs cs.CL · 2026-05-07 · unverdicted · none · ref 60 · internal anchor
Four axioms (Causality, Minimality, Separability, Stability) are formalized for latent thought representations; audits of open LLMs on 23 tasks show none satisfy all four and representations add little beyond input embeddings.
Reproducing Complex Set-Compositional Information Retrieval cs.CL · 2026-05-05 · unverdicted · none · ref 14 · internal anchor
Neural retrievers that double BM25 performance on QUEST collapse below 0.02 Recall@100 on the new LIMIT+ benchmark while lexical methods reach 0.96, with all methods degrading as compositional depth increases.

gpt-oss-120b & gpt-oss-20b Model Card

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer