super hub Mixed citations

gpt-oss-120b & gpt-oss-20b Model Card

Andy Applebaum, Edwin Arbus, Jason Ai, Lama Ahmad, OpenAI: Sandhini Agarwal, Sam Altman · 2025 · cs.CL · arXiv 2508.10925

Mixed citation behavior. Most common role is background (41%).

428 Pith papers citing it

Background 41% of classified citations

open full Pith review browse 428 citing papers more from Andy Applebaum arXiv PDF

abstract

We present gpt-oss-120b and gpt-oss-20b, two open-weight reasoning models that push the frontier of accuracy and inference cost. The models use an efficient mixture-of-expert transformer architecture and are trained using large-scale distillation and reinforcement learning. We optimize the models to have strong agentic capabilities (deep research browsing, python tool use, and support for developer-provided functions), all while using a rendered chat format that enables clear instruction following and role delineation. Both models achieve strong results on benchmarks ranging from mathematics, coding, and safety. We release the model weights, inference implementations, tool environments, and tokenizers under an Apache 2.0 license to enable broad use and further research.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 33 baseline 16 method 16 other 6 dataset 5

citation-polarity summary

background 31 baseline 16 use method 16 unclear 7 use dataset 5 support 1

claims ledger

abstract We present gpt-oss-120b and gpt-oss-20b, two open-weight reasoning models that push the frontier of accuracy and inference cost. The models use an efficient mixture-of-expert transformer architecture and are trained using large-scale distillation and reinforcement learning. We optimize the models to have strong agentic capabilities (deep research browsing, python tool use, and support for developer-provided functions), all while using a rendered chat format that enables clear instruction following and role delineation. Both models achieve strong results on benchmarks ranging from mathematics,

authors

Andy Applebaum Edwin Arbus Jason Ai Lama Ahmad OpenAI: Sandhini Agarwal Sam Altman

co-cited works

representative citing papers

Sumi: Open Uniform Diffusion Language Model from Scratch

cs.CL · 2026-06-17 · unverdicted · novelty 8.0

Sumi is an openly released 7B parameter uniform diffusion language model pretrained from scratch on 1.5T tokens that matches autoregressive models on several benchmarks.

TW-LegalBench: Measuring Taiwanese Legal Understanding

cs.CL · 2026-06-17 · unverdicted · novelty 8.0

TW-LegalBench evaluates 13 LLMs on over 30,000 Taiwanese legal tasks from exams and judgments, showing top models pass lawyer thresholds but struggle with exact statute citations.

UltraEP: Unleash MoE Training and Inference on Rack-Scale Nodes with Near-Optimal Load Balancing

cs.DC · 2026-06-02 · unverdicted · novelty 8.0

UltraEP is the first exact-load real-time expert balancer for large-EP MoE training and serving on rack-scale nodes, reaching 94.3% of ideal throughput and 1.49x over no-balancing.

RobotValues: Evaluating Household Robots When Human Values Conflict

cs.RO · 2026-06-02 · unverdicted · novelty 8.0

RobotValues is a benchmark of 10K value-conflict scenarios that reveals VLMs default to safety and accommodation while failing to follow instructions to prioritize other values 80% of the time.

Fully Open Meditron: An Auditable Pipeline for Clinical LLMs

cs.AI · 2026-05-15 · unverdicted · novelty 8.0 · 2 refs

Presents the first fully open pipeline for clinical LLMs by unifying eight public QA datasets with three clinician-vetted synthetic extensions and applying it to five base models to achieve benchmark gains while maintaining auditability.

MathAtlas: A Benchmark for Autoformalization in the Wild

cs.AI · 2026-05-13 · accept · novelty 8.0

MathAtlas is the first large-scale benchmark for autoformalizing graduate mathematics, where even strong models reach only 9.8% correctness on theorem statements and drop to 2.6% on the hardest dependency-deep subset.

Large Language Models Lack Temporal Awareness of Medical Knowledge

cs.LG · 2026-05-13 · unverdicted · novelty 8.0

LLMs lack temporal awareness of medical knowledge, showing gradual performance decline on up-to-date facts, much lower accuracy on historical knowledge (25-54% relative), and inconsistent year-to-year predictions.

Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models

cs.AR · 2026-05-11 · conditional · novelty 8.0

Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.

Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

cs.CL · 2026-05-09 · unverdicted · novelty 8.0 · 2 refs

Soohak is a 439-problem mathematician-curated benchmark where frontier LLMs reach at most 30.4% on research math challenges and no model exceeds 50% on refusal for ill-posed problems.

MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

MathConstraint generates scalable, automatically verifiable combinatorial problems where LLMs achieve 18.5-66.9% accuracy without tools but roughly double that with solver access.

LLM Translation of Compiler Intermediate Representation

cs.PL · 2026-05-07 · unverdicted · novelty 8.0

IRIS-14B is the first LLM trained explicitly for GIMPLE-to-LLVM IR translation and outperforms much larger models by up to 44 percentage points on real-world C code.

Efficient Training on Multiple Consumer GPUs with RoundPipe

cs.DC · 2026-04-29 · conditional · novelty 8.0

RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on 8x RTX 4090.

InfiniteScienceGym: An Unbounded, Procedurally-Generated Benchmark for Scientific Analysis

cs.CL · 2026-04-14 · unverdicted · novelty 8.0

InfiniteScienceGym procedurally generates unbounded scientific repositories with exact ground-truth QA pairs to benchmark LLMs on data reasoning, abstention, and tool use without static datasets.

Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models

cs.CL · 2026-04-13 · conditional · novelty 8.0

Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.

Tessera: Unlocking Heterogeneous GPUs through Kernel-Granularity Disaggregation

cs.DC · 2026-04-11 · unverdicted · novelty 8.0

Tessera performs kernel-granularity disaggregation on heterogeneous GPUs, achieving up to 2.3x throughput and 1.6x cost efficiency gains for large model inference while generalizing beyond prior methods.

Evaluating Large Language Models in Scientific Discovery

cs.AI · 2025-12-17 · unverdicted · novelty 8.0

The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.

Unlocking Speech-Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning

cs.CL · 2026-07-02 · unverdicted · novelty 7.0

SpeechCombine produces instruction-following SLMs via speech pre-training followed by direct weight combination with the text LLM instruction delta, without any speech instruction tuning.

OpenSafeIntent: Evaluating Intent-Calibrated Safe Completion Across Dual-Use Prompt Sets

cs.CL · 2026-07-02 · unverdicted · novelty 7.0

OpenSafeIntent benchmark shows models fail to calibrate safety across intent shifts in matched dual-use prompts, indicating current evaluations are insufficient.

Can Language Models Actually Retrieve In-Context? Drowning in Documents at Million Token Scale

cs.CL · 2026-07-01 · unverdicted · novelty 7.0

A 0.6B LM with length-aware attention adjustments performs competitive in-context retrieval at million-token scale on MS MARCO, NQ, and LIMIT benchmarks.

Measuring the Gap Between Human and LLM Research Ideas

cs.CL · 2026-07-01 · unverdicted · novelty 7.0

LLM-generated research ideas cluster more around bridge-like opportunities and synthesis methods than the broader distribution seen in human papers.

ELDR: Expert-Locality-Aware Decode Routing for PD-Disaggregated MoE Serving

cs.DC · 2026-07-01 · unverdicted · novelty 7.0 · 2 refs

ELDR reduces median TPOT by 5.9-13.9% in PD-disaggregated MoE serving via expert signatures from prefill, K-means partitioning, and locality-band routing with KV-co-indexed signature cache.

Using AI Agents to Automate Black-Box Audits of Personalization Algorithms at Scale

cs.CL · 2026-06-29 · unverdicted · novelty 7.0

Introduces GenAI agent framework for auditing personalization algorithms via synthetic accounts with fixed personas, applied to X post-2024 election showing amplification of toxic and right-leaning content varying by ideology.

SABER-Math: Automated Benchmark for Information Retrieval Evaluation in Mathematics

cs.IR · 2026-06-29 · unverdicted · novelty 7.0

SABER-Math is an automated benchmark for mathematical IR that uses LLM summaries, topic similarities, and preference tournaments on 283K problems to create reranking tasks, showing embedding models outperform baselines but struggle in symbol-heavy areas and that MTEB does not predict math performanc

Agentic Abstention: Do Agents Know When to Stop Instead of Act?

cs.AI · 2026-06-27 · unverdicted · novelty 7.0

LLM agents often fail to abstain at the right time in uncertain multi-turn tasks, and the CONVOLVE context engineering method raises timely abstention rates on WebShop from 26.7 to 57.4 without parameter updates.

citing papers explorer

Showing 50 of 428 citing papers.

ReSET: Accurate Latency-Critical NVFP4 Reasoning via Step-Aware Temperature Scaling cs.LG · 2026-06-11 · unverdicted · none · ref 25 · internal anchor
ReSET mitigates accuracy degradation in NVFP4-quantized reasoning models via step-aware entropy-based temperature scaling and provides a small-M CUDA kernel for up to 2.5x kernel speedup and 2x end-to-end speedup.
Structured Testbench Generation for LLM-Driven HDL Design and Verification-Oriented Data Curation cs.AI · 2026-06-11 · unverdicted · none · ref 13 · internal anchor
STG generates deterministic testbenches 720x faster than iterative LLM flows with higher coverage and fewer false passes, while serving as an 11x faster data curation engine with 127x less energy.
Grammar-Constrained Decoding Can Jailbreak LLMs into Generating Malicious Code cs.CR · 2026-06-10 · unverdicted · none · ref 49 · internal anchor
Grammar-constrained decoding enables a new jailbreak (CodeSpear) on LLMs for malicious code, countered by CodeShield which trains models to output harmless honeypot code under GCD while preserving refusals.
Architecture-Aware Reinforcement Learning Makes Sliding-Window Attention Competitive in Math Reasoning cs.AI · 2026-06-10 · unverdicted · none · ref 37 · internal anchor
Reinforcement learning after SFT conversion narrows the performance gap between sliding-window attention and full self-attention on math reasoning benchmarks while preserving linear complexity.
Divide and Cooperate: Role-Decomposed Multi-Agent LLM Training with Cross-Agent Learning Signals cs.LG · 2026-06-09 · unverdicted · none · ref 9 · internal anchor
DAC decomposes agentic search into cooperative searcher and generator agents with cross-agent signals (abstention reward and hard-positive augmentation), achieving strong QA benchmark performance via LoRA on a shared backbone.
Are We Evaluating Knowledge or Phrasing? Mitigating MCQA Sensitivity with ParaEval cs.CL · 2026-06-09 · unverdicted · none · ref 11 · internal anchor
ParaEval reduces false performance gaps in MCQA benchmarks from over 2 points to below 1 point by scoring models on multiple paraphrases per answer option instead of single surface forms.
End-to-End Context Compression at Scale cs.CL · 2026-06-08 · unverdicted · none · ref 2 · internal anchor
LCLMs are scaled 0.6B-encoder 4B-decoder compressors pre-trained on over 350B tokens that improve the Pareto frontier for general-task performance, compression speed, and peak memory in long-context language model inference.
Provably Efficient Personalized Multi-Objective Bandits with Proactive Conversational Queries cs.LG · 2026-06-07 · unverdicted · none · ref 84 · internal anchor
MO-PQUCB hybrid algorithm integrates proactive conversational queries with bandit feedback via shift-invariant regularization to achieve improved regret bounds in personalized multi-objective bandits.
TLRD: Teaching LLMs to Reason over Tabular Data with Tri-Level Rationale Distillation cs.CL · 2026-06-06 · unverdicted · none · ref 36 · internal anchor
TLRD distills tri-level rationales (instance features, dataset distributions, neighbor comparisons) from a teacher into student LLMs to close the accuracy gap with tree ensembles on tabular data while generating grounded explanations.
Sparsely gated tiny linear experts cs.LG · 2026-06-05 · unverdicted · none · ref 6 · internal anchor
Sgatlin replaces transformer FF layers with sparse single linear neurons, improving perplexity across compute budgets and enabling direct interpretation of semantically clustered circuits for factual recall.
MLingualFC: Evaluating Jailbreak Vulnerabilities in Multilingual Vision-Language Models cs.CR · 2026-06-05 · unverdicted · none · ref 27 · internal anchor
MLingualFC benchmark finds flowchart jailbreaks succeed at high rates for Latin-script languages but much lower rates for Punjabi in multilingual VLMs, pointing to language-dependent safety gaps.
RECAP: Regression Evaluation for Continual Adaptation of Prompts cs.LG · 2026-06-04 · unverdicted · none · ref 37 · internal anchor
RECAP benchmark finds that six prompt optimization methods show no significant performance gains under proactive continual adaptation to evolving constraints across four LLMs.
The Piggyback Hypothesis of Generalization: Explaining and Mitigating Emergent Misalignment cs.CL · 2026-06-04 · unverdicted · none · ref 1 · internal anchor
The Piggyback Hypothesis attributes emergent misalignment to chat-template tokens piggybacking finetuned behavior; Token-Regularized Finetuning (TReFT) mitigates it by regularizing prefix token representations.
How Language Models Fail: Token-Level Signatures of Committed and Persistent Reasoning Failures cs.CL · 2026-06-04 · unverdicted · none · ref 30 · internal anchor
LLM reasoning failures split into committed (early lock-in) and persistent-uncertainty modes with distinct token-level signatures that hold across 23 model-dataset pairs in 20 of 23 falsifiable tests.
Vortex: Efficient and Programmable Sparse Attention Serving for AI Agents cs.AI · 2026-06-04 · unverdicted · none · ref 2 · internal anchor
Vortex provides a programmable frontend and backend for sparse attention in LLM serving, delivering up to 3.46x throughput over full attention while preserving accuracy.
Compress-Distill: Reasoning Trace Compression for Efficient Knowledge Distillation cs.LG · 2026-06-04 · unverdicted · none · ref 26 · internal anchor
Post-hoc model-based compression of reasoning traces cuts training tokens to 12-30% and speeds training 2-7.6x while retaining up to 96% of raw-trace accuracy, though raw traces remain superior at every scale.
Can LLMs Be Constrained to the Past? Improving Knowledge Cutoff through Recall-Based Prompting cs.CL · 2026-06-04 · unverdicted · none · ref 1 · internal anchor
Recall-based prompting (Self-Recall and Question-Recall) outperforms direct-answer and chain-of-thought methods on knowledge cutoff benchmarks, including a new multi-cutoff historical events benchmark.
SHIELDS: Automating OS Hardening with Iterative Multi-Agent Remediation cs.CR · 2026-06-03 · unverdicted · none · ref 23 · internal anchor
SHIELDS deploys multi-agent LLMs for iterative, feedback-driven OS hardening and reports up to 73% remediation of scan findings, with success tied more to tool use than model size.
Noisy memory encoding explains negative polarity illusions cs.CL · 2026-06-03 · unverdicted · none · ref 74 · internal anchor
Noisy memory encoding of determiners explains negative polarity illusions, with new acceptability experiments showing stronger illusions for similar determiner pairs.
Expert-Aware Refusal Steering cs.CL · 2026-06-02 · unverdicted · none · ref 7 · internal anchor
Refusal steering works on MoE LLMs; expert-aware variants succeed with single-expert outputs and refusal signals differ from routing patterns.
Consistency Training Can Entrench Misalignment cs.CL · 2026-06-02 · unverdicted · none · ref 3 · internal anchor
Consistency training suppresses reward hacking and emergent misalignment but amplifies sycophancy in controlled model organisms, driven by labeling-induced distribution shifts rather than selection operators.
LiveBand: Live Accompaniment Generation in the Audio Domain cs.SD · 2026-06-02 · unverdicted · none · ref 42 · internal anchor
LiveBand generates high-fidelity music accompaniments to live audio in real time via a causal transformer in audio latent space trained with adversarial sequence-level supervision.
GLINT: Sparsely Gated Vision-Language Alignment for Fine-Grained Radiology Representations cs.CV · 2026-06-02 · unverdicted · none · ref 2 · internal anchor
GLINT introduces sparsely gated alignment and dense feature regularization on top of DINOv3 and V-JEPA encoders to enable query-specific zero-shot grounding and segmentation in 2D CXR and 3D CT.
KForge: LLM-Driven Cross-Platform Kernel Generation for AI Accelerators cs.LG · 2026-06-01 · unverdicted · none · ref 18 · internal anchor
KForge uses dual LLM agents for cross-platform kernel generation, reporting 2.12% throughput gain on NVIDIA B200 vs TensorRT-LLM and 5.13x geometric mean speedup on Intel Arc B580 vs PyTorch on 37 workloads.
The Epi-LLM Framework: probing LLM behavioral priors through epidemiological agent-based models cs.MA · 2026-06-01 · unverdicted · none · ref 33 · internal anchor
Epi-LLM integrates LLMs as agents in ABM epidemic simulations, finding reduced peak infections, 58-65% quarantine compliance, and perceived severity as top predictor with pseudo-R² 0.055 comparable to human data.
Traj-Evolve: A Self-Evolving Multi-Agent System for Patient Trajectory Modeling in Lung Cancer Early Detection cs.AI · 2026-06-01 · unverdicted · none · ref 66 · internal anchor
Traj-Evolve combines non-parametric experience retrieval and multi-agent RL with a leave-one-out unification strategy to outperform baselines on lung cancer prediction from up to five years of multimodal EHRs, including in never-smokers.
POIROT: Interrogating Agents for Failure Detection in Multi-Agent Systems cs.AI · 2026-06-01 · unverdicted · none · ref 34 · internal anchor
POIROT protocol repurposes agents in LLM multi-agent systems as an internal diagnostic layer for failure detection, outperforming single-LLM evaluators with gains that increase with complexity, agent count, and fault types.
DFlare: Scaling Up Draft Capacity for Block Diffusion Speculative Decoding cs.CL · 2026-06-01 · unverdicted · none · ref 48 · internal anchor
DFlare replaces DFlash's shared fused representation with per-draft-layer attention to distinct target-layer combinations, enabling deeper drafts and 2.4M training samples for 5-11% higher speedups than DFlash on Qwen3 and GPT-OSS models.
Understanding LLM Behavior in Multi-Target Cross-Lingual Summarization cs.CL · 2026-05-31 · unverdicted · none · ref 4 · internal anchor
Introduces the MEA benchmark for multi-target cross-lingual summarization across 24 languages and demonstrates that activation steering from English summarization representations improves performance.
DeSQ: Decomposition-based SPARQL Query Generation cs.CL · 2026-05-29 · unverdicted · none · ref 2 · internal anchor
DeSQ decomposes questions into atomic constraints, maps them to SPARQL fragments with placeholders, grounds the placeholders, and assembles complete queries, outperforming prior methods on four of five benchmarks.
GPU Forecasters: Language Models as Selective Surrogates for Kernel Runtime Optimization cs.LG · 2026-05-29 · unverdicted · none · ref 26 · internal anchor
LLMs can forecast GPU kernel performance accurately enough to serve as selective surrogates, allowing kernel searches to consider more candidates and recover faster kernels under fixed GPU evaluation budgets.
Neuro-symbolic Syntactic Parsing: Shaping a Neural Network with the CYK Algorithm cs.CL · 2026-05-29 · unverdicted · none · ref 2 · internal anchor
CYKNN encodes the CYK algorithm in a recurrent neural network and outperforms large LLMs on parsing a very simple context-free grammar.
Eigenvectors of Experts are Training-free Non-collapsing Routers cs.LG · 2026-05-29 · unverdicted · none · ref 9 · internal anchor
SSMoE uses eigenvectors of expert weights via SVD to build training-free non-collapsing routers for SMoE models in language and vision tasks.
Automatically Attacking Software Reverse Engineering AI Agents cs.CR · 2026-05-28 · unverdicted · none · ref 6 · internal anchor
Genetic algorithm prompt generation enables prompt injection into binaries via string assignments to fool LLM-powered decompilers and disassemblers.
EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs cs.AI · 2026-05-28 · unverdicted · none · ref 3 · internal anchor
EHRBench uses an EHR-LLM-KB pipeline to automatically create 960,067 reliable QA items spanning diagnosis, treatment, and prognosis for large-scale LLM evaluation in clinical decision making.
Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents cs.AI · 2026-05-28 · unverdicted · none · ref 1 · internal anchor
Harness-updating capability is flat across base model capabilities while harness-benefit is non-monotonic, peaking at mid-tier models in self-evolving LLM agents.
REPOT: Recoverable Program-of-Thought via Checkpoint Repair cs.SE · 2026-05-28 · unverdicted · none · ref 14 · internal anchor
RePoT recovers from PoT failures via deterministic verified replay and checkpoint repair, yielding +3 to +11pp gains on planning benchmarks and showing checkpoint state as the key recovery signal over error-only feedback.
EvoRubric: Self-Evolving Rubric-Driven RL for Open-Ended Generation cs.CL · 2026-05-28 · unverdicted · none · ref 35 · internal anchor
EvoRubric is a single-policy RL method that co-evolves a reasoner and a rubric generator with multi-level verification to produce dynamic rewards for open-ended LLM alignment.
ReasonOps: Operator Segmentation for LLM Reasoning Traces cs.AI · 2026-05-28 · unverdicted · none · ref 4 · internal anchor
Unsupervised clustering on sentence-initial 3-token pivots extracts 7 universal reasoning operators from 44k traces across 12 LLMs that enable model fingerprinting and answer-correctness prediction.
HardMTBench: Stress-Testing Chinese-English Translation on Knowledge-Intensive Domains cs.CL · 2026-05-27 · unverdicted · none · ref 8 · internal anchor
HardMTBench is a difficulty-aware benchmark of 20,000 directional test items across 12 domains that widens GEMBA score ranges by a factor of two and reveals domain-specific weaknesses in 22 MT systems.
HELEA: Hard-Negative Benchmark and LLM-based Reranking for Robust Entity Alignment cs.CL · 2026-05-27 · unverdicted · none · ref 5 · internal anchor
HELEA creates hard-negative benchmarks (DW-HN29K, DY-HN27K) where name-overlap baselines fail and reports F1 0.967 on the new sets while preserving strong standard-benchmark scores via encoder retrieval plus untrained LLM reranking.
Pruning and Distilling Mixture-of-Experts into Dense Language Models cs.CL · 2026-05-27 · unverdicted · none · ref 1 · internal anchor
A systematic MoE-to-dense conversion via expert scoring, grouping, and distillation yields +6.3 pp average accuracy over dense-to-dense pruning at matched parameter count on tested models.
Extracting Small Translation Specialists from LLMs by Aggressively Pruning Experts cs.CL · 2026-05-27 · unverdicted · none · ref 5 · internal anchor
Aggressive expert pruning in MoE LLMs extracts compact translation specialists that retain near-baseline quality after removing up to 75% of experts (or 90% with short SFT).
Query Symbolically or Retrieve Semantically? A Dataset and Method for Semi-Structured Question Answering cs.AI · 2026-05-26 · unverdicted · none · ref 33 · internal anchor
DualGraph combines semantic textual KGs with symbolic KGs for semi-structured QA and introduces the SpecsQA benchmark, outperforming baselines on both open and specification questions.
JuICE: A Benchmark for Evaluating LLM-Judge in Identifying Cultural Errors cs.CL · 2026-05-26 · unverdicted · none · ref 29 · internal anchor
JuICE is a new multilingual benchmark dataset showing top LLM judges reach only F1 0.52 on span-level cultural error detection and miss errors locals readily spot.
An Efficient and Privacy-Preserving Architecture for Cross-Institutional Collaborative RAG cs.CR · 2026-05-25 · unverdicted · none · ref 1 · internal anchor
FedRAG uses a Scrambled Distributed Attention protocol with feature scrambling and token permutation to enable high-throughput, privacy-preserving federated RAG without special hardware or retraining.
Beyond Query Memorization: Large Language Model Routing with Query Decomposition and Historical Matching cs.AI · 2026-05-25 · unverdicted · none · ref 12 · internal anchor
DecoR routes LLM queries by decomposing them into capability dimensions and matching to historical examples, yielding higher accuracy and lower inference costs than direct-mapping routers on both in-distribution and OOD data.
Inference Time Optimization with Confidence Dynamics cs.CL · 2026-05-24 · unverdicted · none · ref 1 · internal anchor
Correct reasoning traces exhibit positive confidence gain while incorrect traces show declining confidence, enabling CDG-based voting that boosts performance on AIME, HMMT and BRUMO benchmarks across multiple LLM architectures.
AstroMind: A High-Fidelity Benchmark for Spacecraft Behavior Reasoning Based on Large Language Models cs.CL · 2026-05-23 · unverdicted · none · ref 42 · internal anchor
AstroMind is a new physics-grounded benchmark for LLM reasoning on spacecraft behavior across intent inference, maneuver estimation, and threat assessment, evaluated on several open-weight models.
PrivFusion: A Privacy-preserving Multi-Agent Framework for Harmonizing Distributed Datasets cs.LG · 2026-05-22 · unverdicted · none · ref 12 · internal anchor
PrivFusion deploys agents to cluster semantically similar features and iteratively recommend transformations for harmonizing heterogeneous structured datasets in a privacy-preserving manner, evaluated on four COVID-19 datasets.

gpt-oss-120b & gpt-oss-20b Model Card

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer