super hub Mixed citations

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Aixin Liu, Aoxue Mei, Bangcai Lin, Bingxuan Wang, Bing Xue, DeepSeek-AI · 2025 · cs.CL · arXiv 2512.02556

Mixed citation behavior. Most common role is background (54%).

244 Pith papers citing it

Background 54% of classified citations

open full Pith review browse 244 citing papers more from Aixin Liu arXiv PDF

abstract

We introduce DeepSeek-V3.2, a model that harmonizes high computational efficiency with superior reasoning and agent performance. The key technical breakthroughs of DeepSeek-V3.2 are as follows: (1) DeepSeek Sparse Attention (DSA): We introduce DSA, an efficient attention mechanism that substantially reduces computational complexity while preserving model performance in long-context scenarios. (2) Scalable Reinforcement Learning Framework: By implementing a robust reinforcement learning protocol and scaling post-training compute, DeepSeek-V3.2 performs comparably to GPT-5. Notably, our high-compute variant, DeepSeek-V3.2-Speciale, surpasses GPT-5 and exhibits reasoning proficiency on par with Gemini-3.0-Pro, achieving gold-medal performance in both the 2025 International Mathematical Olympiad (IMO) and the International Olympiad in Informatics (IOI). (3) Large-Scale Agentic Task Synthesis Pipeline: To integrate reasoning into tool-use scenarios, we developed a novel synthesis pipeline that systematically generates training data at scale. This methodology facilitates scalable agentic post-training, yielding substantial improvements in generalization and instruction-following robustness within complex, interactive environments.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 29 baseline 17 method 5 dataset 2 other 1

citation-polarity summary

background 29 baseline 18 use method 5 unclear 1 use dataset 1

claims ledger

abstract We introduce DeepSeek-V3.2, a model that harmonizes high computational efficiency with superior reasoning and agent performance. The key technical breakthroughs of DeepSeek-V3.2 are as follows: (1) DeepSeek Sparse Attention (DSA): We introduce DSA, an efficient attention mechanism that substantially reduces computational complexity while preserving model performance in long-context scenarios. (2) Scalable Reinforcement Learning Framework: By implementing a robust reinforcement learning protocol and scaling post-training compute, DeepSeek-V3.2 performs comparably to GPT-5. Notably, our high-com

authors

Aixin Liu Aoxue Mei Bangcai Lin Bingxuan Wang Bing Xue DeepSeek-AI

co-cited works

representative citing papers

LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments

cs.CR · 2026-05-11 · conditional · novelty 8.0

LITMUS is the first benchmark using semantic-physical dual verification and OS state rollback to measure behavioral jailbreaks in LLM agents, revealing that even strong models execute 40%+ of high-risk operations and exhibit execution hallucination.

Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

cs.AI · 2026-05-11 · unverdicted · novelty 8.0

Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.

ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning

cs.LG · 2026-05-09 · conditional · novelty 8.0

ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equipped EPLB while staying within 6-10% of an ideal balanced baseline.

HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?

cs.CR · 2026-04-16 · unverdicted · novelty 8.0

Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.

HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks

cs.AI · 2026-04-16 · unverdicted · novelty 8.0

HWE-Bench is the first repository-level benchmark for LLM agents on real hardware bug repair, where the best agent fixes 70.7% of 417 tasks but drops below 65% on complex SoC projects.

OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation

cs.CL · 2026-04-13 · unverdicted · novelty 8.0

OccuBench is a new benchmark for AI agents on real-world occupational tasks via LLM-driven simulators, showing no model dominates all industries, implicit faults are hardest, and larger models with more reasoning perform better.

AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Alignment

cs.CL · 2026-03-09 · unverdicted · novelty 8.0

AlpsBench supplies 2500 real-dialogue sequences with verified memories to benchmark LLM extraction, updating, retrieval, and utilization of personalized information.

OmniCoT: A Benchmark for Global and Multi-Step Panoramic Reasoning

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

OmniCoT is a new panoramic reasoning benchmark with 6.7K eval, 1K real, and 14.3K training examples plus a two-stage SFT+GRPO training method to enforce global 360-degree consistency.

CORTEX: High-Quality Cross-Domain Organization of Web-Scale Corpora through Ontological Corpus Graph

cs.CL · 2026-06-29 · unverdicted · novelty 7.0

Cortex uses an Ontological Corpus Graph to structure web-scale corpora, creating a refined 24.14B-token corpus and a new benchmark validated on eight LLMs.

SpreadsheetBench 2: Evaluating Agents on End-to-End Business Spreadsheet Workflows

cs.SE · 2026-06-29 · unverdicted · novelty 7.0

SpreadsheetBench 2 provides 321 expert-validated tasks from authentic business data showing frontier LLMs reach only 34.89% overall accuracy on end-to-end spreadsheet workflows.

Dockerless: Environment-Free Program Verifier for Coding Agents

cs.SE · 2026-06-26 · unverdicted · novelty 7.0

Dockerless uses agentic repository exploration to verify patches without execution, enabling SFT and RL training of coding agents that reach 62.0/50.0/35.2% resolve rates on SWE-bench Verified/Multilingual/Pro while matching environment-based results.

BehaviorBench: Benchmarking Foundation Models for Behavioral Science Tasks

cs.CL · 2026-06-23 · unverdicted · novelty 7.0

BehaviorBench is a benchmark for foundation models on behavioral tasks that reveals fine-tuned behavioral models outperform general models on distributional alignment while general models lead on individual-level accuracy.

The Invitation Trap: Proactive Availability Backdoor in LLMs via Conversational Induction

cs.CR · 2026-05-30 · unverdicted · novelty 7.0

The paper presents Proactive Availability Backdoor (PAB) attacks on LLMs that achieve 73.1% effective success rate by proactively inducing users via suggestions in a Five-Factor Model simulation.

Every Act Has Its Price: Compressed Moral Composition in Frontier LLMs

cs.CL · 2026-05-29 · unverdicted · novelty 7.0

Moral Trolley Arena shows frontier LLMs produce composite moral preferences that are compressed rather than additive functions of calibrated component act strengths across Moral Foundations Theory.

LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?

cs.AI · 2026-05-27 · unverdicted · novelty 7.0

LiveBrowseComp shows search agents rely on intrinsic knowledge on standard benchmarks, with scores dropping 25-40 points and closed-book accuracy below 2% on questions about facts from the prior 90 days.

VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions

cs.AI · 2026-05-26 · unverdicted · novelty 7.0

VitaBench 2.0 introduces a benchmark for long-term personalized and proactive agent behavior, with results indicating substantial gaps in current frontier LLMs.

AssertLLM2: A Comprehensive LLM Benchmark for Assertion Generation from Design Specifications

cs.AR · 2026-05-26 · unverdicted · novelty 7.0

AssertLLM2 introduces a benchmark of 83 designs supporting bug-prevention and bug-hunting assertion generation tasks with evaluation across syntactic, formal, coverage, and mutation-based metrics.

FTibSuite: A Comprehensive Resource Suite for Tibetan Vision-Language Modeling

cs.CV · 2026-05-26 · unverdicted · novelty 7.0

FTibSuite provides human-verified multimodal corpora, Tibetan-adapted benchmarks with quality controls, and a baseline VLM showing gains on tasks like MMBench while preserving Chinese capabilities.

RepoMirage: Probing Repository Context Reasoning in Code Agents with Perturbations

cs.SE · 2026-05-25 · unverdicted · novelty 7.0

RepoMirage uses semantics-preserving perturbations on SWE-Bench to show code agents lack repository context reasoning, with performance falling sharply on extended structure tasks, and introduces RepoAnchor as a structure-first fix.

Not only where, But when: Temporal Scheduling for RLVR

cs.LG · 2026-05-25 · unverdicted · novelty 7.0

Temporal scheduling of credit allocation criteria over RLVR training, using trajectory percentiles to target heterogeneous behaviors, yields more stable policy entropy and better reasoning benchmark results than static allocation.

Towards Reliable Fetal Ultrasound Interpretation with Multi-Agent Collaboration

cs.CV · 2026-05-25 · unverdicted · novelty 7.0

FetUSAgents uses tool-augmented multi-agent collaboration and Dual-Path Evidence Arbitration to exceed prior MLLMs by over 25% on a new fetal ultrasound VQA benchmark.

MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems

cs.AI · 2026-05-21 · unverdicted · novelty 7.0 · 2 refs

MOSS performs source-level self-rewriting in agent systems using failure-anchored pipelines and container-based verification, raising OpenClaw mean score from 0.25 to 0.61 in one cycle.

Unlocking Proactivity in Task-Oriented Dialogue

cs.AI · 2026-05-21 · unverdicted · novelty 7.0 · 2 refs

Introduces a Cognitive User Simulator modeling stratified personas with hidden concerns and Simulator-Induced Asymmetric-View Policy Optimization to unlock proactive behavior in task-oriented dialogue agents.

SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents

cs.SE · 2026-05-20 · unverdicted · novelty 7.0

SpecBench shows frontier coding agents saturate visible test suites but exhibit persistent reward hacking on held-out tests, with the gap growing 28 percentage points per tenfold increase in code size.

citing papers explorer

Showing 50 of 244 citing papers.

LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments cs.CR · 2026-05-11 · conditional · none · ref 9 · internal anchor
LITMUS is the first benchmark using semantic-physical dual verification and OS state rollback to measure behavioral jailbreaks in LLM agents, revealing that even strong models execute 40%+ of high-risk operations and exhibit execution hallucination.
Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values cs.AI · 2026-05-11 · unverdicted · none · ref 71 · internal anchor
Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.
ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning cs.LG · 2026-05-09 · conditional · none · ref 20 · internal anchor
ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equipped EPLB while staying within 6-10% of an ideal balanced baseline.
HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents? cs.CR · 2026-04-16 · unverdicted · none · ref 16 · internal anchor
Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.
HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks cs.AI · 2026-04-16 · unverdicted · none · ref 10 · internal anchor
HWE-Bench is the first repository-level benchmark for LLM agents on real hardware bug repair, where the best agent fixes 70.7% of 417 tasks but drops below 65% on complex SoC projects.
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation cs.CL · 2026-04-13 · unverdicted · none · ref 3 · internal anchor
OccuBench is a new benchmark for AI agents on real-world occupational tasks via LLM-driven simulators, showing no model dominates all industries, implicit faults are hardest, and larger models with more reasoning perform better.
AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Alignment cs.CL · 2026-03-09 · unverdicted · none · ref 6 · internal anchor
AlpsBench supplies 2500 real-dialogue sequences with verified memories to benchmark LLM extraction, updating, retrieval, and utilization of personalized information.
OmniCoT: A Benchmark for Global and Multi-Step Panoramic Reasoning cs.CV · 2026-06-29 · unverdicted · none · ref 20 · internal anchor
OmniCoT is a new panoramic reasoning benchmark with 6.7K eval, 1K real, and 14.3K training examples plus a two-stage SFT+GRPO training method to enforce global 360-degree consistency.
CORTEX: High-Quality Cross-Domain Organization of Web-Scale Corpora through Ontological Corpus Graph cs.CL · 2026-06-29 · unverdicted · none · ref 49 · internal anchor
Cortex uses an Ontological Corpus Graph to structure web-scale corpora, creating a refined 24.14B-token corpus and a new benchmark validated on eight LLMs.
SpreadsheetBench 2: Evaluating Agents on End-to-End Business Spreadsheet Workflows cs.SE · 2026-06-29 · unverdicted · none · ref 28 · internal anchor
SpreadsheetBench 2 provides 321 expert-validated tasks from authentic business data showing frontier LLMs reach only 34.89% overall accuracy on end-to-end spreadsheet workflows.
Dockerless: Environment-Free Program Verifier for Coding Agents cs.SE · 2026-06-26 · unverdicted · none · ref 12 · internal anchor
Dockerless uses agentic repository exploration to verify patches without execution, enabling SFT and RL training of coding agents that reach 62.0/50.0/35.2% resolve rates on SWE-bench Verified/Multilingual/Pro while matching environment-based results.
BehaviorBench: Benchmarking Foundation Models for Behavioral Science Tasks cs.CL · 2026-06-23 · unverdicted · none · ref 42 · internal anchor
BehaviorBench is a benchmark for foundation models on behavioral tasks that reveals fine-tuned behavioral models outperform general models on distributional alignment while general models lead on individual-level accuracy.
The Invitation Trap: Proactive Availability Backdoor in LLMs via Conversational Induction cs.CR · 2026-05-30 · unverdicted · none · ref 45 · internal anchor
The paper presents Proactive Availability Backdoor (PAB) attacks on LLMs that achieve 73.1% effective success rate by proactively inducing users via suggestions in a Five-Factor Model simulation.
Every Act Has Its Price: Compressed Moral Composition in Frontier LLMs cs.CL · 2026-05-29 · unverdicted · none · ref 83 · internal anchor
Moral Trolley Arena shows frontier LLMs produce composite moral preferences that are compressed rather than additive functions of calibrated component act strengths across Moral Foundations Theory.
LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know? cs.AI · 2026-05-27 · unverdicted · none · ref 30 · internal anchor
LiveBrowseComp shows search agents rely on intrinsic knowledge on standard benchmarks, with scores dropping 25-40 points and closed-book accuracy below 2% on questions about facts from the prior 90 days.
VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions cs.AI · 2026-05-26 · unverdicted · none · ref 6 · internal anchor
VitaBench 2.0 introduces a benchmark for long-term personalized and proactive agent behavior, with results indicating substantial gaps in current frontier LLMs.
AssertLLM2: A Comprehensive LLM Benchmark for Assertion Generation from Design Specifications cs.AR · 2026-05-26 · unverdicted · none · ref 32 · internal anchor
AssertLLM2 introduces a benchmark of 83 designs supporting bug-prevention and bug-hunting assertion generation tasks with evaluation across syntactic, formal, coverage, and mutation-based metrics.
FTibSuite: A Comprehensive Resource Suite for Tibetan Vision-Language Modeling cs.CV · 2026-05-26 · unverdicted · none · ref 9 · internal anchor
FTibSuite provides human-verified multimodal corpora, Tibetan-adapted benchmarks with quality controls, and a baseline VLM showing gains on tasks like MMBench while preserving Chinese capabilities.
RepoMirage: Probing Repository Context Reasoning in Code Agents with Perturbations cs.SE · 2026-05-25 · unverdicted · none · ref 24 · internal anchor
RepoMirage uses semantics-preserving perturbations on SWE-Bench to show code agents lack repository context reasoning, with performance falling sharply on extended structure tasks, and introduces RepoAnchor as a structure-first fix.
Not only where, But when: Temporal Scheduling for RLVR cs.LG · 2026-05-25 · unverdicted · none · ref 15 · internal anchor
Temporal scheduling of credit allocation criteria over RLVR training, using trajectory percentiles to target heterogeneous behaviors, yields more stable policy entropy and better reasoning benchmark results than static allocation.
Towards Reliable Fetal Ultrasound Interpretation with Multi-Agent Collaboration cs.CV · 2026-05-25 · unverdicted · none · ref 2 · internal anchor
FetUSAgents uses tool-augmented multi-agent collaboration and Dual-Path Evidence Arbitration to exceed prior MLLMs by over 25% on a new fetal ultrasound VQA benchmark.
MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems cs.AI · 2026-05-21 · unverdicted · none · ref 10 · 2 links · internal anchor
MOSS performs source-level self-rewriting in agent systems using failure-anchored pipelines and container-based verification, raising OpenClaw mean score from 0.25 to 0.61 in one cycle.
Unlocking Proactivity in Task-Oriented Dialogue cs.AI · 2026-05-21 · unverdicted · none · ref 26 · 2 links · internal anchor
Introduces a Cognitive User Simulator modeling stratified personas with hidden concerns and Simulator-Induced Asymmetric-View Policy Optimization to unlock proactive behavior in task-oriented dialogue agents.
SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents cs.SE · 2026-05-20 · unverdicted · none · ref 10 · internal anchor
SpecBench shows frontier coding agents saturate visible test suites but exhibit persistent reward hacking on held-out tests, with the gap growing 28 percentage points per tenfold increase in code size.
Terminal-World: Scaling Terminal-Agent Environments via Agent Skills cs.CL · 2026-05-20 · unverdicted · none · ref 38 · internal anchor
Terminal-World is a skill-based synthesis pipeline that generates 5,723 training environments and produces Terminal-World-32B which outperforms baselines on Terminal-Bench 2.0 using only 1.2% of the data.
CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning cs.CL · 2026-05-19 · unverdicted · none · ref 20 · internal anchor
CopT reverses CoT by eliciting a draft answer first then using continuous-embedding contrastive verification and on-policy thinking to reflect and correct, yielding up to 23% higher accuracy and 57% fewer tokens without training.
PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning cs.AI · 2026-05-19 · unverdicted · none · ref 13 · internal anchor
PRISM benchmark of over 10k pairs shows LLMs have a 41% average drop from code execution success to spatial correctness in programmatic video generation.
GRASP: Deterministic argument ranking in interaction graphs cs.LG · 2026-05-18 · unverdicted · none · ref 39 · internal anchor
GRASP aggregates stable local LLM interaction judgments into global argument rankings via a convergent attack-defense propagation operator on interaction graphs, yielding higher reproducibility than holistic judging and no correlation with human convincingness.
BacktestBench: Benchmarking Large Language Models for Automated Quantitative Strategy Backtesting cs.CL · 2026-05-18 · unverdicted · none · ref 6 · 2 links · internal anchor
Introduces BacktestBench benchmark with 18k QA pairs across four backtesting tasks and evaluates 23 LLMs via the AutoBacktest multi-agent system.
SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering cs.SE · 2026-05-17 · unverdicted · none · ref 31 · internal anchor
SaaSBench introduces a heterogeneous benchmark for enterprise SaaS engineering and shows that state-of-the-art coding agents fail over 95% of the time before reaching deep business logic due to setup and integration problems.
BioXArena: Benchmarking LLM Agents on Multi-Modal Biomedical Machine Learning Tasks cs.CE · 2026-05-15 · unverdicted · none · ref 21 · internal anchor
BioXArena benchmarks LLM agents on generating end-to-end ML pipelines for 76 multi-modal biomedical tasks, with MLEvolve plus Gemini-3.1-Pro scoring highest at 0.666.
From I/O to Code with Discovery Agent cs.LG · 2026-05-14 · unverdicted · none · ref 12 · internal anchor
DIO-Agent frames IO2Code as LLM-driven evolutionary search over programs with a Transformation Priority Premise to favor simple hypotheses, outperforming baselines on a new IO2CodeBench.
GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding cs.LG · 2026-05-14 · unverdicted · none · ref 5 · internal anchor
GQLA exposes dual MQA-absorb and GQA decoding paths from identical parameters to enable hardware-adaptive LLM inference while preserving cache compression on one path and GQA-level traffic on the other.
$\pi$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows cs.AI · 2026-05-14 · unverdicted · none · ref 18 · 2 links · internal anchor
π-Bench is a new benchmark for evaluating proactive personal assistant agents on 100 multi-turn tasks that include hidden intents, inter-task dependencies, and cross-session continuity.
GGBound: A Genome-Grounded Agent for Microbial Life-Boundary Prediction cs.CY · 2026-05-14 · unverdicted · none · ref 36 · internal anchor
A genome-conditioned 4B LLM agent predicts microbial life boundaries and matches larger frontier models via token fusion, tool use, and a counterfactual gene-grounding reward.
Is Agentic AI Ready for Real-World Hardware Engineering? A Deep Dive with Phoenix-bench cs.AR · 2026-05-13 · unverdicted · none · ref 24 · internal anchor
Phoenix-bench shows agentic AI systems lose 37-58% resolved rate when moving from SWE-bench Verified to hardware tasks because bugs spread across parallel modules via signal flow, with testbench feedback lifting performance by 42-45% while file-level oracles add only 1.4%.
AIS: Adaptive Importance Sampling for Quantized RL stat.ML · 2026-05-13 · unverdicted · none · ref 2 · internal anchor
AIS adaptively corrects non-stationary policy gradient bias in quantized LLM RL, matching BF16 performance while retaining 1.5-2.76x FP8 rollout speedup.
CommonWhy: A Dataset for Evaluating Entity-Based Causal Commonsense Reasoning in Large Language Models cs.CL · 2026-05-13 · unverdicted · none · ref 32 · internal anchor
CommonWhy is a new dataset of 15,000 why-questions for evaluating LLMs on entity-based causal commonsense reasoning grounded in Wikidata.
RecRM-Bench: Benchmarking Multidimensional Reward Modeling for Agentic Recommender Systems cs.IR · 2026-05-12 · unverdicted · none · ref 23 · internal anchor
RecRM-Bench is a new large-scale benchmark dataset and framework for multi-dimensional reward modeling in agentic recommender systems, spanning instruction following, factual consistency, query-item relevance, and user behavior prediction.
Can LLM Agents Respond to Disasters? Benchmarking Heterogeneous Geospatial Reasoning in Emergency Operations cs.AI · 2026-05-12 · unverdicted · none · ref 53 · internal anchor
DORA is the first end-to-end agentic benchmark for LLM-based disaster response, covering perception, spatial analysis, evacuation planning, temporal reasoning, and report generation over heterogeneous geospatial data, with evaluations of 13 frontier models revealing tool-use and composition failures
AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents cs.LG · 2026-05-11 · unverdicted · none · ref 84 · internal anchor
AssayBench is a new gene-ranking benchmark for phenotypic CRISPR screens that shows zero-shot generalist LLMs outperform both biology-specific LLMs and trainable baselines on adjusted nDCG.
Budget-Efficient Automatic Algorithm Design via Code Graph cs.AI · 2026-05-11 · unverdicted · none · ref 23 · internal anchor
A code-graph and correction-based LLM search framework outperforms full-algorithm generation at equal token budgets on three combinatorial optimization problems.
StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs cs.CY · 2026-05-11 · accept · none · ref 32 · 2 links · internal anchor
StereoTales shows that all tested LLMs emit harmful stereotypes in open-ended stories, with associations adapting to prompt language and targeting locally salient groups rather than transferring uniformly across languages.
Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness cs.CL · 2026-05-11 · unverdicted · none · ref 59 · 2 links · internal anchor
ProofRank benchmark shows substantial differences in LLM proof quality not captured by correctness, with trade-offs between quality metrics and accuracy.
Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning cs.CV · 2026-05-10 · unverdicted · none · ref 9 · internal anchor
RaPO reduces catastrophic forgetting in visual continual learning by shaping rewards around policy drift and stabilizing advantages with cross-task exponential moving averages during reinforcement fine-tuning of multimodal models.
SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning cs.AI · 2026-05-10 · unverdicted · none · ref 5 · 2 links · internal anchor
SeePhys Pro benchmark reveals multimodal models degrade on physics reasoning as information transfers from text to images, with blind training improvements often stemming from textual cues rather than visual evidence.
VibeProteinBench: An Evaluation Benchmark for Language-interfaced Vibe Protein Design q-bio.QM · 2026-05-09 · unverdicted · none · ref 51 · 2 links · internal anchor
VibeProteinBench is a new benchmark evaluating LLMs on open-ended language-interfaced protein design across recognition, engineering, and generation, with no model showing strong performance in all areas.
FactoryBench: Evaluating Industrial Machine Understanding cs.AI · 2026-05-08 · unverdicted · none · ref 48 · internal anchor
FactoryBench reveals that frontier LLMs achieve under 50% on structured causal questions and under 18% on decision-making in industrial robotic telemetry.
MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference cs.LG · 2026-05-08 · conditional · none · ref 7 · internal anchor
MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.
Breaking, Stale, or Missing? Benchmarking Coding Agents on Project-Level Test Evolution cs.SE · 2026-05-07 · unverdicted · none · ref 3 · internal anchor
TEBench is a new project-level benchmark for test evolution showing coding agents achieve only 45-49% F1 on identifying tests needing changes, with stale tests hardest due to reliance on execution failures.

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer