super hub Mixed citations

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Aixin Liu, Aoxue Mei, Bangcai Lin, Bingxuan Wang, Bing Xue, DeepSeek-AI · 2025 · cs.CL · arXiv 2512.02556

Mixed citation behavior. Most common role is background (54%).

295 Pith papers citing it

Background 54% of classified citations

open full Pith review browse 295 citing papers more from Aixin Liu arXiv PDF

abstract

We introduce DeepSeek-V3.2, a model that harmonizes high computational efficiency with superior reasoning and agent performance. The key technical breakthroughs of DeepSeek-V3.2 are as follows: (1) DeepSeek Sparse Attention (DSA): We introduce DSA, an efficient attention mechanism that substantially reduces computational complexity while preserving model performance in long-context scenarios. (2) Scalable Reinforcement Learning Framework: By implementing a robust reinforcement learning protocol and scaling post-training compute, DeepSeek-V3.2 performs comparably to GPT-5. Notably, our high-compute variant, DeepSeek-V3.2-Speciale, surpasses GPT-5 and exhibits reasoning proficiency on par with Gemini-3.0-Pro, achieving gold-medal performance in both the 2025 International Mathematical Olympiad (IMO) and the International Olympiad in Informatics (IOI). (3) Large-Scale Agentic Task Synthesis Pipeline: To integrate reasoning into tool-use scenarios, we developed a novel synthesis pipeline that systematically generates training data at scale. This methodology facilitates scalable agentic post-training, yielding substantial improvements in generalization and instruction-following robustness within complex, interactive environments.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 29 baseline 17 method 5 dataset 2 other 1

citation-polarity summary

background 29 baseline 18 use method 5 unclear 1 use dataset 1

claims ledger

abstract We introduce DeepSeek-V3.2, a model that harmonizes high computational efficiency with superior reasoning and agent performance. The key technical breakthroughs of DeepSeek-V3.2 are as follows: (1) DeepSeek Sparse Attention (DSA): We introduce DSA, an efficient attention mechanism that substantially reduces computational complexity while preserving model performance in long-context scenarios. (2) Scalable Reinforcement Learning Framework: By implementing a robust reinforcement learning protocol and scaling post-training compute, DeepSeek-V3.2 performs comparably to GPT-5. Notably, our high-com

authors

Aixin Liu Aoxue Mei Bangcai Lin Bingxuan Wang Bing Xue DeepSeek-AI

co-cited works

representative citing papers

LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments

cs.CR · 2026-05-11 · conditional · novelty 8.0

LITMUS is the first benchmark using semantic-physical dual verification and OS state rollback to measure behavioral jailbreaks in LLM agents, revealing that even strong models execute 40%+ of high-risk operations and exhibit execution hallucination.

Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

cs.AI · 2026-05-11 · unverdicted · novelty 8.0

Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.

ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning

cs.LG · 2026-05-09 · conditional · novelty 8.0

ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equipped EPLB while staying within 6-10% of an ideal balanced baseline.

HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?

cs.CR · 2026-04-16 · unverdicted · novelty 8.0

Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.

HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks

cs.AI · 2026-04-16 · unverdicted · novelty 8.0

HWE-Bench is the first repository-level benchmark for LLM agents on real hardware bug repair, where the best agent fixes 70.7% of 417 tasks but drops below 65% on complex SoC projects.

OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation

cs.CL · 2026-04-13 · unverdicted · novelty 8.0

OccuBench is a new benchmark for AI agents on real-world occupational tasks via LLM-driven simulators, showing no model dominates all industries, implicit faults are hardest, and larger models with more reasoning perform better.

AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Alignment

cs.CL · 2026-03-09 · unverdicted · novelty 8.0

AlpsBench supplies 2500 real-dialogue sequences with verified memories to benchmark LLM extraction, updating, retrieval, and utilization of personalized information.

Lynx: Progressive Speculative Quantization for accelerating KV Transfer in Long-Context Inference

cs.DC · 2026-07-02 · unverdicted · novelty 7.0

Lynx partitions KV cache bits into anchor and residual streams for progressive transfer, enabling speculative decoding on partial data followed by verification to match BF16 accuracy at 4-bit-like TTFT.

Can Agents Generalize to the Open World? Unveiling the Fragility of Static Training in Tool Use

cs.AI · 2026-07-01 · unverdicted · novelty 7.0

Static SFT and RL training for tool-use agents leads to performance drops under open-world distributional shifts across perception, interaction, reasoning and internalization; perturbation-augmented fine-tuning is proposed as mitigation.

Self-GC: Self-Governing Context for Long-Horizon LLM Agents

cs.AI · 2026-07-01 · unverdicted · novelty 7.0

Self-GC governs agent context as indexed objects with planner-proposed actions, achieving 84.85% no-impact on future continuations on a hard set versus 54-70% for baselines.

SmoothAgent: Efficient Long-Horizon LLM-Based Agent Serving with Lookahead Context Engineering

cs.DC · 2026-06-30 · unverdicted · novelty 7.0

SmoothAgent introduces lookahead context engineering to eliminate transformation overhead in LLM agents, reducing TTFT by up to 11.9x through proactive KV cache preparation.

OmniCoT: A Benchmark for Global and Multi-Step Panoramic Reasoning

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

OmniCoT is a new panoramic reasoning benchmark with 6.7K eval, 1K real, and 14.3K training examples plus a two-stage SFT+GRPO training method to enforce global 360-degree consistency.

CORTEX: High-Quality Cross-Domain Organization of Web-Scale Corpora through Ontological Corpus Graph

cs.CL · 2026-06-29 · unverdicted · novelty 7.0

Cortex uses an Ontological Corpus Graph to structure web-scale corpora, creating a refined 24.14B-token corpus and a new benchmark validated on eight LLMs.

SpreadsheetBench 2: Evaluating Agents on End-to-End Business Spreadsheet Workflows

cs.SE · 2026-06-29 · unverdicted · novelty 7.0

SpreadsheetBench 2 provides 321 expert-validated tasks from authentic business data showing frontier LLMs reach only 34.89% overall accuracy on end-to-end spreadsheet workflows.

Dockerless: Environment-Free Program Verifier for Coding Agents

cs.SE · 2026-06-26 · unverdicted · novelty 7.0

Dockerless uses agentic repository exploration to verify patches without execution, enabling SFT and RL training of coding agents that reach 62.0/50.0/35.2% resolve rates on SWE-bench Verified/Multilingual/Pro while matching environment-based results.

BehaviorBench: Benchmarking Foundation Models for Behavioral Science Tasks

cs.CL · 2026-06-23 · unverdicted · novelty 7.0

BehaviorBench is a benchmark for foundation models on behavioral tasks that reveals fine-tuned behavioral models outperform general models on distributional alignment while general models lead on individual-level accuracy.

FORT-Searcher: Synthesizing Shortcut-Resistant Search Tasks for Training Deep Search Agents

cs.CL · 2026-06-10 · unverdicted · novelty 7.0

FORT synthesizes shortcut-resistant search tasks by controlling four identified shortcut risks across entity selection, graph construction, question formulation, and refinement, producing training data that yields agents with longer search trajectories and top performance among open-source models on

The Arbiter Agent: Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment

cs.AI · 2026-06-09 · unverdicted · novelty 7.0

Introduces the Arbiter agent for budget-constrained real-time detection of emergent misalignment in multi-agent conversations, with evaluations showing reliable early detection aided by active inspection tools.

RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning

cs.AI · 2026-06-08 · unverdicted · novelty 7.0

RealMath-Eval benchmark shows LLM judges have an evaluation gap, performing worse on diverse real human math reasoning than on synthetic solutions due to greater error diversity and higher surprisal.

Experience Makes Skillful: Enabling Generalizable Medical Agent Reasoning via Self-Evolving Skill Memory

cs.AI · 2026-06-08 · unverdicted · novelty 7.0

SkeMex distills agent trajectories into value-aware skills organized in general/task/action branches and evolves them via a closed-loop Read-Write-Assess-Govern process, outperforming prior memory agents on clinical tasks.

Data Agents Under Attack: Vulnerabilities in LLM-Driven Analytical Systems

cs.CR · 2026-06-07 · unverdicted · novelty 7.0

The paper introduces a layered vulnerability framework and attack taxonomy for LLM-driven data agents and demonstrates attacks on four open-source and two production systems.

AudioProcessBench: Benchmark for Identifying Process Errors in Audio-Grounded Reasoning

cs.SD · 2026-06-07 · unverdicted · novelty 7.0

AudioProcessBench is a new benchmark with segmented and annotated reasoning traces from six audio and omni-language models for step correctness identification and error-type detection in audio-grounded reasoning.

WhiFlash: Accelerating Speculative Decoding with Token-Level Cross-Paradigm Routing

cs.LG · 2026-06-05 · unverdicted · novelty 7.0

WhiFlash introduces token-level cross-paradigm routing between autoregressive and diffusion drafting models, with cache optimizations, to raise acceptance lengths and deliver up to 69.6% throughput gains over EAGLE-3.

UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs

cs.CL · 2026-06-04 · unverdicted · novelty 7.0

UnpredictaBench creates 448 distributional sampling tasks and the KS@N metric to measure LLM approximation of target distributions, finding no model exceeds 40% success at N=100.

citing papers explorer

Showing 16 of 16 citing papers after filters.

LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments cs.CR · 2026-05-11 · conditional · none · ref 9 · internal anchor
LITMUS is the first benchmark using semantic-physical dual verification and OS state rollback to measure behavioral jailbreaks in LLM agents, revealing that even strong models execute 40%+ of high-risk operations and exhibit execution hallucination.
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation cs.CL · 2026-04-13 · unverdicted · none · ref 3 · internal anchor
OccuBench is a new benchmark for AI agents on real-world occupational tasks via LLM-driven simulators, showing no model dominates all industries, implicit faults are hardest, and larger models with more reasoning perform better.
BioXArena: Benchmarking LLM Agents on Multi-Modal Biomedical Machine Learning Tasks cs.CE · 2026-05-15 · unverdicted · none · ref 21 · internal anchor
BioXArena benchmarks LLM agents on generating end-to-end ML pipelines for 76 multi-modal biomedical tasks, with MLEvolve plus Gemini-3.1-Pro scoring highest at 0.666.
Can LLM Agents Respond to Disasters? Benchmarking Heterogeneous Geospatial Reasoning in Emergency Operations cs.AI · 2026-05-12 · unverdicted · none · ref 53 · internal anchor
DORA is the first end-to-end agentic benchmark for LLM-based disaster response, covering perception, spatial analysis, evacuation planning, temporal reasoning, and report generation over heterogeneous geospatial data, with evaluations of 13 frontier models revealing tool-use and composition failures
VibeProteinBench: An Evaluation Benchmark for Language-interfaced Vibe Protein Design q-bio.QM · 2026-05-09 · unverdicted · none · ref 51 · 2 links · internal anchor
VibeProteinBench is a new benchmark evaluating LLMs on open-ended language-interfaced protein design across recognition, engineering, and generation, with no model showing strong performance in all areas.
FactoryBench: Evaluating Industrial Machine Understanding cs.AI · 2026-05-08 · unverdicted · none · ref 48 · internal anchor
FactoryBench reveals that frontier LLMs achieve under 50% on structured causal questions and under 18% on decision-making in industrial robotic telemetry.
SkVM: Revisiting Language VM for Skills across Heterogenous LLMs and Harnesses cs.SE · 2026-04-03 · unverdicted · none · ref 34 · internal anchor
SkVM uses capability profiling and compiler-style techniques to make skills portable across LLMs and harnesses, raising task completion rates while cutting token use by up to 40% and delivering up to 3.2x speedup.
FAME: Forecasting Academic Impact via Continuous-Time Manifold Evolution cs.LG · 2026-05-08 · unverdicted · none · ref 17 · internal anchor
FAME models scientific topic trajectories in continuous time to forecast paper impact more accurately than LLMs by aligning manuscripts with field momentum in a dynamic latent space.
When Routine Chats Turn Toxic: Unintended Long-Term State Poisoning in Personalized Agents cs.CR · 2026-05-07 · unverdicted · none · ref 8 · internal anchor
Routine user chats can unintentionally poison the long-term state of personalized LLM agents, causing authorization drift, tool escalation, and unchecked autonomy, as measured by a new benchmark and reduced by the StateGuard defense.
CL-bench Life: Can Language Models Learn from Real-Life Context? cs.CL · 2026-04-29 · unverdicted · none · ref 13 · internal anchor
CL-bench Life shows frontier language models achieve only 13.8% average success on real-life context tasks, with the best model at 19.3%.
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence cs.AI · 2026-04-20 · unverdicted · none · ref 58 · internal anchor
Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
REAgent: Requirement-Driven LLM Agents for Software Issue Resolution cs.SE · 2026-04-08 · unverdicted · none · ref 40 · internal anchor
REAgent improves LLM patch generation for software issues by 17.4% on average through automated construction, quality checking, and iterative refinement of structured issue-oriented requirements.
Watch Before You Answer: Learning from Visually Grounded Post-Training cs.CV · 2026-04-06 · unverdicted · none · ref 33 · internal anchor
Filtering post-training data to visually grounded questions improves VLM video understanding performance by up to 6.2 points using 69% of the data.
InCoder-32B-Thinking: Industrial Code World Model for Thinking cs.AR · 2026-04-03 · unverdicted · none · ref 22 · internal anchor
InCoder-32B-Thinking uses error-feedback synthesized thinking traces and a code world model to reach top open-source scores on general and industrial code benchmarks including 81.3% on LiveCodeBench and 84.0% on CAD-Coder.
Dual-Cluster Memory Agent: Resolving Multi-Paradigm Ambiguity in Optimization Problem Solving cs.CL · 2026-04-22 · unreviewed · ref 12 · internal anchor
AssemLM: A Spatial Reasoning Multimodal Large Language Model for Robotic Assembly cs.RO · 2026-04-10 · unreviewed · ref 25 · internal anchor

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer