pith. sign in

arxiv: 2412.19437 · v2 · submitted 2024-12-27 · 💻 cs.CL · cs.AI

DeepSeek-V3 Technical Report

DeepSeek-AI , Aixin Liu , Bei Feng , Bing Xue , Bingxuan Wang , Bochao Wu , Chengda Lu , Chenggang Zhao
show 190 more authors
Chengqi Deng Chenyu Zhang Chong Ruan Damai Dai Daya Guo Dejian Yang Deli Chen Dongjie Ji Erhang Li Fangyun Lin Fucong Dai Fuli Luo Guangbo Hao Guanting Chen Guowei Li H. Zhang Han Bao Hanwei Xu Haocheng Wang Haowei Zhang Honghui Ding Huajian Xin Huazuo Gao Hui Li Hui Qu J.L. Cai Jian Liang Jianzhong Guo Jiaqi Ni Jiashi Li Jiawei Wang Jin Chen Jingchang Chen Jingyang Yuan Junjie Qiu Junlong Li Junxiao Song Kai Dong Kai Hu Kaige Gao Kang Guan Kexin Huang Kuai Yu Lean Wang Lecong Zhang Lei Xu Leyi Xia Liang Zhao Litong Wang Liyue Zhang Meng Li Miaojun Wang Mingchuan Zhang Minghua Zhang Minghui Tang Mingming Li Ning Tian Panpan Huang Peiyi Wang Peng Zhang Qiancheng Wang Qihao Zhu Qinyu Chen Qiushi Du R.J. Chen R.L. Jin Ruiqi Ge Ruisong Zhang Ruizhe Pan Runji Wang Runxin Xu Ruoyu Zhang Ruyi Chen S.S. Li Shanghao Lu Shangyan Zhou Shanhuang Chen Shaoqing Wu Shengfeng Ye Shirong Ma Shiyu Wang Shuang Zhou Shuiping Yu Shunfeng Zhou Shuting Pan T. Wang Tao Yun Tian Pei Tianyu Sun W.L. Xiao Wangding Zeng Wanjia Zhao Wei An Wen Liu Wenfeng Liang Wenjun Gao Wenqin Yu Wentao Zhang X.Q. Li Xiangyue Jin Xianzu Wang Xiao Bi XiaoDong Liu Xiaohan Wang Xiaojin Shen Xiaokang Chen Xiaokang Zhang Xiaosha Chen Xiaotao Nie Xiaowen Sun Xiaoxiang Wang Xin Cheng Xin Liu Xin Xie Xingchao Liu Xingkai Yu Xinnan Song Xinxia Shan Xinyi Zhou Xinyu Yang Xinyuan Li Xuecheng Su Xuheng Lin Y.K. Li Y.Q. Wang Y.X. Wei Y.X. Zhu Yang Zhang Yanhong Xu Yanping Huang Yao Li Yao Zhao Yaofeng Sun Yaohui Li Yaohui Wang Yi Yu Yi Zheng Yichao Zhang Yifan Shi Yiliang Xiong Ying He Ying Tang Yishi Piao Yisong Wang Yixuan Tan Yiyang Ma Yiyuan Liu Yongqiang Guo Yu Wu Yuan Ou Yuchen Zhu Yuduan Wang Yue Gong Yuheng Zou Yujia He Yukun Zha Yunfan Xiong Yunxian Ma Yuting Yan Yuxiang Luo Yuxiang You Yuxuan Liu Yuyang Zhou Z.F. Wu Z.Z. Ren Zehui Ren Zhangli Sha Zhe Fu Zhean Xu Zhen Huang Zhen Zhang Zhenda Xie Zhengyan Zhang Zhewen Hao Zhibin Gou Zhicheng Ma Zhigang Yan Zhihong Shao Zhipeng Xu Zhiyu Wu Zhongyu Zhang Zhuoshu Li Zihui Gu Zijia Zhu Zijun Liu Zilin Li Ziwei Xie Ziyang Song Ziyi Gao Zizheng Pan
This is my paper

Pith reviewed 2026-05-23 06:46 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords DeepSeek-V3Mixture-of-Expertslarge language modelmulti-token predictionload balancingefficient traininginference efficiency
0
0 comments X

The pith

DeepSeek-V3, a 671B-parameter Mixture-of-Experts model, matches leading closed-source performance after training on 14.8 trillion tokens with 2.788 million H800 GPU hours.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DeepSeek-V3 as a large language model that combines scale with efficiency through a Mixture-of-Experts design. It activates only 37 billion parameters per token while using Multi-head Latent Attention and DeepSeekMoE routing carried from earlier models, plus a new auxiliary-loss-free method for balancing expert load and a multi-token prediction objective. After pre-training on 14.8 trillion tokens and follow-on supervised fine-tuning plus reinforcement learning, the model exceeds other open models and reaches parity with top closed models. The entire training completed stably without loss spikes or rollbacks. This outcome shows that targeted architectural and training choices can deliver high capability at lower total compute cost than dense alternatives.

Core claim

DeepSeek-V3 is a Mixture-of-Experts language model with 671B total parameters and 37B activated per token. It adopts Multi-head Latent Attention and DeepSeekMoE architectures, introduces an auxiliary-loss-free load balancing strategy, and trains with a multi-token prediction objective. Pre-trained on 14.8 trillion tokens and refined through supervised fine-tuning and reinforcement learning, it outperforms other open-source models and reaches performance comparable to leading closed-source models, completing full training in 2.788M H800 GPU hours with no irrecoverable loss spikes.

What carries the argument

Multi-head Latent Attention (MLA) and DeepSeekMoE architectures combined with auxiliary-loss-free load balancing and multi-token prediction, which together reduce active parameters, stabilize training, and improve capability without extra loss terms.

If this is right

  • High-performing models can be deployed with inference cost limited to 37B active parameters rather than the full 671B.
  • Mixture-of-Experts load balancing remains effective without auxiliary loss terms, simplifying the training objective.
  • Multi-token prediction during pre-training produces stronger results after standard fine-tuning stages.
  • Very large models can complete training without loss spikes or checkpoint rollbacks when the optimization setup is sufficiently robust.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Releasing both the model and the detailed training recipe allows external groups to replicate or extend the efficiency gains on their own hardware.
  • The reported training stability may generalize to other large-scale runs if the same auxiliary-loss-free and multi-token techniques are applied.
  • If the performance parity holds under scrutiny, future scaling discussions could shift emphasis from total parameter count toward active-parameter efficiency.

Load-bearing premise

The reported benchmark scores reflect the model's true capability under standard evaluation conditions without contamination or undisclosed protocol advantages.

What would settle it

Independent runs of the released model checkpoints on the exact same benchmarks and prompting methods that produce scores materially below the reported figures would falsify the performance claim.

read the original abstract

We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper presents DeepSeek-V3, a 671B-parameter Mixture-of-Experts model (37B active parameters per token) that builds on MLA and DeepSeekMoE from prior work. It introduces an auxiliary-loss-free load-balancing strategy and multi-token prediction, pre-trains on 14.8T tokens, applies SFT and RL, and reports strong benchmark results while using only 2.788M H800 GPU hours with no irrecoverable loss spikes. Public checkpoints are released, and the model is claimed to surpass other open-source models while matching leading closed-source ones.

Significance. If the performance claims are substantiated, the work is significant for demonstrating practical, efficient scaling of large MoE models and for releasing public checkpoints that enable community verification and extension. The reported training stability and low compute cost, together with the auxiliary-loss-free balancing technique, provide concrete, reproducible contributions to the field.

major comments (1)
  1. [Evaluation] Evaluation section: The headline claim that DeepSeek-V3 'outperforms other open-source models and achieves performance comparable to leading closed-source models' is load-bearing for the paper's contribution, yet the manuscript supplies no description of decontamination steps applied to the 14.8T-token corpus, no membership-inference or n-gram overlap checks, and no confirmation that prompting formats, few-shot counts, temperature, or post-processing exactly replicate the protocols used for the closed-model baselines. Without these details, direct comparability cannot be assessed.
minor comments (1)
  1. [Abstract] The abstract and main text would benefit from an explicit statement of the primary benchmarks used (e.g., MMLU, GSM8K, HumanEval) to allow readers to gauge the scope of the 'comprehensive evaluations' claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the evaluation section. The concern about substantiating comparability is well-taken, and we address it directly below.

read point-by-point responses
  1. Referee: The headline claim that DeepSeek-V3 'outperforms other open-source models and achieves performance comparable to leading closed-source models' is load-bearing for the paper's contribution, yet the manuscript supplies no description of decontamination steps applied to the 14.8T-token corpus, no membership-inference or n-gram overlap checks, and no confirmation that prompting formats, few-shot counts, temperature, or post-processing exactly replicate the protocols used for the closed-model baselines. Without these details, direct comparability cannot be assessed.

    Authors: We agree that explicit documentation of decontamination and evaluation protocols is necessary for rigorous comparability. The initial manuscript omitted these details for brevity. In the revised version we will add a dedicated subsection (likely in Section 4 or an appendix) that: (1) describes the decontamination pipeline applied to the 14.8T-token corpus, including n-gram overlap filtering against common benchmarks; (2) reports any membership-inference or contamination checks performed; and (3) tabulates the exact prompting templates, few-shot counts, temperature values, and post-processing steps used for each reported benchmark so that they can be verified against the original closed-model evaluation protocols. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on external benchmarks

full rationale

This is a technical report on model training and evaluation with no mathematical derivation chain. The central claims are measured performance numbers on standard public benchmarks compared to other models. No equations, fitted parameters, or predictions reduce by construction to quantities defined inside the paper. Self-citations to DeepSeek-V2 describe prior architectural choices but are not load-bearing for the reported scores. The evaluation protocol is presented as standard, with no internal redefinition of metrics.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The performance claims rest on the empirical effectiveness of reused architectures from V2, the new balancing and prediction objectives, and the quality of the 14.8T token dataset; these are validated by training runs rather than derived from first principles.

free parameters (2)
  • total parameters = 671B
    Chosen as the product of layers, hidden size, and number of experts to reach 671B while keeping active count at 37B
  • active parameters per token = 37B
    Selected via expert routing to achieve inference efficiency target
axioms (2)
  • domain assumption Multi-head Latent Attention and DeepSeekMoE from V2 transfer to V3 without major modification
    Invoked to justify reuse of the architectures validated in the prior model
  • ad hoc to paper Auxiliary-loss-free load balancing maintains expert utilization without degrading final performance
    New strategy introduced in this work and claimed to be effective

pith-pipeline@v0.9.0 · 6537 in / 1514 out tokens · 50130 ms · 2026-05-23T06:46:17.726349+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding

    cs.CL 2026-05 unverdicted novelty 8.0

    Mistletoe introduces a stealthy attack on speculative decoding that collapses acceleration by reducing average accepted length while preserving output semantics.

  2. HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts

    cs.LG 2026-05 unverdicted novelty 8.0

    HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-wei...

  3. EnergyAgentBench: Benchmarking LLM Agents on Live Energy Infrastructure Data

    econ.EM 2026-05 accept novelty 8.0

    EnergyAgentBench is a new benchmark with 70 task variants that evaluates LLM agents on live energy data for datacenter siting, long-horizon optimization, and causal grid diagnosis.

  4. Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models

    cs.AR 2026-05 conditional novelty 8.0

    Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.

  5. ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning

    cs.LG 2026-05 conditional novelty 8.0

    ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equip...

  6. Narrow Secret Loyalty Dodges Black-Box Audits

    cs.CR 2026-05 unverdicted novelty 8.0

    Narrow secret loyalties implanted via fine-tuning in LLMs at multiple scales evade black-box audits unless the auditor knows the target principal.

  7. From Context to Skills: Can Language Models Learn from Context Skillfully?

    cs.AI 2026-04 unverdicted novelty 8.0

    Ctx2Skill lets language models autonomously evolve context-specific skills via multi-agent self-play, improving performance on context learning tasks without human supervision.

  8. MappingEvolve: LLM-Driven Code Evolution for Technology Mapping

    cs.CE 2026-04 unverdicted novelty 8.0

    MappingEvolve applies LLMs through Planner-Evolver-Evaluator agents to evolve technology mapping code, delivering 10.04% area reduction versus ABC and 7.93% versus mockturtle on EPFL benchmarks.

  9. Revisable by Design: A Theory of Streaming LLM Agent Execution

    cs.LG 2026-04 unverdicted novelty 8.0

    LLM agents achieve greater flexibility during execution by classifying actions via a reversibility taxonomy and using an Earliest-Conflict Rollback algorithm that matches full-restart quality while wasting far less co...

  10. CHASM: Unveiling Covert Advertisements on Chinese Social Media

    cs.LG 2026-04 unverdicted novelty 8.0

    CHASM is a new benchmark dataset showing that existing multimodal large language models fail to reliably detect covert advertisements on Chinese social media even after fine-tuning.

  11. VoxSafeBench: Not Just What Is Said, but Who, How, and Where

    cs.SD 2026-04 unverdicted novelty 8.0

    VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.

  12. Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

    cs.LG 2026-04 unverdicted novelty 8.0

    Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.

  13. Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models

    cs.CL 2026-04 conditional novelty 8.0

    Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.

  14. Tessera: Unlocking Heterogeneous GPUs through Kernel-Granularity Disaggregation

    cs.DC 2026-04 unverdicted novelty 8.0

    Tessera performs kernel-granularity disaggregation on heterogeneous GPUs, achieving up to 2.3x throughput and 1.6x cost efficiency gains for large model inference while generalizing beyond prior methods.

  15. OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models

    cs.CV 2026-04 unverdicted novelty 8.0

    OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.

  16. PolyBench: Benchmarking LLM Forecasting and Trading Capabilities on Live Prediction Market Data

    q-fin.CP 2026-04 conditional novelty 8.0

    Only two of seven LLMs produce positive returns on live Polymarket data, with MiMo-V2-Flash at 17.6% CWR and Gemini-3-Flash at 6.2% CWR while the other five lose money.

  17. AgentSocialBench: Evaluating Privacy Risks in Human-Centered Agentic Social Networks

    cs.AI 2026-04 unverdicted novelty 8.0

    AgentSocialBench demonstrates that privacy preservation is fundamentally harder in human-centered agentic social networks than in single-agent cases due to cross-domain coordination pressures and an abstraction parado...

  18. Bit-Accurate Modeling of GPU Matrix Multiply-Accumulate Units: Demystifying Numerical Discrepancy and Accuracy

    cs.AR 2025-11 accept novelty 8.0

    The authors derive the first bit-accurate arithmetic models for matrix multiply-accumulate operations on ten GPU architectures spanning NVIDIA Volta to Blackwell and AMD CDNA1 to CDNA3.

  19. SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors

    cs.CL 2025-10 unverdicted novelty 8.0

    SimBench unifies 20 datasets into the first large-scale benchmark, finding top LLMs reach only modest human simulation fidelity of 40.8/100 with log-linear scaling by size and an alignment tradeoff on diverse questions.

  20. MediQAl: A French Medical Question Answering Dataset for Knowledge and Reasoning Evaluation

    cs.CL 2025-07 accept novelty 8.0

    MediQAl is a new French medical QA benchmark with 32k exam-sourced questions in three formats and cognitive labels, evaluated on 14 LLMs to reveal gaps between factual recall and reasoning performance.

  21. Large Language Diffusion Models

    cs.CL 2025-02 unverdicted novelty 8.0

    LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

  22. HyperParallel-MoE: Multi-Core Interleaved Scheduling for Fast MoE Training on Ascend NPUs

    cs.DC 2026-05 unverdicted novelty 7.0

    HyperParallel-MoE achieves up to 1.58x lower Dispatch-to-Combine MoE-FFN latency on Ascend A3 clusters via tile-level heterogeneous scheduling of AIC and AIV resources.

  23. CachePrune: Privacy-Aware and Fine-Grained KV Cache Sharing for Efficient LLM Inference

    cs.CR 2026-05 unverdicted novelty 7.0

    CachePrune enables fine-grained, token-level KV cache reuse across LLM requests by masking sensitive segments, eliminating direct side-channel leakage while cutting TTFT by 4.5x and raising hit rates by 44% versus pri...

  24. Unextractable Protocol Models: Collaborative Training and Inference without Weight Materialization

    cs.LG 2026-05 unverdicted novelty 7.0

    UPMs apply periodic time-varying random invertible transforms to sharded model components in decentralized setups to render cross-time assemblies incoherent while preserving network function and incurring minimal overhead.

  25. Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most

    cs.AI 2026-05 unverdicted novelty 7.0

    More capable LLMs produce worse distributional forecasts on superlinear growth time series with tail risks of regime change, with the error concentrated in the upper tail; this reverses on conventional threshold metrics.

  26. Frontier: Towards Comprehensive and Accurate LLM Inference Simulation

    cs.DC 2026-05 unverdicted novelty 7.0

    Frontier is a new discrete-event simulator for disaggregated LLM serving that incorporates co-location, PDD, AFD, and optimizations, achieving under 4% throughput error and large reductions in latency prediction error...

  27. SG-LegalCite: A Principle-Augmented Benchmark for Legal Citation Retrieval in Singapore Law

    cs.IR 2026-05 unverdicted novelty 7.0

    SG-LegalCite supplies 100,890 case-principle pairs from 8,523 Singapore Supreme Court judgments to enable retrieval models that rank precedents using both facts and governing legal principles.

  28. CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning

    cs.CL 2026-05 unverdicted novelty 7.0

    CopT reverses CoT by eliciting a draft answer first then using continuous-embedding contrastive verification and on-policy thinking to reflect and correct, yielding up to 23% higher accuracy and 57% fewer tokens witho...

  29. SCARA: A Semantics-Constrained Autonomous Remediation Agent for Opaque Industrial Software Vulnerabilities

    cs.CR 2026-05 unverdicted novelty 7.0

    SCARA introduces a four-stage pipeline using state-aware verification and constrained synthesis to remediate vulnerabilities in source-unavailable industrial software, reporting 100% precision and 88.9% success on a 1...

  30. Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

    math.OC 2026-05 conditional novelty 7.0

    Proposes equivariant optimizers matched to the symmetry groups of embeddings, SwiGLU projections and MoE routers, with experiments showing consistent gains over AdamW on language model pre-training.

  31. Prompt Compression in Diffusion Large Language Models: Evaluating LLMLingua-2 on LLaDA

    cs.CL 2026-05 unverdicted novelty 7.0

    Evaluating LLMLingua-2 at 2x compression on LLaDA shows non-uniform transfer to diffusion LLMs, with mathematical reasoning degrading substantially despite high BERTScore while summarization remains more robust.

  32. Reasoning Portability: Guiding Continual Learning for MLLMs in the RLVR Era

    cs.LG 2026-05 unverdicted novelty 7.0

    Formalizes Reasoning Portability (RP) and proposes RDB-CL to modulate per-sample KL regularization in RLVR for MLLM continual learning, achieving +12.0% Last accuracy over vanilla RLVR baseline by preserving reusable ...

  33. UniPPTBench: A Unified Benchmark for Presentation Generation Across Diverse Input Settings

    cs.CV 2026-05 conditional novelty 7.0

    The paper presents UniPPTBench and UniPPTEval, a unified benchmark and scenario-aware evaluation framework for presentation generation from vague prompts, long documents, multimodal documents, and multi-source inputs.

  34. HalluScore: Large Language Model Hallucination Question Answering Benchmark

    cs.CL 2026-05 unverdicted novelty 7.0

    HalluScore is a curated Arabic QA dataset with 827 questions, ground-truth evidence, and human annotations used to measure hallucination rates across 17 LLMs.

  35. LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning

    cs.AI 2026-05 unverdicted novelty 7.0

    LinAlg-Bench shows LLMs switch from execution errors to computational abandonment and structured fabrication at 4x4 matrix scale, indicating a working memory limit rather than knowledge gaps.

  36. Dynamic Chunking for Diffusion Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    DCDM replaces positional blocks with learnable semantic chunks via differentiable Chunking Attention, yielding consistent gains over block and unstructured diffusion baselines up to 1.5B parameters.

  37. Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems

    cs.AI 2026-05 unverdicted novelty 7.0

    A survey that unifies prior work on multi-agent LLM systems via the LIFE framework, mapping dependencies across collaboration, failure attribution, and autonomous self-evolution while identifying cross-stage challenges.

  38. Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding

    cs.CL 2026-05 unverdicted novelty 7.0

    Mistletoe is a stealthy attack that collapses the speedup of speculative decoding by reducing average accepted length τ without changing output semantics or perplexity.

  39. What Does LLM Refinement Actually Improve? A Systematic Study on Document-Level Literary Translation

    cs.CL 2026-05 accept novelty 7.0

    Document-level machine translation followed by segment-level LLM refinement provides the strongest and most stable improvements in literary translation quality, mainly enhancing fluency and style rather than adequacy.

  40. FIND: Toward Multimodal Financial Reasoning and Question Answering for Indic Languages

    cs.CL 2026-05 unverdicted novelty 7.0

    FinVQA is a new multilingual benchmark for Indic financial VQA with three difficulty levels and four formats, paired with the FIND framework for faithful numerical reasoning via fine-tuning and constrained decoding.

  41. Multi-Scale Dequant: Eliminating Dequantization Bottleneck via Activation Decomposition for Efficient LLM Inference

    stat.ML 2026-05 unverdicted novelty 7.0

    MSD eliminates dequantization from the GEMM path by decomposing BF16 activations into multiple low-precision parts that multiply directly with INT8 or MXFP4 weights, achieving near-16 effective bits for INT8 and 6.6 f...

  42. Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts

    cs.LG 2026-05 unverdicted novelty 7.0

    Routers in SMoE models form geometric alignments with their experts through shared gradient directions, enabling effective specialization that auxiliary load-balancing losses tend to disrupt.

  43. Formalize, Don't Optimize: The Heuristic Trap in LLM-Generated Combinatorial Solvers

    cs.AI 2026-05 unverdicted novelty 7.0

    LLM-generated combinatorial solvers achieve highest correctness when the model formalizes problems for verified backends rather than attempting to optimize search, which often causes regressions.

  44. Learning Agentic Policy from Action Guidance

    cs.CL 2026-05 unverdicted novelty 7.0

    ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.

  45. Multi-Token Residual Prediction

    cs.LG 2026-05 unverdicted novelty 7.0

    MRP predicts logit residuals from hidden states to support dependency-aware multi-token denoising in a single forward pass for diffusion language models, yielding up to 1.42× lossless speedup on SDAR models.

  46. Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    TABOM is a trajectory-aligned Boltzmann modeling framework that turns self-distilled inference paths into a pairwise ranking loss to close the training-inference gap in diffusion language models and expand their effec...

  47. FlowSteer: Prompt-Only Workflow Steering Exposes Planning-Time Vulnerabilities in Multi-Agent LLM Systems

    cs.CR 2026-05 unverdicted novelty 7.0

    FlowSteer is a prompt-only attack that biases multi-agent LLM workflow planning to propagate malicious signals, raising success rates by up to 55%, with FlowGuard as an input-side defense reducing it by up to 34%.

  48. Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning

    cs.AI 2026-05 unverdicted novelty 7.0

    GCPO shifts RLVR from rollout competition to team cooperation by assigning advantages via marginal contributions to a determinant-based coverage volume over semantic embeddings, yielding higher accuracy and solution d...

  49. Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning

    cs.AI 2026-05 unverdicted novelty 7.0

    GCPO uses team-level credit assignment via determinant volume over reward-weighted semantic embeddings to promote non-redundant correct reasoning paths, improving both accuracy and diversity in LLM training.

  50. Uniform Scaling Limits in AdamW-Trained Transformers

    stat.ML 2026-05 unverdicted novelty 7.0

    AdamW-trained transformer hidden states and backpropagated variables converge uniformly in L2 to a forward-backward ODE system (McKean-Vlasov when non-causal) at rate O(L^{-1}+L^{-1/3}H^{-1/2}) as depth L and heads H ...

  51. ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 7.0

    ALAM creates algebraically consistent latent action transitions from videos to act as auxiliary generative targets, raising robot policy success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.

  52. Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference

    cs.DC 2026-05 unverdicted novelty 7.0

    EEP makes wide expert-parallel MoE serving survive single-rank failures with an 11s recovery pause, 8s reintegration pause, and throughput restored to 95% of pre-fault level within 52s while staying within 4.4% of a f...

  53. Hilbert-Geo: Solving Solid Geometric Problems by Neural-Symbolic Reasoning

    cs.CV 2026-05 unverdicted novelty 7.0

    Hilbert-Geo introduces a unified formal language framework with CDL predicates and theorem bank for solid geometry, using a Parse2Reason pipeline to achieve SOTA accuracy on new solid and plane geometry datasets.

  54. DeepRefine: Agent-Compiled Knowledge Refinement via Reinforcement Learning

    cs.CL 2026-05 unverdicted novelty 7.0

    DeepRefine refines agent-compiled knowledge bases via multi-turn abductive diagnosis and RL training with a GBD reward, yielding consistent downstream task gains.

  55. SciVQR: A Multidisciplinary Multimodal Benchmark for Advanced Scientific Reasoning Evaluation

    cs.CV 2026-05 unverdicted novelty 7.0

    SciVQR is a new benchmark dataset for evaluating multimodal AI models on complex scientific reasoning tasks across six disciplines, including expert solutions for nearly half the items.

  56. GraphInstruct: A Progressive Benchmark for Diagnosing Capability Gaps in LLM Graph Generation

    cs.SI 2026-05 unverdicted novelty 7.0

    GraphInstruct introduces a six-level progressive benchmark with 800 instructions and 1,582 references to diagnose LLM graph generation gaps, plus a verification-guided iterative prompting framework that improves performance.

  57. GraphInstruct: A Progressive Benchmark for Diagnosing Capability Gaps in LLM Graph Generation

    cs.SI 2026-05 unverdicted novelty 7.0

    GraphInstruct is a progressive benchmark with six complexity levels for LLM graph generation that identifies multi-constraint composition as the hardest point and shows a verification-guided iterative framework outper...

  58. Mixture of Layers with Hybrid Attention

    cs.LG 2026-05 unverdicted novelty 7.0

    Mixture of Layers replaces monolithic transformer blocks with routed thin parallel blocks using hybrid attention that combines a shared softmax block for global context with Gated DeltaNet linear attention in the rout...

  59. A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    Massive activations originate in a specific ME Layer across LLM families; reducing their token rigidity via a targeted method boosts performance and mitigates attention sinks.

  60. A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models

    cs.CL 2026-05 conditional novelty 7.0

    Massive activations first appear in a single ME Layer due to RMSNorm and FFN, remain invariant thereafter, and a simple softening method raises LLM performance while reducing attention sinks.