DeepSeek-V3 Technical Report
Pith reviewed 2026-05-23 06:46 UTC · model grok-4.3
The pith
DeepSeek-V3, a 671B-parameter Mixture-of-Experts model, matches leading closed-source performance after training on 14.8 trillion tokens with 2.788 million H800 GPU hours.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DeepSeek-V3 is a Mixture-of-Experts language model with 671B total parameters and 37B activated per token. It adopts Multi-head Latent Attention and DeepSeekMoE architectures, introduces an auxiliary-loss-free load balancing strategy, and trains with a multi-token prediction objective. Pre-trained on 14.8 trillion tokens and refined through supervised fine-tuning and reinforcement learning, it outperforms other open-source models and reaches performance comparable to leading closed-source models, completing full training in 2.788M H800 GPU hours with no irrecoverable loss spikes.
What carries the argument
Multi-head Latent Attention (MLA) and DeepSeekMoE architectures combined with auxiliary-loss-free load balancing and multi-token prediction, which together reduce active parameters, stabilize training, and improve capability without extra loss terms.
If this is right
- High-performing models can be deployed with inference cost limited to 37B active parameters rather than the full 671B.
- Mixture-of-Experts load balancing remains effective without auxiliary loss terms, simplifying the training objective.
- Multi-token prediction during pre-training produces stronger results after standard fine-tuning stages.
- Very large models can complete training without loss spikes or checkpoint rollbacks when the optimization setup is sufficiently robust.
Where Pith is reading between the lines
- Releasing both the model and the detailed training recipe allows external groups to replicate or extend the efficiency gains on their own hardware.
- The reported training stability may generalize to other large-scale runs if the same auxiliary-loss-free and multi-token techniques are applied.
- If the performance parity holds under scrutiny, future scaling discussions could shift emphasis from total parameter count toward active-parameter efficiency.
Load-bearing premise
The reported benchmark scores reflect the model's true capability under standard evaluation conditions without contamination or undisclosed protocol advantages.
What would settle it
Independent runs of the released model checkpoints on the exact same benchmarks and prompting methods that produce scores materially below the reported figures would falsify the performance claim.
read the original abstract
We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents DeepSeek-V3, a 671B-parameter Mixture-of-Experts model (37B active parameters per token) that builds on MLA and DeepSeekMoE from prior work. It introduces an auxiliary-loss-free load-balancing strategy and multi-token prediction, pre-trains on 14.8T tokens, applies SFT and RL, and reports strong benchmark results while using only 2.788M H800 GPU hours with no irrecoverable loss spikes. Public checkpoints are released, and the model is claimed to surpass other open-source models while matching leading closed-source ones.
Significance. If the performance claims are substantiated, the work is significant for demonstrating practical, efficient scaling of large MoE models and for releasing public checkpoints that enable community verification and extension. The reported training stability and low compute cost, together with the auxiliary-loss-free balancing technique, provide concrete, reproducible contributions to the field.
major comments (1)
- [Evaluation] Evaluation section: The headline claim that DeepSeek-V3 'outperforms other open-source models and achieves performance comparable to leading closed-source models' is load-bearing for the paper's contribution, yet the manuscript supplies no description of decontamination steps applied to the 14.8T-token corpus, no membership-inference or n-gram overlap checks, and no confirmation that prompting formats, few-shot counts, temperature, or post-processing exactly replicate the protocols used for the closed-model baselines. Without these details, direct comparability cannot be assessed.
minor comments (1)
- [Abstract] The abstract and main text would benefit from an explicit statement of the primary benchmarks used (e.g., MMLU, GSM8K, HumanEval) to allow readers to gauge the scope of the 'comprehensive evaluations' claim.
Simulated Author's Rebuttal
We thank the referee for the constructive comment on the evaluation section. The concern about substantiating comparability is well-taken, and we address it directly below.
read point-by-point responses
-
Referee: The headline claim that DeepSeek-V3 'outperforms other open-source models and achieves performance comparable to leading closed-source models' is load-bearing for the paper's contribution, yet the manuscript supplies no description of decontamination steps applied to the 14.8T-token corpus, no membership-inference or n-gram overlap checks, and no confirmation that prompting formats, few-shot counts, temperature, or post-processing exactly replicate the protocols used for the closed-model baselines. Without these details, direct comparability cannot be assessed.
Authors: We agree that explicit documentation of decontamination and evaluation protocols is necessary for rigorous comparability. The initial manuscript omitted these details for brevity. In the revised version we will add a dedicated subsection (likely in Section 4 or an appendix) that: (1) describes the decontamination pipeline applied to the 14.8T-token corpus, including n-gram overlap filtering against common benchmarks; (2) reports any membership-inference or contamination checks performed; and (3) tabulates the exact prompting templates, few-shot counts, temperature values, and post-processing steps used for each reported benchmark so that they can be verified against the original closed-model evaluation protocols. revision: yes
Circularity Check
No circularity: empirical results on external benchmarks
full rationale
This is a technical report on model training and evaluation with no mathematical derivation chain. The central claims are measured performance numbers on standard public benchmarks compared to other models. No equations, fitted parameters, or predictions reduce by construction to quantities defined inside the paper. Self-citations to DeepSeek-V2 describe prior architectural choices but are not load-bearing for the reported scores. The evaluation protocol is presented as standard, with no internal redefinition of metrics.
Axiom & Free-Parameter Ledger
free parameters (2)
- total parameters =
671B
- active parameters per token =
37B
axioms (2)
- domain assumption Multi-head Latent Attention and DeepSeekMoE from V2 transfer to V3 without major modification
- ad hoc to paper Auxiliary-loss-free load balancing maintains expert utilization without degrading final performance
Forward citations
Cited by 60 Pith papers
-
Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding
Mistletoe introduces a stealthy attack on speculative decoding that collapses acceleration by reducing average accepted length while preserving output semantics.
-
HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts
HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-wei...
-
EnergyAgentBench: Benchmarking LLM Agents on Live Energy Infrastructure Data
EnergyAgentBench is a new benchmark with 70 task variants that evaluates LLM agents on live energy data for datacenter siting, long-horizon optimization, and causal grid diagnosis.
-
Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models
Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.
-
ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning
ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equip...
-
Narrow Secret Loyalty Dodges Black-Box Audits
Narrow secret loyalties implanted via fine-tuning in LLMs at multiple scales evade black-box audits unless the auditor knows the target principal.
-
From Context to Skills: Can Language Models Learn from Context Skillfully?
Ctx2Skill lets language models autonomously evolve context-specific skills via multi-agent self-play, improving performance on context learning tasks without human supervision.
-
MappingEvolve: LLM-Driven Code Evolution for Technology Mapping
MappingEvolve applies LLMs through Planner-Evolver-Evaluator agents to evolve technology mapping code, delivering 10.04% area reduction versus ABC and 7.93% versus mockturtle on EPFL benchmarks.
-
Revisable by Design: A Theory of Streaming LLM Agent Execution
LLM agents achieve greater flexibility during execution by classifying actions via a reversibility taxonomy and using an Earliest-Conflict Rollback algorithm that matches full-restart quality while wasting far less co...
-
CHASM: Unveiling Covert Advertisements on Chinese Social Media
CHASM is a new benchmark dataset showing that existing multimodal large language models fail to reliably detect covert advertisements on Chinese social media even after fine-tuning.
-
VoxSafeBench: Not Just What Is Said, but Who, How, and Where
VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
-
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.
-
Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models
Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.
-
Tessera: Unlocking Heterogeneous GPUs through Kernel-Granularity Disaggregation
Tessera performs kernel-granularity disaggregation on heterogeneous GPUs, achieving up to 2.3x throughput and 1.6x cost efficiency gains for large model inference while generalizing beyond prior methods.
-
OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models
OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.
-
PolyBench: Benchmarking LLM Forecasting and Trading Capabilities on Live Prediction Market Data
Only two of seven LLMs produce positive returns on live Polymarket data, with MiMo-V2-Flash at 17.6% CWR and Gemini-3-Flash at 6.2% CWR while the other five lose money.
-
AgentSocialBench: Evaluating Privacy Risks in Human-Centered Agentic Social Networks
AgentSocialBench demonstrates that privacy preservation is fundamentally harder in human-centered agentic social networks than in single-agent cases due to cross-domain coordination pressures and an abstraction parado...
-
Bit-Accurate Modeling of GPU Matrix Multiply-Accumulate Units: Demystifying Numerical Discrepancy and Accuracy
The authors derive the first bit-accurate arithmetic models for matrix multiply-accumulate operations on ten GPU architectures spanning NVIDIA Volta to Blackwell and AMD CDNA1 to CDNA3.
-
SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors
SimBench unifies 20 datasets into the first large-scale benchmark, finding top LLMs reach only modest human simulation fidelity of 40.8/100 with log-linear scaling by size and an alignment tradeoff on diverse questions.
-
MediQAl: A French Medical Question Answering Dataset for Knowledge and Reasoning Evaluation
MediQAl is a new French medical QA benchmark with 32k exam-sourced questions in three formats and cognitive labels, evaluated on 14 LLMs to reveal gaps between factual recall and reasoning performance.
-
Large Language Diffusion Models
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
-
HyperParallel-MoE: Multi-Core Interleaved Scheduling for Fast MoE Training on Ascend NPUs
HyperParallel-MoE achieves up to 1.58x lower Dispatch-to-Combine MoE-FFN latency on Ascend A3 clusters via tile-level heterogeneous scheduling of AIC and AIV resources.
-
CachePrune: Privacy-Aware and Fine-Grained KV Cache Sharing for Efficient LLM Inference
CachePrune enables fine-grained, token-level KV cache reuse across LLM requests by masking sensitive segments, eliminating direct side-channel leakage while cutting TTFT by 4.5x and raising hit rates by 44% versus pri...
-
Unextractable Protocol Models: Collaborative Training and Inference without Weight Materialization
UPMs apply periodic time-varying random invertible transforms to sharded model components in decentralized setups to render cross-time assemblies incoherent while preserving network function and incurring minimal overhead.
-
Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most
More capable LLMs produce worse distributional forecasts on superlinear growth time series with tail risks of regime change, with the error concentrated in the upper tail; this reverses on conventional threshold metrics.
-
Frontier: Towards Comprehensive and Accurate LLM Inference Simulation
Frontier is a new discrete-event simulator for disaggregated LLM serving that incorporates co-location, PDD, AFD, and optimizations, achieving under 4% throughput error and large reductions in latency prediction error...
-
SG-LegalCite: A Principle-Augmented Benchmark for Legal Citation Retrieval in Singapore Law
SG-LegalCite supplies 100,890 case-principle pairs from 8,523 Singapore Supreme Court judgments to enable retrieval models that rank precedents using both facts and governing legal principles.
-
CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning
CopT reverses CoT by eliciting a draft answer first then using continuous-embedding contrastive verification and on-policy thinking to reflect and correct, yielding up to 23% higher accuracy and 57% fewer tokens witho...
-
SCARA: A Semantics-Constrained Autonomous Remediation Agent for Opaque Industrial Software Vulnerabilities
SCARA introduces a four-stage pipeline using state-aware verification and constrained synthesis to remediate vulnerabilities in source-unavailable industrial software, reporting 100% precision and 88.9% success on a 1...
-
Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers
Proposes equivariant optimizers matched to the symmetry groups of embeddings, SwiGLU projections and MoE routers, with experiments showing consistent gains over AdamW on language model pre-training.
-
Prompt Compression in Diffusion Large Language Models: Evaluating LLMLingua-2 on LLaDA
Evaluating LLMLingua-2 at 2x compression on LLaDA shows non-uniform transfer to diffusion LLMs, with mathematical reasoning degrading substantially despite high BERTScore while summarization remains more robust.
-
Reasoning Portability: Guiding Continual Learning for MLLMs in the RLVR Era
Formalizes Reasoning Portability (RP) and proposes RDB-CL to modulate per-sample KL regularization in RLVR for MLLM continual learning, achieving +12.0% Last accuracy over vanilla RLVR baseline by preserving reusable ...
-
UniPPTBench: A Unified Benchmark for Presentation Generation Across Diverse Input Settings
The paper presents UniPPTBench and UniPPTEval, a unified benchmark and scenario-aware evaluation framework for presentation generation from vague prompts, long documents, multimodal documents, and multi-source inputs.
-
HalluScore: Large Language Model Hallucination Question Answering Benchmark
HalluScore is a curated Arabic QA dataset with 827 questions, ground-truth evidence, and human annotations used to measure hallucination rates across 17 LLMs.
-
LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning
LinAlg-Bench shows LLMs switch from execution errors to computational abandonment and structured fabrication at 4x4 matrix scale, indicating a working memory limit rather than knowledge gaps.
-
Dynamic Chunking for Diffusion Language Models
DCDM replaces positional blocks with learnable semantic chunks via differentiable Chunking Attention, yielding consistent gains over block and unstructured diffusion baselines up to 1.5B parameters.
-
Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems
A survey that unifies prior work on multi-agent LLM systems via the LIFE framework, mapping dependencies across collaboration, failure attribution, and autonomous self-evolution while identifying cross-stage challenges.
-
Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding
Mistletoe is a stealthy attack that collapses the speedup of speculative decoding by reducing average accepted length τ without changing output semantics or perplexity.
-
What Does LLM Refinement Actually Improve? A Systematic Study on Document-Level Literary Translation
Document-level machine translation followed by segment-level LLM refinement provides the strongest and most stable improvements in literary translation quality, mainly enhancing fluency and style rather than adequacy.
-
FIND: Toward Multimodal Financial Reasoning and Question Answering for Indic Languages
FinVQA is a new multilingual benchmark for Indic financial VQA with three difficulty levels and four formats, paired with the FIND framework for faithful numerical reasoning via fine-tuning and constrained decoding.
-
Multi-Scale Dequant: Eliminating Dequantization Bottleneck via Activation Decomposition for Efficient LLM Inference
MSD eliminates dequantization from the GEMM path by decomposing BF16 activations into multiple low-precision parts that multiply directly with INT8 or MXFP4 weights, achieving near-16 effective bits for INT8 and 6.6 f...
-
Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts
Routers in SMoE models form geometric alignments with their experts through shared gradient directions, enabling effective specialization that auxiliary load-balancing losses tend to disrupt.
-
Formalize, Don't Optimize: The Heuristic Trap in LLM-Generated Combinatorial Solvers
LLM-generated combinatorial solvers achieve highest correctness when the model formalizes problems for verified backends rather than attempting to optimize search, which often causes regressions.
-
Learning Agentic Policy from Action Guidance
ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
-
Multi-Token Residual Prediction
MRP predicts logit residuals from hidden states to support dependency-aware multi-token denoising in a single forward pass for diffusion language models, yielding up to 1.42× lossless speedup on SDAR models.
-
Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models
TABOM is a trajectory-aligned Boltzmann modeling framework that turns self-distilled inference paths into a pairwise ranking loss to close the training-inference gap in diffusion language models and expand their effec...
-
FlowSteer: Prompt-Only Workflow Steering Exposes Planning-Time Vulnerabilities in Multi-Agent LLM Systems
FlowSteer is a prompt-only attack that biases multi-agent LLM workflow planning to propagate malicious signals, raising success rates by up to 55%, with FlowGuard as an input-side defense reducing it by up to 34%.
-
Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning
GCPO shifts RLVR from rollout competition to team cooperation by assigning advantages via marginal contributions to a determinant-based coverage volume over semantic embeddings, yielding higher accuracy and solution d...
-
Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning
GCPO uses team-level credit assignment via determinant volume over reward-weighted semantic embeddings to promote non-redundant correct reasoning paths, improving both accuracy and diversity in LLM training.
-
Uniform Scaling Limits in AdamW-Trained Transformers
AdamW-trained transformer hidden states and backpropagated variables converge uniformly in L2 to a forward-backward ODE system (McKean-Vlasov when non-causal) at rate O(L^{-1}+L^{-1/3}H^{-1/2}) as depth L and heads H ...
-
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
ALAM creates algebraically consistent latent action transitions from videos to act as auxiliary generative targets, raising robot policy success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
-
Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference
EEP makes wide expert-parallel MoE serving survive single-rank failures with an 11s recovery pause, 8s reintegration pause, and throughput restored to 95% of pre-fault level within 52s while staying within 4.4% of a f...
-
Hilbert-Geo: Solving Solid Geometric Problems by Neural-Symbolic Reasoning
Hilbert-Geo introduces a unified formal language framework with CDL predicates and theorem bank for solid geometry, using a Parse2Reason pipeline to achieve SOTA accuracy on new solid and plane geometry datasets.
-
DeepRefine: Agent-Compiled Knowledge Refinement via Reinforcement Learning
DeepRefine refines agent-compiled knowledge bases via multi-turn abductive diagnosis and RL training with a GBD reward, yielding consistent downstream task gains.
-
SciVQR: A Multidisciplinary Multimodal Benchmark for Advanced Scientific Reasoning Evaluation
SciVQR is a new benchmark dataset for evaluating multimodal AI models on complex scientific reasoning tasks across six disciplines, including expert solutions for nearly half the items.
-
GraphInstruct: A Progressive Benchmark for Diagnosing Capability Gaps in LLM Graph Generation
GraphInstruct introduces a six-level progressive benchmark with 800 instructions and 1,582 references to diagnose LLM graph generation gaps, plus a verification-guided iterative prompting framework that improves performance.
-
GraphInstruct: A Progressive Benchmark for Diagnosing Capability Gaps in LLM Graph Generation
GraphInstruct is a progressive benchmark with six complexity levels for LLM graph generation that identifies multi-constraint composition as the hardest point and shows a verification-guided iterative framework outper...
-
Mixture of Layers with Hybrid Attention
Mixture of Layers replaces monolithic transformer blocks with routed thin parallel blocks using hybrid attention that combines a shared softmax block for global context with Gated DeltaNet linear attention in the rout...
-
A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models
Massive activations originate in a specific ME Layer across LLM families; reducing their token rigidity via a targeted method boosts performance and mitigates attention sinks.
-
A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models
Massive activations first appear in a single ME Layer due to RMSNorm and FFN, remain invariant thereafter, and a simple softening method raises LLM performance while reducing attention sinks.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.