VideoMLA applies multi-head latent attention with 3D-RoPE decoupling to autoregressive video diffusion, delivering 92.7% KV memory reduction while matching short-horizon baselines and leading long-horizon VBench scores.
super hub Canonical reference
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Canonical reference. 71% of citing Pith papers cite this work as background.
abstract
We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSe
authors
co-cited works
representative citing papers
LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.
CrossPool separates weights and KV-cache into distinct GPU pools plus a planner, virtualizer, and layer-wise scheduler to cut P99 time-between-tokens by up to 10.4x versus prior kvcached multi-LLM systems.
Distance-Adaptive Representation (DAR) keeps full KV dimensionality inside a local window and reduces it to 1/4 outside, matching full-dimensional baselines on pretraining (70M-410M) and 1B-scale fine-tuning while uniform reduction performs worse.
KV caches function as notebooks of prefilled conclusions, enabling field-level edits that recover decisions (especially with CoT) and position-portable skill composition with near-identical outputs at O(L) cost.
STAR-KV applies differentiable soft thresholding for per-head and per-block adaptive low-rank KV cache compression, combined with hybrid decomposition and low-rank-aware quantization, achieving up to 75% compression and 3.1x throughput gains.
Depth-Attention mixes values from earlier layers into the current attention value by having the query attend to previous-layer keys at the same position, yielding lower perplexity and up to 2.3 points higher average accuracy than vanilla transformers on Qwen3-style models with negligible extra FLOPs
LazyAttention kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV cache reuse, delivering 1.37× lower TTFT and 1.40× higher throughput than Block-Attention under skewed document distributions while preserving output quality.
On a real multi-node H100 cluster the authors show that for MLA, routing the ~1 KB compressed query row is cheaper than moving cache chunks and supply a topology-aware cost model accurate to ~7% on IBGDA fabrics.
Training-free looped transformers retrofit recurrence to frozen models via damped ODE sub-steps on mid-stack blocks, yielding gains such as +2.64 pp on MMLU-Pro for Qwen3-4B.
Latent Cache Flow uses a small joint-translation-and-compression adapter to let LLMs with different contexts exchange KV cache summaries, outperforming both larger C2C adapters and text in early experiments.
Text2CAD-Bench supplies 600 dual-prompt examples across four geometric and domain levels to test LLMs on text-to-parametric CAD, finding solid basic performance but sharp drops on complex topology and advanced features.
LLMForge is a NAS framework with Infinite-Head Attention, a Forge-Former surrogate, and Forge-DSE engine that discovers hardware-specific architectures for edge language models, yielding variants with improved accuracy, energy, or latency on different substrates.
φ-balancing is a convex optimization method for population-level expert balance in MoE training that derives an online EMA adjustment and outperforms heuristic baselines.
GQLA exposes dual MQA-absorb and GQA decoding paths from identical parameters to enable hardware-adaptive LLM inference while preserving cache compression on one path and GQA-level traffic on the other.
MSD eliminates dequantization from the GEMM path by decomposing BF16 activations into multiple low-precision parts that multiply directly with INT8 or MXFP4 weights, achieving near-16 effective bits for INT8 and 6.6 for MXFP4 with reduced HBM traffic.
Power capping is illusory in LLM decode as memory-bound operation leaves power headroom untouched on 700 W GPUs, while SM clock locking saves up to 32% energy and three DVFS classes appear across attention types.
EEP makes wide expert-parallel MoE serving survive single-rank failures with an 11s recovery pause, 8s reintegration pause, and throughput restored to 95% of pre-fault level within 52s while staying within 4.4% of a fixed-membership baseline in steady state.
Fully Looped Transformer stabilizes looped training up to 12 iterations via distributed inter-loop signals and attention injection, improving downstream performance by up to 13.2%.
Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.
KernelBenchX benchmark shows task category explains nearly three times more variance in LLM kernel correctness than method choice, iterative refinement boosts correctness but reduces performance, and quantization remains unsolved.
Misrouter enables input-only attacks on MoE LLMs by optimizing queries on open-source surrogates to route toward weakly aligned experts and transferring them to public APIs.
Hosted open-weight LLM APIs function as time-varying heterogeneous services rather than fixed model artifacts, with demand concentrated, supply-use mismatches, and task-specific routing yielding major cost and throughput gains.
Incompressible Knowledge Probes enable log-linear estimation of LLM parameter counts from factual accuracy on obscure questions, showing continued scaling of knowledge capacity across open and closed models.
citing papers explorer
-
SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators
SnapStream deploys sparse KV attention in a production inference system on dataflow accelerators, delivering 4x on-chip memory savings for DeepSeek-671B at 128k context with up to 1832 tokens/sec and minimal accuracy loss on LongBench-v2, AIME24, and LiveCodeBench.
-
EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving
EngiBench shows LLMs accuracy drops with task complexity, degrades under perturbations, and stays below human performance on open-ended engineering problems.
-
mHC: Manifold-Constrained Hyper-Connections
mHC projects hyper-connection residual spaces onto a manifold to restore identity mapping, enabling stable large-scale training with performance gains over standard HC.
-
Janus: Disaggregating Attention and Experts for Scalable MoE Inference
JANUS disaggregates attention and MoE layers onto separate GPU pools with an expert-balancing scheduler and SLO-aware scaling, delivering up to 4.7x higher per-GPU throughput than prior MoE systems under token-level latency constraints.
-
BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding
BLASST dynamically sparsifies attention by thresholding softmax scores to skip blocks, delivering 1.5x speedups at 70%+ sparsity while preserving benchmark accuracy.
-
Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning
Seer improves synchronous LLM RL rollout throughput by up to 2.04x and reduces long-tail latency by 72-94% via divided rollout, context-aware scheduling, and adaptive grouped speculative decoding based on prompt similarity observations.
-
DeepSeek-OCR: Contexts Optical Compression
DeepSeek-OCR compresses text contexts up to 20x via 2D optical mapping while achieving 97% OCR accuracy below 10x and 60% at 20x, outperforming prior OCR tools with fewer vision tokens.
-
From Tokens to Layers: Redefining Stall-Free Scheduling for MoE Serving with Layered Prefill
Layered prefill replaces token-chunked prefill with layer-group interleaving in MoE models, cutting TTFT by up to 70%, end-to-end latency by 41%, and per-token energy by 22% while preserving stall-free TBT.
-
GRACE-MoE: Grouping and Replication with Locality-Aware Routing for Efficient Distributed MoE Inference
GRACE-MoE integrates expert grouping, dynamic replication, and locality-aware routing with hierarchical sparse communication to reduce end-to-end latency in distributed SMoE inference.
-
Elastic MoE: Unlocking the Inference-Time Scalability of Mixture-of-Experts
EMoE trains MoE models so they maintain performance when the number of activated experts changes at inference, expanding the usable range to 2-3 times the training k with higher peak results.
-
MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent
MemAgent uses multi-conversation RL to train a memory agent that reads text in segments and overwrites memory, extrapolating from 8K training to 3.5M token QA with under 5% loss and 95%+ on 512K RULER.
-
Siamese Foundation Models for Crystal Structure Prediction
DAO pretrains Siamese diffusion-based models on stable/unstable crystal data to achieve 100% experimental match on Cr6Os2 and 2000x speedup over DFT on real superconductors.
-
Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts
Capacity-aware dropping techniques mitigate load imbalance in MoE inference, delivering up to 1.85x speedup with 0.2% or less performance change on models including Mixtral-8x7B.
-
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
NSA is a hardware-aligned sparse attention mechanism that enables end-to-end trainable long-context modeling by combining coarse token compression with fine-grained selection.
-
Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process
LLM-PeerReview ensembles LLMs by scoring responses with LLM-as-Judge and selecting the best via averaging or truth inference, beating Smoothie-Global by 6.9-7.3 points on four datasets.
-
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
DeepSeek-V3.2 adds sparse attention, scaled RL post-training, and large-scale agentic data synthesis to reach GPT-5-level performance and gold medals in 2025 IMO and IOI with its high-compute variant.
-
TTT3R: 3D Reconstruction as Test-Time Training
TTT3R derives a closed-form learning rate from memory-observation alignment confidence to boost length generalization in RNN-based 3D reconstruction by 2x in global pose estimation.
-
UAV-VL-R1: Generalizing Vision-Language Models via Supervised Fine-Tuning and Multi-Stage GRPO for UAV Visual Reasoning
UAV-VL-R1 combines SFT and multi-stage GRPO reinforcement learning on a new 50,019-sample HRVQA-VL dataset to deliver substantially higher zero-shot accuracy on UAV visual reasoning tasks than both its 2B baseline and a 72B-scale model.
-
Kimi K2: Open Agentic Intelligence
Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.
-
Finite-Precision Conjugate Gradient Method for Massive MIMO Detection
Introduces FP-CG and FP-BJ-CG detectors for massive MIMO with accuracy, convergence, and complexity analyses plus simulations.
-
Quantifying the Climate Risk of Generative AI: Region-Aware Carbon Accounting with G-TRACE and the AI Sustainability Pyramid
G-TRACE provides region-aware estimates of GenAI carbon emissions including 4309 MWh and 2068 tCO2 for a 2024-2025 image generation trend, paired with a seven-level AI Sustainability Pyramid for policy guidance.
-
LayerScope: Predictive Cross-Layer Scheduling for Efficient Multi-Batch MoE Inference on Legacy Servers
PreScope combines a layer-aware activation predictor, cross-layer prefetch scheduling, and asynchronous I/O to deliver 141% higher throughput and 74.6% lower latency for MoE inference on legacy hardware.
-
Accelerating Edge Inference for Distributed MoE Models with Latency-Optimized Expert Placement
Prism optimizes expert placement and uses runtime migration for distributed MoE inference on heterogeneous edge GPUs, achieving up to 30.6% lower latency than baselines.
-
Training LLMs on HPC Systems: Best Practices from the OpenGPT-X Project
Engineering report detailing HPC infrastructure, software choices, and performance measurements for training a 7B LLM using 3D parallelism on JUWELS Booster.
-
Harnessing Multiple Large Language Models: A Survey on LLM Ensemble
A systematic survey of LLM ensemble methods organized into a taxonomy of ensemble-before-inference, ensemble-during-inference, and ensemble-after-inference stages, with review of benchmarks, applications, and future directions.