hub

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu · 2024 · cs.CL · arXiv 2405.04434

64 Pith papers cite this work. Polarity classification is still indexing.

64 Pith papers citing it

open full Pith review browse 64 citing papers arXiv PDF

abstract

We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

claims ledger

abstract We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSe

co-cited works

representative citing papers

OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents

cs.LG · 2026-05-09 · unverdicted · novelty 8.0

OTora provides the first unified framework for reasoning-level denial-of-service attacks on LLM agents, achieving up to 10x more reasoning tokens and order-of-magnitude latency increases while preserving task accuracy across multiple agent types and models.

The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures

cs.DC · 2026-05-12 · unverdicted · novelty 7.0

Power capping is illusory in LLM decode as memory-bound operation leaves power headroom untouched on 700 W GPUs, while SM clock locking saves up to 32% energy and three DVFS classes appear across attention types.

Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference

cs.DC · 2026-05-11 · unverdicted · novelty 7.0

EEP makes wide expert-parallel MoE serving survive single-rank failures with an 11s recovery pause, 8s reintegration pause, and throughput restored to 95% of pre-fault level within 52s while staying within 4.4% of a fixed-membership baseline in steady state.

When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

cs.LG · 2026-05-06 · conditional · novelty 7.0 · 2 refs

KernelBenchX benchmark shows task category explains nearly three times more variance in LLM kernel correctness than method choice, iterative refinement boosts correctness but reduces performance, and quantization remains unsolved.

Misrouter: Exploiting Routing Mechanisms for Input-Only Attacks on Mixture-of-Experts LLMs

cs.CR · 2026-05-06 · unverdicted · novelty 7.0

Misrouter enables input-only attacks on MoE LLMs by optimizing queries on open-source surrogates to route toward weakly aligned experts and transferring them to public APIs.

When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs

cs.PF · 2026-05-04 · unverdicted · novelty 7.0 · 2 refs

Hosted open-weight LLM APIs function as time-varying heterogeneous services rather than fixed model artifacts, with demand concentrated, supply-use mismatches, and task-specific routing yielding major cost and throughput gains.

Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity

cs.LG · 2026-04-27 · unverdicted · novelty 7.0

Incompressible Knowledge Probes enable log-linear estimation of LLM parameter counts from factual accuracy on obscure questions, showing continued scaling of knowledge capacity across open and closed models.

DPC: A Distributed Page Cache over CXL

cs.DC · 2026-04-21 · conditional · novelty 7.0

DPC maintains exactly one DRAM copy of each file page in a CXL-connected cluster and delivers up to 12.4X speedup (5.6X geometric mean) over replicated caches on data-sharing workloads.

Using large language models for embodied planning introduces systematic safety risks

cs.AI · 2026-04-20 · unverdicted · novelty 7.0

LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.

Awakening Dormant Experts:Counterfactual Routing to Mitigate MoE Hallucinations

cs.LG · 2026-04-15 · unverdicted · novelty 7.0

Counterfactual Routing awakens dormant experts in MoE models via layer-wise perturbation and a new CEI metric, raising factual accuracy 3.1% on average across TruthfulQA, FACTOR, and TriviaQA without extra inference cost.

The Phase Is the Gradient: Equilibrium Propagation for Frequency Learning in Kuramoto Networks

cs.LG · 2026-04-11 · unverdicted · novelty 7.0

In Kuramoto networks at equilibrium, weak nudging makes phase displacement the exact gradient of loss w.r.t. natural frequencies, enabling frequency learning that beats weight learning and resolves convergence via spectral initialization.

Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning

cs.AI · 2026-04-10 · unverdicted · novelty 7.0

COVERT generates verifiable synthetic tool-use environments for RL by validated trajectory synthesis and oracle-preserving augmentations, improving tool-use accuracy on BFCL v3 and ACEBench while remaining complementary to SFT.

How Independent are Large Language Models? A Statistical Framework for Auditing Behavioral Entanglement and Reweighting Verifier Ensembles

cs.AI · 2026-04-08 · unverdicted · novelty 7.0

A new auditing framework reveals widespread behavioral entanglement among LLMs and shows that reweighting ensembles based on measured independence improves verification accuracy by up to 4.5%.

CHAL: Council of Hierarchical Agentic Language

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

CHAL is a multi-agent dialectic system that performs structured belief optimization over defeasible domains using Bayesian-inspired graph representations and configurable meta-cognitive value system hyperparameters.

Search Your Block Floating Point Scales!

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.

PowerStep: Memory-Efficient Adaptive Optimization via $\ell_p$-Norm Steepest Descent

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

PowerStep delivers coordinate-wise adaptive optimization by nonlinearly transforming a momentum buffer under an lp-norm steepest-descent geometry, matching Adam convergence with half the memory and supporting aggressive quantization.

From Passive Reuse to Active Reasoning: Grounding Large Language Models for Neuro-Symbolic Experience Replay

cs.AI · 2026-05-10 · unverdicted · novelty 6.0

NSER uses zero-shot LLMs to induce behavioral rules from RL trajectories, grounds them in differentiable first-order logic, and applies the symbolic structures to dynamically reweight experience replay for better sample efficiency.

LBI: Parallel Scan Backpropagation via Latent Bounded Interfaces

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

LBI enables tractable parallel backpropagation by reducing inter-region adjoint computation to low-dimensional r x r Jacobians while preserving exact gradients under a bounded-interface model.

Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

MELT decouples reasoning depth from memory in looped LLMs by sharing a single gated KV cache per layer and using two-phase chunk-wise distillation from Ouro, delivering constant memory use while matching or beating standard LLM performance.

UniPool: A Globally Shared Expert Pool for Mixture-of-Experts

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.

Continuous Latent Diffusion Language Model

cs.CL · 2026-05-07 · unverdicted · novelty 6.0

Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing latent prior modeling as an alternative to token-level autoregressive language model

MoE-Hub: Taming Software Complexity for Seamless MoE Overlap with Hardware-Accelerated Communication on Multi-GPU Systems

cs.AR · 2026-05-07 · unverdicted · novelty 6.0

MoE-Hub enables seamless MoE communication overlap via hardware-accelerated destination-agnostic data transmission, delivering 1.40x-3.08x per-layer and 1.21x-1.98x end-to-end speedups over prior systems.

Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism

cs.DC · 2026-05-06 · unverdicted · novelty 6.0

Nitsum dynamically adapts tensor parallelism and GPU splits in LLM serving to raise SLO-compliant goodput by up to 5.3 times over prior systems.

citing papers explorer

Showing 50 of 64 citing papers.

OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents cs.LG · 2026-05-09 · unverdicted · none · ref 6 · internal anchor
OTora provides the first unified framework for reasoning-level denial-of-service attacks on LLM agents, achieving up to 10x more reasoning tokens and order-of-magnitude latency increases while preserving task accuracy across multiple agent types and models.
The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures cs.DC · 2026-05-12 · unverdicted · none · ref 6 · internal anchor
Power capping is illusory in LLM decode as memory-bound operation leaves power headroom untouched on 700 W GPUs, while SM clock locking saves up to 32% energy and three DVFS classes appear across attention types.
Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference cs.DC · 2026-05-11 · unverdicted · none · ref 4 · internal anchor
EEP makes wide expert-parallel MoE serving survive single-rank failures with an 11s recovery pause, 8s reintegration pause, and throughput restored to 95% of pre-fault level within 52s while staying within 4.4% of a fixed-membership baseline in steady state.
When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models cs.LG · 2026-05-08 · unverdicted · none · ref 31 · internal anchor
Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.
KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels cs.LG · 2026-05-06 · conditional · none · ref 6 · 2 links · internal anchor
KernelBenchX benchmark shows task category explains nearly three times more variance in LLM kernel correctness than method choice, iterative refinement boosts correctness but reduces performance, and quantization remains unsolved.
Misrouter: Exploiting Routing Mechanisms for Input-Only Attacks on Mixture-of-Experts LLMs cs.CR · 2026-05-06 · unverdicted · none · ref 21 · internal anchor
Misrouter enables input-only attacks on MoE LLMs by optimizing queries on open-source surrogates to route toward weakly aligned experts and transferring them to public APIs.
When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs cs.PF · 2026-05-04 · unverdicted · none · ref 1 · 2 links · internal anchor
Hosted open-weight LLM APIs function as time-varying heterogeneous services rather than fixed model artifacts, with demand concentrated, supply-use mismatches, and task-specific routing yielding major cost and throughput gains.
Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity cs.LG · 2026-04-27 · unverdicted · none · ref 4 · internal anchor
Incompressible Knowledge Probes enable log-linear estimation of LLM parameter counts from factual accuracy on obscure questions, showing continued scaling of knowledge capacity across open and closed models.
DPC: A Distributed Page Cache over CXL cs.DC · 2026-04-21 · conditional · none · ref 13 · internal anchor
DPC maintains exactly one DRAM copy of each file page in a CXL-connected cluster and delivers up to 12.4X speedup (5.6X geometric mean) over replicated caches on data-sharing workloads.
Using large language models for embodied planning introduces systematic safety risks cs.AI · 2026-04-20 · unverdicted · none · ref 35 · internal anchor
LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.
Awakening Dormant Experts:Counterfactual Routing to Mitigate MoE Hallucinations cs.LG · 2026-04-15 · unverdicted · none · ref 1 · internal anchor
Counterfactual Routing awakens dormant experts in MoE models via layer-wise perturbation and a new CEI metric, raising factual accuracy 3.1% on average across TruthfulQA, FACTOR, and TriviaQA without extra inference cost.
The Phase Is the Gradient: Equilibrium Propagation for Frequency Learning in Kuramoto Networks cs.LG · 2026-04-11 · unverdicted · none · ref 15 · internal anchor
In Kuramoto networks at equilibrium, weak nudging makes phase displacement the exact gradient of loss w.r.t. natural frequencies, enabling frequency learning that beats weight learning and resolves convergence via spectral initialization.
Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning cs.AI · 2026-04-10 · unverdicted · none · ref 4 · internal anchor
COVERT generates verifiable synthetic tool-use environments for RL by validated trajectory synthesis and oracle-preserving augmentations, improving tool-use accuracy on BFCL v3 and ACEBench while remaining complementary to SFT.
How Independent are Large Language Models? A Statistical Framework for Auditing Behavioral Entanglement and Reweighting Verifier Ensembles cs.AI · 2026-04-08 · unverdicted · none · ref 11 · internal anchor
A new auditing framework reveals widespread behavioral entanglement among LLMs and shows that reweighting ensembles based on measured independence improves verification accuracy by up to 4.5%.
CHAL: Council of Hierarchical Agentic Language cs.AI · 2026-05-12 · unverdicted · none · ref 100 · internal anchor
CHAL is a multi-agent dialectic system that performs structured belief optimization over defeasible domains using Bayesian-inspired graph representations and configurable meta-cognitive value system hyperparameters.
Search Your Block Floating Point Scales! cs.LG · 2026-05-12 · unverdicted · none · ref 138 · internal anchor
ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.
PowerStep: Memory-Efficient Adaptive Optimization via $\ell_p$-Norm Steepest Descent cs.LG · 2026-05-11 · unverdicted · none · ref 33 · internal anchor
PowerStep delivers coordinate-wise adaptive optimization by nonlinearly transforming a momentum buffer under an lp-norm steepest-descent geometry, matching Adam convergence with half the memory and supporting aggressive quantization.
From Passive Reuse to Active Reasoning: Grounding Large Language Models for Neuro-Symbolic Experience Replay cs.AI · 2026-05-10 · unverdicted · none · ref 42 · internal anchor
NSER uses zero-shot LLMs to induce behavioral rules from RL trajectories, grounds them in differentiable first-order logic, and applies the symbolic structures to dynamically reweight experience replay for better sample efficiency.
LBI: Parallel Scan Backpropagation via Latent Bounded Interfaces cs.LG · 2026-05-09 · unverdicted · none · ref 10 · internal anchor
LBI enables tractable parallel backpropagation by reducing inter-region adjoint computation to low-dimensional r x r Jacobians while preserving exact gradients under a bounded-interface model.
Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models cs.CL · 2026-05-08 · unverdicted · none · ref 19 · internal anchor
MELT decouples reasoning depth from memory in looped LLMs by sharing a single gated KV cache per layer and using two-phase chunk-wise distillation from Ouro, delivering constant memory use while matching or beating standard LLM performance.
UniPool: A Globally Shared Expert Pool for Mixture-of-Experts cs.LG · 2026-05-07 · unverdicted · none · ref 8 · internal anchor
A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.
Continuous Latent Diffusion Language Model cs.CL · 2026-05-07 · unverdicted · none · ref 55 · internal anchor
Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing latent prior modeling as an alternative to token-level autoregressive language model
MoE-Hub: Taming Software Complexity for Seamless MoE Overlap with Hardware-Accelerated Communication on Multi-GPU Systems cs.AR · 2026-05-07 · unverdicted · none · ref 11 · internal anchor
MoE-Hub enables seamless MoE communication overlap via hardware-accelerated destination-agnostic data transmission, delivering 1.40x-3.08x per-layer and 1.21x-1.98x end-to-end speedups over prior systems.
Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism cs.DC · 2026-05-06 · unverdicted · none · ref 23 · internal anchor
Nitsum dynamically adapts tensor parallelism and GPU splits in LLM serving to raise SLO-compliant goodput by up to 5.3 times over prior systems.
The Impossibility Triangle of Long-Context Modeling cs.CL · 2026-05-06 · unverdicted · none · ref 20 · internal anchor
No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.
A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints cs.LG · 2026-05-06 · unverdicted · none · ref 122 · internal anchor
A queueing model derives stability conditions for LLM inference services under combined compute and KV cache memory limits, with experimental validation showing typical deviations under 10%.
DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs cs.PL · 2026-05-02 · unverdicted · none · ref 6 · internal anchor
DITRON introduces a hierarchical multi-level tiling compiler for distributed tensor programs that matches or exceeds expert CUDA libraries with 6-30% speedups and has been deployed to improve training MFU by over 10% while saving hundreds of thousands of GPU hours monthly.
When to Vote, When to Rewrite: Disagreement-Guided Strategy Routing for Test-Time Scaling cs.AI · 2026-04-29 · unverdicted · none · ref 10 · internal anchor
A disagreement-guided routing framework dynamically selects among resolution, voting, and rewriting strategies for test-time scaling, delivering 3-7% accuracy gains with lower sampling cost on mathematical benchmarks.
Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling cs.CL · 2026-04-27 · unverdicted · none · ref 31 · internal anchor
HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.
Mixture of Heterogeneous Grouped Experts for Language Modeling cs.CL · 2026-04-25 · unverdicted · none · ref 16 · internal anchor
MoHGE achieves standard MoE performance with 20% fewer parameters and balanced GPU utilization via grouped heterogeneous experts, two-level routing, and specialized auxiliary losses.
Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs cs.LG · 2026-04-20 · unverdicted · none · ref 32 · internal anchor
NPUMoE accelerates MoE LLM inference on Apple Silicon NPUs via offline-calibrated static expert tiers, grouped execution, and load-aware graph residency, delivering 1.32x-5.55x lower latency and 1.81x-7.37x better energy efficiency.
Multi-LLM Token Filtering and Routing for Sequential Recommendation cs.IR · 2026-04-20 · unverdicted · none · ref 18 · internal anchor
MLTFR combines user-guided token filtering with a multi-LLM mixture-of-experts and Fisher-weighted consensus expert to deliver stable gains in corpus-free sequential recommendation.
MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference cs.LG · 2026-04-19 · unverdicted · none · ref 12 · internal anchor
MACS improves inference speed in multimodal MoE models by entropy-weighted balancing of visual tokens and real-time modality-adaptive expert capacity allocation.
Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter cs.DC · 2026-04-16 · unverdicted · none · ref 12 · internal anchor
PrfaaS enables practical cross-datacenter prefill-decode disaggregation for hybrid-attention models via selective offloading, bandwidth-aware scheduling, and cache-aware placement, yielding 54% higher throughput and 64% lower P90 TTFT than homogeneous baselines in a 1T-parameter case study.
ELMoE-3D: Leveraging Intrinsic Elasticity of MoE for Hybrid-Bonding-Enabled Self-Speculative Decoding in On-Premises Serving cs.LG · 2026-04-16 · unverdicted · none · ref 11 · internal anchor
ELMoE-3D achieves 6.6x average speedup and 4.4x energy efficiency gain for MoE serving on 3D hardware by scaling expert and bit elasticity for elastic self-speculative decoding.
AsyncTLS: Efficient Generative LLM Inference with Asynchronous Two-level Sparse Attention cs.CL · 2026-04-09 · unverdicted · none · ref 1 · internal anchor
AsyncTLS delivers full-attention accuracy with 1.2-10x operator speedups and 1.3-4.7x end-to-end throughput gains on 48k-96k contexts via two-level sparse attention and asynchronous offloading.
Fine-grained Approaches for Confidence Calibration of LLMs in Automated Code Revision cs.SE · 2026-04-08 · unverdicted · none · ref 82 · internal anchor
Local Platt scaling on three fine-grained confidence scores reduces calibration error for LLM-based automated code revision across tasks and models compared to global scaling alone.
ITIScore: An Image-to-Text-to-Image Rating Framework for the Image Captioning Ability of MLLMs cs.CV · 2026-04-04 · unverdicted · none · ref 28 · internal anchor
ITIScore evaluates MLLM image captions via image-to-text-to-image reconstruction consistency and aligns with human judgments on a new 40K-caption benchmark.
WIO: Upload-Enabled Computational Storage on CXL SSDs cs.OS · 2026-04-02 · unverdicted · none · ref 15 · internal anchor
WIO enables reversible computational storage on CXL SSDs via WebAssembly actors and zero-copy migration, achieving up to 2x throughput and 3.75x lower write latency.
Rethinking Language Model Scaling under Transferable Hypersphere Optimization cs.LG · 2026-03-30 · conditional · none · ref 4 · internal anchor
HyperP transfers optimal learning rates across model width, depth, tokens, and MoE granularity under Frobenius-sphere constraints, delivering stable scaling and 1.58x efficiency gains.
DeepSeek-OCR: Contexts Optical Compression cs.CV · 2025-10-21 · unverdicted · none · ref 19 · internal anchor
DeepSeek-OCR compresses text contexts up to 20x via 2D optical mapping while achieving 97% OCR accuracy below 10x and 60% at 20x, outperforming prior OCR tools with fewer vision tokens.
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling cs.LG · 2024-07-31 · unverdicted · none · ref 21 · internal anchor
Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.
Dense vs Sparse Pretraining at Tiny Scale: Active-Parameter vs Total-Parameter Matching cs.CL · 2026-05-13 · accept · none · ref 17 · internal anchor
At tiny scale, MoE transformers lower validation loss versus dense models when active parameters match but raise it when total stored parameters match.
Position: LLM Inference Should Be Evaluated as Energy-to-Token Production cs.CE · 2026-05-12 · unverdicted · none · ref 33 · internal anchor
LLM inference should be reframed and evaluated as energy-to-token production with a Token Production Function that accounts for power, cooling, and efficiency ceilings.
TIDE: Every Layer Knows the Token Beneath the Context cs.CL · 2026-05-07 · unverdicted · none · ref 75 · internal anchor
TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.
Irminsul: MLA-Native Position-Independent Caching for Agentic LLM Serving cs.DC · 2026-05-07 · unverdicted · none · ref 10 · internal anchor
Irminsul recovers up to 83% of prompt tokens above exact-prefix matching and delivers 63% prefill energy savings per cache hit on MLA-MoE models by content-hashing CDC chunks and applying closed-form kr correction.
Universal Smoothness via Bernstein Polynomials: A Constructive Approximation Approach for Activation Functions cs.AI · 2026-05-04 · unverdicted · none · ref 24 · internal anchor
BerLU constructs a C1-differentiable activation with Lipschitz constant 1 via Bernstein polynomial approximation, showing better performance and efficiency than baselines on image classification with ViTs and CNNs.
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k cs.LG · 2026-05-04 · accept · none · ref 8 · internal anchor
Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.
Mesh Based Simulations with Spatial and Temporal awareness cs.LG · 2026-05-02 · unverdicted · none · ref 159 · internal anchor
A unified training framework for mesh-based ML surrogates in CFD improves accuracy and long-horizon stability by enforcing spatial derivative consistency via multi-node prediction, using temporal cross-attention correction, and adding 3D rotary positional embeddings.
FedSLoP: Memory-Efficient Federated Learning with Low-Rank Gradient Projection cs.LG · 2026-04-27 · unverdicted · none · ref 7 · internal anchor
FedSLoP reduces communication and memory costs in federated learning through stochastic low-rank gradient projections, with a nonconvex convergence rate of O(1/sqrt(NT)) and competitive accuracy on heterogeneous MNIST data.

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer