super hub Mixed citations

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Aixin Liu, Bei Feng, Bingxuan Wang, Bin Wang, Bo Liu, DeepSeek-AI · 2024 · cs.CL · arXiv 2405.04434

Mixed citation behavior. Most common role is background (70%).

144 Pith papers citing it

Background 70% of classified citations

open full Pith review browse 144 citing papers more from Aixin Liu arXiv PDF

abstract

We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 24 method 5 dataset 3 baseline 1

citation-polarity summary

background 23 use method 5 use dataset 3 baseline 1 support 1

claims ledger

abstract We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSe

authors

Aixin Liu Bei Feng Bingxuan Wang Bin Wang Bo Liu DeepSeek-AI

co-cited works

representative citing papers

VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion

cs.CV · 2026-05-28 · unverdicted · novelty 8.0

VideoMLA applies multi-head latent attention with 3D-RoPE decoupling to autoregressive video diffusion, delivering 92.7% KV memory reduction while matching short-horizon baselines and leading long-horizon VBench scores.

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

cs.CL · 2024-06-27 · unverdicted · novelty 8.0

LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.

CrossPool: Efficient Multi-LLM Serving for Cold MoE Models through KV-Cache and Weight Disaggregation

cs.DC · 2026-06-23 · unverdicted · novelty 7.0 · 2 refs

CrossPool separates weights and KV-cache into distinct GPU pools plus a planner, virtualizer, and layer-wise scheduler to cut P99 time-between-tokens by up to 10.4x versus prior kvcached multi-LLM systems.

LazyAttention: Efficient Retrieval-Augmented Generation with Deferred Positional Encoding

cs.CL · 2026-06-03 · unverdicted · novelty 7.0

LazyAttention kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV cache reuse, delivering 1.37× lower TTFT and 1.40× higher throughput than Block-Attention under skewed document distributions while preserving output quality.

Training-Free Looped Transformers

cs.LG · 2026-05-22 · unverdicted · novelty 7.0

Training-free looped transformers retrofit recurrence to frozen models via damped ODE sub-steps on mid-stack blocks, yielding gains such as +2.64 pp on MMLU-Pro for Qwen3-4B.

Latent Cache Flow: Model-to-Model Communication Without Text

cs.LG · 2026-05-19 · unverdicted · novelty 7.0

Latent Cache Flow uses a small joint-translation-and-compression adapter to let LLMs with different contexts exchange KV cache summaries, outperforming both larger C2C adapters and text in early experiments.

Text2CAD-Bench: A Benchmark for LLM-based Text-to-Parametric CAD Generation

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

Text2CAD-Bench supplies 600 dual-prompt examples across four geometric and domain levels to test LLMs on text-to-parametric CAD, finding solid basic performance but sharp drops on complex topology and advanced features.

LLMForge: Multi-Backend Hardware-Aware Neural Architecture Search with Infinite-Head Attention for Edge Language Models

cs.LG · 2026-05-17 · unverdicted · novelty 7.0

LLMForge is a NAS framework with Infinite-Head Attention, a Forge-Former surrogate, and Forge-DSE engine that discovers hardware-specific architectures for edge language models, yielding variants with improved accuracy, energy, or latency on different substrates.

$\phi$-Balancing for Mixture-of-Experts Training

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

φ-balancing is a convex optimization method for population-level expert balance in MoE training that derives an online EMA adjustment and outperforms heuristic baselines.

GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

GQLA exposes dual MQA-absorb and GQA decoding paths from identical parameters to enable hardware-adaptive LLM inference while preserving cache compression on one path and GQA-level traffic on the other.

Multi-Scale Dequant: Eliminating Dequantization Bottleneck via Activation Decomposition for Efficient LLM Inference

stat.ML · 2026-05-13 · unverdicted · novelty 7.0

MSD eliminates dequantization from the GEMM path by decomposing BF16 activations into multiple low-precision parts that multiply directly with INT8 or MXFP4 weights, achieving near-16 effective bits for INT8 and 6.6 for MXFP4 with reduced HBM traffic.

The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures

cs.DC · 2026-05-12 · unverdicted · novelty 7.0

Power capping is illusory in LLM decode as memory-bound operation leaves power headroom untouched on 700 W GPUs, while SM clock locking saves up to 32% energy and three DVFS classes appear across attention types.

Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference

cs.DC · 2026-05-11 · unverdicted · novelty 7.0

EEP makes wide expert-parallel MoE serving survive single-rank failures with an 11s recovery pause, 8s reintegration pause, and throughput restored to 95% of pre-fault level within 52s while staying within 4.4% of a fixed-membership baseline in steady state.

Simply Stabilizing the Loop via Fully Looped Transformer

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Fully Looped Transformer stabilizes looped training up to 12 iterations via distributed inter-loop signals and attention injection, improving downstream performance by up to 13.2%.

When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

cs.LG · 2026-05-06 · conditional · novelty 7.0 · 2 refs

KernelBenchX benchmark shows task category explains nearly three times more variance in LLM kernel correctness than method choice, iterative refinement boosts correctness but reduces performance, and quantization remains unsolved.

Misrouter: Exploiting Routing Mechanisms for Input-Only Attacks on Mixture-of-Experts LLMs

cs.CR · 2026-05-06 · unverdicted · novelty 7.0

Misrouter enables input-only attacks on MoE LLMs by optimizing queries on open-source surrogates to route toward weakly aligned experts and transferring them to public APIs.

When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs

cs.PF · 2026-05-04 · unverdicted · novelty 7.0 · 2 refs

Hosted open-weight LLM APIs function as time-varying heterogeneous services rather than fixed model artifacts, with demand concentrated, supply-use mismatches, and task-specific routing yielding major cost and throughput gains.

Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity

cs.LG · 2026-04-27 · unverdicted · novelty 7.0

Incompressible Knowledge Probes enable log-linear estimation of LLM parameter counts from factual accuracy on obscure questions, showing continued scaling of knowledge capacity across open and closed models.

DPC: A Distributed Page Cache over CXL

cs.DC · 2026-04-21 · conditional · novelty 7.0

DPC maintains exactly one DRAM copy of each file page in a CXL-connected cluster and delivers up to 12.4X speedup (5.6X geometric mean) over replicated caches on data-sharing workloads.

Using large language models for embodied planning introduces systematic safety risks

cs.AI · 2026-04-20 · unverdicted · novelty 7.0

LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.

Awakening Dormant Experts:Counterfactual Routing to Mitigate MoE Hallucinations

cs.LG · 2026-04-15 · unverdicted · novelty 7.0

Counterfactual Routing awakens dormant experts in MoE models via layer-wise perturbation and a new CEI metric, raising factual accuracy 3.1% on average across TruthfulQA, FACTOR, and TriviaQA without extra inference cost.

The Phase Is the Gradient: Equilibrium Propagation for Frequency Learning in Kuramoto Networks

cs.LG · 2026-04-11 · unverdicted · novelty 7.0

In Kuramoto networks at equilibrium, weak nudging makes phase displacement the exact gradient of loss w.r.t. natural frequencies, enabling frequency learning that beats weight learning and resolves convergence via spectral initialization.

Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning

cs.AI · 2026-04-10 · unverdicted · novelty 7.0

COVERT generates verifiable synthetic tool-use environments for RL by validated trajectory synthesis and oracle-preserving augmentations, improving tool-use accuracy on BFCL v3 and ACEBench while remaining complementary to SFT.

citing papers explorer

Showing 50 of 144 citing papers.

Search Your Block Floating Point Scales! cs.LG · 2026-05-12 · unverdicted · none · ref 138 · internal anchor
ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.
PowerStep: Memory-Efficient Adaptive Optimization via $\ell_p$-Norm Steepest Descent cs.LG · 2026-05-11 · unverdicted · none · ref 33 · internal anchor
PowerStep delivers coordinate-wise adaptive optimization by nonlinearly transforming a momentum buffer under an lp-norm steepest-descent geometry, matching Adam convergence with half the memory and supporting aggressive quantization.
From Passive Reuse to Active Reasoning: Grounding Large Language Models for Neuro-Symbolic Experience Replay cs.AI · 2026-05-10 · unverdicted · none · ref 42 · internal anchor
NSER uses zero-shot LLMs to induce behavioral rules from RL trajectories, grounds them in differentiable first-order logic, and applies the symbolic structures to dynamically reweight experience replay for better sample efficiency.
LBI: Parallel Scan Backpropagation via Latent Bounded Interfaces cs.LG · 2026-05-09 · unverdicted · none · ref 10 · internal anchor
LBI enables tractable parallel backpropagation by reducing inter-region adjoint computation to low-dimensional r x r Jacobians while preserving exact gradients under a bounded-interface model.
OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents cs.LG · 2026-05-09 · unverdicted · none · ref 6 · 2 links · internal anchor
OTora is a two-stage framework that generates insertion-aware adversarial triggers and ICL-guided genetic payloads to induce reasoning-level denial-of-service in tool-augmented LLM agents across multiple backbones while preserving task correctness.
Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models cs.CL · 2026-05-08 · unverdicted · none · ref 19 · 2 links · internal anchor
MELT decouples reasoning depth from memory in looped language models by sharing a single gated KV cache per layer and training it via chunk-wise distillation from Ouro starting models.
UniPool: A Globally Shared Expert Pool for Mixture-of-Experts cs.LG · 2026-05-07 · unverdicted · none · ref 8 · internal anchor
A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.
Continuous Latent Diffusion Language Model cs.CL · 2026-05-07 · unverdicted · none · ref 55 · internal anchor
Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing latent prior modeling as an alternative to token-level autoregressive language model
MoE-Hub: Taming Software Complexity for Seamless MoE Overlap with Hardware-Accelerated Communication on Multi-GPU Systems cs.AR · 2026-05-07 · unverdicted · none · ref 11 · internal anchor
MoE-Hub enables seamless MoE communication overlap via hardware-accelerated destination-agnostic data transmission, delivering 1.40x-3.08x per-layer and 1.21x-1.98x end-to-end speedups over prior systems.
Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism cs.DC · 2026-05-06 · unverdicted · none · ref 23 · internal anchor
Nitsum dynamically adapts tensor parallelism and GPU splits in LLM serving to raise SLO-compliant goodput by up to 5.3 times over prior systems.
The Impossibility Triangle of Long-Context Modeling cs.CL · 2026-05-06 · unverdicted · none · ref 20 · internal anchor
No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.
A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints cs.LG · 2026-05-06 · unverdicted · none · ref 122 · internal anchor
A queueing model derives stability conditions for LLM inference services under combined compute and KV cache memory limits, with experimental validation showing typical deviations under 10%.
DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs cs.PL · 2026-05-02 · unverdicted · none · ref 6 · internal anchor
DITRON introduces a hierarchical multi-level tiling compiler for distributed tensor programs that matches or exceeds expert CUDA libraries with 6-30% speedups and has been deployed to improve training MFU by over 10% while saving hundreds of thousands of GPU hours monthly.
When to Vote, When to Rewrite: Disagreement-Guided Strategy Routing for Test-Time Scaling cs.AI · 2026-04-29 · unverdicted · none · ref 10 · internal anchor
A disagreement-guided routing framework dynamically selects among resolution, voting, and rewriting strategies for test-time scaling, delivering 3-7% accuracy gains with lower sampling cost on mathematical benchmarks.
Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling cs.CL · 2026-04-27 · unverdicted · none · ref 31 · internal anchor
HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.
Mixture of Heterogeneous Grouped Experts for Language Modeling cs.CL · 2026-04-25 · unverdicted · none · ref 16 · internal anchor
MoHGE achieves standard MoE performance with 20% fewer parameters and balanced GPU utilization via grouped heterogeneous experts, two-level routing, and specialized auxiliary losses.
Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs cs.LG · 2026-04-20 · unverdicted · none · ref 32 · internal anchor
NPUMoE accelerates MoE LLM inference on Apple Silicon NPUs via offline-calibrated static expert tiers, grouped execution, and load-aware graph residency, delivering 1.32x-5.55x lower latency and 1.81x-7.37x better energy efficiency.
Multi-LLM Token Filtering and Routing for Sequential Recommendation cs.IR · 2026-04-20 · unverdicted · none · ref 18 · internal anchor
MLTFR combines user-guided token filtering with a multi-LLM mixture-of-experts and Fisher-weighted consensus expert to deliver stable gains in corpus-free sequential recommendation.
Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter cs.DC · 2026-04-16 · unverdicted · none · ref 12 · internal anchor
PrfaaS enables practical cross-datacenter prefill-decode disaggregation for hybrid-attention models via selective offloading, bandwidth-aware scheduling, and cache-aware placement, yielding 54% higher throughput and 64% lower P90 TTFT than homogeneous baselines in a 1T-parameter case study.
ELMoE-3D: Leveraging Intrinsic Elasticity of MoE for Hybrid-Bonding-Enabled Self-Speculative Decoding in On-Premises Serving cs.LG · 2026-04-16 · unverdicted · none · ref 11 · internal anchor
ELMoE-3D achieves 6.6x average speedup and 4.4x energy efficiency gain for MoE serving on 3D hardware by scaling expert and bit elasticity for elastic self-speculative decoding.
AsyncTLS: Efficient Generative LLM Inference with Asynchronous Two-level Sparse Attention cs.CL · 2026-04-09 · unverdicted · none · ref 1 · internal anchor
AsyncTLS delivers full-attention accuracy with 1.2-10x operator speedups and 1.3-4.7x end-to-end throughput gains on 48k-96k contexts via two-level sparse attention and asynchronous offloading.
Fine-grained Approaches for Confidence Calibration of LLMs in Automated Code Revision cs.SE · 2026-04-08 · unverdicted · none · ref 82 · internal anchor
Local Platt scaling on three fine-grained confidence scores reduces calibration error for LLM-based automated code revision across tasks and models compared to global scaling alone.
ITIScore: An Image-to-Text-to-Image Rating Framework for the Image Captioning Ability of MLLMs cs.CV · 2026-04-04 · unverdicted · none · ref 28 · internal anchor
ITIScore evaluates MLLM image captions via image-to-text-to-image reconstruction consistency and aligns with human judgments on a new 40K-caption benchmark.
WIO: Upload-Enabled Computational Storage on CXL SSDs cs.OS · 2026-04-02 · unverdicted · none · ref 15 · internal anchor
WIO enables reversible computational storage on CXL SSDs via WebAssembly actors and zero-copy migration, achieving up to 2x throughput and 3.75x lower write latency.
Rethinking Language Model Scaling under Transferable Hypersphere Optimization cs.LG · 2026-03-30 · conditional · none · ref 4 · internal anchor
HyperP transfers optimal learning rates across model width, depth, tokens, and MoE granularity under Frobenius-sphere constraints, delivering stable scaling and 1.58x efficiency gains.
EchoKV: Efficient KV Cache Compression via Similarity-Based Reconstruction cs.CL · 2026-03-24 · unverdicted · none · ref 12 · internal anchor
EchoKV compresses LLM KV caches by reconstructing missing components from partial data via inter- and intra-layer attention similarities, outperforming prior methods on LongBench and RULER while supporting on-demand full-cache inference.
Why Attend to Everything? Focus is the Key cs.CL · 2026-03-12 · conditional · none · ref 9 · internal anchor
Focus learns a few centroids to gate long-range token attention, producing sparse attention that matches or beats full attention quality with up to 8.6x speedup at million-token lengths.
mHC: Manifold-Constrained Hyper-Connections cs.CL · 2025-12-31 · unverdicted · none · ref 10 · internal anchor
mHC projects hyper-connection residual spaces onto a manifold to restore identity mapping, enabling stable large-scale training with performance gains over standard HC.
Janus: Disaggregating Attention and Experts for Scalable MoE Inference cs.DC · 2025-12-15 · unverdicted · none · ref 10 · internal anchor
JANUS disaggregates attention and MoE layers onto separate GPU pools with an expert-balancing scheduler and SLO-aware scaling, delivering up to 4.7x higher per-GPU throughput than prior MoE systems under token-level latency constraints.
BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding cs.CL · 2025-12-12 · unverdicted · none · ref 14 · internal anchor
BLASST dynamically sparsifies attention by thresholding softmax scores to skip blocks, delivering 1.5x speedups at 70%+ sparsity while preserving benchmark accuracy.
Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning cs.DC · 2025-11-18 · unverdicted · none · ref 23 · internal anchor
Seer improves synchronous LLM RL rollout throughput by up to 2.04x and reduces long-tail latency by 72-94% via divided rollout, context-aware scheduling, and adaptive grouped speculative decoding based on prompt similarity observations.
DeepSeek-OCR: Contexts Optical Compression cs.CV · 2025-10-21 · unverdicted · none · ref 19 · internal anchor
DeepSeek-OCR compresses text contexts up to 20x via 2D optical mapping while achieving 97% OCR accuracy below 10x and 60% at 20x, outperforming prior OCR tools with fewer vision tokens.
From Tokens to Layers: Redefining Stall-Free Scheduling for MoE Serving with Layered Prefill cs.LG · 2025-10-09 · unverdicted · none · ref 9 · internal anchor
Layered prefill replaces token-chunked prefill with layer-group interleaving in MoE models, cutting TTFT by up to 70%, end-to-end latency by 41%, and per-token energy by 22% while preserving stall-free TBT.
GRACE-MoE: Grouping and Replication with Locality-Aware Routing for Efficient Distributed MoE Inference cs.DC · 2025-09-29 · unverdicted · none · ref 8 · internal anchor
GRACE-MoE integrates expert grouping, dynamic replication, and locality-aware routing with hierarchical sparse communication to reduce end-to-end latency in distributed SMoE inference.
Elastic MoE: Unlocking the Inference-Time Scalability of Mixture-of-Experts cs.CL · 2025-09-26 · unverdicted · none · ref 8 · internal anchor
EMoE trains MoE models so they maintain performance when the number of activated experts changes at inference, expanding the usable range to 2-3 times the training k with higher peak results.
MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent cs.CL · 2025-07-03 · unverdicted · none · ref 10 · internal anchor
MemAgent uses multi-conversation RL to train a memory agent that reads text in segments and overwrites memory, extrapolating from 8K training to 3.5M token QA with under 5% loss and 95%+ on 512K RULER.
Siamese Foundation Models for Crystal Structure Prediction cond-mat.mtrl-sci · 2025-03-13 · unverdicted · none · ref 47 · internal anchor
DAO pretrains Siamese diffusion-based models on stable/unstable crystal data to achieve 100% experimental match on Cr6Os2 and 2000x speedup over DFT on real superconductors.
Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts cs.LG · 2025-03-07 · conditional · none · ref 8 · internal anchor
Capacity-aware dropping techniques mitigate load imbalance in MoE inference, delivering up to 1.85x speedup with 0.2% or less performance change on models including Mixtral-8x7B.
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention cs.CL · 2025-02-16 · unverdicted · none · ref 10 · internal anchor
NSA is a hardware-aligned sparse attention mechanism that enables end-to-end trainable long-context modeling by combining coarse token compression with fine-grained selection.
Optimization Hyper-parameter Laws for Large Language Models cs.LG · 2024-09-07 · unverdicted · none · ref 5 · internal anchor
Opt-Laws predicts LLM final training loss from LR schedules via SDE-derived convergence and escape features, with 94% Top-2 hit rate on held-out schedules and F1=0.92 for divergence detection.
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling cs.LG · 2024-07-31 · unverdicted · none · ref 21 · internal anchor
Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.
Beyond Uniform Experts: Cost-Aware Expert Execution for Efficient Multi-Device MoE Inference cs.DC · 2026-06-29 · unverdicted · none · ref 14 · internal anchor
CAEE reduces MoE inference latency 8-18% on 671B DeepSeek-R1 by cost-aware expert pruning and low-overhead compensation while keeping accuracy drop under 1%.
Conservation Laws for Modern Neural Architectures cs.LG · 2026-06-16 · unverdicted · none · ref 10 · internal anchor
Unified framework characterizes conservation laws for gradient flow in feedforward networks with GELU/SiLU/SwiGLU, multihead attention with positional encodings, and MoE models under various gating.
MiniPIC: Flexible Position-Independent Caching in <100LOC cs.LG · 2026-06-11 · unverdicted · none · ref 15 · internal anchor
MiniPIC enables multiple position-independent caching methods inside vLLM via unrotated KV storage, per-request RoPE application, and three primitives, delivering 49% prefill throughput gains and up to 100x lower cached-span TTFT on 2WikiMultihopQA.
Dialectics of Alignment: Harnessing Unsafe Knowledge for Dynamic Safety Routing cs.LG · 2026-05-30 · unverdicted · none · ref 18 · internal anchor
SafeMoE isolates unsafe knowledge in domain-specific LoRA experts and routes them via a lightweight gate trained on safe responses to produce safer and more informative LLM outputs with zero-shot generalization.
MESA: Improving MoE Safety Alignment via Decentralized Expertise cs.LG · 2026-05-30 · unverdicted · none · ref 3 · internal anchor
MESA decentralizes safety duties in MoE LLMs via expert capacity reallocation and dynamic routing refinement based on optimal transport theory, yielding robust defense on harmful benchmarks while preserving helpfulness.
Wall-OSS-0.5 Technical Report cs.RO · 2026-05-29 · unverdicted · none · ref 100 · internal anchor
Wall-OSS-0.5 is a 4B VLA model pretrained across many embodiments that achieves zero-shot real-robot performance on a 17-task suite and outperforms π_0.5 after fine-tuning.
How Far Can Disaggregation Go? A Design-Space Exploration of Attention-FFN Disaggregation for Efficient MoE LLM Serving cs.LG · 2026-05-27 · unverdicted · none · ref 4 · internal anchor
Operator-level attention-FFN disaggregation enables ~4k tokens/s throughput for DeepSeek-V3.2 under tight TTFT/TPOT SLOs where chunked-prefill and prefill-decode baselines cannot.
NITP: Next Implicit Token Prediction for LLM Pre-training cs.CL · 2026-05-24 · unverdicted · none · ref 26 · internal anchor
NITP adds dense supervision from shallow model layers to predict implicit next-token semantics, yielding consistent downstream gains on 0.5B-9B models with ~2% extra training FLOPs.
Instant GPU Efficiency Visibility at Fleet Scale cs.DC · 2026-05-20 · unverdicted · none · ref 15 · 2 links · internal anchor
OFU is a hardware-counter metric that approximates application MFU to within 2 percentage points after tile correction and shows r=0.78 correlation on 608 production jobs.

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer