super hub Mixed citations

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Aixin Liu, Bei Feng, Bingxuan Wang, Bin Wang, Bo Liu, DeepSeek-AI · 2024 · cs.CL · arXiv 2405.04434

Mixed citation behavior. Most common role is background (70%).

148 Pith papers citing it

Background 70% of classified citations

open full Pith review browse 148 citing papers more from Aixin Liu arXiv PDF

abstract

We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 24 method 5 dataset 3 baseline 1

citation-polarity summary

background 23 use method 5 use dataset 3 baseline 1 support 1

claims ledger

abstract We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSe

authors

Aixin Liu Bei Feng Bingxuan Wang Bin Wang Bo Liu DeepSeek-AI

co-cited works

representative citing papers

VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion

cs.CV · 2026-05-28 · unverdicted · novelty 8.0

VideoMLA applies multi-head latent attention with 3D-RoPE decoupling to autoregressive video diffusion, delivering 92.7% KV memory reduction while matching short-horizon baselines and leading long-horizon VBench scores.

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

cs.CL · 2024-06-27 · unverdicted · novelty 8.0

LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.

CrossPool: Efficient Multi-LLM Serving for Cold MoE Models through KV-Cache and Weight Disaggregation

cs.DC · 2026-06-23 · unverdicted · novelty 7.0 · 2 refs

CrossPool separates weights and KV-cache into distinct GPU pools plus a planner, virtualizer, and layer-wise scheduler to cut P99 time-between-tokens by up to 10.4x versus prior kvcached multi-LLM systems.

LazyAttention: Efficient Retrieval-Augmented Generation with Deferred Positional Encoding

cs.CL · 2026-06-03 · unverdicted · novelty 7.0

LazyAttention kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV cache reuse, delivering 1.37× lower TTFT and 1.40× higher throughput than Block-Attention under skewed document distributions while preserving output quality.

Move the Query, Not the Cache: Characterizing Cross-Instance Latent Attention Redistribution Across GPU Fabrics

cs.DC · 2026-05-31 · unverdicted · novelty 7.0

On a real multi-node H100 cluster the authors show that for MLA, routing the ~1 KB compressed query row is cheaper than moving cache chunks and supply a topology-aware cost model accurate to ~7% on IBGDA fabrics.

Training-Free Looped Transformers

cs.LG · 2026-05-22 · unverdicted · novelty 7.0

Training-free looped transformers retrofit recurrence to frozen models via damped ODE sub-steps on mid-stack blocks, yielding gains such as +2.64 pp on MMLU-Pro for Qwen3-4B.

Latent Cache Flow: Model-to-Model Communication Without Text

cs.LG · 2026-05-19 · unverdicted · novelty 7.0

Latent Cache Flow uses a small joint-translation-and-compression adapter to let LLMs with different contexts exchange KV cache summaries, outperforming both larger C2C adapters and text in early experiments.

Text2CAD-Bench: A Benchmark for LLM-based Text-to-Parametric CAD Generation

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

Text2CAD-Bench supplies 600 dual-prompt examples across four geometric and domain levels to test LLMs on text-to-parametric CAD, finding solid basic performance but sharp drops on complex topology and advanced features.

LLMForge: Multi-Backend Hardware-Aware Neural Architecture Search with Infinite-Head Attention for Edge Language Models

cs.LG · 2026-05-17 · unverdicted · novelty 7.0

LLMForge is a NAS framework with Infinite-Head Attention, a Forge-Former surrogate, and Forge-DSE engine that discovers hardware-specific architectures for edge language models, yielding variants with improved accuracy, energy, or latency on different substrates.

$\phi$-Balancing for Mixture-of-Experts Training

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

φ-balancing is a convex optimization method for population-level expert balance in MoE training that derives an online EMA adjustment and outperforms heuristic baselines.

GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

GQLA exposes dual MQA-absorb and GQA decoding paths from identical parameters to enable hardware-adaptive LLM inference while preserving cache compression on one path and GQA-level traffic on the other.

Multi-Scale Dequant: Eliminating Dequantization Bottleneck via Activation Decomposition for Efficient LLM Inference

stat.ML · 2026-05-13 · unverdicted · novelty 7.0

MSD eliminates dequantization from the GEMM path by decomposing BF16 activations into multiple low-precision parts that multiply directly with INT8 or MXFP4 weights, achieving near-16 effective bits for INT8 and 6.6 for MXFP4 with reduced HBM traffic.

The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures

cs.DC · 2026-05-12 · unverdicted · novelty 7.0

Power capping is illusory in LLM decode as memory-bound operation leaves power headroom untouched on 700 W GPUs, while SM clock locking saves up to 32% energy and three DVFS classes appear across attention types.

Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference

cs.DC · 2026-05-11 · unverdicted · novelty 7.0

EEP makes wide expert-parallel MoE serving survive single-rank failures with an 11s recovery pause, 8s reintegration pause, and throughput restored to 95% of pre-fault level within 52s while staying within 4.4% of a fixed-membership baseline in steady state.

Simply Stabilizing the Loop via Fully Looped Transformer

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Fully Looped Transformer stabilizes looped training up to 12 iterations via distributed inter-loop signals and attention injection, improving downstream performance by up to 13.2%.

When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

cs.LG · 2026-05-06 · conditional · novelty 7.0 · 2 refs

KernelBenchX benchmark shows task category explains nearly three times more variance in LLM kernel correctness than method choice, iterative refinement boosts correctness but reduces performance, and quantization remains unsolved.

Misrouter: Exploiting Routing Mechanisms for Input-Only Attacks on Mixture-of-Experts LLMs

cs.CR · 2026-05-06 · unverdicted · novelty 7.0

Misrouter enables input-only attacks on MoE LLMs by optimizing queries on open-source surrogates to route toward weakly aligned experts and transferring them to public APIs.

When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs

cs.PF · 2026-05-04 · unverdicted · novelty 7.0 · 2 refs

Hosted open-weight LLM APIs function as time-varying heterogeneous services rather than fixed model artifacts, with demand concentrated, supply-use mismatches, and task-specific routing yielding major cost and throughput gains.

Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity

cs.LG · 2026-04-27 · unverdicted · novelty 7.0

Incompressible Knowledge Probes enable log-linear estimation of LLM parameter counts from factual accuracy on obscure questions, showing continued scaling of knowledge capacity across open and closed models.

DPC: A Distributed Page Cache over CXL

cs.DC · 2026-04-21 · conditional · novelty 7.0

DPC maintains exactly one DRAM copy of each file page in a CXL-connected cluster and delivers up to 12.4X speedup (5.6X geometric mean) over replicated caches on data-sharing workloads.

Using large language models for embodied planning introduces systematic safety risks

cs.AI · 2026-04-20 · unverdicted · novelty 7.0

LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.

Awakening Dormant Experts:Counterfactual Routing to Mitigate MoE Hallucinations

cs.LG · 2026-04-15 · unverdicted · novelty 7.0

Counterfactual Routing awakens dormant experts in MoE models via layer-wise perturbation and a new CEI metric, raising factual accuracy 3.1% on average across TruthfulQA, FACTOR, and TriviaQA without extra inference cost.

The Phase Is the Gradient: Equilibrium Propagation for Frequency Learning in Kuramoto Networks

cs.LG · 2026-04-11 · unverdicted · novelty 7.0

In Kuramoto networks at equilibrium, weak nudging makes phase displacement the exact gradient of loss w.r.t. natural frequencies, enabling frequency learning that beats weight learning and resolves convergence via spectral initialization.

citing papers explorer

Showing 50 of 56 citing papers after filters.

Training-Free Looped Transformers cs.LG · 2026-05-22 · unverdicted · none · ref 25 · internal anchor
Training-free looped transformers retrofit recurrence to frozen models via damped ODE sub-steps on mid-stack blocks, yielding gains such as +2.64 pp on MMLU-Pro for Qwen3-4B.
Latent Cache Flow: Model-to-Model Communication Without Text cs.LG · 2026-05-19 · unverdicted · none · ref 7 · internal anchor
Latent Cache Flow uses a small joint-translation-and-compression adapter to let LLMs with different contexts exchange KV cache summaries, outperforming both larger C2C adapters and text in early experiments.
Text2CAD-Bench: A Benchmark for LLM-based Text-to-Parametric CAD Generation cs.LG · 2026-05-18 · unverdicted · none · ref 21 · internal anchor
Text2CAD-Bench supplies 600 dual-prompt examples across four geometric and domain levels to test LLMs on text-to-parametric CAD, finding solid basic performance but sharp drops on complex topology and advanced features.
LLMForge: Multi-Backend Hardware-Aware Neural Architecture Search with Infinite-Head Attention for Edge Language Models cs.LG · 2026-05-17 · unverdicted · none · ref 10 · internal anchor
LLMForge is a NAS framework with Infinite-Head Attention, a Forge-Former surrogate, and Forge-DSE engine that discovers hardware-specific architectures for edge language models, yielding variants with improved accuracy, energy, or latency on different substrates.
$\phi$-Balancing for Mixture-of-Experts Training cs.LG · 2026-05-14 · unverdicted · none · ref 8 · internal anchor
φ-balancing is a convex optimization method for population-level expert balance in MoE training that derives an online EMA adjustment and outperforms heuristic baselines.
GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding cs.LG · 2026-05-14 · unverdicted · none · ref 3 · internal anchor
GQLA exposes dual MQA-absorb and GQA decoding paths from identical parameters to enable hardware-adaptive LLM inference while preserving cache compression on one path and GQA-level traffic on the other.
Simply Stabilizing the Loop via Fully Looped Transformer cs.LG · 2026-05-11 · unverdicted · none · ref 1 · internal anchor
Fully Looped Transformer stabilizes looped training up to 12 iterations via distributed inter-loop signals and attention injection, improving downstream performance by up to 13.2%.
When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models cs.LG · 2026-05-08 · unverdicted · none · ref 31 · internal anchor
Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.
KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels cs.LG · 2026-05-06 · conditional · none · ref 6 · 2 links · internal anchor
KernelBenchX benchmark shows task category explains nearly three times more variance in LLM kernel correctness than method choice, iterative refinement boosts correctness but reduces performance, and quantization remains unsolved.
Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity cs.LG · 2026-04-27 · unverdicted · none · ref 4 · internal anchor
Incompressible Knowledge Probes enable log-linear estimation of LLM parameter counts from factual accuracy on obscure questions, showing continued scaling of knowledge capacity across open and closed models.
Awakening Dormant Experts:Counterfactual Routing to Mitigate MoE Hallucinations cs.LG · 2026-04-15 · unverdicted · none · ref 1 · internal anchor
Counterfactual Routing awakens dormant experts in MoE models via layer-wise perturbation and a new CEI metric, raising factual accuracy 3.1% on average across TruthfulQA, FACTOR, and TriviaQA without extra inference cost.
The Phase Is the Gradient: Equilibrium Propagation for Frequency Learning in Kuramoto Networks cs.LG · 2026-04-11 · unverdicted · none · ref 15 · internal anchor
In Kuramoto networks at equilibrium, weak nudging makes phase displacement the exact gradient of loss w.r.t. natural frequencies, enabling frequency learning that beats weight learning and resolves convergence via spectral initialization.
EvoESAP: Non-Uniform Expert Pruning for Sparse MoE cs.LG · 2026-03-06 · conditional · none · ref 9 · internal anchor
EvoESAP uses evolutionary search guided by a speculative-decoding-inspired ESAP metric to discover non-uniform layer-wise sparsity allocations for MoE expert pruning, improving generation accuracy up to 19.6% at 50% sparsity.
Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts cs.LG · 2024-08-28 · conditional · none · ref 7 · internal anchor
Loss-Free Balancing keeps expert loads balanced in MoE models by dynamically adjusting routing-score biases based on recent usage, avoiding auxiliary-loss interference and yielding better performance.
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision cs.LG · 2024-07-11 · accept · none · ref 19 · internal anchor
FlashAttention-3 achieves 1.5-2x speedup on H100 GPUs for attention, reaching 740 TFLOPs/s (75% utilization) in FP16 and near 1.2 PFLOPs/s in FP8 while cutting numerical error by 2.6x versus baseline FP8 attention.
Functional Equivalence in Attention: A Comprehensive Study with Applications to Linear Mode Connectivity cs.LG · 2026-06-16 · unverdicted · none · ref 3 · internal anchor
Rotary positional encodings reduce the symmetry group of functional equivalence in attention compared to sinusoidal encodings, increasing expressivity and altering linear mode connectivity patterns.
Do Transformers Need Three Projections? Systematic Study of QKV Variants cs.LG · 2026-06-01 · conditional · none · ref 98 · internal anchor
Q-K=V projection sharing in transformers matches standard QKV performance with 50% KV cache reduction and combines with GQA/MQA for up to 96.9% reduction across vision and language tasks.
Blurry Window Attention cs.LG · 2026-05-31 · unverdicted · none · ref 25 · internal anchor
Blurry Window Attention stores a frequency window and reconstructs blurry KV history via Dirichlet kernel interpolation, achieving 8x better state efficiency than sliding window attention on the MQAR synthetic task.
MONA: Muon Optimizer with Nesterov Acceleration for Scalable Language Model Training cs.LG · 2026-05-26 · unverdicted · none · ref 29 · internal anchor
MONA integrates Nesterov acceleration into Muon's orthogonalization framework, reporting better convergence than Muon and AdamW on MoE models up to 68B parameters trained on 1T tokens and SOTA fine-tuning results.
BitsMoE: Efficient Spectral Energy-Guided Bit Allocation for MoE LLM Quantization cs.LG · 2026-05-22 · unverdicted · none · ref 27 · internal anchor
BitsMoE uses SVD decomposition and activation-aware ILP bit allocation to quantize MoE LLMs at ultra-low bits with reduced accuracy degradation compared to GPTQ.
GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs cs.LG · 2026-05-21 · unverdicted · none · ref 17 · internal anchor
GEMQ applies global LP-based expert importance estimation and router fine-tuning within progressive quantization to cut memory and speed inference in MoE LLMs with little accuracy loss.
UB-SMoE: Universally Balanced Sparse Mixture-of-Experts for Resource-adaptive Federated Fine-tuning of Foundation Models cs.LG · 2026-05-15 · unverdicted · none · ref 47 · internal anchor
UB-SMoE balances expert utilization in heterogeneous federated SMoE fine-tuning via Dynamic Modulated Routing and Universal Pseudo-Gradient, delivering up to 45% compute reduction and 8.7x performance gains for low-resource clients over prior LoRA-rank methods.
Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility cs.LG · 2026-05-13 · unverdicted · none · ref 61 · internal anchor
SP-KV trains a utility predictor jointly with the LLM to dynamically prune low-utility KV cache entries, achieving 3-10x memory reduction during generation with negligible performance loss.
EMO: Frustratingly Easy Progressive Training of Extendable MoE cs.LG · 2026-05-13 · unverdicted · none · ref 7 · 2 links · internal anchor
EMO progressively expands the expert pool in MoE models during training to match fixed-expert performance with improved wall-clock efficiency.
Search Your Block Floating Point Scales! cs.LG · 2026-05-12 · unverdicted · none · ref 138 · internal anchor
ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.
PowerStep: Memory-Efficient Adaptive Optimization via $\ell_p$-Norm Steepest Descent cs.LG · 2026-05-11 · unverdicted · none · ref 33 · internal anchor
PowerStep delivers coordinate-wise adaptive optimization by nonlinearly transforming a momentum buffer under an lp-norm steepest-descent geometry, matching Adam convergence with half the memory and supporting aggressive quantization.
LBI: Parallel Scan Backpropagation via Latent Bounded Interfaces cs.LG · 2026-05-09 · unverdicted · none · ref 10 · internal anchor
LBI enables tractable parallel backpropagation by reducing inter-region adjoint computation to low-dimensional r x r Jacobians while preserving exact gradients under a bounded-interface model.
OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents cs.LG · 2026-05-09 · unverdicted · none · ref 6 · 2 links · internal anchor
OTora is a two-stage framework that generates insertion-aware adversarial triggers and ICL-guided genetic payloads to induce reasoning-level denial-of-service in tool-augmented LLM agents across multiple backbones while preserving task correctness.
UniPool: A Globally Shared Expert Pool for Mixture-of-Experts cs.LG · 2026-05-07 · unverdicted · none · ref 8 · internal anchor
A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.
A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints cs.LG · 2026-05-06 · unverdicted · none · ref 122 · internal anchor
A queueing model derives stability conditions for LLM inference services under combined compute and KV cache memory limits, with experimental validation showing typical deviations under 10%.
Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs cs.LG · 2026-04-20 · unverdicted · none · ref 32 · internal anchor
NPUMoE accelerates MoE LLM inference on Apple Silicon NPUs via offline-calibrated static expert tiers, grouped execution, and load-aware graph residency, delivering 1.32x-5.55x lower latency and 1.81x-7.37x better energy efficiency.
ELMoE-3D: Leveraging Intrinsic Elasticity of MoE for Hybrid-Bonding-Enabled Self-Speculative Decoding in On-Premises Serving cs.LG · 2026-04-16 · unverdicted · none · ref 11 · internal anchor
ELMoE-3D achieves 6.6x average speedup and 4.4x energy efficiency gain for MoE serving on 3D hardware by scaling expert and bit elasticity for elastic self-speculative decoding.
Rethinking Language Model Scaling under Transferable Hypersphere Optimization cs.LG · 2026-03-30 · conditional · none · ref 4 · internal anchor
HyperP transfers optimal learning rates across model width, depth, tokens, and MoE granularity under Frobenius-sphere constraints, delivering stable scaling and 1.58x efficiency gains.
From Tokens to Layers: Redefining Stall-Free Scheduling for MoE Serving with Layered Prefill cs.LG · 2025-10-09 · unverdicted · none · ref 9 · internal anchor
Layered prefill replaces token-chunked prefill with layer-group interleaving in MoE models, cutting TTFT by up to 70%, end-to-end latency by 41%, and per-token energy by 22% while preserving stall-free TBT.
Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts cs.LG · 2025-03-07 · conditional · none · ref 8 · internal anchor
Capacity-aware dropping techniques mitigate load imbalance in MoE inference, delivering up to 1.85x speedup with 0.2% or less performance change on models including Mixtral-8x7B.
Optimization Hyper-parameter Laws for Large Language Models cs.LG · 2024-09-07 · unverdicted · none · ref 5 · internal anchor
Opt-Laws predicts LLM final training loss from LR schedules via SDE-derived convergence and escape features, with 94% Top-2 hit rate on held-out schedules and F1=0.92 for divergence detection.
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling cs.LG · 2024-07-31 · unverdicted · none · ref 21 · internal anchor
Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.
Conservation Laws for Modern Neural Architectures cs.LG · 2026-06-16 · unverdicted · none · ref 10 · internal anchor
Unified framework characterizes conservation laws for gradient flow in feedforward networks with GELU/SiLU/SwiGLU, multihead attention with positional encodings, and MoE models under various gating.
MiniPIC: Flexible Position-Independent Caching in <100LOC cs.LG · 2026-06-11 · unverdicted · none · ref 15 · internal anchor
MiniPIC enables multiple position-independent caching methods inside vLLM via unrotated KV storage, per-request RoPE application, and three primitives, delivering 49% prefill throughput gains and up to 100x lower cached-span TTFT on 2WikiMultihopQA.
Dialectics of Alignment: Harnessing Unsafe Knowledge for Dynamic Safety Routing cs.LG · 2026-05-30 · unverdicted · none · ref 18 · internal anchor
SafeMoE isolates unsafe knowledge in domain-specific LoRA experts and routes them via a lightweight gate trained on safe responses to produce safer and more informative LLM outputs with zero-shot generalization.
MESA: Improving MoE Safety Alignment via Decentralized Expertise cs.LG · 2026-05-30 · unverdicted · none · ref 3 · internal anchor
MESA decentralizes safety duties in MoE LLMs via expert capacity reallocation and dynamic routing refinement based on optimal transport theory, yielding robust defense on harmful benchmarks while preserving helpfulness.
How Far Can Disaggregation Go? A Design-Space Exploration of Attention-FFN Disaggregation for Efficient MoE LLM Serving cs.LG · 2026-05-27 · unverdicted · none · ref 4 · internal anchor
Operator-level attention-FFN disaggregation enables ~4k tokens/s throughput for DeepSeek-V3.2 under tight TTFT/TPOT SLOs where chunked-prefill and prefill-decode baselines cannot.
CoX-MoE: Coalesced Expert Execution for High-Throughput MoE Inference with AMX-Enabled CPU-GPU Co-Execution cs.LG · 2026-05-18 · unverdicted · none · ref 16 · internal anchor
CoX-MoE achieves up to 7.1x higher throughput than FlexGen for MoE inference via coalesced expert execution and AMX-enabled CPU-GPU orchestration with static expert stratification.
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k cs.LG · 2026-05-04 · accept · none · ref 8 · internal anchor
Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.
Mesh Based Simulations with Spatial and Temporal awareness cs.LG · 2026-05-02 · unverdicted · none · ref 159 · internal anchor
A unified training framework for mesh-based ML surrogates in CFD improves accuracy and long-horizon stability by enforcing spatial derivative consistency via multi-node prediction, using temporal cross-attention correction, and adding 3D rotary positional embeddings.
FG$^2$-GDN: Enhancing Long-Context Gated Delta Networks with Doubly Fine-Grained Control cs.LG · 2026-04-21 · unverdicted · none · ref 18 · internal anchor
FG²-GDN replaces the scalar beta in the delta update with a channel-wise vector and decouples key/value scaling to improve recall over prior GDN and KDA models.
VFA: Relieving Vector Operations in Flash Attention with Global Maximum Pre-computation cs.LG · 2026-04-14 · unverdicted · none · ref 13 · internal anchor
VFA optimizes Flash Attention by pre-computing global max approximations from key blocks and reordering traversal to reduce vector bottlenecks while preserving exact computation.
GLM-5: from Vibe Coding to Agentic Engineering cs.LG · 2026-02-17 · unverdicted · none · ref 24 · internal anchor
GLM-5 is a foundation model that claims state-of-the-art results on coding benchmarks and superior performance on end-to-end software engineering tasks via new asynchronous RL methods and cost-saving DSA.
SnapMLA: Efficient Long-Context MLA Decoding via Hardware-Aware FP8 Quantized Pipelining cs.LG · 2026-02-11 · conditional · none · ref 24 · internal anchor
SnapMLA achieves up to 1.91x higher throughput in long-output MLA decoding using FP8 quantization and specialized kernels while keeping benchmark quality near the BF16 baseline.
Kimi K2: Open Agentic Intelligence cs.LG · 2025-07-28 · unverdicted · none · ref 48 · internal anchor
Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer