UltraEP is the first exact-load real-time expert balancer for large-EP MoE training and serving on rack-scale nodes, reaching 94.3% of ideal throughput and 1.49x over no-balancing.
Mixed citations
Title resolution pending
Mixed citation behavior. Most common role is background (57%).
citation-role summary
citation-polarity summary
representative citing papers
LegalWorld is a life-cycle interactive environment modeling Chinese civil litigation as five causally connected stages grounded in 75,309 judgments, paired with LongJud-Bench for cross-stage agent evaluation.
LazyAttention kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV cache reuse, delivering 1.37× lower TTFT and 1.40× higher throughput than Block-Attention under skewed document distributions while preserving output quality.
NARRA-Gym is an executable benchmark that generates complete interactive narrative episodes from emotional seeds and logs full model trajectories to expose gaps in coherence, adaptation, and personalization that static story tests miss.
A new evaluation protocol shows agent memory reliability degrades variably with added irrelevant sessions depending on agent, memory interface, and scale.
MedicalBench is a benchmark for implicit medical concept extraction and sentence-level evidence retrieval built from MIMIC-IV discharge summaries with human verification to test LLM reasoning on unstated medical ideas.
HERALD enables near-lossless accuracy at 5-10% KV budget for block dLLMs by amortizing top-k selection across denoising steps and overlapping CPU-GPU retrieval, yielding up to 2.47x higher throughput than GPU-only inference.
MCompassRAG adds topic metadata to chunk representations and uses LLM distillation to train a lightweight topic-aware retriever, reporting 8.24% average information efficiency gain and over 5x lower latency than strong baselines across six benchmarks.
ConSA learns FA/SWA allocation via L0 masks and augmented Lagrangian constraints, outperforming rule-based baselines on 0.6B and 1.7B models with consistent layer patterns.
LCLMs are scaled 0.6B-encoder 4B-decoder compressors pre-trained on over 350B tokens that improve the Pareto frontier for general-task performance, compression speed, and peak memory in long-context language model inference.
EntropyInfer adaptively allocates inference compute using per-head attention entropy for rigid/dynamic classification during prefilling and compresses KV cache with generated tokens, achieving up to 2.39x speedup on long contexts.
Still is an amortized per-layer Perceiver that synthesizes compact KV caches in one forward pass, outperforming selection and per-context baselines on RULER, HELMET, and LongBench at 8-200x compression.
CCQ adds a curvature-based query contraction to linear attention backbones, improving perplexity, retrieval, and long-context performance on GLA and Gated DeltaNet at low extra cost.
MixRea benchmark reveals LLMs achieve at most 42.8% consistency on explicit-implicit reasoning tasks, with PRCP prompting proposed to recover overlooked relations.
SPECTRE achieves up to 2.28x speedup for large-model LLM serving by running speculative draft generation and target verification in parallel using idle tail-model services.
SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.
LKV learns task-optimized global budgets and intrinsic KV token importance without attention matrices, delivering near-lossless performance at 15% cache retention on LongBench.
SinkRouter identifies attention sinks as training-derived fixed points and routes around them to skip redundant KV-cache loads, delivering up to 2.03x decoding speedup on long-context benchmarks.
StructKV compresses LLM KV caches by tracking global in-degree centrality across network depth and dynamically selecting compression layers to preserve long-range dependencies better than local pruning methods.
LLMLingua prompt compression yields up to 18% end-to-end LLM speedups with unchanged quality when prompt length, ratio, and hardware align, plus an open profiler to predict the break-even point.
A unified framework for LLM agent memory is benchmarked, with a new hybrid method outperforming state-of-the-art on standard tasks.
FlexiCache reduces GPU memory for long-context LLM requests by up to 70% and boosts throughput 1.38-1.55x and latency 1.6-2.1x by exploiting per-head differences in temporal stability of critical tokens.
SIFT precomputes selective attention indices via local and cross-attention invariance to speed RAG prefill 1.71x while keeping accuracy within 1% of full recompute, storing only bit vectors 24,000x smaller than KV tensors.
WaveFilter applies wavelet decomposition to filter critical tokens for sparse KV caching, improving long-context performance of diffusion LLMs as a plug-and-play addition to existing methods.
citing papers explorer
-
UltraEP: Unleash MoE Training and Inference on Rack-Scale Nodes with Near-Optimal Load Balancing
UltraEP is the first exact-load real-time expert balancer for large-EP MoE training and serving on rack-scale nodes, reaching 94.3% of ideal throughput and 1.49x over no-balancing.
-
LegalWorld: A Life-Cycle Interactive Environment for Legal Agents
LegalWorld is a life-cycle interactive environment modeling Chinese civil litigation as five causally connected stages grounded in 75,309 judgments, paired with LongJud-Bench for cross-stage agent evaluation.
-
LazyAttention: Efficient Retrieval-Augmented Generation with Deferred Positional Encoding
LazyAttention kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV cache reuse, delivering 1.37× lower TTFT and 1.40× higher throughput than Block-Attention under skewed document distributions while preserving output quality.
-
NARRA-Gym for Evaluating Interactive Narrative Agents
NARRA-Gym is an executable benchmark that generates complete interactive narrative episodes from emotional seeds and logs full model trajectories to expose gaps in coherence, adaptation, and personalization that static story tests miss.
-
When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory
A new evaluation protocol shows agent memory reliability degrades variably with added irrelevant sessions depending on agent, memory interface, and scale.
-
MedicalBench: Evaluating Large Language Models Toward Improved Medical Concept Extraction
MedicalBench is a benchmark for implicit medical concept extraction and sentence-level evidence retrieval built from MIMIC-IV discharge summaries with human verification to test LLM reasoning on unstated medical ideas.
-
HERALD: High-Throughput Block Diffusion LLM Serving via CPU-GPU Cooperative KV Cache Retrieval
HERALD enables near-lossless accuracy at 5-10% KV budget for block dLLMs by amortizing top-k selection across denoising steps and overlapping CPU-GPU retrieval, yielding up to 2.47x higher throughput than GPU-only inference.
-
MCompassRAG: Topic Metadata as a Semantic Compass for Paragraph-Level Retrieval
MCompassRAG adds topic metadata to chunk representations and uses LLM distillation to train a lightweight topic-aware retriever, reporting 8.24% average information efficiency gain and over 5x lower latency than strong baselines across six benchmarks.
-
ConSA: Controllable Sparsity in Hybrid Attention via Learnable Allocation
ConSA learns FA/SWA allocation via L0 masks and augmented Lagrangian constraints, outperforming rule-based baselines on 0.6B and 1.7B models with consistent layer patterns.
-
End-to-End Context Compression at Scale
LCLMs are scaled 0.6B-encoder 4B-decoder compressors pre-trained on over 350B tokens that improve the Pareto frontier for general-task performance, compression speed, and peak memory in long-context language model inference.
-
From Rigid to Dynamic: Entropy-Guided Adaptive Inference for Long-Context LLMs
EntropyInfer adaptively allocates inference compute using per-head attention entropy for rigid/dynamic classification during prefilling and compresses KV cache with generated tokens, achieving up to 2.39x speedup on long contexts.
-
Still: Amortized KV Cache Compaction in a Single Forward Pass
Still is an amortized per-layer Perceiver that synthesizes compact KV caches in one forward pass, outperforming selection and per-context baselines on RULER, HELMET, and LongBench at 8-200x compression.
-
Don't Read Everything: A Curvature-Conditioned Query for Linear Attention
CCQ adds a curvature-based query contraction to linear attention backbones, improving perplexity, retrieval, and long-context performance on GLA and Gated DeltaNet at low extra cost.
-
MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models
MixRea benchmark reveals LLMs achieve at most 42.8% consistency on explicit-implicit reasoning tasks, with PRCP prompting proposed to recover overlooked relations.
-
SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference
SPECTRE achieves up to 2.28x speedup for large-model LLM serving by running speculative draft generation and target verification in parallel using idle tail-model services.
-
Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving
SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.
-
LKV: End-to-End Learning of Head-wise Budgets and Token Selection for LLM KV Cache Eviction
LKV learns task-optimized global budgets and intrinsic KV token importance without attention matrices, delivering near-lossless performance at 15% cache retention on LongBench.
-
SinkRouter: Sink-Aware Routing for Efficient Long-Context Decoding in Large Language and Multimodal Models
SinkRouter identifies attention sinks as training-derived fixed points and routes around them to skip redundant KV-cache loads, delivering up to 2.03x decoding speedup on long-context benchmarks.
-
StructKV: Preserving the Structural Skeleton for Scalable Long-Context Inference
StructKV compresses LLM KV caches by tracking global in-degree centrality across network depth and dynamically selecting compression layers to preserve long-range dependencies better than local pruning methods.
-
Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference
LLMLingua prompt compression yields up to 18% end-to-end LLM speedups with unchanged quality when prompt length, ratio, and hardware align, plus an open profiler to predict the break-even point.
-
Memory in the LLM Era: Modular Architectures and Strategies in a Unified Framework
A unified framework for LLM agent memory is benchmarked, with a new hybrid method outperforming state-of-the-art on standard tasks.
-
FlexiCache: Leveraging Temporal Stability of Attention Heads for Efficient KV Cache Management
FlexiCache reduces GPU memory for long-context LLM requests by up to 70% and boosts throughput 1.38-1.55x and latency 1.6-2.1x by exploiting per-head differences in temporal stability of critical tokens.
-
SIFT: Selective-Index For Fast Compute of RAG Prefill by Exploiting Attention Invariance
SIFT precomputes selective attention indices via local and cross-attention invariance to speed RAG prefill 1.71x while keeping accuracy within 1% of full recompute, storing only bit vectors 24,000x smaller than KV tensors.
-
WaveFilter: Enhancing the Long-Context Capability of Diffusion LLMs via Wavelet-Guided KV Cache Filtering
WaveFilter applies wavelet decomposition to filter critical tokens for sparse KV caching, improving long-context performance of diffusion LLMs as a plug-and-play addition to existing methods.
-
GRKV: Global Regression for Training-Free KV Cache Compression in Long-Context LLMs
GRKV applies global ridge regression to KV cache merging for span-based retention in long-context LLMs, claiming to be the only method that improves benchmark performance with minimal overhead.
-
How LoRA Remembers? A Parametric Memory Law for LLM Finetuning
Introduces Parametric Memory Law as power law for LoRA memory capacity and MemFT threshold-guided optimization for better memory fidelity.
-
Language models fail at extended rule following
LLMs fail at extended counting of repeated characters due to finite internal states, with abrupt errors persisting across model scales and inference methods.
-
Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models
Toeplitz MLP Mixers replace attention with masked Toeplitz multiplications for sub-quadratic complexity while retaining more sequence information and outperforming on copying and in-context tasks.
-
SpikingBrain2.0: Brain-Inspired Foundation Models for Efficient Long-Context and Cross-Platform Inference
SpikingBrain2.0 is a 5B hybrid spiking-Transformer that recovers most base model performance while delivering 10x TTFT speedup at 4M context and supporting over 10M tokens on limited GPUs via dual sparse attention and dual quantization paths.
-
MemCoT: Test-Time Scaling through Memory-Driven Chain-of-Thought
MemCoT transforms long-context LLM reasoning into an iterative stateful search using multi-view memory for evidence localization and dual short-term memory for guiding decisions, achieving SOTA on LoCoMo and LongMemEval-S benchmarks.
- Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps
- Towards Generalization of Block Attention via Automatic Segmentation and Block Distillation
- A Recipe for Long-Context Reasoning in Large Language Models via On-Policy Optimization and Distillation
- Structured Recurrent Mixers for Massively Parallelized Sequence Generation