hub Mixed citations

LongBench

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li · 2024 · DOI 10.18653/v1/2024.acl-long.172

Mixed citation behavior. Most common role is background (57%).

40 Pith papers citing it

Background 57% of classified citations

open at publisher browse 40 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 5 dataset 2

citation-polarity summary

background 4 use dataset 2 unclear 1

representative citing papers

UltraEP: Unleash MoE Training and Inference on Rack-Scale Nodes with Near-Optimal Load Balancing

cs.DC · 2026-06-02 · unverdicted · novelty 8.0

UltraEP is the first exact-load real-time expert balancer for large-EP MoE training and serving on rack-scale nodes, reaching 94.3% of ideal throughput and 1.49x over no-balancing.

Indi-RomCoM: Code-Mixed Benchmark for Evaluating LLMs on Romanized Indic-English Instructions

cs.CL · 2026-06-29 · unverdicted · novelty 7.0

Introduces Indi-RomCoM benchmark for evaluating LLMs on Romanized code-mixed Indic-English instructions across seven tasks, four languages, and three mixing levels.

LegalWorld: A Life-Cycle Interactive Environment for Legal Agents

cs.CL · 2026-06-17 · unverdicted · novelty 7.0

LegalWorld is a life-cycle interactive environment modeling Chinese civil litigation as five causally connected stages grounded in 75,309 judgments, paired with LongJud-Bench for cross-stage agent evaluation.

LazyAttention: Efficient Retrieval-Augmented Generation with Deferred Positional Encoding

cs.CL · 2026-06-03 · unverdicted · novelty 7.0

LazyAttention kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV cache reuse, delivering 1.37× lower TTFT and 1.40× higher throughput than Block-Attention under skewed document distributions while preserving output quality.

NARRA-Gym for Evaluating Interactive Narrative Agents

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

NARRA-Gym is an executable benchmark that generates complete interactive narrative episodes from emotional seeds and logs full model trajectories to expose gaps in coherence, adaptation, and personalization that static story tests miss.

When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory

cs.AI · 2026-05-08 · unverdicted · novelty 7.0

A new evaluation protocol shows agent memory reliability degrades variably with added irrelevant sessions depending on agent, memory interface, and scale.

MedicalBench: Evaluating Large Language Models Toward Improved Medical Concept Extraction

cs.CL · 2026-04-05 · unverdicted · novelty 7.0

MedicalBench is a benchmark for implicit medical concept extraction and sentence-level evidence retrieval built from MIMIC-IV discharge summaries with human verification to test LLM reasoning on unstated medical ideas.

MosaicKV: Serving Long-Context LLM with Dynamic Two-D KV Cache Compression

cs.LG · 2026-07-01 · unverdicted · novelty 6.0

MosaicKV achieves up to 16x attention speedup, 4.8x lower decode latency, 7.3x higher throughput, and 3x memory reduction with 1.76% accuracy loss via dynamic two-D KV cache compression and management on H800 GPUs.

SeKV: Resolution-Adaptive KV Cache with Hierarchical Semantic Memory for Long-Context LLM Inference

cs.CL · 2026-06-30 · unverdicted · novelty 6.0

SeKV introduces resolution-adaptive semantic KV caching with GPU-CPU hierarchy and selective zoom-in reconstruction, achieving 5.9% average improvement over semantic baselines and 53.3% GPU memory reduction at 128K context.

HERALD: High-Throughput Block Diffusion LLM Serving via CPU-GPU Cooperative KV Cache Retrieval

cs.LG · 2026-06-19 · unverdicted · novelty 6.0

HERALD enables near-lossless accuracy at 5-10% KV budget for block dLLMs by amortizing top-k selection across denoising steps and overlapping CPU-GPU retrieval, yielding up to 2.47x higher throughput than GPU-only inference.

MCompassRAG: Topic Metadata as a Semantic Compass for Paragraph-Level Retrieval

cs.CL · 2026-06-16 · unverdicted · novelty 6.0

MCompassRAG adds topic metadata to chunk representations and uses LLM distillation to train a lightweight topic-aware retriever, reporting 8.24% average information efficiency gain and over 5x lower latency than strong baselines across six benchmarks.

ConSA: Controllable Sparsity in Hybrid Attention via Learnable Allocation

cs.CL · 2026-06-16 · unverdicted · novelty 6.0

ConSA learns FA/SWA allocation via L0 masks and augmented Lagrangian constraints, outperforming rule-based baselines on 0.6B and 1.7B models with consistent layer patterns.

End-to-End Context Compression at Scale

cs.CL · 2026-06-08 · unverdicted · novelty 6.0

LCLMs are scaled 0.6B-encoder 4B-decoder compressors pre-trained on over 350B tokens that improve the Pareto frontier for general-task performance, compression speed, and peak memory in long-context language model inference.

From Rigid to Dynamic: Entropy-Guided Adaptive Inference for Long-Context LLMs

cs.AI · 2026-06-08 · unverdicted · novelty 6.0

EntropyInfer adaptively allocates inference compute using per-head attention entropy for rigid/dynamic classification during prefilling and compresses KV cache with generated tokens, achieving up to 2.39x speedup on long contexts.

Still: Amortized KV Cache Compaction in a Single Forward Pass

cs.LG · 2026-06-05 · unverdicted · novelty 6.0

Still is an amortized per-layer Perceiver that synthesizes compact KV caches in one forward pass, outperforming selection and per-context baselines on RULER, HELMET, and LongBench at 8-200x compression.

Don't Read Everything: A Curvature-Conditioned Query for Linear Attention

cs.CL · 2026-05-31 · unverdicted · novelty 6.0

CCQ adds a curvature-based query contraction to linear attention backbones, improving perplexity, retrieval, and long-context performance on GLA and Gated DeltaNet at low extra cost.

MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models

cs.CL · 2026-05-19 · unverdicted · novelty 6.0

MixRea benchmark reveals LLMs achieve at most 42.8% consistency on explicit-implicit reasoning tasks, with PRCP prompting proposed to recover overlooked relations.

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

cs.CL · 2026-05-16 · unverdicted · novelty 6.0 · 2 refs

RTPurbo converts full-attention LLMs to sparse attention by retaining full KV for retrieval heads and using a low-dimensional dynamic indexer, achieving near-lossless accuracy after minimal adaptation.

Towards Generalization of Block Attention via Automatic Segmentation and Block Distillation

cs.CL · 2026-05-15 · unverdicted · novelty 6.0 · 4 refs

A new 30k-instance semantic segmentation dataset plus block distillation with sink tokens, dropout, and weighted loss lets block-attention models reach near full-attention performance on long texts.

Structured Recurrent Mixers for Massively Parallelized Sequence Generation

cs.CL · 2026-05-09 · unverdicted · novelty 6.0 · 3 refs

Structured Recurrent Mixers provide a dual parallel-recurrent representation for sequence models, claiming superior training efficiency, information capacity, and inference throughput over linear complexity alternatives.

SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference

cs.DC · 2026-05-04 · unverdicted · novelty 6.0 · 2 refs

SPECTRE achieves up to 2.28x speedup for large-model LLM serving by running speculative draft generation and target verification in parallel using idle tail-model services.

Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving

cs.LG · 2026-04-29 · unverdicted · novelty 6.0

SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.

LKV: End-to-End Learning of Head-wise Budgets and Token Selection for LLM KV Cache Eviction

cs.LG · 2026-04-22 · conditional · novelty 6.0

LKV learns task-optimized global budgets and intrinsic KV token importance without attention matrices, delivering near-lossless performance at 15% cache retention on LongBench.

SinkRouter: Sink-Aware Routing for Efficient Long-Context Decoding in Large Language and Multimodal Models

cs.LG · 2026-04-18 · unverdicted · novelty 6.0

SinkRouter identifies attention sinks as training-derived fixed points and routes around them to skip redundant KV-cache loads, delivering up to 2.03x decoding speedup on long-context benchmarks.

citing papers explorer

Showing 24 of 24 citing papers after filters.

Indi-RomCoM: Code-Mixed Benchmark for Evaluating LLMs on Romanized Indic-English Instructions cs.CL · 2026-06-29 · unverdicted · none · ref 68
Introduces Indi-RomCoM benchmark for evaluating LLMs on Romanized code-mixed Indic-English instructions across seven tasks, four languages, and three mixing levels.
LegalWorld: A Life-Cycle Interactive Environment for Legal Agents cs.CL · 2026-06-17 · unverdicted · none · ref 51
LegalWorld is a life-cycle interactive environment modeling Chinese civil litigation as five causally connected stages grounded in 75,309 judgments, paired with LongJud-Bench for cross-stage agent evaluation.
LazyAttention: Efficient Retrieval-Augmented Generation with Deferred Positional Encoding cs.CL · 2026-06-03 · unverdicted · none · ref 38
LazyAttention kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV cache reuse, delivering 1.37× lower TTFT and 1.40× higher throughput than Block-Attention under skewed document distributions while preserving output quality.
NARRA-Gym for Evaluating Interactive Narrative Agents cs.CL · 2026-05-08 · unverdicted · none · ref 7
NARRA-Gym is an executable benchmark that generates complete interactive narrative episodes from emotional seeds and logs full model trajectories to expose gaps in coherence, adaptation, and personalization that static story tests miss.
MedicalBench: Evaluating Large Language Models Toward Improved Medical Concept Extraction cs.CL · 2026-04-05 · unverdicted · none · ref 2
MedicalBench is a benchmark for implicit medical concept extraction and sentence-level evidence retrieval built from MIMIC-IV discharge summaries with human verification to test LLM reasoning on unstated medical ideas.
SeKV: Resolution-Adaptive KV Cache with Hierarchical Semantic Memory for Long-Context LLM Inference cs.CL · 2026-06-30 · unverdicted · none · ref 42
SeKV introduces resolution-adaptive semantic KV caching with GPU-CPU hierarchy and selective zoom-in reconstruction, achieving 5.9% average improvement over semantic baselines and 53.3% GPU memory reduction at 128K context.
MCompassRAG: Topic Metadata as a Semantic Compass for Paragraph-Level Retrieval cs.CL · 2026-06-16 · unverdicted · none · ref 34
MCompassRAG adds topic metadata to chunk representations and uses LLM distillation to train a lightweight topic-aware retriever, reporting 8.24% average information efficiency gain and over 5x lower latency than strong baselines across six benchmarks.
ConSA: Controllable Sparsity in Hybrid Attention via Learnable Allocation cs.CL · 2026-06-16 · unverdicted · none · ref 33
ConSA learns FA/SWA allocation via L0 masks and augmented Lagrangian constraints, outperforming rule-based baselines on 0.6B and 1.7B models with consistent layer patterns.
End-to-End Context Compression at Scale cs.CL · 2026-06-08 · unverdicted · none · ref 6
LCLMs are scaled 0.6B-encoder 4B-decoder compressors pre-trained on over 350B tokens that improve the Pareto frontier for general-task performance, compression speed, and peak memory in long-context language model inference.
Don't Read Everything: A Curvature-Conditioned Query for Linear Attention cs.CL · 2026-05-31 · unverdicted · none · ref 20
CCQ adds a curvature-based query contraction to linear attention backbones, improving perplexity, retrieval, and long-context performance on GLA and Gated DeltaNet at low extra cost.
MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models cs.CL · 2026-05-19 · unverdicted · none · ref 28
MixRea benchmark reveals LLMs achieve at most 42.8% consistency on explicit-implicit reasoning tasks, with PRCP prompting proposed to recover overlooked relations.
Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps cs.CL · 2026-05-16 · unverdicted · none · ref 1 · 2 links
RTPurbo converts full-attention LLMs to sparse attention by retaining full KV for retrieval heads and using a low-dimensional dynamic indexer, achieving near-lossless accuracy after minimal adaptation.
Towards Generalization of Block Attention via Automatic Segmentation and Block Distillation cs.CL · 2026-05-15 · unverdicted · none · ref 1 · 4 links
A new 30k-instance semantic segmentation dataset plus block distillation with sink tokens, dropout, and weighted loss lets block-attention models reach near full-attention performance on long texts.
Structured Recurrent Mixers for Massively Parallelized Sequence Generation cs.CL · 2026-05-09 · unverdicted · none · ref 28 · 3 links
Structured Recurrent Mixers provide a dual parallel-recurrent representation for sequence models, claiming superior training efficiency, information capacity, and inference throughput over linear complexity alternatives.
StructKV: Preserving the Structural Skeleton for Scalable Long-Context Inference cs.CL · 2026-04-08 · unverdicted · none · ref 1
StructKV compresses LLM KV caches by tracking global in-degree centrality across network depth and dynamically selecting compression layers to preserve long-range dependencies better than local pruning methods.
Memory in the LLM Era: Modular Architectures and Strategies in a Unified Framework cs.CL · 2026-04-02 · unverdicted · none · ref 5
A unified framework for LLM agent memory is benchmarked, with a new hybrid method outperforming state-of-the-art on standard tasks.
Latent Bridges for Multi-Table Question Answering cs.CL · 2026-06-27 · unverdicted · none · ref 35
GRAB improves multi-table QA performance by encoding relational data as graphs and bridging structural signals to frozen LLMs through latent tokens.
WaveFilter: Enhancing the Long-Context Capability of Diffusion LLMs via Wavelet-Guided KV Cache Filtering cs.CL · 2026-05-30 · unverdicted · none · ref 16
WaveFilter applies wavelet decomposition to filter critical tokens for sparse KV caching, improving long-context performance of diffusion LLMs as a plug-and-play addition to existing methods.
GRKV: Global Regression for Training-Free KV Cache Compression in Long-Context LLMs cs.CL · 2026-05-29 · unverdicted · none · ref 3
GRKV applies global ridge regression to KV cache merging for span-based retention in long-context LLMs, claiming to be the only method that improves benchmark performance with minimal overhead.
How LoRA Remembers? A Parametric Memory Law for LLM Finetuning cs.CL · 2026-05-28 · unverdicted · none · ref 3
Introduces Parametric Memory Law as power law for LoRA memory capacity and MemFT threshold-guided optimization for better memory fidelity.
ATLAS: All-round Testing of Long-context Abilities across Scales cs.CL · 2026-05-27 · unverdicted · none · ref 6
ATLAS is a length-dependent benchmarking framework that evaluates 26 models on 8 capability dimensions and shows substantial rank changes when moving from 128K to 1M token ranges.
A Recipe for Long-Context Reasoning in Large Language Models via On-Policy Optimization and Distillation cs.CL · 2026-05-12 · unverdicted · none · ref 61
Combines GRPO with teacher-guided on-policy distillation and introduces LongBlocks dataset to yield more stable long-context reasoning than either method alone.
Language models fail at extended rule following cs.CL · 2026-05-03 · unverdicted · none · ref 47
LLMs fail at extended counting of repeated characters due to finite internal states, with abrupt errors persisting across model scales and inference methods.
MATCH: Modulating Attention via In-Context Retrieval for Long-Context Transformers cs.CL · 2026-06-29 · unverdicted · none · ref 75
MATCH augments sparsified attention with an efficient in-context retrieval system to boost performance on long-range recall tasks in transformers.

LongBench

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer