{"total":39,"items":[{"citing_arxiv_id":"2606.31145","ref_index":42,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"SeKV: Resolution-Adaptive KV Cache with Hierarchical Semantic Memory for Long-Context LLM Inference","primary_cat":"cs.CL","submitted_at":"2026-06-30T05:18:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SeKV introduces resolution-adaptive semantic KV caching with GPU-CPU hierarchy and selective zoom-in reconstruction, achieving 5.9% average improvement over semantic baselines and 53.3% GPU memory reduction at 128K context.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30790","ref_index":68,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Indi-RomCoM: Code-Mixed Benchmark for Evaluating LLMs on Romanized Indic-English Instructions","primary_cat":"cs.CL","submitted_at":"2026-06-29T18:19:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces Indi-RomCoM benchmark for evaluating LLMs on Romanized code-mixed Indic-English instructions across seven tasks, four languages, and three mixing levels.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29844","ref_index":75,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"MATCH: Modulating Attention via In-Context Retrieval for Long-Context Transformers","primary_cat":"cs.CL","submitted_at":"2026-06-29T06:33:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"MATCH augments sparsified attention with an efficient in-context retrieval system to boost performance on long-range recall tasks in transformers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.28916","ref_index":35,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Latent Bridges for Multi-Table Question Answering","primary_cat":"cs.CL","submitted_at":"2026-06-27T13:48:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GRAB improves multi-table QA performance by encoding relational data as graphs and bridging structural signals to frozen LLMs through latent tokens.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.21633","ref_index":3,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"HERALD: High-Throughput Block Diffusion LLM Serving via CPU-GPU Cooperative KV Cache Retrieval","primary_cat":"cs.LG","submitted_at":"2026-06-19T17:36:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HERALD enables near-lossless accuracy at 5-10% KV budget for block dLLMs by amortizing top-k selection across denoising steps and overlapping CPU-GPU retrieval, yielding up to 2.47x higher throughput than GPU-only inference.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.18728","ref_index":51,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"LegalWorld: A Life-Cycle Interactive Environment for Legal Agents","primary_cat":"cs.CL","submitted_at":"2026-06-17T06:11:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LegalWorld is a life-cycle interactive environment modeling Chinese civil litigation as five causally connected stages grounded in 75,309 judgments, paired with LongJud-Bench for cross-stage agent evaluation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.18508","ref_index":34,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"MCompassRAG: Topic Metadata as a Semantic Compass for Paragraph-Level Retrieval","primary_cat":"cs.CL","submitted_at":"2026-06-16T21:50:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MCompassRAG adds topic metadata to chunk representations and uses LLM distillation to train a lightweight topic-aware retriever, reporting 8.24% average information efficiency gain and over 5x lower latency than strong baselines across six benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.18056","ref_index":33,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"ConSA: Controllable Sparsity in Hybrid Attention via Learnable Allocation","primary_cat":"cs.CL","submitted_at":"2026-06-16T15:33:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ConSA learns FA/SWA allocation via L0 masks and augmented Lagrangian constraints, outperforming rule-based baselines on 0.6B and 1.7B models with consistent layer patterns.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.09659","ref_index":6,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"End-to-End Context Compression at Scale","primary_cat":"cs.CL","submitted_at":"2026-06-08T15:43:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LCLMs are scaled 0.6B-encoder 4B-decoder compressors pre-trained on over 350B tokens that improve the Pareto frontier for general-task performance, compression speed, and peak memory in long-context language model inference.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.09508","ref_index":1,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"From Rigid to Dynamic: Entropy-Guided Adaptive Inference for Long-Context LLMs","primary_cat":"cs.AI","submitted_at":"2026-06-08T14:02:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EntropyInfer adaptively allocates inference compute using per-head attention entropy for rigid/dynamic classification during prefilling and compresses KV cache with generated tokens, achieving up to 2.39x speedup on long contexts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.09441","ref_index":2,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"SIFT: Selective-Index For Fast Compute of RAG Prefill by Exploiting Attention Invariance","primary_cat":"cs.AI","submitted_at":"2026-06-08T12:50:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SIFT precomputes selective attention indices via local and cross-attention invariance to speed RAG prefill 1.71x while keeping accuracy within 1% of full recompute, storing only bit vectors 24,000x smaller than KV tensors.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.07878","ref_index":60,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Still: Amortized KV Cache Compaction in a Single Forward Pass","primary_cat":"cs.LG","submitted_at":"2026-06-05T22:21:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Still is an amortized per-layer Perceiver that synthesizes compact KV caches in one forward pass, outperforming selection and per-context baselines on RULER, HELMET, and LongBench at 8-200x compression.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.04302","ref_index":38,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"LazyAttention: Efficient Retrieval-Augmented Generation with Deferred Positional Encoding","primary_cat":"cs.CL","submitted_at":"2026-06-03T00:12:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LazyAttention kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV cache reuse, delivering 1.37× lower TTFT and 1.40× higher throughput than Block-Attention under skewed document distributions while preserving output quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.04101","ref_index":3,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"UltraEP: Unleash MoE Training and Inference on Rack-Scale Nodes with Near-Optimal Load Balancing","primary_cat":"cs.DC","submitted_at":"2026-06-02T18:07:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"UltraEP is the first exact-load real-time expert balancer for large-EP MoE training and serving on rack-scale nodes, reaching 94.3% of ideal throughput and 1.49x over no-balancing.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01294","ref_index":20,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Don't Read Everything: A Curvature-Conditioned Query for Linear Attention","primary_cat":"cs.CL","submitted_at":"2026-05-31T15:25:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CCQ adds a curvature-based query contraction to linear attention backbones, improving perplexity, retrieval, and long-context performance on GLA and Gated DeltaNet at low extra cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00724","ref_index":16,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"WaveFilter: Enhancing the Long-Context Capability of Diffusion LLMs via Wavelet-Guided KV Cache Filtering","primary_cat":"cs.CL","submitted_at":"2026-05-30T13:32:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"WaveFilter applies wavelet decomposition to filter critical tokens for sparse KV caching, improving long-context performance of diffusion LLMs as a plug-and-play addition to existing methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.31105","ref_index":3,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"GRKV: Global Regression for Training-Free KV Cache Compression in Long-Context LLMs","primary_cat":"cs.CL","submitted_at":"2026-05-29T10:16:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GRKV applies global ridge regression to KV cache merging for span-based retention in long-context LLMs, claiming to be the only method that improves benchmark performance with minimal overhead.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30260","ref_index":3,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"How LoRA Remembers? A Parametric Memory Law for LLM Finetuning","primary_cat":"cs.CL","submitted_at":"2026-05-28T17:22:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Introduces Parametric Memory Law as power law for LoRA memory capacity and MemFT threshold-guided optimization for better memory fidelity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28079","ref_index":6,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"ATLAS: All-round Testing of Long-context Abilities across Scales","primary_cat":"cs.CL","submitted_at":"2026-05-27T07:33:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ATLAS is a length-dependent benchmarking framework that evaluates 26 models on 8 capability dimensions and shows substantial rank changes when moving from 128K to 1M token ranges.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20128","ref_index":28,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-19T17:15:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MixRea benchmark reveals LLMs achieve at most 42.8% consistency on explicit-implicit reasoning tasks, with PRCP prompting proposed to recover overlooked relations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16928","ref_index":1,"ref_count":2,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps","primary_cat":"cs.CL","submitted_at":"2026-05-16T10:51:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RTPurbo converts full-attention LLMs to sparse attention by retaining full KV for retrieval heads and using a low-dimensional dynamic indexer, achieving near-lossless accuracy after minimal adaptation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15913","ref_index":1,"ref_count":4,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Towards Generalization of Block Attention via Automatic Segmentation and Block Distillation","primary_cat":"cs.CL","submitted_at":"2026-05-15T12:51:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A new 30k-instance semantic segmentation dataset plus block distillation with sink tokens, dropout, and weighted loss lets block-attention models reach near full-attention performance on long texts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12227","ref_index":61,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"A Recipe for Long-Context Reasoning in Large Language Models via On-Policy Optimization and Distillation","primary_cat":"cs.CL","submitted_at":"2026-05-12T15:04:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Combines GRPO with teacher-guided on-policy distillation and introduces LongBlocks dataset to yield more stable long-context reasoning than either method alone.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08696","ref_index":28,"ref_count":3,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Structured Recurrent Mixers for Massively Parallelized Sequence Generation","primary_cat":"cs.CL","submitted_at":"2026-05-09T05:07:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Structured Recurrent Mixers provide a dual parallel-recurrent representation for sequence models, claiming superior training efficiency, information capacity, and inference throughput over linear complexity alternatives.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08503","ref_index":7,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"NARRA-Gym for Evaluating Interactive Narrative Agents","primary_cat":"cs.CL","submitted_at":"2026-05-08T21:36:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"NARRA-Gym is an executable benchmark that generates complete interactive narrative episodes from emotional seeds and logs full model trajectories to expose gaps in coherence, adaptation, and personalization that static story tests miss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07313","ref_index":52,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory","primary_cat":"cs.AI","submitted_at":"2026-05-08T06:22:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"UNKNOWN","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new evaluation protocol shows agent memory reliability degrades variably with added irrelevant sessions depending on agent, memory interface, and scale.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08151","ref_index":16,"ref_count":2,"confidence":0.88,"is_internal_anchor":false,"paper_title":"SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference","primary_cat":"cs.DC","submitted_at":"2026-05-04T01:27:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SPECTRE achieves up to 2.28x speedup for large-model LLM serving by running speculative draft generation and target verification in parallel using idle tail-model services.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"5B (TP1) → DeepSeek-R1-Distill-Qwen-32B (TP1) [ 10]. The draft and target models are deployed on separate H200 GPUs[ 11] and communicate remotely through our serving framework. Additional implementation details are provided in Appendix A. Datasets. We evaluate SPECTRE on six datasets: GSM8K [ 12], MATH500 [13], Minerva Math, LongBench [14, 15], MRCR [ 16], and ShareGPT. These datasets cover grade-school math, advanced mathematical rea- soning, STEM-oriented problem solving, long-context question answering and retrieval, and conversational data. Baselines. We compare SPECTRE against autoregressive decoding (AR), Standalone [ 17], EAGLE-3 [ 18], PEARL [ 6], and MineDraft [ 19], under the same target model, decoding configuration, batch size, and"},{"citing_arxiv_id":"2605.02028","ref_index":47,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Language models fail at extended rule following","primary_cat":"cs.CL","submitted_at":"2026-05-03T19:27:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLMs fail at extended counting of repeated characters due to finite internal states, with abrupt errors persisting across model scales and inference methods.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"only a single parameter count are marked as Dense; unavailable cases are left as \"-\". Model Name Developer/Family Parameter Count Nominal Context Window Architecture Measured CC claude-4-opus (43) Anthropic - - - 2176 claude-4.1-opus (44) Anthropic - - - 1920 claude-4.5-opus (45) Anthropic - - - 1536 claude-4.5-sonnet (46) Anthropic - - - 896 claude-4.6-opus (47) Anthropic - - - 768 gpt-5.4 (48) OpenAI - - - 608 claude-4-sonnet (49) Anthropic - - - 544 claude-4.6-sonnet (50) Anthropic - - - 512 claude-3.7-sonnet (51) Anthropic - 200,000 - 352 claude-3.7-sonnet (thinking) (52) Anthropic - 200,000 - 352 gemini-3-pro-preview (53) Google - - - 352 gemini-3.1-pro-preview (54) Google - - - 352 gemini-3-flash-preview (55) Google - - - 304"},{"citing_arxiv_id":"2604.26837","ref_index":6,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving","primary_cat":"cs.LG","submitted_at":"2026-04-29T16:02:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"layout sized to the active working set rather than the worst- case address space. Built on vLLM with three representative sparse attention algorithms,Spindelivers1 .66-5.66× higher end-to-end throughput and7-9 × lower TTFT than vLLM, and reduces TPOT by up to58%over the original sparse- attention implementations. 1 Introduction To support increasingly sophisticated workloads such as long-context reasoning [6, 57, 61], document understand- ing [24, 39, 60], and code generation[8, 51, 73], modern LLMs ∗Work performed during the internship while at Microsoft Research. have expanded their context windows to hundreds of thou- sands or even millions of tokens[ 5, 20, 42, 43]. This trend makes long-context serving fundamentally costly because the key-value (KV) cache [45], intermediate states inatten-"},{"citing_arxiv_id":"2605.06683","ref_index":75,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models","primary_cat":"cs.LG","submitted_at":"2026-04-24T20:37:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Toeplitz MLP Mixers replace attention with masked Toeplitz multiplications for sub-quadratic complexity while retaining more sequence information and outperforming on copying and in-context tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22575","ref_index":4,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"SpikingBrain2.0: Brain-Inspired Foundation Models for Efficient Long-Context and Cross-Platform Inference","primary_cat":"cs.LG","submitted_at":"2026-04-24T14:07:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SpikingBrain2.0 is a 5B hybrid spiking-Transformer that recovers most base model performance while delivering 10x TTFT speedup at 4M context and supporting over 10M tokens on limited GPUs via dual sparse attention and dual quantization paths.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"sparsity, which helps mitigate attention distraction caused by noisy long sequences (Pan et al., 2025a). 7 Replaced Layer Index (from shallow to deep) Identifying MoBA Layers via Performance Sensitvity LongBench Score MMLU Score Selected as MoBA (e.g. Layer #24) SSE-SWA Block MoBA Block Full Attention Block Resulting Layer Assignment [35] (last layer) [0,1,2,3,6,12,17,21,24] (selected by sensitivity) [4,5,7,8,...,33,34] (all remaining 26 layers) Located at which layer(s)?Which kind of block? Figure 2:Layerwise performance sensitivity and resulting layer assignment for SpB2.0-5B. Left: each point denotes the performance of a candidate model obtained by replacing a single FA layer with SSE. Dashed lines indicate the Qwen3 baseline performance on MMLU and LongBench."},{"citing_arxiv_id":"2605.06676","ref_index":3,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"LKV: End-to-End Learning of Head-wise Budgets and Token Selection for LLM KV Cache Eviction","primary_cat":"cs.LG","submitted_at":"2026-04-22T06:35:23+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LKV learns task-optimized global budgets and intrinsic KV token importance without attention matrices, delivering near-lossless performance at 15% cache retention on LongBench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16883","ref_index":1,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"SinkRouter: Sink-Aware Routing for Efficient Long-Context Decoding in Large Language and Multimodal Models","primary_cat":"cs.LG","submitted_at":"2026-04-18T07:23:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SinkRouter identifies attention sinks as training-derived fixed points and routes around them to skip redundant KV-cache loads, delivering up to 2.03x decoding speedup on long-context benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08216","ref_index":3,"ref_count":2,"confidence":0.88,"is_internal_anchor":false,"paper_title":"MemCoT: Test-Time Scaling through Memory-Driven Chain-of-Thought","primary_cat":"cs.MA","submitted_at":"2026-04-09T13:13:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MemCoT transforms long-context LLM reasoning into an iterative stateful search using multi-view memory for evidence localization and dual short-term memory for guiding decisions, achieving SOTA on LoCoMo and LongMemEval-S benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"treated as a single-step matching process [12] between the current query and stored history. The agent must decide what to retrieve before knowing what information will become relevant during reasoning. Consequently, retrieval either becomes overly broad, introducing excessive irrelevant context that dilutes reasoning, or overly narrow, omitting crucial information and causing semantic fragmentation [3]. In essence, existing AI memory technologies encounter two fundamental bottlenecks in long-context multi-hop reasoning:(1) Search Modeling Dilemma in Long Contexts:Extended texts are replete with dense entities and high-frequency noise. It is diffi- cult to predict complex user intents a priori, causing static knowl- edge bases to suffer from severe semantic dilution during rela-"},{"citing_arxiv_id":"2604.06746","ref_index":1,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"StructKV: Preserving the Structural Skeleton for Scalable Long-Context Inference","primary_cat":"cs.CL","submitted_at":"2026-04-08T07:10:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"StructKV compresses LLM KV caches by tracking global in-degree centrality across network depth and dynamically selecting compression layers to preserve long-range dependencies better than local pruning methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20197","ref_index":2,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"MedicalBench: Evaluating Large Language Models Toward Improved Medical Concept Extraction","primary_cat":"cs.CL","submitted_at":"2026-04-05T14:11:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MedicalBench is a benchmark for implicit medical concept extraction and sentence-level evidence retrieval built from MIMIC-IV discharge summaries with human verification to test LLM reasoning on unstated medical ideas.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.02985","ref_index":1,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference","primary_cat":"cs.IR","submitted_at":"2026-04-03T11:41:53+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLMLingua prompt compression yields up to 18% end-to-end LLM speedups with unchanged quality when prompt length, ratio, and hardware align, plus an open profiler to predict the break-even point.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"1 Compression Overhead (Microbenchmark) To isolate the overhead introduced by the compression process itself, we bench- marked the runtime of the compression algorithm, separating total execution time from model inference time on three hardware setups (see Table 1). Compression was performed with a batch size of one using a single prompt from theLongBench dataset [1], truncated to the target input length by randomly sampling from its ∼51,000-token context. We evaluated prompt lengths ranging from 50 to 48,000 tokens and applied compression ratios of1.5×,2 ×,3 ×, and5 ×. Unless otherwise specified, results refer to2× compression (i.e., 50% compression). Our analysis examines key latency factors: prompt length, compression ratio, and hardware"},{"citing_arxiv_id":"2604.01707","ref_index":5,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Memory in the LLM Era: Modular Architectures and Strategies in a Unified Framework","primary_cat":"cs.CL","submitted_at":"2026-04-02T07:19:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A unified framework for LLM agent memory is benchmarked, with a new hybrid method outperforming state-of-the-art on standard tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Memory is incorporated into LLM-based agents to compensate for the bounded context window [ 26, 52, 57]. Since the model conditions only on a limited number of tokens, information that comes from earlier turns and lies outside the current prompt can be easily lost, which degrades performance in long-horizon dia- logue and multi-session tasks [77, 91, 94]. An explicit memory mod- ule [5, 21, 66, 67, 83] allows the system to persist interaction-derived information-such as user preferences, salient events, intermediate 2 decisions, and task constraints-and to reintroduce it when rele- vant, thereby improving consistency and enabling reasoning that depends on long-term context [21, 67]. The typical workflow of memory-augmented systems begins by"},{"citing_arxiv_id":"2511.00868","ref_index":2,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"FlexiCache: Leveraging Temporal Stability of Attention Heads for Efficient KV Cache Management","primary_cat":"cs.LG","submitted_at":"2025-11-02T09:33:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FlexiCache reduces GPU memory for long-context LLM requests by up to 70% and boosts throughput 1.38-1.55x and latency 1.6-2.1x by exploiting per-head differences in temporal stability of critical tokens.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}