super hub Canonical reference

Lost in the Middle: How Language Models Use Long Contexts

Sara Papi · 2019 · cs.CL · DOI 10.1162/tacl · arXiv 2307.03172

Canonical reference. 78% of citing Pith papers cite this work as background.

162 Pith papers citing it

Background 78% of classified citations

open full Pith review browse 162 citing papers more from Sara Papi arXiv PDF

abstract

While recent language models have the ability to take long contexts as input, relatively little is known about how well they use longer context. We analyze the performance of language models on two tasks that require identifying relevant information in their input contexts: multi-document question answering and key-value retrieval. We find that performance can degrade significantly when changing the position of relevant information, indicating that current language models do not robustly make use of information in long input contexts. In particular, we observe that performance is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades when models must access relevant information in the middle of long contexts, even for explicitly long-context models. Our analysis provides a better understanding of how language models use their input context and provides new evaluation protocols for future long-context language models.

hub tools

JSON dossier citing papers JSON publisher DOI arXiv source

citation-role summary

background 32 dataset 2 method 2

citation-polarity summary

background 28 unclear 3 use dataset 2 use method 2 support 1

claims ledger

abstract While recent language models have the ability to take long contexts as input, relatively little is known about how well they use longer context. We analyze the performance of language models on two tasks that require identifying relevant information in their input contexts: multi-document question answering and key-value retrieval. We find that performance can degrade significantly when changing the position of relevant information, indicating that current language models do not robustly make use of information in long input contexts. In particular, we observe that performance is often highest

authors

and Ond r ej Bojar Dominik Mach \'a c ek Peter Pol \'a k Sara Papi

co-cited works

representative citing papers

Submodular Ground-Set Pruning: Monotone Tightness and a Non-Monotone Separation

cs.DS · 2026-05-06 · unverdicted · novelty 8.0

For monotone submodular maximization, containment pruning has a tight 1-1/e factor; for non-monotone objectives, 1/2-ε algorithms exist that exceed known optimization hardness bounds.

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

cs.CL · 2023-10-10 · unverdicted · novelty 8.0

SWE-bench reveals that even top language models like Claude 2 resolve only 1.96% of 2,294 real-world GitHub issues, highlighting a gap in practical coding capabilities.

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

cs.CL · 2023-08-28 · unverdicted · novelty 8.0

LongBench is the first bilingual multi-task benchmark for long context understanding in LLMs, containing 21 datasets in 6 categories with average lengths of 6711 words (English) and 13386 characters (Chinese).

Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks

cs.CL · 2026-05-22 · conditional · novelty 7.0

Audits reveal no reasoning benchmark controls position/filler/length jointly; CRE shows LLMs drop up to 88pp on middle-position tasks at 64K context, with diagnostic probe supporting positional cause.

Brain-LLM Alignment Tracks Training Data, Not Typology

cs.CL · 2026-05-21 · unverdicted · novelty 7.0

Training-language dominance, not English inherent properties, determines brain-LLM alignment across English, Chinese, and French, with additional independent effects from typological distance concentrated in syntactic brain regions.

On the Cost and Benefit of Chain of Thought: A Learning-Theoretic Perspective

cs.LG · 2026-05-20 · unverdicted · novelty 7.0

Chain of Thought risk decomposes into oracle-trajectory benefit and trajectory-mismatch cost, with stability determining bounded, linear, or exponential error growth.

The Expressive Power of Low Precision Softmax Transformers with (Summarized) Chain-of-Thought

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

Low-precision softmax transformers with chain-of-thought simulate Turing machines at logarithmic depth and width; summarized CoT improves this to logarithmic space scaling.

GRASP: Graph Agentic Search over Propositions for Multi-hop Question Answering

cs.MA · 2026-05-15 · unverdicted · novelty 7.0

GRASP introduces a hierarchical graph-based agentic retrieval method that achieves top accuracy on MuSiQue, 2WikiMultihopQA, and HotpotQA while using 30-50% fewer tokens than strong baselines.

Agentic Interpretation: Lattice-Structured Evidence for LLM-Based Program Analysis

cs.SE · 2026-05-12 · unverdicted · novelty 7.0

Agentic interpretation uses lattices to track LLM judgments on decomposed program claims during analysis.

Measuring What Matters Beyond Text: Evaluating Multimodal Summaries by Quality, Alignment, and Diversity

cs.AI · 2026-05-12 · unverdicted · novelty 7.0

MM-Eval unifies evaluation of multimodal summaries by integrating factual text quality, cross-modal relevance via MLLM judge, and visual diversity via truncated CLIP entropy, then calibrates their combination on human preferences.

Can a Single Message Paralyze the AI Infrastructure? The Rise of AbO-DDoS Attacks through Targeted Mobius Injection

cs.CR · 2026-05-12 · unverdicted · novelty 7.0

Mobius Injection exploits semantic closure in LLM agents to enable single-message AbO-DDoS attacks achieving up to 51x call amplification and 229x latency inflation.

Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.

Concept-Based Abductive and Contrastive Explanations for Behaviors of Vision Models

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

Concept-based abductive and contrastive explanations find minimal high-level concepts that causally determine vision model outcomes on individual images or groups sharing a specified behavior.

SCOUT: Active Information Foraging for Long-Text Understanding with Decoupled Epistemic States

cs.CL · 2026-05-06 · unverdicted · novelty 7.0

SCOUT achieves state-of-the-art long-text understanding with up to 8x lower token use by actively foraging for sparse query-relevant information and updating a compact provenance-grounded epistemic state.

AdaGATE: Adaptive Gap-Aware Token-Efficient Evidence Assembly for Multi-Hop Retrieval-Augmented Generation

cs.CL · 2026-05-04 · unverdicted · novelty 7.0

AdaGATE improves evidence F1 scores on HotpotQA for multi-hop RAG under clean, redundant, and noisy conditions by framing selection as gap-aware token-constrained repair, outperforming baselines while using 2.6x fewer tokens.

Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory

cs.CL · 2026-05-01 · unverdicted · novelty 7.0

MemCoE learns memory organization guidelines via contrastive feedback and then trains a guideline-aligned RL policy for memory updates, yielding consistent gains on personalization benchmarks.

OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory

cs.CL · 2026-04-29 · unverdicted · novelty 7.0

OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.

Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.

Semantic Needles in Document Haystacks: Sensitivity Testing of LLM-as-a-Judge Similarity Scoring

cs.CL · 2026-04-20 · unverdicted · novelty 7.0

LLMs exhibit positional bias and context-dependent scoring patterns when judging document similarity, with each model showing a stable scoring fingerprint but a shared hierarchy of sensitivity to different semantic perturbations.

Beyond Surface Statistics: Robust Conformal Prediction for LLMs via Internal Representations

cs.CL · 2026-04-17 · unverdicted · novelty 7.0

Internal layer-wise entropy reshaping provides nonconformity scores that improve the validity-efficiency trade-off of conformal prediction for LLMs under cross-domain shift compared to text-level baselines.

Closing the Theory-Practice Gap in Spiking Transformers via Effective Dimension

cs.LG · 2026-04-17 · unverdicted · novelty 7.0

Spiking attention is a universal approximator of permutation-equivariant functions with ε-approximation requiring Ω(L_f² nd / ε²) spikes, but low effective dimensions (47-89) allow T=4 timesteps in practice.

IE as Cache: Information Extraction Enhanced Agentic Reasoning

cs.CL · 2026-04-16 · unverdicted · novelty 7.0

IE-as-Cache framework repurposes information extraction as a dynamic cognitive cache to improve agentic reasoning accuracy in LLMs on challenging benchmarks.

In-Context Learning in Speech Language Models: Analyzing the Role of Acoustic Features, Linguistic Structure, and Induction Heads

cs.CL · 2026-04-07 · unverdicted · novelty 7.0

Speech language models show in-context learning where speaking rate affects both accuracy and mimicry, and induction heads are causally necessary for this capability.

MedicalBench: Evaluating Large Language Models Toward Improved Medical Concept Extraction

cs.CL · 2026-04-05 · unverdicted · novelty 7.0

MedicalBench is a benchmark for implicit medical concept extraction and sentence-level evidence retrieval built from MIMIC-IV discharge summaries with human verification to test LLM reasoning on unstated medical ideas.

citing papers explorer

Showing 26 of 26 citing papers after filters.

Measuring What Matters Beyond Text: Evaluating Multimodal Summaries by Quality, Alignment, and Diversity cs.AI · 2026-05-12 · unverdicted · none · ref 199 · internal anchor
MM-Eval unifies evaluation of multimodal summaries by integrating factual text quality, cross-modal relevance via MLLM judge, and visual diversity via truncated CLIP entropy, then calibrates their combination on human preferences.
Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory cs.AI · 2026-05-11 · unverdicted · none · ref 23 · internal anchor
Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.
PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments cs.AI · 2026-03-24 · unverdicted · none · ref 31 · internal anchor
PERMA is a new benchmark using temporally ordered events, text variability, and linguistic alignment to evaluate LLM memory agents on persona consistency beyond simple retrieval.
Nonlinearity as Rank: Generative Low-Rank Adapter with Radial Basis Functions cs.AI · 2026-02-05 · unverdicted · none · ref 1 · internal anchor
GenLoRA replaces explicit low-rank basis storage with RBF-generated vectors from latent codes, yielding higher effective ranks and stronger fine-tuning performance at lower parameter cost.
E-mem: Multi-agent based Episodic Context Reconstruction for LLM Agent Memory cs.AI · 2026-01-29 · unverdicted · none · ref 2 · internal anchor
E-mem uses a heterogeneous multi-agent setup for episodic context reconstruction in LLM agents, reaching over 54% F1 on LoCoMo while cutting token cost by over 70% compared to prior methods like GAM.
ICRL: Learning to Internalize Self-Critique with Reinforcement Learning cs.AI · 2026-05-13 · unverdicted · none · ref 28 · internal anchor
ICRL uses joint RL training of solver and critic with distribution-calibration re-weighting and role-wise advantage estimation to internalize critique into unassisted LLM performance, yielding 6.4-point gains on agentic tasks and 7.0 on math reasoning with Qwen3 models.
LIDSA: Cognitive Arbitration for Signal-Free Autonomous Intersection Management cs.AI · 2026-05-12 · unverdicted · none · ref 42 · 2 links · internal anchor
LIDSA applies LLMs as primary decision-makers for signal-free intersection management, achieving up to 89% lower control delay and 93% lower waiting time versus fixed-cycle and other baselines in simulation.
Towards Visually Grounded Multimodal Summarization via Cross-Modal Transformer and Gated Attention cs.AI · 2026-05-12 · unverdicted · none · ref 198 · internal anchor
SPeCTrA-Sum uses hierarchical cross-modal fusion via DVP and DPP-distilled image selection via VRP to generate more accurate and visually grounded multimodal summaries.
Bias by Necessity: Impossibility Theorems for Sequential Processing with Convergent AI and Human Validation cs.AI · 2026-05-09 · unverdicted · none · ref 18 · internal anchor
Primacy, anchoring, and order-dependence are architecturally necessary in autoregressive models due to causal masking constraints, with supporting evidence from theorems, LLM fits, and human experiments.
From Unstructured Recall to Schema-Grounded Memory: Reliable AI Memory via Iterative, Schema-Aware Extraction cs.AI · 2026-04-30 · unverdicted · none · ref 8 · internal anchor
Schema-aware iterative extraction turns AI memory into a verified system of record, reaching 90-97% accuracy on extraction and end-to-end memory benchmarks where retrieval baselines score 80-87%.
TrajOnco: a multi-agent framework for temporal reasoning over longitudinal EHR for multi-cancer early detection cs.AI · 2026-04-12 · unverdicted · none · ref 13 · internal anchor
TrajOnco uses a chain-of-agents LLM architecture with memory to perform temporal reasoning on longitudinal EHR, achieving 0.64-0.80 AUROC for 1-year multi-cancer risk prediction in zero-shot mode on matched cohorts while matching supervised ML on lung cancer and outperforming single-agent baselines.
Traj-CoA: Patient Trajectory Modeling via Chain-of-Agents for Lung Cancer Risk Prediction cs.AI · 2025-10-12 · unverdicted · none · ref 11 · internal anchor
Traj-CoA is a multi-agent LLM framework that sequentially processes noisy five-year EHR data via worker agents into EHRMem for manager-agent lung cancer risk prediction and outperforms four categories of baselines in zero-shot evaluation.
SynthPert: Enhancing LLM Biological Reasoning via Synthetic Reasoning Traces for Cellular Perturbation Prediction cs.AI · 2025-09-29 · unverdicted · none · ref 17 · internal anchor
SynthPert fine-tunes LLMs using synthetic reasoning traces to reach state-of-the-art on the PerturbQA benchmark for cellular perturbation prediction, surpassing the generating frontier model while generalizing to unseen cell types with only 2% of filtered data.
Reasoning Models Can be Accurately Pruned Via Chain-of-Thought Reconstruction cs.AI · 2025-09-15 · unverdicted · none · ref 20 · internal anchor
A pruning technique called Reasoning-Aware Compression (RAC) jointly reconstructs input and chain-of-thought activations to preserve reasoning performance better than standard methods when compressing models like DeepSeek-R1.
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning cs.AI · 2025-07-01 · conditional · none · ref 225 · internal anchor
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
MemGPT: Towards LLMs as Operating Systems cs.AI · 2023-10-12 · unverdicted · none · ref 13 · internal anchor
MemGPT uses OS-inspired virtual context management to extend LLM context windows for large document analysis and long-term multi-session chat.
NeuSymMS: A Hybrid Neuro-Symbolic Memory System for Persistent, Self-Curating LLM Agents cs.AI · 2026-05-17 · unverdicted · none · ref 1 · 2 links · internal anchor
NeuSymMS is a hybrid neuro-symbolic memory system that extracts facts via LLMs and manages them with explicit CLIPS rules for scoping, deduplication, and dual-horizon persistence in LLM agents.
Context Pruning for Coding Agents via Multi-Rubric Latent Reasoning cs.AI · 2026-05-14 · unverdicted · none · ref 5 · internal anchor
LaMR decomposes code context pruning into two rubrics using dedicated CRFs, a mixture-of-experts gate, and AST-derived labels to filter noise and often match or beat full-context baselines on coding benchmarks.
Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability cs.AI · 2026-05-11 · unverdicted · none · ref 4 · internal anchor
A framework with U-statistics and kernel-based metrics quantifies AI agent consistency and robustness, showing trajectory metrics outperform pass@1 rates in diagnosing failures.
What Deserves Memory: Adaptive Memory Distillation for LLM Agents cs.AI · 2025-08-05 · unverdicted · none · ref 2 · internal anchor
NEMORI is an adaptive memory distillation framework for LLM agents that transforms raw interactions into narratives and extracts insights via prediction error to decide what deserves retention.
BioBLP: A Modular Framework for Learning on Multimodal Biomedical Knowledge Graphs cs.AI · 2023-06-06 · unverdicted · none · ref 33 · internal anchor
BioBLP is a modular embedding framework for multimodal biomedical KGs supporting heterogeneous attributes and missing data, with a pretraining strategy that improves results on drug-protein interaction prediction especially for low-degree entities.
SOM: Structured Opponent Modeling for LLM-based Agents via Structural Causal Model cs.AI · 2026-05-08 · unverdicted · none · ref 17 · internal anchor
SOM uses a Structural Causal Model to create an explicit graph of opponent observation-to-action links, allowing LLMs to reason along those paths for more accurate and stable predictions in multi-agent settings.
A Survey on the Memory Mechanism of Large Language Model based Agents cs.AI · 2024-04-21 · accept · none · ref 120 · internal anchor
A systematic review of memory designs, evaluation methods, applications, limitations, and future directions for LLM-based agents.
AMEL: Accumulated Message Effects on LLM Judgments cs.AI · 2026-05-21 · unreviewed · ref 17 · internal anchor
MMSkills: Towards Multimodal Skills for General Visual Agents cs.AI · 2026-05-13 · unreviewed · ref 17 · 2 links · internal anchor
Don't Make the LLM Read the Graph: Make the Graph Think cs.AI · 2026-04-24 · unreviewed · ref 5 · internal anchor

Lost in the Middle: How Language Models Use Long Contexts

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer