arxiv: 2307.03172 · v3 · submitted 2023-07-06 · 💻 cs.CL

Recognition: 3 theorem links

· Lean Theorem

Lost in the Middle: How Language Models Use Long Contexts

Ashwin Paranjape, Fabio Petroni, John Hewitt, Kevin Lin, Michele Bevilacqua, Nelson F. Liu, Percy Liang

Pith reviewed 2026-05-08 22:47 UTC · model claude-opus-4-7

classification 💻 cs.CL

keywords modelslanguagecontextsinformationinputlongrelevantcontext

0 comments

The pith

Language models reliably use information at the start and end of their input context but lose track of material placed in the middle, producing a U-shaped accuracy curve even in models built for long contexts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks a simple question: when a long-context language model is given a prompt with one truly relevant passage and many distractors, does it matter where in the prompt that passage sits? The answer, across open and closed models from 7B to GPT-4, is yes — and the dependence is not monotone but U-shaped. Models exploit primacy and recency, and degrade in the middle, sometimes falling below their own closed-book accuracy. Doubling a model's advertised context length does not fix this: when the input fits in both the short and long variants, their position curves nearly coincide. A synthetic UUID-to-UUID lookup task strips away semantics and still reproduces the dip for several models, showing that the failure is partly about retrieval, not just reasoning. The work reframes "context length" as a misleading single number and proposes that any long-context claim should be backed by a flat best-versus-worst position curve.

Core claim

Across multi-document question answering and a synthetic key-value lookup task, the authors show that current language models — including ones explicitly marketed as long-context — do not treat their input window uniformly. Accuracy is highest when the relevant passage sits at the very start or very end of the context and drops sharply, sometimes by more than 20 points, when the same passage is buried in the middle. In the worst case, GPT-3.5-Turbo with 20 or 30 retrieved documents performs worse than with no documents at all. The effect persists for extended-context variants, base (non-instruction-tuned) models, and most encoder-decoder models once sequences exceed their training length, su

What carries the argument

A controlled position-sweep experiment: hold the question and the gold document fixed, vary only where the gold document is placed among k distractors, and plot accuracy as a function of that position. The same protocol is run on a semantics-free key-value retrieval task built from random UUIDs, isolating retrieval from comprehension. The shape of the resulting curve — flat, monotone, or U-shaped — becomes the diagnostic for whether a model uses its context uniformly.

If this is right

<parameter name="0">Headline context-window numbers (4K
16K
100K) overstate usable capacity
the effective window is the region where position-conditioned accuracy is roughly flat.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

<parameter name="0">The U-shape echoes the serial-position effect from human memory research
if the underlying cause is similar (rehearsal-like reinforcement of edges)
it predicts the dip should worsen as the middle region grows
which is consistent with the encoder-decoder result that the curve only emerges past training-time sequence length.

Load-bearing premise

The diagnostic assumes that accuracy on these two tasks faithfully reflects how the model uses context in general; if real workloads have different prompt structure or distractor statistics, the U-shape might be milder or sharper than reported.

What would settle it

Run the same position-sweep on a model and observe flat accuracy across all positions of the gold document, with best-minus-worst gap under a few percent, on contexts well inside its advertised window. Claude-1.3 already does this on the synthetic key-value task, showing the curve is not inevitable; a model that did the same on multi-document QA at 20 and 30 documents would refute the generality of the lost-in-the-middle effect.

read the original abstract

While recent language models have the ability to take long contexts as input, relatively little is known about how well they use longer context. We analyze the performance of language models on two tasks that require identifying relevant information in their input contexts: multi-document question answering and key-value retrieval. We find that performance can degrade significantly when changing the position of relevant information, indicating that current language models do not robustly make use of information in long input contexts. In particular, we observe that performance is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades when models must access relevant information in the middle of long contexts, even for explicitly long-context models. Our analysis provides a better understanding of how language models use their input context and provides new evaluation protocols for future long-context language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Solid empirical paper that nailed a real phenomenon and named it well; the interpretation is a bit stronger than the evidence, but the experiments are clean and worth knowing.

read the letter

Quick take: this is the paper that put "lost in the middle" on the map, and it earns the attention. The core finding — accuracy as a function of gold-document position is U-shaped across GPT-3.5, Claude-1.3, MPT-30B-Instruct, and LongChat — is established carefully and replicates under the obvious confound checks (random distractors, randomized distractor order, unambiguous-question subset, query-aware contextualization, instruction-tuned vs base). They didn't oversell a single curve; they ran the controls.

What I think is genuinely useful here: - Two cleanly designed tasks (multi-doc QA on NQ-Open, plus a synthetic UUID key-value retrieval that strips out semantics). The KV task is a nice minimal probe and I expect to reuse it. - The retriever-reader saturation result in §5 is the most actionable bit for practitioners: reader accuracy plateaus around 20 docs while retriever recall keeps climbing. That argues directly against "just stuff more context in." - They release code and data, and the token-count tables in the appendix are the kind of thing reviewers usually have to ask for.

Where I'd push back, and this matches the stress-test note: the title and Figure 1 invite a stronger reading than the experiments support. §4.1 shows Flan-UL2 is essentially flat inside its training-length window and only develops the U beyond it. Claude-1.3 is near-perfect on KV retrieval at every position. Query-aware contextualization erases the effect on KV. The Llama-2 appendix shows the primacy half of the U only emerges at 13B+. Taken together, this looks more like an out-of-training-distribution length effect plus a prior about where salient content sits in pretraining documents, rather than an intrinsic limit of long-context attention. The authors are careful in the prose but the framing nudges past their evidence.

Minor: the "evaluation protocol" prescription (report best vs. worst position) is reasonable but could flag models that are simply undertrained on uniform-position data rather than architecturally deficient. Worth noting, not a dealbreaker.

Recommendation: cite it, use the KV task, and treat the U-shape as a real and reproducible behavior — but don't quote the title as if it were a theorem about Transformers. Definitely worth peer review and worth a reading group slot; it's already shaped how people think about RAG context budgets, and the methodology holds up on a careful read.

Referee Report

4 major / 7 minor

Summary. The paper investigates how decoder-only and encoder-decoder language models use information located at varying positions within their input contexts. Using two controlled tasks — multi-document question answering built from NaturalQuestions-Open with Contriever-retrieved distractors, and a synthetic JSON key-value retrieval task with random UUIDs — the authors vary (i) the position of the gold document/key and (ii) total context length, while holding the desired output fixed. The central empirical finding is a U-shaped accuracy curve: across GPT-3.5-Turbo, Claude-1.3, MPT-30B-Instruct, and LongChat-13B (16K), performance is highest when the relevant item is at the start or end of the context and degrades in the middle, sometimes below closed-book accuracy. The paper further (a) shows extended-context variants do not outperform their base counterparts on inputs both can fit, (b) compares decoder-only vs. encoder-decoder models (Flan-T5-XXL, Flan-UL2) and finds encoder-decoders are flat within their training-time length and develop a U-shape beyond it, (c) shows query-aware contextualization nearly solves KV retrieval but barely changes multi-doc QA, (d) shows base MPT-30B already exhibits the U-shape, and (e) presents an open-domain QA case study where reader accuracy saturates well before retriever recall.

Significance. The U-shaped positional sensitivity is a clean, reproducible empirical observation across both open and closed frontier-tier models at the time of writing, established with controlled interventions (position swap, length sweep) on two qualitatively different tasks. The paper's design explicitly preempts the most salient confounds — Contriever ordering bias (Appendix C), retrieved-vs-random distractors (Appendix B), and NaturalQuestions ambiguity (Appendix A) — which materially strengthens the claim. The accompanying ablations (encoder-decoder vs. decoder-only in §4.1, query-aware contextualization in §4.2, instruction-tuning in §4.3, Llama-2 scaling in Appendix E) are unusually thorough for an empirical analysis paper and themselves constitute reusable evaluation protocols. The open-domain QA case study (§5) translates the phenomenon into an actionable practical implication for retrieval-augmented generation: more retrieved documents past ~20 yield negligible gains. Code and data are released. The work has clear value as a benchmark/diagnostic framework even setting aside the headline interpretation.

major comments (4)

[§1 / §2.3 framing vs. §4.1, Appendix E] The headline framing ('language models do not robustly make use of information in long input contexts') is in tension with the authors' own ablations. §4.1 shows Flan-UL2 is essentially flat within its 2048-token training window and only develops a U-shape beyond it; Appendix E shows Llama-2-7B is purely recency-biased while only 13B/70B exhibit primacy bias; Figure 7 shows Claude-1.3 is near-perfect on KV retrieval at all positions. Together these are consistent with the U-shape being substantially an out-of-training-length-distribution effect plus a prior over where 'relevant' content sits in pretraining documents, rather than an intrinsic limitation of long-context attention. The authors should either (a) soften the abstract/Figure 1 framing to match what the ablations support, or (b) provide an experiment that disentangles 'middle tokens are hard in principle' from 'middle positions
[§4.3 and Appendix E] The conclusion that 'instruction fine-tuning is not necessarily responsible' rests on a single base/instruct pair (MPT-30B vs. MPT-30B-Instruct, Figure 10) with overlapping shapes but ~6% absolute gap. Appendix E partially complicates this — the Llama-2 13B base shows a much larger primacy/recency disparity than its chat counterpart, while at 70B the gap is small. The §4.3 narrative would be more defensible if it explicitly summarized this scale-dependence in the main text rather than in an appendix, since the current main-text claim risks being read as stronger than the evidence supports.
[§5, Figure 11] The open-domain QA case study is the paper's main practical recommendation, but the reader-accuracy curves are reported without confidence intervals or a statistical test for the saturation claim ('only marginally improves performance ~1.5%'). Given that the y-axis spans a wide range and only six k values are shown, please report bootstrap CIs or a paired test on per-question correctness so that 'saturation' is distinguishable from noise. This matters because the practical takeaway (rerank/truncate rather than feed more documents) is being inferred from a small number of points.
[§3.1] The KV retrieval task uses 128-bit UUIDs to remove linguistic confounds, but UUID strings are tokenized into many sub-tokens by BPE tokenizers in highly model-specific ways (Table 4 shows ~4K–21K tokens for 75–300 pairs depending on tokenizer). This means the 'position' axis in Figure 7 is not commensurate across models — e.g., the 'middle' of a 300-pair context corresponds to different absolute token positions for Claude vs. LongChat. A short discussion or a supplementary plot indexing position by token offset rather than pair index would clarify whether cross-model differences in Figure 7 reflect retrieval ability or simply different absolute-token regimes.

minor comments (7)

[Figure 1] The teaser figure shows only GPT-3.5-Turbo at 20 documents; consider either labeling it as illustrative or overlaying at least one additional model so the headline U-shape is not visually anchored to a single system.
[§2.1] The accuracy metric ('any correct answer string appears in the predicted output') is a permissive substring match. Since closed-book GPT-3.5-Turbo scores 56.1%, some of the 'middle' degradation could partly reflect lexical-match noise rather than retrieval failure. A brief note on false-positive rates of the metric, or a spot-check with exact match, would help.
[§4.2] The query-aware contextualization result on KV retrieval (near-perfect across all positions) is striking and arguably one of the more actionable findings, but is reported only narratively without a figure. Consider promoting a plot to the main text.
[Appendix D] GPT-4 results are on a 500-question subsample and only at 20 documents. Stating sample size and that no significance test is performed against the 2655-question runs in the figure caption would prevent over-reading.
[§6.3] The analogy to the human serial-position effect (Ebbinghaus, Murdock) is evocative but causally unsupported; consider hedging the connection.
[Tables 5–7] Tabulated results report point accuracies without standard errors; given n≈2655 and accuracy near 55–75%, ~1% binomial SE is non-trivial when comparing adjacent positions. Adding SEs would strengthen the case that intermediate dips are real rather than noise.
[Figure 8] The Flan-T5-XXL series is hard to distinguish from Flan-UL2 in the legend coloring; consider higher-contrast styles.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for a careful and constructive report, and in particular for recognizing the ablations (Appendices A–C, §4.1–§4.3, Appendix E) as load-bearing parts of the contribution. The four major comments all push in the same direction — that several of our framing choices are stronger than the evidence strictly licenses, and that some quantitative claims need uncertainty quantification or tokenizer-aware re-indexing. We accept all four points and will revise accordingly. Specifically, we will (i) soften the abstract and §1/§2.3 framing so it is consistent with the scale- and training-length-dependence shown in §4.1 and Appendix E, while preserving the within-training-window evidence that the U-shape is not solely an out-of-distribution-length artifact; (ii) promote the Llama-2 scale-dependence finding from Appendix E into the main text of §4.3 and rephrase the instruction-tuning conclusion more precisely; (iii) add bootstrap confidence intervals and a paired test to the open-domain QA case study in §5/Figure 11; and (iv) add a token-offset–indexed version of Figure 7 plus a caveat in §3.1 about cross-tokenizer comparisons. None of these revisions change the headline empirical finding, but they make the claims commensurate with the evidence and strengthen the paper as a diagnostic framework, which is the use the referee identifies as primary.

read point-by-point responses

Referee: Headline framing ('LMs do not robustly make use of long input contexts') is in tension with §4.1 and Appendix E, which suggest the U-shape may largely be an out-of-training-length effect plus a pretraining positional prior, rather than an intrinsic attention limitation. Soften the framing or add an experiment that disentangles these.

Authors: We agree the framing should be tightened to match what the ablations actually support. We will revise the abstract, the Figure 1 caption, and the §1/§2.3 introductory claims to state that current models exhibit substantial position sensitivity — most pronounced at sequence lengths beyond their training-time window and at sufficient scale — rather than asserting a blanket inability. Concretely: (i) the abstract will explicitly note that the effect interacts with training-time sequence length and model scale; (ii) the §2.3 paragraph headers will be hedged from 'cannot effectively reason' to 'show pronounced positional sensitivity'; and (iii) we will add a forward pointer from §1 to §4.1 and Appendix E so readers see the scope conditions before the headline claim. We do not, however, believe the phenomenon reduces entirely to an out-of-distribution length effect: GPT-3.5-Turbo and MPT-30B-Instruct show the U-shape on 10-document inputs (~2K tokens, Figure 5 left) that are well within their training windows, and Appendix E shows Llama-2-70B exhibits the U-shape on inputs (≤4K tokens) within its training length. We will state this explicitly as evidence that length extrapolation alone does not account for the effect, while acknowledging the referee's point that it is a substantial contributing factor. revision: yes
Referee: The §4.3 claim that instruction fine-tuning is 'not necessarily responsible' rests on one base/instruct pair (MPT-30B). Appendix E shows the picture is scale-dependent (large gap at Llama-2-13B base vs. chat; small at 70B). Surface this in the main text.

Authors: This is fair. The current main text understates the scale dependence we ourselves document in Appendix E. We will revise §4.3 to (i) explicitly state that the role of supervised fine-tuning / RLHF in shaping positional bias is scale-dependent, (ii) summarize the Llama-2 7B/13B/70B comparison in one paragraph in the main text with a small inline figure or table reference, and (iii) reword the conclusion from 'instruction fine-tuning is not necessarily responsible for these performance trends' to a more precise statement: at sufficient scale (≥30B for MPT, 70B for Llama-2) the U-shape is already present in the base model and is only modestly attenuated by alignment, whereas at smaller scales (≤13B) alignment can substantially reduce the worst-case gap. This better reflects the data and removes the over-generalization the referee correctly identifies. revision: yes
Referee: Figure 11 (open-domain QA saturation) lacks confidence intervals or a paired test, which weakens the practical 'rerank/truncate' takeaway given only six k values.

Authors: We agree and will add uncertainty quantification to Figure 11. Specifically, we will report bootstrap 95% confidence intervals (1000 resamples over the question set) on per-k reader accuracy, and add a paired bootstrap test on per-question correctness comparing k=20 vs. k=50 for each model. We will report the resulting p-values and effect sizes in the caption and in §5. We expect — based on the per-question correctness records we already have — that the k=20→k=50 differences for GPT-3.5-Turbo (~1.5%) and Claude-1.3 (~1%) are within or near the bootstrap CI width, which would actually strengthen the 'saturation' claim by showing the marginal gains are not statistically distinguishable from noise. If the test shows a significant but small gain, we will revise the practical recommendation accordingly to 'small and possibly not cost-justified' rather than 'marginal'. revision: yes
Referee: In §3.1 the KV retrieval position axis is indexed by pair number, but UUID tokenization is tokenizer-specific (Table 4: ~4K–21K tokens for 75–300 pairs), so the 'middle' is not commensurate across models in absolute token offset.

Authors: The referee is correct that pair index and absolute token offset diverge across tokenizers, and we will address this. We will add a supplementary figure to Appendix F (or a new appendix) replotting Figure 7 with the x-axis converted to fractional token offset within the input context, computed per model using each model's tokenizer. We will also add a sentence to §3.1 noting this caveat and pointing to the supplementary plot. Our expectation is that the qualitative U-shape is preserved under either parameterization, since the relative position of the queried key within the JSON object scales monotonically with both pair index and absolute token offset for a fixed total. However, the referee is right that direct cross-model comparisons of the location of the accuracy minimum are confounded by tokenizer differences, and we will explicitly caution against such comparisons in the revised text. revision: yes

Circularity Check

0 steps flagged

No significant circularity: an empirical study with controlled position/length manipulations and external benchmarks (NaturalQuestions, synthetic KV retrieval).

full rationale

This is an empirical analysis paper, not a derivation paper. The central claim — that language model accuracy follows a U-shaped curve as a function of the position of relevant information in the input context — is established by direct measurement on (i) multi-document QA built from NaturalQuestions-Open with controlled gold-document placement, and (ii) a synthetic key-value retrieval task with random UUIDs. Neither task fits a parameter from the same data it then "predicts"; the manipulated variable (position) and the measured variable (accuracy) are independent by construction, and the models evaluated (GPT-3.5-Turbo, Claude-1.3, MPT-30B-Instruct, LongChat-13B, Flan-T5/UL2, Llama-2, GPT-4) are external to the authors. The reader's skeptical concern — that the U-shape may be an out-of-training-distribution length artifact rather than an intrinsic property — is a question about interpretation and external validity, not circularity. The paper itself surfaces evidence consistent with that reading (Flan-UL2 flat within its 2048-token window in §4.1; MPT-30B base exhibits the curve in §4.3; Llama-2-7B is recency-only in Appendix E; Claude saturates KV retrieval). That is the opposite of circular reasoning: the paper reports data that complicates its own headline framing rather than concealing it. Self-citation is essentially absent in the load-bearing chain. The methodology cites external work for datasets (Kwiatkowski et al. 2019, Lee et al. 2019), retriever (Izacard et al. 2021 Contriever), evaluation metric (Kandpal et al. 2022; Mallen et al. 2023), and related needle-in-haystack setups (Ivgi et al. 2023; Li et al. 2023; Papailiopoulos et al. 2023). No "uniqueness theorem" or authors' prior ansatz is invoked to force a conclusion. The closed-book and oracle baselines (Table 1) provide independent reference points against which the middle-position degradation is compared, and the synthetic KV task removes lexical confounds entirely. There is no fitted-parameter-renamed-as-prediction step, no self-definitional loop, and no renaming of a prior result as a new finding. Score 1 rather than 0 only to acknowledge that the framing "lost in the middle" is a vivid relabeling of an effect partly anticipated by serial-position literature (Ebbinghaus 1913; Murdock 1962) and prior LM context studies (Khandelwal et al. 2018; Sun et al. 2021), which the paper explicitly cites — but this is honest contextualization, not circular renaming.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Model omitted the axiom ledger; defaulted for pipeline continuity.

pith-pipeline@v0.9.0 · 9467 in / 126 out tokens · 3421 ms · 2026-05-08T22:47:34.241173+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation/EightTick.lean, Foundation/PhiForcing.lean no_parallel — RS 8-tick periodicity and primacy/recency in LLM attention are unrelated phenomena unclear
The U-shaped curve we observe in this work has a connection in psychology known as the serial-position effect (Ebbinghaus, 1913; Murdock Jr, 1962)... humans tend to best remember the first and last elements of the list.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Submodular Ground-Set Pruning: Monotone Tightness and a Non-Monotone Separation
cs.DS 2026-05 unverdicted novelty 8.0

For monotone submodular maximization, containment pruning has a tight 1-1/e factor; for non-monotone objectives, 1/2-ε algorithms exist that exceed known optimization hardness bounds.
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
cs.CL 2023-10 unverdicted novelty 8.0

SWE-bench reveals that even top language models like Claude 2 resolve only 1.96% of 2,294 real-world GitHub issues, highlighting a gap in practical coding capabilities.
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
cs.CL 2023-08 unverdicted novelty 8.0

LongBench is the first bilingual multi-task benchmark for long context understanding in LLMs, containing 21 datasets in 6 categories with average lengths of 6711 words (English) and 13386 characters (Chinese).
Agentic Interpretation: Lattice-Structured Evidence for LLM-Based Program Analysis
cs.SE 2026-05 unverdicted novelty 7.0

Agentic interpretation uses lattices to track LLM judgments on decomposed program claims during analysis.
Measuring What Matters Beyond Text: Evaluating Multimodal Summaries by Quality, Alignment, and Diversity
cs.AI 2026-05 unverdicted novelty 7.0

MM-Eval unifies evaluation of multimodal summaries by integrating factual text quality, cross-modal relevance via MLLM judge, and visual diversity via truncated CLIP entropy, then calibrates their combination on human...
Can a Single Message Paralyze the AI Infrastructure? The Rise of AbO-DDoS Attacks through Targeted Mobius Injection
cs.CR 2026-05 unverdicted novelty 7.0

Mobius Injection exploits semantic closure in LLM agents to enable single-message AbO-DDoS attacks achieving up to 51x call amplification and 229x latency inflation.
Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory
cs.AI 2026-05 unverdicted novelty 7.0

Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.
Concept-Based Abductive and Contrastive Explanations for Behaviors of Vision Models
cs.LG 2026-05 unverdicted novelty 7.0

Concept-based abductive and contrastive explanations find minimal high-level concepts that causally determine vision model outcomes on individual images or groups sharing a specified behavior.
SCOUT: Active Information Foraging for Long-Text Understanding with Decoupled Epistemic States
cs.CL 2026-05 unverdicted novelty 7.0

SCOUT achieves state-of-the-art long-text understanding with up to 8x lower token use by actively foraging for sparse query-relevant information and updating a compact provenance-grounded epistemic state.
AdaGATE: Adaptive Gap-Aware Token-Efficient Evidence Assembly for Multi-Hop Retrieval-Augmented Generation
cs.CL 2026-05 unverdicted novelty 7.0

AdaGATE improves evidence F1 scores on HotpotQA for multi-hop RAG under clean, redundant, and noisy conditions by framing selection as gap-aware token-constrained repair, outperforming baselines while using 2.6x fewer tokens.
Don't Be a Pot Stirrer! Authorized Vector Data Retrieval via Access-Aware Indexing
cs.DB 2026-05 unverdicted novelty 7.0

Veda and EffVeda build access-aware lattice indexes on role-partitioned vector blocks to support authorized top-k queries with controlled duplication and pruned search.
Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory
cs.CL 2026-05 unverdicted novelty 7.0

MemCoE learns memory organization guidelines via contrastive feedback and then trains a guideline-aligned RL policy for memory updates, yielding consistent gains on personalization benchmarks.
OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory
cs.CL 2026-04 unverdicted novelty 7.0

OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.
Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences
cs.LG 2026-04 unverdicted novelty 7.0

Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.
Semantic Needles in Document Haystacks: Sensitivity Testing of LLM-as-a-Judge Similarity Scoring
cs.CL 2026-04 unverdicted novelty 7.0

LLMs exhibit positional bias and context-dependent scoring patterns when judging document similarity, with each model showing a stable scoring fingerprint but a shared hierarchy of sensitivity to different semantic pe...
Beyond Surface Statistics: Robust Conformal Prediction for LLMs via Internal Representations
cs.CL 2026-04 unverdicted novelty 7.0

Internal layer-wise entropy reshaping provides nonconformity scores that improve the validity-efficiency trade-off of conformal prediction for LLMs under cross-domain shift compared to text-level baselines.
Closing the Theory-Practice Gap in Spiking Transformers via Effective Dimension
cs.LG 2026-04 unverdicted novelty 7.0

Spiking attention is a universal approximator of permutation-equivariant functions with ε-approximation requiring Ω(L_f² nd / ε²) spikes, but low effective dimensions (47-89) allow T=4 timesteps in practice.
IE as Cache: Information Extraction Enhanced Agentic Reasoning
cs.CL 2026-04 unverdicted novelty 7.0

IE-as-Cache framework repurposes information extraction as a dynamic cognitive cache to improve agentic reasoning accuracy in LLMs on challenging benchmarks.
In-Context Learning in Speech Language Models: Analyzing the Role of Acoustic Features, Linguistic Structure, and Induction Heads
cs.CL 2026-04 unverdicted novelty 7.0

Speech language models show in-context learning where speaking rate affects both accuracy and mimicry, and induction heads are causally necessary for this capability.
MatClaw: An Autonomous Code-First LLM Agent for End-to-End Materials Exploration
cond-mat.mtrl-sci 2026-04 conditional novelty 7.0

MatClaw is a code-first LLM agent that autonomously executes end-to-end materials workflows by generating and running Python scripts on remote clusters, achieving reliable code generation via memory architecture and R...
Internalized Reasoning for Long-Context Visual Document Understanding
cs.CV 2026-03 unverdicted novelty 7.0

A synthetic pipeline creates and internalizes reasoning traces in VLMs for long-context visual document understanding, with a 32B model surpassing a 235B model on MMLongBenchDoc and showing 12.4x fewer output tokens.
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory
cs.CL 2024-10 unverdicted novelty 7.0

LongMemEval benchmarks long-term memory in chat assistants, revealing 30% accuracy drops across sustained interactions and proposing indexing-retrieval-reading optimizations that boost performance.
Moshi: a speech-text foundation model for real-time dialogue
eess.AS 2024-09 accept novelty 7.0

Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
cs.CL 2024-05 unverdicted novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
cs.CL 2024-02 unverdicted novelty 7.0

M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual,...
Extending Context Window of Large Language Models via Positional Interpolation
cs.CL 2023-06 conditional novelty 7.0

Position Interpolation linearly down-scales position indices to extend RoPE context windows to 32768 tokens with 1000-step fine-tuning, delivering strong long-context results on LLaMA 7B-65B while preserving short-con...
OPT: Open Pre-trained Transformer Language Models
cs.CL 2022-05 unverdicted novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
Multitask Prompted Training Enables Zero-Shot Task Generalization
cs.LG 2021-10 conditional novelty 7.0

Multitask fine-tuning of an encoder-decoder model on prompted datasets produces zero-shot generalization that often beats models up to 16 times larger on standard benchmarks.
MMSkills: Towards Multimodal Skills for General Visual Agents
cs.AI 2026-05 unverdicted novelty 6.0

MMSkills turns public interaction trajectories into compact multimodal skill packages that visual agents can consult at runtime to improve decision-making on benchmarks.
LISA: Cognitive Arbitration for Signal-Free Autonomous Intersection Management
cs.AI 2026-05 unverdicted novelty 6.0

LISA applies LLMs as primary decision-makers for signal-free intersection management, cutting mean control delay by up to 89.1% and maintaining better service levels than fixed-cycle, SCATS, AIM, or GLOSA baselines.
Do Language Models Encode Knowledge of Linguistic Constraint Violations?
cs.CL 2026-05 unverdicted novelty 6.0

Sparse autoencoder analysis of language model activations finds limited evidence for a unified set of features detecting linguistic constraint violations.
Towards Visually Grounded Multimodal Summarization via Cross-Modal Transformer and Gated Attention
cs.AI 2026-05 unverdicted novelty 6.0

SPeCTrA-Sum uses hierarchical cross-modal fusion via DVP and DPP-distilled image selection via VRP to generate more accurate and visually grounded multimodal summaries.
Instructions Shape Production of Language, not Processing
cs.CL 2026-05 unverdicted novelty 6.0

Instructions trigger a production-centered mechanism in language models, with task-specific information stable in input tokens but varying strongly in output tokens and correlating with behavior.
Adversarial SQL Injection Generation with LLM-Based Architectures
cs.CR 2026-05 unverdicted novelty 6.0

RADAGAS-GPT4o achieves a 22.73% bypass rate against 10 WAFs, succeeding more against AI/ML-based firewalls than rule-based ones.
Bias by Necessity: Impossibility Theorems for Sequential Processing with Convergent AI and Human Validation
cs.AI 2026-05 unverdicted novelty 6.0

Primacy, anchoring, and order-dependence are architecturally necessary in autoregressive models due to causal masking constraints, with supporting evidence from theorems, LLM fits, and human experiments.
Slipstream: Trajectory-Grounded Compaction Validation for Long-Horizon Agents
cs.MA 2026-05 unverdicted novelty 6.0

Slipstream uses asynchronous compaction with trajectory-grounded judge validation to improve long-horizon agent accuracy by up to 8.8 percentage points and reduce latency by up to 39.7%.
The Position Curse: LLMs Struggle to Locate the Last Few Items in a List
cs.LG 2026-05 unverdicted novelty 6.0

LLMs exhibit the Position Curse, with backward position retrieval in lists lagging far behind forward retrieval, showing only partial gains from PosBench fine-tuning.
A Unified Benchmark for Evaluating Knowledge Graph Construction Methods and Graph Neural Networks
cs.LG 2026-05 unverdicted novelty 6.0

A dual-purpose benchmark supplies two text-derived knowledge graphs and one expert reference graph on the same biomedical corpus to jointly measure construction method quality and GNN robustness via semi-supervised no...
Joint Semantic Token Selection and Prompt Optimization for Interpretable Prompt Learning
cs.CV 2026-05 unverdicted novelty 6.0

IPL alternates discrete semantic token selection using approximate submodular optimization with continuous prompt optimization to boost both interpretability and task performance in vision-language model adaptation.
Focus and Dilution: The Multi-stage Learning Process of Attention
cs.LG 2026-05 unverdicted novelty 6.0

In one-layer Transformers trained on Markovian data, attention undergoes a cycle of rapid rank-one condensation, frequency-driven focus on high-frequency tokens, dilution via embedding perturbations, and restart from ...
M-CaStLe: Uncovering Local Causal Structures in Multivariate Space-Time Gridded Data
cs.LG 2026-05 unverdicted novelty 6.0

M-CaStLe generalizes local stencil-based causal discovery to the multivariate case and decomposes resulting graphs into reaction and spatial components for interpretation in space-time gridded data.
Structure-Aware Chunking for Tabular Data in Retrieval-Augmented Generation
cs.CL 2026-05 unverdicted novelty 6.0

STC reduces tabular chunk counts by up to 56% versus baselines and raises hybrid MRR to 0.5945 and BM25 Recall@1 to 0.754 by preserving row structure during chunking.
From Unstructured Recall to Schema-Grounded Memory: Reliable AI Memory via Iterative, Schema-Aware Extraction
cs.AI 2026-04 unverdicted novelty 6.0

Schema-aware iterative extraction turns AI memory into a verified system of record, reaching 90-97% accuracy on extraction and end-to-end memory benchmarks where retrieval baselines score 80-87%.
NuggetIndex: Governed Atomic Retrieval for Maintainable RAG
cs.IR 2026-04 unverdicted novelty 6.0

NuggetIndex manages atomic nuggets with temporal validity and lifecycle metadata to filter outdated information before ranking, yielding 42% higher nugget recall, 9pp better temporal correctness, and 55% fewer conflic...
PRAG: End-to-End Privacy-Preserving Retrieval-Augmented Generation
cs.CR 2026-04 unverdicted novelty 6.0

PRAG delivers end-to-end private RAG with 72-74% recall via non-interactive homomorphic approximations, interactive client assistance, and operation-error estimation to preserve ranking quality.
HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models
cs.LG 2026-04 unverdicted novelty 6.0

HubRouter is a sub-quadratic routing primitive using learned hubs that replaces attention layers in hybrid models while delivering competitive perplexity and large throughput gains.
Dissociating Decodability and Causal Use in Bracket-Sequence Transformers
cs.CL 2026-04 unverdicted novelty 6.0

In Dyck-language transformers, attention patterns causally use top-of-stack information while residual-stream depth and distance signals are decodable yet causally inert.
R$^3$AG: Retriever Routing for Retrieval-Augmented Generation
cs.IR 2026-04 unverdicted novelty 6.0

R³AG routes queries to retrievers by decomposing capabilities into retrieval quality and generation utility, trained via contrastive learning on document assessments and downstream answer correctness to outperform sta...
Omission Constraints Decay While Commission Constraints Persist in Long-Context LLM Agents
cs.CR 2026-04 unverdicted novelty 6.0

Omission constraints in LLM agents decay with conversation length while commission constraints remain stable, creating an invisible security failure.
Pause or Fabricate? Training Language Models for Grounded Reasoning
cs.CL 2026-04 conditional novelty 6.0

GRIL uses stage-specific RL rewards to train LLMs to detect missing premises, pause proactively, and resume grounded reasoning after clarification, yielding up to 45% better premise detection and 30% higher task succe...
From Program Slices to Causal Clarity: Evaluating Faithful, Actionable LLM-Generated Failure Explanations via Context Partitioning and LLM-as-a-Judge
cs.SE 2026-04 unverdicted novelty 6.0

Focused, failure-specific contexts such as program slices produce more causal and actionable LLM bug explanations than large undifferentiated contexts, and higher-quality explanations correlate with better downstream ...
MemSearch-o1: Empowering Large Language Models with Reasoning-Aligned Memory Growth in Agentic Search
cs.IR 2026-04 unverdicted novelty 6.0

MemSearch-o1 mitigates memory dilution in agentic LLM search through reasoning-aligned token-level memory growth, retracing with a contribution function, and path reorganization, improving reasoning activation on benchmarks.
MemSearch-o1: Empowering Large Language Models with Reasoning-Aligned Memory Growth in Agentic Search
cs.IR 2026-04 unverdicted novelty 6.0

MemSearch-o1 uses reasoning-aligned memory growth from seed tokens, retracing via contribution functions, and path reorganization to mitigate memory dilution in LLM agentic search.
GenericAgent: A Token-Efficient Self-Evolving LLM Agent via Contextual Information Density Maximization (V1.0)
cs.CL 2026-04 unverdicted novelty 6.0

GenericAgent outperforms other LLM agents on long-horizon tasks by maximizing context information density with fewer tokens via minimal tools, on-demand memory, trajectory-to-SOP evolution, and compression.
Accuracy Is Speed: Towards Long-Context-Aware Routing for Distributed LLM Serving
cs.DC 2026-04 unverdicted novelty 6.0

In long-context LLM serving, accuracy becomes speed via retry dynamics, and accuracy-aware routing reduces time-to-correct-answer.
FocalLens: Visualizing Narratives through Focalization
cs.HC 2026-04 unverdicted novelty 6.0

FocalLens is a new visualization system that captures focalization to display character perceptions, direct/indirect involvement, and narration in narratives, evaluated qualitatively with writers and scholars.
One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding
cs.CV 2026-04 unverdicted novelty 6.0

XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent ...
When Verification Fails: How Compositionally Infeasible Claims Escape Rejection
cs.CL 2026-04 unverdicted novelty 6.0

AI claim verification models rely on salient-constraint shortcuts instead of full compositional reasoning under the closed-world assumption, as revealed by their over-acceptance of claims with supported salient constr...
TrajOnco: a multi-agent framework for temporal reasoning over longitudinal EHR for multi-cancer early detection
cs.AI 2026-04 unverdicted novelty 6.0

TrajOnco uses a chain-of-agents LLM architecture with memory to perform temporal reasoning on longitudinal EHR, achieving 0.64-0.80 AUROC for 1-year multi-cancer risk prediction in zero-shot mode on matched cohorts wh...
Filling the Gaps: Selective Knowledge Augmentation for LLM Recommenders
cs.IR 2026-04 unverdicted novelty 6.0

KnowSA_CKP uses comparative knowledge probing to selectively augment LLM prompts for items with knowledge gaps, improving recommendation accuracy and context efficiency.