Lost in the Middle: How Language Models Use Long Contexts
Pith reviewed 2026-05-08 22:47 UTC · model claude-opus-4-7
The pith
Language models reliably use information at the start and end of their input context but lose track of material placed in the middle, producing a U-shaped accuracy curve even in models built for long contexts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across multi-document question answering and a synthetic key-value lookup task, the authors show that current language models — including ones explicitly marketed as long-context — do not treat their input window uniformly. Accuracy is highest when the relevant passage sits at the very start or very end of the context and drops sharply, sometimes by more than 20 points, when the same passage is buried in the middle. In the worst case, GPT-3.5-Turbo with 20 or 30 retrieved documents performs worse than with no documents at all. The effect persists for extended-context variants, base (non-instruction-tuned) models, and most encoder-decoder models once sequences exceed their training length, su
What carries the argument
A controlled position-sweep experiment: hold the question and the gold document fixed, vary only where the gold document is placed among k distractors, and plot accuracy as a function of that position. The same protocol is run on a semantics-free key-value retrieval task built from random UUIDs, isolating retrieval from comprehension. The shape of the resulting curve — flat, monotone, or U-shaped — becomes the diagnostic for whether a model uses its context uniformly.
If this is right
- <parameter name="0">Headline context-window numbers (4K
- 16K
- 100K) overstate usable capacity
- the effective window is the region where position-conditioned accuracy is roughly flat.
Where Pith is reading between the lines
- <parameter name="0">The U-shape echoes the serial-position effect from human memory research
- if the underlying cause is similar (rehearsal-like reinforcement of edges)
- it predicts the dip should worsen as the middle region grows
- which is consistent with the encoder-decoder result that the curve only emerges past training-time sequence length.
Load-bearing premise
The diagnostic assumes that accuracy on these two tasks faithfully reflects how the model uses context in general; if real workloads have different prompt structure or distractor statistics, the U-shape might be milder or sharper than reported.
What would settle it
Run the same position-sweep on a model and observe flat accuracy across all positions of the gold document, with best-minus-worst gap under a few percent, on contexts well inside its advertised window. Claude-1.3 already does this on the synthetic key-value task, showing the curve is not inevitable; a model that did the same on multi-document QA at 20 and 30 documents would refute the generality of the lost-in-the-middle effect.
read the original abstract
While recent language models have the ability to take long contexts as input, relatively little is known about how well they use longer context. We analyze the performance of language models on two tasks that require identifying relevant information in their input contexts: multi-document question answering and key-value retrieval. We find that performance can degrade significantly when changing the position of relevant information, indicating that current language models do not robustly make use of information in long input contexts. In particular, we observe that performance is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades when models must access relevant information in the middle of long contexts, even for explicitly long-context models. Our analysis provides a better understanding of how language models use their input context and provides new evaluation protocols for future long-context language models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper investigates how decoder-only and encoder-decoder language models use information located at varying positions within their input contexts. Using two controlled tasks — multi-document question answering built from NaturalQuestions-Open with Contriever-retrieved distractors, and a synthetic JSON key-value retrieval task with random UUIDs — the authors vary (i) the position of the gold document/key and (ii) total context length, while holding the desired output fixed. The central empirical finding is a U-shaped accuracy curve: across GPT-3.5-Turbo, Claude-1.3, MPT-30B-Instruct, and LongChat-13B (16K), performance is highest when the relevant item is at the start or end of the context and degrades in the middle, sometimes below closed-book accuracy. The paper further (a) shows extended-context variants do not outperform their base counterparts on inputs both can fit, (b) compares decoder-only vs. encoder-decoder models (Flan-T5-XXL, Flan-UL2) and finds encoder-decoders are flat within their training-time length and develop a U-shape beyond it, (c) shows query-aware contextualization nearly solves KV retrieval but barely changes multi-doc QA, (d) shows base MPT-30B already exhibits the U-shape, and (e) presents an open-domain QA case study where reader accuracy saturates well before retriever recall.
Significance. The U-shaped positional sensitivity is a clean, reproducible empirical observation across both open and closed frontier-tier models at the time of writing, established with controlled interventions (position swap, length sweep) on two qualitatively different tasks. The paper's design explicitly preempts the most salient confounds — Contriever ordering bias (Appendix C), retrieved-vs-random distractors (Appendix B), and NaturalQuestions ambiguity (Appendix A) — which materially strengthens the claim. The accompanying ablations (encoder-decoder vs. decoder-only in §4.1, query-aware contextualization in §4.2, instruction-tuning in §4.3, Llama-2 scaling in Appendix E) are unusually thorough for an empirical analysis paper and themselves constitute reusable evaluation protocols. The open-domain QA case study (§5) translates the phenomenon into an actionable practical implication for retrieval-augmented generation: more retrieved documents past ~20 yield negligible gains. Code and data are released. The work has clear value as a benchmark/diagnostic framework even setting aside the headline interpretation.
major comments (4)
- [§1 / §2.3 framing vs. §4.1, Appendix E] The headline framing ('language models do not robustly make use of information in long input contexts') is in tension with the authors' own ablations. §4.1 shows Flan-UL2 is essentially flat within its 2048-token training window and only develops a U-shape beyond it; Appendix E shows Llama-2-7B is purely recency-biased while only 13B/70B exhibit primacy bias; Figure 7 shows Claude-1.3 is near-perfect on KV retrieval at all positions. Together these are consistent with the U-shape being substantially an out-of-training-length-distribution effect plus a prior over where 'relevant' content sits in pretraining documents, rather than an intrinsic limitation of long-context attention. The authors should either (a) soften the abstract/Figure 1 framing to match what the ablations support, or (b) provide an experiment that disentangles 'middle tokens are hard in principle' from 'middle positions
- [§4.3 and Appendix E] The conclusion that 'instruction fine-tuning is not necessarily responsible' rests on a single base/instruct pair (MPT-30B vs. MPT-30B-Instruct, Figure 10) with overlapping shapes but ~6% absolute gap. Appendix E partially complicates this — the Llama-2 13B base shows a much larger primacy/recency disparity than its chat counterpart, while at 70B the gap is small. The §4.3 narrative would be more defensible if it explicitly summarized this scale-dependence in the main text rather than in an appendix, since the current main-text claim risks being read as stronger than the evidence supports.
- [§5, Figure 11] The open-domain QA case study is the paper's main practical recommendation, but the reader-accuracy curves are reported without confidence intervals or a statistical test for the saturation claim ('only marginally improves performance ~1.5%'). Given that the y-axis spans a wide range and only six k values are shown, please report bootstrap CIs or a paired test on per-question correctness so that 'saturation' is distinguishable from noise. This matters because the practical takeaway (rerank/truncate rather than feed more documents) is being inferred from a small number of points.
- [§3.1] The KV retrieval task uses 128-bit UUIDs to remove linguistic confounds, but UUID strings are tokenized into many sub-tokens by BPE tokenizers in highly model-specific ways (Table 4 shows ~4K–21K tokens for 75–300 pairs depending on tokenizer). This means the 'position' axis in Figure 7 is not commensurate across models — e.g., the 'middle' of a 300-pair context corresponds to different absolute token positions for Claude vs. LongChat. A short discussion or a supplementary plot indexing position by token offset rather than pair index would clarify whether cross-model differences in Figure 7 reflect retrieval ability or simply different absolute-token regimes.
minor comments (7)
- [Figure 1] The teaser figure shows only GPT-3.5-Turbo at 20 documents; consider either labeling it as illustrative or overlaying at least one additional model so the headline U-shape is not visually anchored to a single system.
- [§2.1] The accuracy metric ('any correct answer string appears in the predicted output') is a permissive substring match. Since closed-book GPT-3.5-Turbo scores 56.1%, some of the 'middle' degradation could partly reflect lexical-match noise rather than retrieval failure. A brief note on false-positive rates of the metric, or a spot-check with exact match, would help.
- [§4.2] The query-aware contextualization result on KV retrieval (near-perfect across all positions) is striking and arguably one of the more actionable findings, but is reported only narratively without a figure. Consider promoting a plot to the main text.
- [Appendix D] GPT-4 results are on a 500-question subsample and only at 20 documents. Stating sample size and that no significance test is performed against the 2655-question runs in the figure caption would prevent over-reading.
- [§6.3] The analogy to the human serial-position effect (Ebbinghaus, Murdock) is evocative but causally unsupported; consider hedging the connection.
- [Tables 5–7] Tabulated results report point accuracies without standard errors; given n≈2655 and accuracy near 55–75%, ~1% binomial SE is non-trivial when comparing adjacent positions. Adding SEs would strengthen the case that intermediate dips are real rather than noise.
- [Figure 8] The Flan-T5-XXL series is hard to distinguish from Flan-UL2 in the legend coloring; consider higher-contrast styles.
Simulated Author's Rebuttal
We thank the referee for a careful and constructive report, and in particular for recognizing the ablations (Appendices A–C, §4.1–§4.3, Appendix E) as load-bearing parts of the contribution. The four major comments all push in the same direction — that several of our framing choices are stronger than the evidence strictly licenses, and that some quantitative claims need uncertainty quantification or tokenizer-aware re-indexing. We accept all four points and will revise accordingly. Specifically, we will (i) soften the abstract and §1/§2.3 framing so it is consistent with the scale- and training-length-dependence shown in §4.1 and Appendix E, while preserving the within-training-window evidence that the U-shape is not solely an out-of-distribution-length artifact; (ii) promote the Llama-2 scale-dependence finding from Appendix E into the main text of §4.3 and rephrase the instruction-tuning conclusion more precisely; (iii) add bootstrap confidence intervals and a paired test to the open-domain QA case study in §5/Figure 11; and (iv) add a token-offset–indexed version of Figure 7 plus a caveat in §3.1 about cross-tokenizer comparisons. None of these revisions change the headline empirical finding, but they make the claims commensurate with the evidence and strengthen the paper as a diagnostic framework, which is the use the referee identifies as primary.
read point-by-point responses
-
Referee: Headline framing ('LMs do not robustly make use of long input contexts') is in tension with §4.1 and Appendix E, which suggest the U-shape may largely be an out-of-training-length effect plus a pretraining positional prior, rather than an intrinsic attention limitation. Soften the framing or add an experiment that disentangles these.
Authors: We agree the framing should be tightened to match what the ablations actually support. We will revise the abstract, the Figure 1 caption, and the §1/§2.3 introductory claims to state that current models exhibit substantial position sensitivity — most pronounced at sequence lengths beyond their training-time window and at sufficient scale — rather than asserting a blanket inability. Concretely: (i) the abstract will explicitly note that the effect interacts with training-time sequence length and model scale; (ii) the §2.3 paragraph headers will be hedged from 'cannot effectively reason' to 'show pronounced positional sensitivity'; and (iii) we will add a forward pointer from §1 to §4.1 and Appendix E so readers see the scope conditions before the headline claim. We do not, however, believe the phenomenon reduces entirely to an out-of-distribution length effect: GPT-3.5-Turbo and MPT-30B-Instruct show the U-shape on 10-document inputs (~2K tokens, Figure 5 left) that are well within their training windows, and Appendix E shows Llama-2-70B exhibits the U-shape on inputs (≤4K tokens) within its training length. We will state this explicitly as evidence that length extrapolation alone does not account for the effect, while acknowledging the referee's point that it is a substantial contributing factor. revision: yes
-
Referee: The §4.3 claim that instruction fine-tuning is 'not necessarily responsible' rests on one base/instruct pair (MPT-30B). Appendix E shows the picture is scale-dependent (large gap at Llama-2-13B base vs. chat; small at 70B). Surface this in the main text.
Authors: This is fair. The current main text understates the scale dependence we ourselves document in Appendix E. We will revise §4.3 to (i) explicitly state that the role of supervised fine-tuning / RLHF in shaping positional bias is scale-dependent, (ii) summarize the Llama-2 7B/13B/70B comparison in one paragraph in the main text with a small inline figure or table reference, and (iii) reword the conclusion from 'instruction fine-tuning is not necessarily responsible for these performance trends' to a more precise statement: at sufficient scale (≥30B for MPT, 70B for Llama-2) the U-shape is already present in the base model and is only modestly attenuated by alignment, whereas at smaller scales (≤13B) alignment can substantially reduce the worst-case gap. This better reflects the data and removes the over-generalization the referee correctly identifies. revision: yes
-
Referee: Figure 11 (open-domain QA saturation) lacks confidence intervals or a paired test, which weakens the practical 'rerank/truncate' takeaway given only six k values.
Authors: We agree and will add uncertainty quantification to Figure 11. Specifically, we will report bootstrap 95% confidence intervals (1000 resamples over the question set) on per-k reader accuracy, and add a paired bootstrap test on per-question correctness comparing k=20 vs. k=50 for each model. We will report the resulting p-values and effect sizes in the caption and in §5. We expect — based on the per-question correctness records we already have — that the k=20→k=50 differences for GPT-3.5-Turbo (~1.5%) and Claude-1.3 (~1%) are within or near the bootstrap CI width, which would actually strengthen the 'saturation' claim by showing the marginal gains are not statistically distinguishable from noise. If the test shows a significant but small gain, we will revise the practical recommendation accordingly to 'small and possibly not cost-justified' rather than 'marginal'. revision: yes
-
Referee: In §3.1 the KV retrieval position axis is indexed by pair number, but UUID tokenization is tokenizer-specific (Table 4: ~4K–21K tokens for 75–300 pairs), so the 'middle' is not commensurate across models in absolute token offset.
Authors: The referee is correct that pair index and absolute token offset diverge across tokenizers, and we will address this. We will add a supplementary figure to Appendix F (or a new appendix) replotting Figure 7 with the x-axis converted to fractional token offset within the input context, computed per model using each model's tokenizer. We will also add a sentence to §3.1 noting this caveat and pointing to the supplementary plot. Our expectation is that the qualitative U-shape is preserved under either parameterization, since the relative position of the queried key within the JSON object scales monotonically with both pair index and absolute token offset for a fixed total. However, the referee is right that direct cross-model comparisons of the location of the accuracy minimum are confounded by tokenizer differences, and we will explicitly caution against such comparisons in the revised text. revision: yes
Circularity Check
No significant circularity: an empirical study with controlled position/length manipulations and external benchmarks (NaturalQuestions, synthetic KV retrieval).
full rationale
This is an empirical analysis paper, not a derivation paper. The central claim — that language model accuracy follows a U-shaped curve as a function of the position of relevant information in the input context — is established by direct measurement on (i) multi-document QA built from NaturalQuestions-Open with controlled gold-document placement, and (ii) a synthetic key-value retrieval task with random UUIDs. Neither task fits a parameter from the same data it then "predicts"; the manipulated variable (position) and the measured variable (accuracy) are independent by construction, and the models evaluated (GPT-3.5-Turbo, Claude-1.3, MPT-30B-Instruct, LongChat-13B, Flan-T5/UL2, Llama-2, GPT-4) are external to the authors. The reader's skeptical concern — that the U-shape may be an out-of-training-distribution length artifact rather than an intrinsic property — is a question about interpretation and external validity, not circularity. The paper itself surfaces evidence consistent with that reading (Flan-UL2 flat within its 2048-token window in §4.1; MPT-30B base exhibits the curve in §4.3; Llama-2-7B is recency-only in Appendix E; Claude saturates KV retrieval). That is the opposite of circular reasoning: the paper reports data that complicates its own headline framing rather than concealing it. Self-citation is essentially absent in the load-bearing chain. The methodology cites external work for datasets (Kwiatkowski et al. 2019, Lee et al. 2019), retriever (Izacard et al. 2021 Contriever), evaluation metric (Kandpal et al. 2022; Mallen et al. 2023), and related needle-in-haystack setups (Ivgi et al. 2023; Li et al. 2023; Papailiopoulos et al. 2023). No "uniqueness theorem" or authors' prior ansatz is invoked to force a conclusion. The closed-book and oracle baselines (Table 1) provide independent reference points against which the middle-position degradation is compared, and the synthetic KV task removes lexical confounds entirely. There is no fitted-parameter-renamed-as-prediction step, no self-definitional loop, and no renaming of a prior result as a new finding. Score 1 rather than 0 only to acknowledge that the framing "lost in the middle" is a vivid relabeling of an effect partly anticipated by serial-position literature (Ebbinghaus 1913; Murdock 1962) and prior LM context studies (Khandelwal et al. 2018; Sun et al. 2021), which the paper explicitly cites — but this is honest contextualization, not circular renaming.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
Foundation/EightTick.lean, Foundation/PhiForcing.leanno_parallel — RS 8-tick periodicity and primacy/recency in LLM attention are unrelated phenomena unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The U-shaped curve we observe in this work has a connection in psychology known as the serial-position effect (Ebbinghaus, 1913; Murdock Jr, 1962)... humans tend to best remember the first and last elements of the list.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 60 Pith papers
-
Submodular Ground-Set Pruning: Monotone Tightness and a Non-Monotone Separation
For monotone submodular maximization, containment pruning has a tight 1-1/e factor; for non-monotone objectives, 1/2-ε algorithms exist that exceed known optimization hardness bounds.
-
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
SWE-bench reveals that even top language models like Claude 2 resolve only 1.96% of 2,294 real-world GitHub issues, highlighting a gap in practical coding capabilities.
-
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
LongBench is the first bilingual multi-task benchmark for long context understanding in LLMs, containing 21 datasets in 6 categories with average lengths of 6711 words (English) and 13386 characters (Chinese).
-
Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks
Audits reveal no reasoning benchmark controls position/filler/length jointly; CRE shows LLMs drop up to 88pp on middle-position tasks at 64K context, with diagnostic probe supporting positional cause.
-
Brain-LLM Alignment Tracks Training Data, Not Typology
Training-language dominance, not English inherent properties, determines brain-LLM alignment across English, Chinese, and French, with additional independent effects from typological distance concentrated in syntactic...
-
On the Cost and Benefit of Chain of Thought: A Learning-Theoretic Perspective
Chain of Thought risk decomposes into oracle-trajectory benefit and trajectory-mismatch cost, with stability determining bounded, linear, or exponential error growth.
-
The Expressive Power of Low Precision Softmax Transformers with (Summarized) Chain-of-Thought
Low-precision softmax transformers with chain-of-thought simulate Turing machines at logarithmic depth and width; summarized CoT improves this to logarithmic space scaling.
-
GRASP: Graph Agentic Search over Propositions for Multi-hop Question Answering
GRASP introduces a hierarchical graph-based agentic retrieval method that achieves top accuracy on MuSiQue, 2WikiMultihopQA, and HotpotQA while using 30-50% fewer tokens than strong baselines.
-
Towards Generalization of Block Attention via Automatic Segmentation and Block Distillation
Introduces SemanticSeg dataset with 30k+ instances and block distillation using sink tokens, dropout, and weighted loss to reach near full-attention performance with block attention.
-
From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents
A dataset-agnostic framework converts text tool-calling benchmarks to paired audio versions via TTS and noise, showing model-dependent performance with small text-to-voice gaps of 1.8-4.8 points on Confetti and When2Call.
-
MMSkills: Towards Multimodal Skills for General Visual Agents
MMSkills creates compact multimodal skill packages from trajectories and uses a branch-loaded agent to improve visual decision-making on GUI and game benchmarks.
-
Agentic Interpretation: Lattice-Structured Evidence for LLM-Based Program Analysis
Agentic interpretation uses lattices to track LLM judgments on decomposed program claims during analysis.
-
Measuring What Matters Beyond Text: Evaluating Multimodal Summaries by Quality, Alignment, and Diversity
MM-Eval unifies evaluation of multimodal summaries by integrating factual text quality, cross-modal relevance via MLLM judge, and visual diversity via truncated CLIP entropy, then calibrates their combination on human...
-
Can a Single Message Paralyze the AI Infrastructure? The Rise of AbO-DDoS Attacks through Targeted Mobius Injection
Mobius Injection exploits semantic closure in LLM agents to enable single-message AbO-DDoS attacks achieving up to 51x call amplification and 229x latency inflation.
-
Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory
Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.
-
Concept-Based Abductive and Contrastive Explanations for Behaviors of Vision Models
Concept-based abductive and contrastive explanations find minimal high-level concepts that causally determine vision model outcomes on individual images or groups sharing a specified behavior.
-
SCOUT: Active Information Foraging for Long-Text Understanding with Decoupled Epistemic States
SCOUT achieves state-of-the-art long-text understanding with up to 8x lower token use by actively foraging for sparse query-relevant information and updating a compact provenance-grounded epistemic state.
-
AdaGATE: Adaptive Gap-Aware Token-Efficient Evidence Assembly for Multi-Hop Retrieval-Augmented Generation
AdaGATE improves evidence F1 scores on HotpotQA for multi-hop RAG under clean, redundant, and noisy conditions by framing selection as gap-aware token-constrained repair, outperforming baselines while using 2.6x fewer tokens.
-
Don't Be a Pot Stirrer! Authorized Vector Data Retrieval via Access-Aware Indexing
Veda and EffVeda partition vectors into disjoint role-combination blocks, apply lattice-based copy and merge operations within a storage budget, index large nodes with HNSW, and use coordinated search with distance bo...
-
Don't Be a Pot Stirrer! Authorized Vector Data Retrieval via Access-Aware Indexing
Veda and EffVeda build access-aware lattice indexes on role-partitioned vector blocks to support authorized top-k queries with controlled duplication and pruned search.
-
Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory
MemCoE learns memory organization guidelines via contrastive feedback and then trains a guideline-aligned RL policy for memory updates, yielding consistent gains on personalization benchmarks.
-
OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory
OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.
-
Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences
Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.
-
Semantic Needles in Document Haystacks: Sensitivity Testing of LLM-as-a-Judge Similarity Scoring
LLMs exhibit positional bias and context-dependent scoring patterns when judging document similarity, with each model showing a stable scoring fingerprint but a shared hierarchy of sensitivity to different semantic pe...
-
Beyond Surface Statistics: Robust Conformal Prediction for LLMs via Internal Representations
Internal layer-wise entropy reshaping provides nonconformity scores that improve the validity-efficiency trade-off of conformal prediction for LLMs under cross-domain shift compared to text-level baselines.
-
Closing the Theory-Practice Gap in Spiking Transformers via Effective Dimension
Spiking attention is a universal approximator of permutation-equivariant functions with ε-approximation requiring Ω(L_f² nd / ε²) spikes, but low effective dimensions (47-89) allow T=4 timesteps in practice.
-
IE as Cache: Information Extraction Enhanced Agentic Reasoning
IE-as-Cache framework repurposes information extraction as a dynamic cognitive cache to improve agentic reasoning accuracy in LLMs on challenging benchmarks.
-
In-Context Learning in Speech Language Models: Analyzing the Role of Acoustic Features, Linguistic Structure, and Induction Heads
Speech language models show in-context learning where speaking rate affects both accuracy and mimicry, and induction heads are causally necessary for this capability.
-
MedicalBench: Evaluating Large Language Models Toward Improved Medical Concept Extraction
MedicalBench is a benchmark for implicit medical concept extraction and sentence-level evidence retrieval built from MIMIC-IV discharge summaries with human verification to test LLM reasoning on unstated medical ideas.
-
MatClaw: An Autonomous Code-First LLM Agent for End-to-End Materials Exploration
MatClaw shows a code-first LLM agent autonomously generating and executing workflows for ML force field training, Curie temperature prediction, and parameter search on CuInP2S6, succeeding on code but requiring interv...
-
MatClaw: An Autonomous Code-First LLM Agent for End-to-End Materials Exploration
MatClaw is a code-first LLM agent that autonomously executes end-to-end materials workflows by generating and running Python scripts on remote clusters, achieving reliable code generation via memory architecture and R...
-
Internalized Reasoning for Long-Context Visual Document Understanding
A synthetic pipeline creates and internalizes reasoning traces in VLMs for long-context visual document understanding, with a 32B model surpassing a 235B model on MMLongBenchDoc and showing 12.4x fewer output tokens.
-
PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments
PERMA is a new benchmark using temporally ordered events, text variability, and linguistic alignment to evaluate LLM memory agents on persona consistency beyond simple retrieval.
-
Characterizing Performance-Energy Trade-offs of Large Language Models in Multi-Request Workflows
This work delivers the first measurements of performance-energy trade-offs across four multi-request LLM workflow patterns on A100 GPUs using vLLM and Parrot.
-
Multimodal Fact-Level Attribution for Verifiable Reasoning
MuRGAt benchmark reveals that strong multimodal models frequently hallucinate citations in complex reasoning tasks despite correct answers, exposing a gap between internal reasoning and verifiable attribution.
-
KRONE: Scalable LLM-Augmented Log Anomaly Detection via Hierarchical Abstraction
KRONE derives semantic execution hierarchies from flat logs to enable modular multi-level anomaly detection with hybrid local and nested-aware detectors plus limited LLM use, delivering 10% F1 gains and over 100x data...
-
Nonlinearity as Rank: Generative Low-Rank Adapter with Radial Basis Functions
GenLoRA replaces explicit low-rank basis storage with RBF-generated vectors from latent codes, yielding higher effective ranks and stronger fine-tuning performance at lower parameter cost.
-
E-mem: Multi-agent based Episodic Context Reconstruction for LLM Agent Memory
E-mem uses a heterogeneous multi-agent setup for episodic context reconstruction in LLM agents, reaching over 54% F1 on LoCoMo while cutting token cost by over 70% compared to prior methods like GAM.
-
Annotating Dimensions of Social Perception in Text: A Sentence-Level Dataset of Warmth and Competence
The paper introduces W&C-Sent, the first sentence-level dataset annotated for trust, sociability, and competence in text about individuals or social groups.
-
Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs
Cascaded systems remain the most reliable for speech translation overall, but recent SpeechLLMs match or outperform them in many conditions while standalone speech models lag.
-
Retrieval as a Decision: Training-Free Adaptive Gating for Efficient RAG
TARG uses uncertainty scores from a short no-context draft to gate retrieval in RAG, matching Always-RAG accuracy while cutting retrievals by 70-90% on QA benchmarks.
-
MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning
MemSearcher trains LLMs to manage compact memory in multi-turn searches via multi-context GRPO for end-to-end RL, outperforming ReAct-style baselines with stable token counts.
-
HiPRAG: Hierarchical Process Rewards for Efficient Agentic Retrieval Augmented Generation
HiPRAG adds hierarchical process rewards to RL training for agentic RAG, reducing over-search to 2.3% and achieving 65.4-67.2% accuracy on seven QA benchmarks across 3B and 7B models.
-
LogitTrace: Detecting Benchmark Contamination via Layerwise Logit Trajectories
LogitTrace detects benchmark contamination by showing that contaminated inputs produce earlier stabilization in layerwise logit trajectories while clean inputs show more gradual accumulation.
-
User-Assistant Bias in LLMs
LLMs show strong user bias in role-tagged contexts that is amplified by preference alignment and can be reduced or controlled through targeted fine-tuning and DPO.
-
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning
MGPO elicits grounding in LMMs via multi-turn RL with binary rewards, yielding 5.4% and 5.2% gains on MME-Realworld and V* Bench and surpassing GPT-4o on the latter after training on 21K samples.
-
Transformers Provably Learn Sparse XOR with Polylogarithmic Parameters
Single-layer two-head Transformers learn sparse XOR with O(polylog(d)) parameters in one gradient step, breaking the Omega(d) parameter bottleneck of FFNNs.
-
Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression
KV cache compression causes task-dependent degradation in high-density reasoning due to disrupted CoT links; ShotKV mitigates this by preserving few-shot examples as indivisible semantic units through phase separation...
-
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory
LongMemEval benchmarks long-term memory in chat assistants, revealing 30% accuracy drops across sustained interactions and proposing indexing-retrieval-reading optimizations that boost performance.
-
Moshi: a speech-text foundation model for real-time dialogue
Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.
-
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
-
M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual,...
-
Towards Measuring the Representation of Subjective Global Opinions in Language Models
LLMs default to responses more similar to opinions from the USA and some European and South American countries; prompting for a country shifts alignment but can introduce stereotypes, while translation does not reliab...
-
Extending Context Window of Large Language Models via Positional Interpolation
Position Interpolation linearly down-scales position indices to extend RoPE context windows to 32768 tokens with 1000-step fine-tuning, delivering strong long-context results on LLaMA 7B-65B while preserving short-con...
-
OPT: Open Pre-trained Transformer Language Models
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
-
Multitask Prompted Training Enables Zero-Shot Task Generalization
Multitask fine-tuning of an encoder-decoder model on prompted datasets produces zero-shot generalization that often beats models up to 16 times larger on standard benchmarks.
-
Model Collapse as Cultural Evolution
Iterated learning theory predicts and LLM experiments confirm non-monotonic compositionality during self-training, reframing model collapse as cultural transmission with matching human regularization patterns.
-
Sparse Autoencoders Map Brain-LLM Alignment onto Cortical Semantic Topography
Sparse autoencoders applied to GPT-2 and Llama models recover semantic features accounting for 94% of peak brain encoding performance and map onto distinct cortical semantic regions across three languages.
-
AMEL: Accumulated Message Effects on LLM Judgments
LLMs exhibit an accumulated message effect where conversation history saturated with positive or negative evaluations biases subsequent judgments, with larger shifts on uncertain items, a negativity asymmetry, and no ...
-
One prompt is not enough: Instruction Sensitivity Undermines Embedding Model Evaluation
Single-prompt evaluations of instruction-tuned embedding models misrepresent performance and allow any model to be ranked first by favorable prompt choice.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.