For monotone submodular maximization, containment pruning has a tight 1-1/e factor; for non-monotone objectives, 1/2-ε algorithms exist that exceed known optimization hardness bounds.
super hub Canonical reference
Lost in the Middle: How Language Models Use Long Contexts
Canonical reference. 78% of citing Pith papers cite this work as background.
abstract
While recent language models have the ability to take long contexts as input, relatively little is known about how well they use longer context. We analyze the performance of language models on two tasks that require identifying relevant information in their input contexts: multi-document question answering and key-value retrieval. We find that performance can degrade significantly when changing the position of relevant information, indicating that current language models do not robustly make use of information in long input contexts. In particular, we observe that performance is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades when models must access relevant information in the middle of long contexts, even for explicitly long-context models. Our analysis provides a better understanding of how language models use their input context and provides new evaluation protocols for future long-context language models.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract While recent language models have the ability to take long contexts as input, relatively little is known about how well they use longer context. We analyze the performance of language models on two tasks that require identifying relevant information in their input contexts: multi-document question answering and key-value retrieval. We find that performance can degrade significantly when changing the position of relevant information, indicating that current language models do not robustly make use of information in long input contexts. In particular, we observe that performance is often highest
authors
co-cited works
representative citing papers
SWE-bench reveals that even top language models like Claude 2 resolve only 1.96% of 2,294 real-world GitHub issues, highlighting a gap in practical coding capabilities.
LongBench is the first bilingual multi-task benchmark for long context understanding in LLMs, containing 21 datasets in 6 categories with average lengths of 6711 words (English) and 13386 characters (Chinese).
Hybrid models outperform transformers on semantic state tracking tasks but underperform on syntactic bracket matching and n-gram copying at the token level.
Apparent psychological profiles of LLMs are largely measurement artifacts driven by directional response bias rather than actual traits.
SwiftTrans improves both functional correctness and runtime efficiency of LLM code translations via multi-perspective exploration with hierarchical guidance and difference-aware selection with ordinal guidance on extended benchmarks including new SwiftBench.
LazyAttention kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV cache reuse, delivering 1.37× lower TTFT and 1.40× higher throughput than Block-Attention under skewed document distributions while preserving output quality.
Analysis of 15 calibration sources shows opposite-sign Spearman correlations between perplexity and retention across General vs. Math/Code dimensions in LLM pruning, and multi-source mixing via IGSP raises total retention from 40-50% to 58.8%.
MemTrain introduces two coupled self-supervised proxy tasks on Wikipedia corpora to train general context-memory capabilities in LLMs, reporting gains of up to 17.67 points on long-text and search-based QA benchmarks over direct post-training.
Audits reveal no reasoning benchmark controls position/filler/length jointly; CRE shows LLMs drop up to 88pp on middle-position tasks at 64K context, with diagnostic probe supporting positional cause.
Training-language dominance, not English inherent properties, determines brain-LLM alignment across English, Chinese, and French, with additional independent effects from typological distance concentrated in syntactic brain regions.
Chain of Thought risk decomposes into oracle-trajectory benefit and trajectory-mismatch cost, with stability determining bounded, linear, or exponential error growth.
Low-precision softmax transformers with chain-of-thought simulate Turing machines at logarithmic depth and width; summarized CoT improves this to logarithmic space scaling.
GRASP introduces a hierarchical graph-based agentic retrieval method that achieves top accuracy on MuSiQue, 2WikiMultihopQA, and HotpotQA while using 30-50% fewer tokens than strong baselines.
Agentic interpretation uses lattices to track LLM judgments on decomposed program claims during analysis.
MM-Eval unifies evaluation of multimodal summaries by integrating factual text quality, cross-modal relevance via MLLM judge, and visual diversity via truncated CLIP entropy, then calibrates their combination on human preferences.
Mobius Injection exploits semantic closure in LLM agents to enable single-message AbO-DDoS attacks achieving up to 51x call amplification and 229x latency inflation.
Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.
Concept-based abductive and contrastive explanations find minimal high-level concepts that causally determine vision model outcomes on individual images or groups sharing a specified behavior.
SCOUT achieves state-of-the-art long-text understanding with up to 8x lower token use by actively foraging for sparse query-relevant information and updating a compact provenance-grounded epistemic state.
AdaGATE improves evidence F1 scores on HotpotQA for multi-hop RAG under clean, redundant, and noisy conditions by framing selection as gap-aware token-constrained repair, outperforming baselines while using 2.6x fewer tokens.
MemCoE learns memory organization guidelines via contrastive feedback and then trains a guideline-aligned RL policy for memory updates, yielding consistent gains on personalization benchmarks.
OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.
Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.
citing papers explorer
-
Submodular Ground-Set Pruning: Monotone Tightness and a Non-Monotone Separation
For monotone submodular maximization, containment pruning has a tight 1-1/e factor; for non-monotone objectives, 1/2-ε algorithms exist that exceed known optimization hardness bounds.
-
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
SWE-bench reveals that even top language models like Claude 2 resolve only 1.96% of 2,294 real-world GitHub issues, highlighting a gap in practical coding capabilities.
-
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
LongBench is the first bilingual multi-task benchmark for long context understanding in LLMs, containing 21 datasets in 6 categories with average lengths of 6711 words (English) and 13386 characters (Chinese).
-
Comparing Transformers and Hybrid Models at the Token Level
Hybrid models outperform transformers on semantic state tracking tasks but underperform on syntactic bracket matching and n-gram copying at the token level.
-
Apparent Psychological Profiles of Large Language Models are Largely a Measurement Artifact
Apparent psychological profiles of LLMs are largely measurement artifacts driven by directional response bias rather than actual traits.
-
Bridging Functional Correctness and Runtime Efficiency Gaps in LLM-Based Code Translation
SwiftTrans improves both functional correctness and runtime efficiency of LLM code translations via multi-perspective exploration with hierarchical guidance and difference-aware selection with ordinal guidance on extended benchmarks including new SwiftBench.
-
LazyAttention: Efficient Retrieval-Augmented Generation with Deferred Positional Encoding
LazyAttention kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV cache reuse, delivering 1.37× lower TTFT and 1.40× higher throughput than Block-Attention under skewed document distributions while preserving output quality.
-
Calibration Data Trade-offs Across Capability Dimensions: Why Multi-Source Mixing Matters for High-Sparsity LLM Pruning
Analysis of 15 calibration sources shows opposite-sign Spearman correlations between perplexity and retention across General vs. Math/Code dimensions in LLM pruning, and multi-source mixing via IGSP raises total retention from 40-50% to 58.8%.
-
MemTrain: Self-Supervised Context Memory Training
MemTrain introduces two coupled self-supervised proxy tasks on Wikipedia corpora to train general context-memory capabilities in LLMs, reporting gains of up to 17.67 points on long-text and search-based QA benchmarks over direct post-training.
-
Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks
Audits reveal no reasoning benchmark controls position/filler/length jointly; CRE shows LLMs drop up to 88pp on middle-position tasks at 64K context, with diagnostic probe supporting positional cause.
-
Brain-LLM Alignment Tracks Training Data, Not Typology
Training-language dominance, not English inherent properties, determines brain-LLM alignment across English, Chinese, and French, with additional independent effects from typological distance concentrated in syntactic brain regions.
-
On the Cost and Benefit of Chain of Thought: A Learning-Theoretic Perspective
Chain of Thought risk decomposes into oracle-trajectory benefit and trajectory-mismatch cost, with stability determining bounded, linear, or exponential error growth.
-
The Expressive Power of Low Precision Softmax Transformers with (Summarized) Chain-of-Thought
Low-precision softmax transformers with chain-of-thought simulate Turing machines at logarithmic depth and width; summarized CoT improves this to logarithmic space scaling.
-
GRASP: Graph Agentic Search over Propositions for Multi-hop Question Answering
GRASP introduces a hierarchical graph-based agentic retrieval method that achieves top accuracy on MuSiQue, 2WikiMultihopQA, and HotpotQA while using 30-50% fewer tokens than strong baselines.
-
Agentic Interpretation: Lattice-Structured Evidence for LLM-Based Program Analysis
Agentic interpretation uses lattices to track LLM judgments on decomposed program claims during analysis.
-
Measuring What Matters Beyond Text: Evaluating Multimodal Summaries by Quality, Alignment, and Diversity
MM-Eval unifies evaluation of multimodal summaries by integrating factual text quality, cross-modal relevance via MLLM judge, and visual diversity via truncated CLIP entropy, then calibrates their combination on human preferences.
-
Can a Single Message Paralyze the AI Infrastructure? The Rise of AbO-DDoS Attacks through Targeted Mobius Injection
Mobius Injection exploits semantic closure in LLM agents to enable single-message AbO-DDoS attacks achieving up to 51x call amplification and 229x latency inflation.
-
Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory
Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.
-
Concept-Based Abductive and Contrastive Explanations for Behaviors of Vision Models
Concept-based abductive and contrastive explanations find minimal high-level concepts that causally determine vision model outcomes on individual images or groups sharing a specified behavior.
-
SCOUT: Active Information Foraging for Long-Text Understanding with Decoupled Epistemic States
SCOUT achieves state-of-the-art long-text understanding with up to 8x lower token use by actively foraging for sparse query-relevant information and updating a compact provenance-grounded epistemic state.
-
AdaGATE: Adaptive Gap-Aware Token-Efficient Evidence Assembly for Multi-Hop Retrieval-Augmented Generation
AdaGATE improves evidence F1 scores on HotpotQA for multi-hop RAG under clean, redundant, and noisy conditions by framing selection as gap-aware token-constrained repair, outperforming baselines while using 2.6x fewer tokens.
-
Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory
MemCoE learns memory organization guidelines via contrastive feedback and then trains a guideline-aligned RL policy for memory updates, yielding consistent gains on personalization benchmarks.
-
OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory
OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.
-
Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences
Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.
-
Semantic Needles in Document Haystacks: Sensitivity Testing of LLM-as-a-Judge Similarity Scoring
LLMs exhibit positional bias and context-dependent scoring patterns when judging document similarity, with each model showing a stable scoring fingerprint but a shared hierarchy of sensitivity to different semantic perturbations.
-
Beyond Surface Statistics: Robust Conformal Prediction for LLMs via Internal Representations
Internal layer-wise entropy reshaping provides nonconformity scores that improve the validity-efficiency trade-off of conformal prediction for LLMs under cross-domain shift compared to text-level baselines.
-
Closing the Theory-Practice Gap in Spiking Transformers via Effective Dimension
Spiking attention is a universal approximator of permutation-equivariant functions with ε-approximation requiring Ω(L_f² nd / ε²) spikes, but low effective dimensions (47-89) allow T=4 timesteps in practice.
-
IE as Cache: Information Extraction Enhanced Agentic Reasoning
IE-as-Cache framework repurposes information extraction as a dynamic cognitive cache to improve agentic reasoning accuracy in LLMs on challenging benchmarks.
-
In-Context Learning in Speech Language Models: Analyzing the Role of Acoustic Features, Linguistic Structure, and Induction Heads
Speech language models show in-context learning where speaking rate affects both accuracy and mimicry, and induction heads are causally necessary for this capability.
-
MedicalBench: Evaluating Large Language Models Toward Improved Medical Concept Extraction
MedicalBench is a benchmark for implicit medical concept extraction and sentence-level evidence retrieval built from MIMIC-IV discharge summaries with human verification to test LLM reasoning on unstated medical ideas.
-
MatClaw: An Autonomous Code-First LLM Agent for End-to-End Materials Exploration
MatClaw shows a code-first LLM agent autonomously generating and executing workflows for ML force field training, Curie temperature prediction, and parameter search on CuInP2S6, succeeding on code but requiring interventions for tacit domain knowledge.
-
Internalized Reasoning for Long-Context Visual Document Understanding
A synthetic pipeline creates and internalizes reasoning traces in VLMs for long-context visual document understanding, with a 32B model surpassing a 235B model on MMLongBenchDoc and showing 12.4x fewer output tokens.
-
PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments
PERMA is a new benchmark using temporally ordered events, text variability, and linguistic alignment to evaluate LLM memory agents on persona consistency beyond simple retrieval.
-
Characterizing Performance-Energy Trade-offs of Large Language Models in Multi-Request Workflows
This work delivers the first measurements of performance-energy trade-offs across four multi-request LLM workflow patterns on A100 GPUs using vLLM and Parrot.
-
Multimodal Fact-Level Attribution for Verifiable Reasoning
MuRGAt benchmark reveals that strong multimodal models frequently hallucinate citations in complex reasoning tasks despite correct answers, exposing a gap between internal reasoning and verifiable attribution.
-
KRONE: Scalable LLM-Augmented Log Anomaly Detection via Hierarchical Abstraction
KRONE derives semantic execution hierarchies from flat logs to enable modular multi-level anomaly detection with hybrid local and nested-aware detectors plus limited LLM use, delivering 10% F1 gains and over 100x data efficiency on benchmarks and industrial data.
-
Nonlinearity as Rank: Generative Low-Rank Adapter with Radial Basis Functions
GenLoRA replaces explicit low-rank basis storage with RBF-generated vectors from latent codes, yielding higher effective ranks and stronger fine-tuning performance at lower parameter cost.
-
E-mem: Multi-agent based Episodic Context Reconstruction for LLM Agent Memory
E-mem uses a heterogeneous multi-agent setup for episodic context reconstruction in LLM agents, reaching over 54% F1 on LoCoMo while cutting token cost by over 70% compared to prior methods like GAM.
-
Annotating Dimensions of Social Perception in Text: A Sentence-Level Dataset of Warmth and Competence
The paper introduces W&C-Sent, the first sentence-level dataset annotated for trust, sociability, and competence in text about individuals or social groups.
-
Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs
Cascaded systems remain the most reliable for speech translation overall, but recent SpeechLLMs match or outperform them in many conditions while standalone speech models lag.
-
Retrieval as a Decision: Training-Free Adaptive Gating for Efficient RAG
TARG uses uncertainty scores from a short no-context draft to gate retrieval in RAG, matching Always-RAG accuracy while cutting retrievals by 70-90% on QA benchmarks.
-
MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning
MemSearcher trains LLMs to manage compact memory in multi-turn searches via multi-context GRPO for end-to-end RL, outperforming ReAct-style baselines with stable token counts.
-
HiPRAG: Hierarchical Process Rewards for Efficient Agentic Retrieval Augmented Generation
HiPRAG adds hierarchical process rewards to RL training for agentic RAG, reducing over-search to 2.3% and achieving 65.4-67.2% accuracy on seven QA benchmarks across 3B and 7B models.
-
LogitTrace: Detecting Benchmark Contamination via Layerwise Logit Trajectories
LogitTrace detects benchmark contamination by showing that contaminated inputs produce earlier stabilization in layerwise logit trajectories while clean inputs show more gradual accumulation.
-
User-Assistant Bias in LLMs
LLMs show strong user bias in role-tagged contexts that is amplified by preference alignment and can be reduced or controlled through targeted fine-tuning and DPO.
-
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning
MGPO elicits grounding in LMMs via multi-turn RL with binary rewards, yielding 5.4% and 5.2% gains on MME-Realworld and V* Bench and surpassing GPT-4o on the latter after training on 21K samples.
-
Transformers Provably Learn Sparse XOR with Polylogarithmic Parameters
Single-layer two-head Transformers learn sparse XOR with O(polylog(d)) parameters in one gradient step, breaking the Omega(d) parameter bottleneck of FFNNs.
-
Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression
KV cache compression causes task-dependent degradation in high-density reasoning due to disrupted CoT links; ShotKV mitigates this by preserving few-shot examples as indivisible semantic units through phase separation, delivering 9-18% accuracy gains and 11% latency reduction.
-
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory
LongMemEval benchmarks long-term memory in chat assistants, revealing 30% accuracy drops across sustained interactions and proposing indexing-retrieval-reading optimizations that boost performance.
-
Moshi: a speech-text foundation model for real-time dialogue
Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.