For monotone submodular maximization, containment pruning has a tight 1-1/e factor; for non-monotone objectives, 1/2-ε algorithms exist that exceed known optimization hardness bounds.
super hub Canonical reference
Lost in the Middle: How Language Models Use Long Contexts
Canonical reference. 78% of citing Pith papers cite this work as background.
abstract
While recent language models have the ability to take long contexts as input, relatively little is known about how well they use longer context. We analyze the performance of language models on two tasks that require identifying relevant information in their input contexts: multi-document question answering and key-value retrieval. We find that performance can degrade significantly when changing the position of relevant information, indicating that current language models do not robustly make use of information in long input contexts. In particular, we observe that performance is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades when models must access relevant information in the middle of long contexts, even for explicitly long-context models. Our analysis provides a better understanding of how language models use their input context and provides new evaluation protocols for future long-context language models.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract While recent language models have the ability to take long contexts as input, relatively little is known about how well they use longer context. We analyze the performance of language models on two tasks that require identifying relevant information in their input contexts: multi-document question answering and key-value retrieval. We find that performance can degrade significantly when changing the position of relevant information, indicating that current language models do not robustly make use of information in long input contexts. In particular, we observe that performance is often highest
authors
co-cited works
representative citing papers
SWE-bench reveals that even top language models like Claude 2 resolve only 1.96% of 2,294 real-world GitHub issues, highlighting a gap in practical coding capabilities.
LongBench is the first bilingual multi-task benchmark for long context understanding in LLMs, containing 21 datasets in 6 categories with average lengths of 6711 words (English) and 13386 characters (Chinese).
Audits reveal no reasoning benchmark controls position/filler/length jointly; CRE shows LLMs drop up to 88pp on middle-position tasks at 64K context, with diagnostic probe supporting positional cause.
Training-language dominance, not English inherent properties, determines brain-LLM alignment across English, Chinese, and French, with additional independent effects from typological distance concentrated in syntactic brain regions.
Chain of Thought risk decomposes into oracle-trajectory benefit and trajectory-mismatch cost, with stability determining bounded, linear, or exponential error growth.
Low-precision softmax transformers with chain-of-thought simulate Turing machines at logarithmic depth and width; summarized CoT improves this to logarithmic space scaling.
GRASP introduces a hierarchical graph-based agentic retrieval method that achieves top accuracy on MuSiQue, 2WikiMultihopQA, and HotpotQA while using 30-50% fewer tokens than strong baselines.
Agentic interpretation uses lattices to track LLM judgments on decomposed program claims during analysis.
MM-Eval unifies evaluation of multimodal summaries by integrating factual text quality, cross-modal relevance via MLLM judge, and visual diversity via truncated CLIP entropy, then calibrates their combination on human preferences.
Mobius Injection exploits semantic closure in LLM agents to enable single-message AbO-DDoS attacks achieving up to 51x call amplification and 229x latency inflation.
Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.
Concept-based abductive and contrastive explanations find minimal high-level concepts that causally determine vision model outcomes on individual images or groups sharing a specified behavior.
SCOUT achieves state-of-the-art long-text understanding with up to 8x lower token use by actively foraging for sparse query-relevant information and updating a compact provenance-grounded epistemic state.
AdaGATE improves evidence F1 scores on HotpotQA for multi-hop RAG under clean, redundant, and noisy conditions by framing selection as gap-aware token-constrained repair, outperforming baselines while using 2.6x fewer tokens.
MemCoE learns memory organization guidelines via contrastive feedback and then trains a guideline-aligned RL policy for memory updates, yielding consistent gains on personalization benchmarks.
OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.
Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.
LLMs exhibit positional bias and context-dependent scoring patterns when judging document similarity, with each model showing a stable scoring fingerprint but a shared hierarchy of sensitivity to different semantic perturbations.
Internal layer-wise entropy reshaping provides nonconformity scores that improve the validity-efficiency trade-off of conformal prediction for LLMs under cross-domain shift compared to text-level baselines.
Spiking attention is a universal approximator of permutation-equivariant functions with ε-approximation requiring Ω(L_f² nd / ε²) spikes, but low effective dimensions (47-89) allow T=4 timesteps in practice.
IE-as-Cache framework repurposes information extraction as a dynamic cognitive cache to improve agentic reasoning accuracy in LLMs on challenging benchmarks.
Speech language models show in-context learning where speaking rate affects both accuracy and mimicry, and induction heads are causally necessary for this capability.
MedicalBench is a benchmark for implicit medical concept extraction and sentence-level evidence retrieval built from MIMIC-IV discharge summaries with human verification to test LLM reasoning on unstated medical ideas.
citing papers explorer
-
Measuring What Matters Beyond Text: Evaluating Multimodal Summaries by Quality, Alignment, and Diversity
MM-Eval unifies evaluation of multimodal summaries by integrating factual text quality, cross-modal relevance via MLLM judge, and visual diversity via truncated CLIP entropy, then calibrates their combination on human preferences.
-
Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory
Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.
-
PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments
PERMA is a new benchmark using temporally ordered events, text variability, and linguistic alignment to evaluate LLM memory agents on persona consistency beyond simple retrieval.
-
Nonlinearity as Rank: Generative Low-Rank Adapter with Radial Basis Functions
GenLoRA replaces explicit low-rank basis storage with RBF-generated vectors from latent codes, yielding higher effective ranks and stronger fine-tuning performance at lower parameter cost.
-
E-mem: Multi-agent based Episodic Context Reconstruction for LLM Agent Memory
E-mem uses a heterogeneous multi-agent setup for episodic context reconstruction in LLM agents, reaching over 54% F1 on LoCoMo while cutting token cost by over 70% compared to prior methods like GAM.
-
ICRL: Learning to Internalize Self-Critique with Reinforcement Learning
ICRL uses joint RL training of solver and critic with distribution-calibration re-weighting and role-wise advantage estimation to internalize critique into unassisted LLM performance, yielding 6.4-point gains on agentic tasks and 7.0 on math reasoning with Qwen3 models.
-
LIDSA: Cognitive Arbitration for Signal-Free Autonomous Intersection Management
LIDSA applies LLMs as primary decision-makers for signal-free intersection management, achieving up to 89% lower control delay and 93% lower waiting time versus fixed-cycle and other baselines in simulation.
-
Towards Visually Grounded Multimodal Summarization via Cross-Modal Transformer and Gated Attention
SPeCTrA-Sum uses hierarchical cross-modal fusion via DVP and DPP-distilled image selection via VRP to generate more accurate and visually grounded multimodal summaries.
-
Bias by Necessity: Impossibility Theorems for Sequential Processing with Convergent AI and Human Validation
Primacy, anchoring, and order-dependence are architecturally necessary in autoregressive models due to causal masking constraints, with supporting evidence from theorems, LLM fits, and human experiments.
-
From Unstructured Recall to Schema-Grounded Memory: Reliable AI Memory via Iterative, Schema-Aware Extraction
Schema-aware iterative extraction turns AI memory into a verified system of record, reaching 90-97% accuracy on extraction and end-to-end memory benchmarks where retrieval baselines score 80-87%.
-
TrajOnco: a multi-agent framework for temporal reasoning over longitudinal EHR for multi-cancer early detection
TrajOnco uses a chain-of-agents LLM architecture with memory to perform temporal reasoning on longitudinal EHR, achieving 0.64-0.80 AUROC for 1-year multi-cancer risk prediction in zero-shot mode on matched cohorts while matching supervised ML on lung cancer and outperforming single-agent baselines.
-
Traj-CoA: Patient Trajectory Modeling via Chain-of-Agents for Lung Cancer Risk Prediction
Traj-CoA is a multi-agent LLM framework that sequentially processes noisy five-year EHR data via worker agents into EHRMem for manager-agent lung cancer risk prediction and outperforms four categories of baselines in zero-shot evaluation.
-
SynthPert: Enhancing LLM Biological Reasoning via Synthetic Reasoning Traces for Cellular Perturbation Prediction
SynthPert fine-tunes LLMs using synthetic reasoning traces to reach state-of-the-art on the PerturbQA benchmark for cellular perturbation prediction, surpassing the generating frontier model while generalizing to unseen cell types with only 2% of filtered data.
-
Reasoning Models Can be Accurately Pruned Via Chain-of-Thought Reconstruction
A pruning technique called Reasoning-Aware Compression (RAC) jointly reconstructs input and chain-of-thought activations to preserve reasoning performance better than standard methods when compressing models like DeepSeek-R1.
-
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
-
MemGPT: Towards LLMs as Operating Systems
MemGPT uses OS-inspired virtual context management to extend LLM context windows for large document analysis and long-term multi-session chat.
-
NeuSymMS: A Hybrid Neuro-Symbolic Memory System for Persistent, Self-Curating LLM Agents
NeuSymMS is a hybrid neuro-symbolic memory system that extracts facts via LLMs and manages them with explicit CLIPS rules for scoping, deduplication, and dual-horizon persistence in LLM agents.
-
Context Pruning for Coding Agents via Multi-Rubric Latent Reasoning
LaMR decomposes code context pruning into two rubrics using dedicated CRFs, a mixture-of-experts gate, and AST-derived labels to filter noise and often match or beat full-context baselines on coding benchmarks.
-
Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability
A framework with U-statistics and kernel-based metrics quantifies AI agent consistency and robustness, showing trajectory metrics outperform pass@1 rates in diagnosing failures.
-
What Deserves Memory: Adaptive Memory Distillation for LLM Agents
NEMORI is an adaptive memory distillation framework for LLM agents that transforms raw interactions into narratives and extracts insights via prediction error to decide what deserves retention.
-
BioBLP: A Modular Framework for Learning on Multimodal Biomedical Knowledge Graphs
BioBLP is a modular embedding framework for multimodal biomedical KGs supporting heterogeneous attributes and missing data, with a pretraining strategy that improves results on drug-protein interaction prediction especially for low-degree entities.
-
SOM: Structured Opponent Modeling for LLM-based Agents via Structural Causal Model
SOM uses a Structural Causal Model to create an explicit graph of opponent observation-to-action links, allowing LLMs to reason along those paths for more accurate and stable predictions in multi-agent settings.
-
A Survey on the Memory Mechanism of Large Language Model based Agents
A systematic review of memory designs, evaluation methods, applications, limitations, and future directions for LLM-based agents.
- AMEL: Accumulated Message Effects on LLM Judgments
- MMSkills: Towards Multimodal Skills for General Visual Agents
- Don't Make the LLM Read the Graph: Make the Graph Think