EntmaxKV enables exact sparse KV-cache decoding for entmax attention via support-aware page selection and a Gaussian threshold estimator, matching full attention quality at a fraction of the cache size with up to 5.43x speedup.
super hub Baseline reference
RULER: What's the Real Context Size of Your Long-Context Language Models?
Baseline reference. 57% of citing Pith papers use this work as a benchmark or comparison.
abstract
The needle-in-a-haystack (NIAH) test, which examines the ability to retrieve a piece of information (the "needle") from long distractor texts (the "haystack"), has been widely adopted to evaluate long-context language models (LMs). However, this simple retrieval-based test is indicative of only a superficial form of long-context understanding. To provide a more comprehensive evaluation of long-context LMs, we create a new synthetic benchmark RULER with flexible configurations for customized sequence length and task complexity. RULER expands upon the vanilla NIAH test to encompass variations with diverse types and quantities of needles. Moreover, RULER introduces new task categories multi-hop tracing and aggregation to test behaviors beyond searching from context. We evaluate 17 long-context LMs with 13 representative tasks in RULER. Despite achieving nearly perfect accuracy in the vanilla NIAH test, almost all models exhibit large performance drops as the context length increases. While these models all claim context sizes of 32K tokens or greater, only half of them can maintain satisfactory performance at the length of 32K. Our analysis of Yi-34B, which supports context length of 200K, reveals large room for improvement as we increase input length and task complexity. We open source RULER to spur comprehensive evaluation of long-context LMs.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract The needle-in-a-haystack (NIAH) test, which examines the ability to retrieve a piece of information (the "needle") from long distractor texts (the "haystack"), has been widely adopted to evaluate long-context language models (LMs). However, this simple retrieval-based test is indicative of only a superficial form of long-context understanding. To provide a more comprehensive evaluation of long-context LMs, we create a new synthetic benchmark RULER with flexible configurations for customized sequence length and task complexity. RULER expands upon the vanilla NIAH test to encompass variations wi
authors
co-cited works
representative citing papers
MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for personalized healthcare.
RLMs allow LLMs to handle prompts up to 100x longer than their context window via recursive self-calls on prompt parts, outperforming standard long-context methods on benchmarks.
Maven is an RL method using answer-conditioned evidence-state values to assign rewards to add, link, and drop actions on evidence memory, outperforming outcome-only baselines on LongBench v2, LongReason, and RULER.
A 0.6B LM with length-aware attention adjustments performs competitive in-context retrieval at million-token scale on MS MARCO, NQ, and LIMIT benchmarks.
Empirical study of 238 SKILL.md files finds over 99% contain skill smells that rarely disappear, revealing a gap between recommended and actual authoring practices for agent skills.
FlashMorph formulates hybrid layer selection as budget-constrained optimization, trains per-layer gates on synthetic retrieval data with linearization regularization, then discretizes and distills to produce efficient hybrid architectures.
CARVE introduces key-axis content-aware gating and value-efficient scalar writes in recurrent linear attention, outperforming GDN-2 on perplexity and retrieval tasks while cutting parameters and memory.
Block-GTQ performs RoPE-aware greedy bit allocation on KV caches using per-block energy scores, cutting logit MAE 32-80% versus uniform TQ-MSE and lifting long-context task scores substantially at 2-3 bits per dimension.
StaminaBench evaluates coding agents over 100 procedurally generated change requests to a REST API, finding that tested models fail within 5-6 turns without feedback but improve up to 12x with test feedback and good harnesses.
RNG-Bench evaluates MLLMs on hidden-observation reconstruction in non-Markov games, finds forgetting as the dominant error source, and shows fine-tuning on optimal rollouts improves performance with transfer to other benchmarks.
DICE aggregates independently encoded document chunks into a single vector to reduce evidence dilution in long-document dense retrieval, reporting gains on LongEmbed especially beyond 4k tokens.
MemTrace shows that evidence utilization, not retrieval, is the dominant failure mode in LLM long-term memory systems across tested configurations.
Doc-to-Atom decomposes documents into composable micro-LoRA adapters selected by a query router for efficient long-context QA.
STAR-KV applies differentiable soft thresholding for per-head and per-block adaptive low-rank KV cache compression, combined with hybrid decomposition and low-rank-aware quantization, achieving up to 75% compression and 3.1x throughput gains.
The paper delivers the first systems characterization of agent memory, with a four-axis taxonomy, phase-aware profiler, evaluation of ten systems on two benchmarks, and ten design recommendations.
QCFuse achieves full-prefill quality in RAG with 1.7x average prefill speedup over full prefill and 1.5x over ProphetKV via compressed query-aware cache fusion.
M³Eval is a new cognitively-grounded benchmark that evaluates memory dimensions in multi-modal video models and reports consistent model weaknesses in disentanglement, interference, spatial-temporal grounding, and symbolic recall.
Fixed block causal masks create reachability boundaries where representations depend only on block prefixes, formalized via dependency sets and phase-conditioned coverage functions, with a parameter-free boundary bridge repair.
Proposes the Intelligent Computing Architecture (ICA) as a six-layer framework with dual probabilistic-deterministic planes and three Amdahl-style heuristics to unify design of LLM-based systems.
AgingBench demonstrates multi-dimensional degradation in deployed AI agents through four aging mechanisms diagnosed by temporal graphs and counterfactual probes across hundreds of runs.
ContextEcho benchmark shows persona drift occurs across 23 frontier models in long agentic-coding sessions, is not reliably reset by compaction, and can be restored by single-shot anchors with mode-dependent effects.
Audits reveal no reasoning benchmark controls position/filler/length jointly; CRE shows LLMs drop up to 88pp on middle-position tasks at 64K context, with diagnostic probe supporting positional cause.
A new queryable binary dataset combining cross-build diversity, temporal history, and CVE labels with linked metadata for vulnerability research.
citing papers explorer
-
EntmaxKV: Support-Aware Decoding for Entmax Attention
EntmaxKV enables exact sparse KV-cache decoding for entmax attention via support-aware page selection and a Gaussian threshold estimator, matching full attention quality at a fraction of the cache size with up to 5.43x speedup.
-
MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare
MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for personalized healthcare.
-
Evidence-State Rewards for Long-Context Reasoning
Maven is an RL method using answer-conditioned evidence-state values to assign rewards to add, link, and drop actions on evidence memory, outperforming outcome-only baselines on LongBench v2, LongReason, and RULER.
-
Can Language Models Actually Retrieve In-Context? Drowning in Documents at Million Token Scale
A 0.6B LM with length-aware attention adjustments performs competitive in-context retrieval at million-token scale on MS MARCO, NQ, and LIMIT benchmarks.
-
From Anatomy to Smells: An Empirical Study of SKILL.md in Agent Skills
Empirical study of 238 SKILL.md files finds over 99% contain skill smells that rarely disappear, revealing a gap between recommended and actual authoring practices for agent skills.
-
Morphing into Hybrid Attention Models
FlashMorph formulates hybrid layer selection as budget-constrained optimization, trains per-layer gates on synthetic retrieval data with linearization regularization, then discretizes and distills to produce efficient hybrid architectures.
-
CARVE: Content-Aware Recurrent with Value Efficiency for Chunk-Parallel Linear Attention
CARVE introduces key-axis content-aware gating and value-efficient scalar writes in recurrent linear attention, outperforming GDN-2 on perplexity and retrieval tasks while cutting parameters and memory.
-
RoPE-Aware Bit Allocation for KV-Cache Quantization
Block-GTQ performs RoPE-aware greedy bit allocation on KV caches using per-block energy scores, cutting logit MAE 32-80% versus uniform TQ-MSE and lifting long-context task scores substantially at 2-3 bits per dimension.
-
StaminaBench: Stress-Testing Coding Agents over 100 Interaction Turns
StaminaBench evaluates coding agents over 100 procedurally generated change requests to a REST API, finding that tested models fail within 5-6 turns without feedback but improve up to 12x with test feedback and good harnesses.
-
Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games
RNG-Bench evaluates MLLMs on hidden-observation reconstruction in non-Markov games, finds forgetting as the dominant error source, and shows fine-tuning on optimal rollouts improves performance with transfer to other benchmarks.
-
Lost in a Single Vector: Improving Long-Document Retrieval with Chunk Evidence Aggregation
DICE aggregates independently encoded document chunks into a single vector to reduce evidence dilution in long-document dense retrieval, reporting gains on LongEmbed especially beyond 4k tokens.
-
MemTrace: Probing What Final Accuracy Misses in Long-Term Memory
MemTrace shows that evidence utilization, not retrieval, is the dominant failure mode in LLM long-term memory systems across tested configurations.
-
Doc-to-Atom: Learning to Compile and Compose Memory Atoms
Doc-to-Atom decomposes documents into composable micro-LoRA adapters selected by a query router for efficient long-context QA.
-
STAR-KV: Low-Rank KV Cache Compression via Soft Thresholding for Adaptive Rank Control
STAR-KV applies differentiable soft thresholding for per-head and per-block adaptive low-rank KV cache compression, combined with hybrid decomposition and low-rank-aware quantization, achieving up to 75% compression and 3.1x throughput gains.
-
Agent Memory: Characterization and System Implications of Stateful Long-Horizon Workloads
The paper delivers the first systems characterization of agent memory, with a four-axis taxonomy, phase-aware profiler, evaluation of ten systems on two benchmarks, and ten design recommendations.
-
QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving
QCFuse achieves full-prefill quality in RAG with 1.7x average prefill speedup over full prefill and 1.5x over ProphetKV via compressed query-aware cache fusion.
-
M$^3$Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks
M³Eval is a new cognitively-grounded benchmark that evaluates memory dimensions in multi-modal video models and reports consistent model weaknesses in disentanglement, interference, spatial-temporal grounding, and symbolic recall.
-
Locality Does Not Imply Reachability: Boundary Repair in Block-Sparse Causal Attention
Fixed block causal masks create reachability boundaries where representations depend only on block prefixes, formalized via dependency sets and phase-conditioned coverage functions, with a parameter-free boundary bridge repair.
-
Model-Native Computing Architecture: Envisioning Future System Architecture Through the Lens of Computer Architecture
Proposes the Intelligent Computing Architecture (ICA) as a six-layer framework with dual probabilistic-deterministic planes and three Amdahl-style heuristics to unify design of LLM-based systems.
-
Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems
AgingBench demonstrates multi-dimensional degradation in deployed AI agents through four aging mechanisms diagnosed by temporal graphs and counterfactual probes across hundreds of runs.
-
ContextEcho: A Benchmark for Persona Drift in Long Agentic-Coding Sessions
ContextEcho benchmark shows persona drift occurs across 23 frontier models in long agentic-coding sessions, is not reliably reset by compaction, and can be restored by single-shot anchors with mode-dependent effects.
-
Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks
Audits reveal no reasoning benchmark controls position/filler/length jointly; CRE shows LLMs drop up to 88pp on middle-position tasks at 64K context, with diagnostic probe supporting positional cause.
-
ASSEMBLAGE-DEEPHISTORY: A Cross-Build Binary Dataset with Temporal Coverage
A new queryable binary dataset combining cross-build diversity, temporal history, and CVE labels with linked metadata for vulnerability research.
-
MemGym: a Long-Horizon Memory Environment for LLM Agents
MemGym unifies agent gyms into a memory benchmark with isolated scoring across tool-use, research, coding, and computer-use regimes plus a lightweight reward model for tractable coding evaluation.
-
DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows
DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under perfect delegation.
-
LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues
LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.
-
MEME: Multi-entity & Evolving Memory Evaluation
All tested LLM memory systems fail at dependency reasoning in multi-entity evolving scenarios, with only an expensive file-based setup showing partial recovery.
-
ProactBench: Beyond What The User Asked For
ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.
-
VORT: Adaptive Power-Law Memory for NLP Transformers
VORT assigns learnable fractional orders to tokens and approximates their power-law retention kernels via sum-of-exponentials for efficient long-range dependency modeling in transformers.
-
When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory
A new evaluation protocol shows agent memory reliability degrades variably with added irrelevant sessions depending on agent, memory interface, and scale.
-
The Text Uncanny Valley: Non-Monotonic Performance Degradation in LLM Information Retrieval
LLM information retrieval shows a U-shaped performance drop as words are fragmented by inserted whitespace, attributed to a disordered transition between word-level and character-level processing modes.
-
SCOUT: Active Information Foraging for Long-Text Understanding with Decoupled Epistemic States
SCOUT achieves state-of-the-art long-text understanding with up to 8x lower token use by actively foraging for sparse query-relevant information and updating a compact provenance-grounded epistemic state.
-
HUGO-CS: A Hybrid-Labeled, Uncertainty-Aware, General-Purpose, Observational Dataset for Cold Spray
HUGO-CS is a 4,383-experiment cold-spray dataset extracted from literature via a new hybrid LLM-manual framework that is 30 times larger than prior collections and released with code.
-
Token Arena: A Continuous Benchmark Unifying Energy and Cognition in AI Inference
TokenArena is a continuous benchmark for AI inference endpoints that measures output speed, time to first token, blended price, effective context, quality, and modeled energy to produce composites of joules per correct answer, dollars per correct answer, and endpoint fidelity.
-
PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training
Stealth Pretraining Seeding plants persistent unsafe behaviors in LLMs via diffuse poisoned web content that activates on precise triggers and evades standard evaluation.
-
Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences
Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.
-
An Agentic Evaluation Architecture for Historical Bias Detection in Educational Textbooks
An agentic architecture with multimodal screening, a five-agent jury, meta-synthesis, and source attribution protocol detects biases in Romanian history textbooks more accurately than zero-shot baselines, achieving 83.3% acceptable excerpts and human preference in 64.8% of blind comparisons.
-
MemDLM: Memory-Enhanced DLM Training
MemDLM embeds a simulated denoising trajectory into DLM training via bi-level optimization, creating a parametric memory that improves convergence and long-context performance even when the memory is dropped at test time.
-
CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training
CapTrack shows post-training causes drift beyond facts, with instruction fine-tuning producing stronger behavioral changes than preference optimization across model families.
-
Hidden State Poisoning Attacks against Mamba-based Language Models
Short input phrases can irreversibly overwrite hidden states in Mamba models, impairing information retrieval on a new benchmark while leaving pure Transformer models unaffected.
-
A Hippocampus for Linear Attention: An Exact Memory for What the Recurrent State Forgets
HOLA pairs a compressive delta-rule recurrent state with a residual-selected exact KV cache and decoupled RMSNorm-gamma read, yielding lower perplexity than both standard linear attention and full-attention baselines on Wikitext and LAMBADA plus stronger needle-in-haystack recall.
-
PARTREP: Learning What to Repeat for Decoder-only LLMs
PartRep selects high-NLL tokens via a lightweight early-exit gate for partial prompt repetition, retaining most full-repetition gains at 59.4% KV cache and 79% prefill FLOPs on eight benchmarks.
-
Memory-Managed Long-Context Attention: A Preliminary Study of Editable Request-Local Memory
A hybrid attention mechanism with editable request-local memory slots and sparse fallback achieves high accuracy on synthetic overwrite, version, and anti-pollution tasks where pure fixed-state or sparse methods fail, while identifying open-domain selection as the remaining bottleneck.
-
VeriPort: Automated and Verified Patch Backporting at Scale
VeriPort is an end-to-end agentic system that backports vulnerability patches to all affected versions of a package at scale while producing verification evidence, achieving 95.3% success on 128 benchmark tasks and generating over 5,000 verified patches across 169 CVEs.
-
Learning the ARTS of Search for Automated Discovery
ARTS improves automated scientific discovery by using reasoning LMs with test-time training to separate hypothesis merit from execution quality in tree search, achieving 15.3% relative gains on 22 MLGym and MLEBench tasks.
-
Test-Time Training with Next-Token Prediction
TTT-NTP adapts pretrained LLMs at test time by training fast weights to match next-position hidden states from the forward pass, yielding consistent gains on long-context benchmarks across Llama, Mistral, and Qwen models.
-
NarrativeWorldBench: A Frontier-Saturated Benchmark and a Latent World Model for Long-Horizon Co-Creative Audio Drama
NarrativeWorldBench evaluates 21 LLMs on nine narrative metrics across horizons to 200 episodes and introduces N-VSSM, a 256-dimensional variational state-space model that achieves plot-beat F1 >=0.84 with 4x lower compute and wins writer preference on consistency.
-
RoVE: Rotary Value Embeddings Attention for Relative Position-dependent Value Pathways
RoVE rotates value embeddings simultaneously with keys in attention to make values position-dependent, reframing RoPE as attentive convolution and reporting gains on long-context tasks in 124M and 354M GPT-2 models.
-
End-to-End Context Compression at Scale
LCLMs are scaled 0.6B-encoder 4B-decoder compressors pre-trained on over 350B tokens that improve the Pareto frontier for general-task performance, compression speed, and peak memory in long-context language model inference.
-
Diagnosing Evidence Utilization in Long-Context and Retrieval-Augmented Language Models under Matched Evidence Conditions
Introduces a matched four-condition protocol and ONCU metric to diagnose evidence utilization in long-context and RAG models across synthetic and multi-hop QA tasks.