arxiv: 2404.06654 · v3 · submitted 2024-04-09 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

RULER: What's the Real Context Size of Your Long-Context Language Models?

Boris Ginsburg, Cheng-Ping Hsieh, Dima Rekesh, Fei Jia, Samuel Kriman, Shantanu Acharya, Simeng Sun, Yang Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-11 03:50 UTC · model grok-4.3

classification 💻 cs.CL

keywords long-context language modelsbenchmark evaluationneedle-in-a-haystackcontext lengthmulti-hop reasoningsynthetic tasksmodel performance

0 comments

The pith

Long-context language models lose accuracy on tasks beyond basic retrieval as context length grows, despite advertised sizes of 32K or more.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates RULER, a benchmark that broadens the needle-in-a-haystack test by varying the number and type of needles and by adding multi-hop tracing and aggregation tasks. It tests 17 models that claim context lengths of 32K tokens or greater and shows that near-perfect performance on simple retrieval does not hold up when length increases or when tasks require connecting information across the input. Only half the models keep satisfactory results at 32K, and even models with much larger claimed limits show clear gaps as complexity rises. If these findings hold, they indicate that current training and architecture choices leave models unable to use their full stated context for anything beyond surface-level lookup.

Core claim

Despite near-perfect scores on the standard needle-in-a-haystack test, nearly all evaluated long-context language models display large performance drops as input length increases and as tasks move beyond simple retrieval to multi-hop tracing and aggregation. While every model claims a context size of 32K tokens or greater, only half maintain acceptable accuracy at that length on the full set of RULER tasks.

What carries the argument

RULER benchmark, a configurable synthetic test suite that extends needle-in-a-haystack retrieval with controlled numbers of needles plus multi-hop tracing and aggregation categories.

If this is right

Models that pass basic retrieval tests still need separate validation on multi-hop and aggregation tasks before their context claims can be trusted.
Only half of the 17 tested models that advertise 32K or larger contexts actually sustain performance at 32K on the expanded tasks.
Even the Yi-34B model, which supports up to 200K tokens, shows substantial room for improvement when both length and task complexity are increased together.
Current approaches to extending context length do not automatically produce models that can use that length for reasoning steps spread across the input.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Evaluation standards for long-context models may shift toward requiring these harder synthetic tasks rather than relying on needle retrieval alone.
Training objectives focused on maintaining and linking distant pieces of information could close more of the observed gap than simply increasing maximum length.
Applications that depend on synthesizing information from many sources may need to keep inputs shorter or add retrieval steps until model capabilities improve.

Load-bearing premise

That results on these synthetic tracing and aggregation tasks reliably indicate how well a model will handle the long-context demands of real applications.

What would settle it

A model that scores low on RULER at 32K yet achieves high accuracy on production tasks that require connecting facts across long documents or conversations.

read the original abstract

The needle-in-a-haystack (NIAH) test, which examines the ability to retrieve a piece of information (the "needle") from long distractor texts (the "haystack"), has been widely adopted to evaluate long-context language models (LMs). However, this simple retrieval-based test is indicative of only a superficial form of long-context understanding. To provide a more comprehensive evaluation of long-context LMs, we create a new synthetic benchmark RULER with flexible configurations for customized sequence length and task complexity. RULER expands upon the vanilla NIAH test to encompass variations with diverse types and quantities of needles. Moreover, RULER introduces new task categories multi-hop tracing and aggregation to test behaviors beyond searching from context. We evaluate 17 long-context LMs with 13 representative tasks in RULER. Despite achieving nearly perfect accuracy in the vanilla NIAH test, almost all models exhibit large performance drops as the context length increases. While these models all claim context sizes of 32K tokens or greater, only half of them can maintain satisfactory performance at the length of 32K. Our analysis of Yi-34B, which supports context length of 200K, reveals large room for improvement as we increase input length and task complexity. We open source RULER to spur comprehensive evaluation of long-context LMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces RULER, a synthetic benchmark extending the needle-in-a-haystack (NIAH) test with configurable sequence lengths and task complexities, including variations with multiple/diverse needles, multi-hop tracing, and aggregation tasks. It evaluates 17 long-context LMs on 13 tasks, claiming that near-perfect vanilla NIAH performance does not hold as context length grows, with only half the models maintaining satisfactory results at 32K tokens despite claiming 32K+ windows; it also analyzes Yi-34B (200K claimed) and releases RULER openly.

Significance. If the central results hold after addressing potential confounds, the work is significant for the field: it provides a more demanding, flexible benchmark than vanilla NIAH for assessing genuine long-context capabilities, demonstrates that claimed context windows often overestimate effective length, and supplies reproducible code and tasks. The broad model coverage and open release are explicit strengths that support follow-on work.

major comments (2)

[Abstract] Abstract and evaluation description: the central claim attributes large performance drops (and the 'only half maintain satisfactory performance at 32K' result) to insufficient effective context length. However, RULER is described as having 'flexible configurations for customized sequence length and task complexity,' and the Yi-34B analysis explicitly increases both length and complexity together. It is not stated that complexity parameters (needle count, hop count, aggregation difficulty) are held fixed while scaling length in the 13-task suite. If complexity rises with length, the observed degradation cannot be cleanly attributed to length alone and the 32K threshold interpretation is under-supported.
[Evaluation] Evaluation of 17 models: without explicit confirmation that the same task-complexity settings are used across the tested lengths (e.g., 4K vs 32K), the trends in degradation are difficult to interpret as length-specific. Adding a controlled experiment or table that fixes complexity parameters while varying only length would directly address this.

minor comments (2)

[Evaluation] Provide additional detail on the precise configurations (needle quantities, hop counts, distractor types) used for each of the 13 tasks at each tested length, and on any statistical controls or variance estimates for the reported accuracy drops.
[Abstract] The abstract states 'only half of them can maintain satisfactory performance at the length of 32K'; define 'satisfactory' quantitatively (e.g., accuracy threshold) and report per-model numbers to make the claim precise.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback on potential confounds between sequence length and task complexity. We clarify that the primary evaluation of the 17 models holds complexity fixed while varying length, and we will revise the manuscript for explicitness as detailed below.

read point-by-point responses

Referee: [Abstract] Abstract and evaluation description: the central claim attributes large performance drops (and the 'only half maintain satisfactory performance at 32K' result) to insufficient effective context length. However, RULER is described as having 'flexible configurations for customized sequence length and task complexity,' and the Yi-34B analysis explicitly increases both length and complexity together. It is not stated that complexity parameters (needle count, hop count, aggregation difficulty) are held fixed while scaling length in the 13-task suite. If complexity rises with length, the observed degradation cannot be cleanly attributed to length alone and the 32K threshold interpretation is under-supported.

Authors: We appreciate this observation. In the main 13-task evaluation of the 17 models, complexity parameters are held fixed (e.g., fixed needle counts/types, hop counts, and aggregation functions per task) while only the haystack length is scaled across 4K/8K/16K/32K. The Yi-34B analysis deliberately varies both dimensions to probe upper limits. The manuscript does not explicitly state the fixed-complexity design for the primary results, which we agree weakens the length-specific interpretation. We will revise the abstract, Section 3 (Benchmark), and the evaluation description to state this clearly, and add a table enumerating the exact fixed parameter values used for each task. revision: yes
Referee: [Evaluation] Evaluation of 17 models: without explicit confirmation that the same task-complexity settings are used across the tested lengths (e.g., 4K vs 32K), the trends in degradation are difficult to interpret as length-specific. Adding a controlled experiment or table that fixes complexity parameters while varying only length would directly address this.

Authors: We agree that explicit confirmation is needed. The existing results already use fixed complexity settings across lengths by design of RULER's configurable tasks. To directly address the concern, we will add a new table (or appendix table) that lists the complexity parameters for all 13 tasks and confirms they are unchanged while length varies. This makes the length-specific trends unambiguous without requiring new experiments. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results from new benchmark tasks stand independently

full rationale

The paper introduces the RULER benchmark with 13 tasks (NIAH variants, multi-hop tracing, aggregation) and reports direct accuracy measurements from running 17 LMs at varying lengths. No equations, fitted parameters, or derivations are present; claims about performance drops are observational outputs from model evaluations on explicitly defined inputs, not reductions to self-defined quantities or self-citation chains. The evaluation is self-contained and falsifiable via external reproduction on the open-sourced tasks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the introduced synthetic tasks measure meaningful long-context capabilities beyond retrieval. No free parameters are fitted to data, and no new physical or theoretical entities are postulated.

axioms (1)

domain assumption Synthetic tasks such as multi-hop tracing and aggregation test genuine long-context understanding rather than narrow artifacts
Invoked when arguing that performance drops reveal real limitations in model context handling.

pith-pipeline@v0.9.0 · 5563 in / 1281 out tokens · 50016 ms · 2026-05-11T03:50:08.177345+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DimensionForcing alexander_duality_circle_linking unclear
RULER expands upon the vanilla NIAH test to encompass variations with diverse types and quantities of needles. Moreover, RULER introduces new task categories multi-hop tracing and aggregation to test behaviors beyond searching from context.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare
cs.AI 2026-05 conditional novelty 8.0

MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for ...
LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues
cs.CL 2026-05 unverdicted novelty 7.0

LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.
MEME: Multi-entity & Evolving Memory Evaluation
cs.LG 2026-05 unverdicted novelty 7.0

All tested LLM memory systems fail at dependency reasoning in multi-entity evolving scenarios, with only an expensive file-based setup showing partial recovery.
ProactBench: Beyond What The User Asked For
cs.LG 2026-05 unverdicted novelty 7.0

ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.
VORT: Adaptive Power-Law Memory for NLP Transformers
cs.LG 2026-05 unverdicted novelty 7.0

VORT assigns learnable fractional orders to tokens and approximates their power-law retention kernels via sum-of-exponentials for efficient long-range dependency modeling in transformers.
When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory
cs.AI 2026-05 unverdicted novelty 7.0

A new evaluation protocol shows agent memory reliability degrades variably with added irrelevant sessions depending on agent, memory interface, and scale.
The Text Uncanny Valley: Non-Monotonic Performance Degradation in LLM Information Retrieval
cs.CL 2026-05 unverdicted novelty 7.0

LLM information retrieval shows a U-shaped performance drop as words are fragmented by inserted whitespace, attributed to a disordered transition between word-level and character-level processing modes.
SCOUT: Active Information Foraging for Long-Text Understanding with Decoupled Epistemic States
cs.CL 2026-05 unverdicted novelty 7.0

SCOUT achieves state-of-the-art long-text understanding with up to 8x lower token use by actively foraging for sparse query-relevant information and updating a compact provenance-grounded epistemic state.
HUGO-CS: A Hybrid-Labeled, Uncertainty-Aware, General-Purpose, Observational Dataset for Cold Spray
cs.LG 2026-05 accept novelty 7.0

HUGO-CS is a 4,383-experiment cold-spray dataset extracted from literature via a new hybrid LLM-manual framework that is 30 times larger than prior collections and released with code.
Token Arena: A Continuous Benchmark Unifying Energy and Cognition in AI Inference
cs.AI 2026-05 unverdicted novelty 7.0

TokenArena is a continuous benchmark for AI inference endpoints that measures output speed, time to first token, blended price, effective context, quality, and modeled energy to produce composites of joules per correc...
PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training
cs.LG 2026-04 unverdicted novelty 7.0

Stealth Pretraining Seeding plants persistent unsafe behaviors in LLMs via diffuse poisoned web content that activates on precise triggers and evades standard evaluation.
Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences
cs.LG 2026-04 unverdicted novelty 7.0

Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.
An Agentic Evaluation Architecture for Historical Bias Detection in Educational Textbooks
cs.AI 2026-04 unverdicted novelty 7.0

An agentic architecture with multimodal screening, a five-agent jury, meta-synthesis, and source attribution protocol detects biases in Romanian history textbooks more accurately than zero-shot baselines, achieving 83...
When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction
cs.AI 2026-05 unverdicted novelty 6.0

Attention to goal tokens declines in multi-turn LLM interactions while residual representations often retain decodable goal information, and the gap between these predicts whether goal-conditioned behavior survives.
AB-Sparse: Sparse Attention with Adaptive Block Size for Accurate and Efficient Long-Context Inference
cs.DC 2026-05 unverdicted novelty 6.0

AB-Sparse adaptively allocates per-head block sizes for sparse attention, adds lossless centroid quantization and custom variable-block GPU kernels, and reports up to 5.43% accuracy gain over fixed-block baselines wit...
Where Does Long-Context Supervision Actually Go? Effective-Context Exposure Balancing
cs.CL 2026-05 conditional novelty 6.0

EXACT re-allocates training supervision by inverse frequency of long effective-context targets, improving NoLiMa and RULER scores by 5-18 points on Qwen and LLaMA models without degrading standard QA or reasoning.
Remember to Forget: Gated Adaptive Positional Encoding
cs.LG 2026-05 unverdicted novelty 6.0

GAPE augments RoPE with query- and key-dependent gates to stabilize attention and improve long-context performance in language models.
FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning
cs.CL 2026-05 unverdicted novelty 6.0

FocuSFT uses an inner optimization loop to adapt fast-weight parameters into a parametric memory that sharpens attention on relevant content, then conditions outer-loop supervised fine-tuning on this representation, y...
KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving
cs.AR 2026-05 unverdicted novelty 6.0

KV-RM regularizes KV-cache movement in static-graph LLM serving via block paging and merge-staged transport to improve throughput, tail latency, and memory use for variable-length decoding.
ReST-KV: Robust KV Cache Eviction with Layer-wise Output Reconstruction and Spatial-Temporal Smoothing
cs.CL 2026-05 conditional novelty 6.0

ReST-KV formulates KV eviction as layer-wise output reconstruction optimization with spatial-temporal smoothing, outperforming baselines by 2.58% on LongBench and 15.2% on RULER while cutting decoding latency by 10.61...
Slipstream: Trajectory-Grounded Compaction Validation for Long-Horizon Agents
cs.MA 2026-05 unverdicted novelty 6.0

Slipstream uses asynchronous compaction with trajectory-grounded judge validation to improve long-horizon agent accuracy by up to 8.8 percentage points and reduce latency by up to 39.7%.
RDKV: Rate-Distortion Bit Allocation for Joint Eviction and Quantization of the KV Cache
cs.LG 2026-05 unverdicted novelty 6.0

RDKV derives per-token and per-channel weights from attention distortion, then uses reverse water-filling to assign bit-widths from full precision to zero after prefilling, recovering 97.81% accuracy with 2.48% cache ...
Reformulating KV Cache Eviction Problem for Long-Context LLM Inference
cs.CL 2026-05 unverdicted novelty 6.0

LaProx reformulates KV cache eviction as an output-aware matrix approximation, enabling a unified global token selection strategy that preserves LLM performance at 5% cache size across long-context benchmarks.
The Position Curse: LLMs Struggle to Locate the Last Few Items in a List
cs.LG 2026-05 unverdicted novelty 6.0

LLMs exhibit the Position Curse, with backward position retrieval in lists lagging far behind forward retrieval, showing only partial gains from PosBench fine-tuning.
Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache
cs.LG 2026-05 unverdicted novelty 6.0

Louver is a new index for LLM KV caches that guarantees zero false negatives for keys above a relevance threshold, runs faster than prior sparse and some dense attention methods, and integrates lightly into existing p...
Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache
cs.LG 2026-05 unverdicted novelty 6.0

Louver is a new index structure that guarantees zero false negatives for sparse attention in LLM KV caches by casting the problem as halfspace range searching.
UniPrefill: Universal Long-Context Prefill Acceleration via Block-wise Dynamic Sparsification
cs.CL 2026-05 unverdicted novelty 6.0

UniPrefill accelerates LLM prefill via block-wise dynamic sparsification, achieving up to 2.1x TTFT speedup while supporting hybrid architectures and native vLLM continuous batching.
When Does Value-Aware KV Eviction Help? A Fixed-Contract Diagnostic for Non-Monotone Cache Compression
cs.LG 2026-05 unverdicted novelty 6.0

A fixed-contract probe shows value-aware KV eviction recovers needed evidence in 72.6% of accuracy-improving cases on LongBench but only 32.4% otherwise, suggesting an order of recover evidence, rank value, then prese...
Retrieval and Multi-Hop Reasoning in 1M-Token Context Windows: Evaluating LLMs on Classical Chinese Text
cs.AI 2026-05 unverdicted novelty 6.0

Frontier LLMs solve single-needle retrieval at 1M tokens on classical Chinese but show three distinct accuracy-decay patterns in three-hop reasoning between 256K and 1M tokens.
Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling
cs.CL 2026-04 unverdicted novelty 6.0

HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.
Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs
cs.LG 2026-04 unverdicted novelty 6.0

NPUMoE accelerates MoE LLM inference on Apple Silicon NPUs via offline-calibrated static expert tiers, grouped execution, and load-aware graph residency, delivering 1.32x-5.55x lower latency and 1.81x-7.37x better ene...
OPSDL: On-Policy Self-Distillation for Long-Context Language Models
cs.CL 2026-04 unverdicted novelty 6.0

OPSDL improves long-context LLM performance by having the model self-distill from its short-context capability using point-wise reverse KL divergence on generated tokens, outperforming SFT and DPO on benchmarks withou...
Sketching the Readout of Large Language Models for Scalable Data Attribution and Valuation
cs.LG 2026-04 unverdicted novelty 6.0

RISE applies CountSketch to dual lexical and semantic channels derived from output-layer gradient outer products, cutting data attribution storage by up to 112x and enabling retrospective and prospective influence ana...
Accuracy Is Speed: Towards Long-Context-Aware Routing for Distributed LLM Serving
cs.DC 2026-04 unverdicted novelty 6.0

In long-context LLM serving, accuracy becomes speed via retry dynamics, and accuracy-aware routing reduces time-to-correct-answer.
Latent-Condensed Transformer for Efficient Long Context Modeling
cs.CL 2026-04 unverdicted novelty 6.0

Latent-Condensed Attention condenses context in MLA's latent space via query-aware semantic pooling and positional anchor selection, delivering up to 2.5x prefilling speedup and 90% KV cache reduction at 128K length w...
IceCache: Memory-efficient KV-cache Management for Long-Sequence LLMs
cs.LG 2026-04 unverdicted novelty 6.0

IceCache combines semantic token clustering with PagedAttention to keep only 25% of the KV cache tokens while retaining 99% accuracy on LongBench and matching or beating prior offloading methods in latency.
StructKV: Preserving the Structural Skeleton for Scalable Long-Context Inference
cs.CL 2026-04 unverdicted novelty 6.0

StructKV compresses LLM KV caches by tracking global in-degree centrality across network depth and dynamically selecting compression layers to preserve long-range dependencies better than local pruning methods.
In-Place Test-Time Training
cs.LG 2026-04 conditional novelty 6.0

In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.
Safety, Security, and Cognitive Risks in State-Space Models: A Systematic Threat Analysis with Spectral, Stateful, and Capacity Attacks
cs.CR 2026-04 unverdicted novelty 6.0

State-space models are vulnerable to three new attack types that corrupt state integrity, with experiments showing up to 156x output changes and 6x higher targeted corruption than random inputs.
Kimi Linear: An Expressive, Efficient Attention Architecture
cs.CL 2025-10 unverdicted novelty 6.0

Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
cs.CL 2025-05 conditional novelty 6.0

Applying a head-specific sigmoid gate after SDPA in LLMs boosts performance and stability by adding non-linearity and query-dependent sparse modulation while reducing attention sinks.
MemReread: Enhancing Agentic Long-Context Reasoning via Memory-Guided Rereading
cs.CL 2026-05 unverdicted novelty 5.0

MemReread improves agent long-context reasoning by triggering rereading on insufficient final memory to recover discarded indirect facts, outperforming baselines at linear complexity.
Instruction Adherence in Coding Agent Configuration Files: A Factorial Study of Four File-Structure Variables
cs.SE 2026-05 unverdicted novelty 5.0

A 1650-session factorial study found no measurable impact from config file size, instruction position, architecture, or conflicts on coding agent adherence, though compliance declined within sessions.
LensVLM: Selective Context Expansion for Compressed Visual Representation of Text
cs.CV 2026-05 unverdicted novelty 5.0

LensVLM trains VLMs to scan compressed rendered text images and selectively expand task-relevant regions, achieving 4.3x compression with near full-text accuracy and outperforming baselines up to 10.1x on text QA benchmarks.
MDN: Parallelizing Stepwise Momentum for Delta Linear Attention
cs.LG 2026-05 unverdicted novelty 5.0

MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k
cs.LG 2026-05 accept novelty 5.0

Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.
Budget-Aware Routing for Long Clinical Text
cs.CL 2026-05 unverdicted novelty 5.0

RCD balances relevance, coverage, and diversity in a knapsack-constrained selection framework, with experiments showing that selector choice and budget level determine optimal unitization strategies on clinical datasets.
Caracal: Causal Architecture via Spectral Mixing
cs.LG 2026-04 unverdicted novelty 5.0

Caracal is a Fourier-based sequence mixing architecture that achieves causal autoregressive modeling with standard operators and competitive performance on long sequences.
SpikingBrain2.0: Brain-Inspired Foundation Models for Efficient Long-Context and Cross-Platform Inference
cs.LG 2026-04 unverdicted novelty 5.0

SpikingBrain2.0 is a 5B hybrid spiking-Transformer that recovers most base model performance while delivering 10x TTFT speedup at 4M context and supporting over 10M tokens on limited GPUs via dual sparse attention and...
FG$^2$-GDN: Enhancing Long-Context Gated Delta Networks with Doubly Fine-Grained Control
cs.LG 2026-04 unverdicted novelty 5.0

FG²-GDN replaces the scalar beta in the delta update with a channel-wise vector and decouples key/value scaling to improve recall over prior GDN and KDA models.
LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 5.0

LongAct uses saliency from high-magnitude activations to guide sparse weight updates in long-context RL, yielding about 8% gains on LongBench v2 across multiple algorithms.
A Decomposition Perspective to Long-context Reasoning for LLMs
cs.CL 2026-04 unverdicted novelty 5.0

Decomposing long-context reasoning into atomic skills, synthesizing targeted pseudo-datasets, and applying RL improves LLM performance on long-context benchmarks by an average of 7.7%.
Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference
cs.LG 2026-04 unverdicted novelty 5.0

Flux Attention uses a context-aware Layer Router to dynamically assign full or sparse attention to each LLM layer, achieving up to 2.8x prefill and 2.0x decode speedups with competitive performance on long-context and...
AgentCE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments
cs.AI 2026-04 unverdicted novelty 5.0

AgentCE-Bench is a lightweight grid-planning benchmark that controls task horizon via hidden slots H and difficulty via decoy budget B, validated across 13 models for consistent and discriminative evaluation.
JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency
cs.CL 2026-04 unverdicted novelty 5.0

JoyAI-LLM Flash delivers a 48B MoE LLM with 2.7B active parameters per token via FiberPO RL and dense multi-token prediction, released with checkpoints on Hugging Face.
Qwen3 Technical Report
cs.CL 2025-05 unverdicted novelty 5.0

Pith review generated a malformed one-line summary.
Submodular Benchmark Selection
cs.AI 2026-05 unverdicted novelty 4.0

Submodular maximization under a Gaussian model selects small benchmark subsets that outperform random selection for imputing leaderboard scores, with mutual information better than entropy at small sizes.
Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs
cs.CL 2026-05 unverdicted novelty 3.0

EngGPT2MoE-16B-A3B matches or beats other Italian models on most international benchmarks but trails top international models such as GPT-5 nano and Qwen3-8B.
XekRung Technical Report
cs.CR 2026-04 unverdicted novelty 3.0

XekRung achieves state-of-the-art performance on cybersecurity benchmarks among same-scale models via tailored data synthesis and multi-stage training while retaining strong general capabilities.
RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference
cs.LG 2025-05

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 59 Pith papers · 15 internal anchors

[1]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Agrawal, P., Craig, N., Madden, A., and Lombera, I

Rishabh Agarwal, Avi Singh, Lei M Zhang, Bernd Bohnet, Stephanie Chan, Ankesh Anand, Zaheer Abbas, Azade Nova, John D Co-Reyes, Eric Chu, et al. Many-shot in-context learning. arXiv preprint arXiv:2404.11018,

work page arXiv
[3]

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Yushi Bai et al. LongBench: A bilingual, multitask benchmark for long context understand- ing. arXiv:2308.14508,

work page internal anchor Pith review arXiv
[4]

xLSTM: Extended Long Short-Term Memory

Maximilian Beck, Korbinian P¨oppel, Markus Spanring, Andreas Auer, Oleksandra Prud- nikova, Michael Kopp, G¨unter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xlstm: Extended long short-term memory. arXiv preprint arXiv:2405.04517,

work page arXiv
[5]

arXiv preprint arXiv:2405.00200 , year=

Amanda Bertsch, Maor Ivgi, Uri Alon, Jonathan Berant, Matthew R Gormley, and Graham Neubig. In-context learning with long-context models: An in-depth exploration. arXiv preprint arXiv:2405.00200,

work page arXiv
[6]

Scaling transformer to 1m tokens and beyond with rmt

Aydar Bulatov, Yuri Kuratov, and Mikhail S Burtsev. Scaling Transformer to 1M tokens and beyond with RMT. arXiv:2304.11062,

work page arXiv
[7]

Generating Long Sequences with Sparse Transformers

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse Transformers. arXiv:1904.10509,

work page internal anchor Pith review Pith/arXiv arXiv 1904
[8]

URL https://docs.cohere.com/docs/command-r-plus# model-details. Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. arxiv:2307.08691,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Introducing dbrx: A new state-of-the-art open llm,

11 Published as a conference paper at COLM 2024 Databricks. Introducing dbrx: A new state-of-the-art open llm,

work page 2024
[10]

LongNet: Scaling transformers to 1,000,000,000 tokens.arXiv preprint arXiv:2307.02486,

URL https://www. databricks.com/blog/introducing-dbrx-new-state-art-open-llm . Jiayu Ding et al. LongNet: Scaling Transformers to 1,000,000,000 tokens. arXiv:2307.02486,

work page arXiv
[11]

Longrope: Extending llm context window beyond 2 million tokens.arXiv preprint arXiv:2402.13753, 2024

Yiran Ding et al. LongRoPE: Extending LLM context window beyond 2 million tokens. arXiv:2402.13753,

work page arXiv
[12]

Bamboo: A com- prehensive benchmark for evaluating long text modeling capacities of large language models

Zican Dong, Tianyi Tang, Junyi Li, Wayne Xin Zhao, and Ji-Rong Wen. Bamboo: A com- prehensive benchmark for evaluating long text modeling capacities of large language models. arXiv:2309.13345,

work page arXiv
[13]

Data engineering for scaling language models to 128K context.arXiv preprint arXiv:2402.10171, 2024

Daniel Y. Fu, Tri Dao, Khaled K. Saab, Armin W. Thomas, Atri Rudra, and Christopher R´e. Hungry Hungry Hippos: Towards language modeling with state space models. In ICLR, 2023a. Daniel Y. Fu et al. Simple hardware-efficient long convolutions for sequence modeling. ICML, 2023b. Yao Fu et al. Data engineering for scaling language models to 128k context.arXi...

work page arXiv
[14]

Is it really long context if all you need is retrieval? towards genuinely difficult long context nlp

Omer Goldman, Alon Jacovi, Aviv Slobodkin, Aviya Maimon, Ido Dagan, and Reut Tsarfaty. Is it really long context if all you need is retrieval? towards genuinely difficult long context nlp. arXiv preprint arXiv:2407.00402,

work page arXiv
[15]

Neural Turing Machines

Alex Graves, Greg Wayne, and Ivo Danihelka. Neural Turing machines. arXiv:1410.5401,

work page internal anchor Pith review arXiv
[16]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv:2312.00752,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Lm-infinite: Simple on-the-fly length generalization for large language models

Chi Han, Qifan Wang, Wenhan Xiong, Yu Chen, Heng Ji, and Sinong Wang. Lm-infinite: Simple on-the-fly length generalization for large language models. arXiv:2308.16137,

work page arXiv
[18]

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

Sam Ade Jacobs et al. DeepSpeed Ulysses: System optimizations for enabling training of extreme long sequence Transformer models. arXiv:2309.14509,

work page internal anchor Pith review arXiv
[19]

Mixtral of Experts

Albert Q Jiang et al. Mixtral of experts. arXiv:2401.04088,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression

12 Published as a conference paper at COLM 2024 Huiqiang Jiang et al. LongLlmLingua: Accelerating and enhancing LLMs in long context scenarios via prompt compression. arXiv:2310.06839,

work page arXiv 2024
[21]

Karpinska, K

URL https://github.com/gkamradt/LLMTest NeedleInAHaystack/tree/main. Marzena Karpinska, Katherine Thai, Kyle Lo, Tanya Goyal, and Mohit Iyyer. One thousand and one pairs: A” novel” challenge for long-context language models. arXiv preprint arXiv:2406.16264,

work page arXiv
[22]

arXiv:2406.10149

Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Sorokin, Artyom Sorokin, and Mikhail Burtsev. Babilong: Testing the limits of llms with long context reasoning-in- a-haystack. arXiv preprint arXiv:2406.10149,

work page arXiv
[23]

Can long-context language models subsume retrieval, rag, sql, and more? arXiv preprint arXiv:2406.13121,

Jinhyuk Lee, Anthony Chen, Zhuyun Dai, Dheeru Dua, Devendra Singh Sachan, Michael Boratko, Yi Luan, S ´ebastien MR Arnold, Vincent Perot, Siddharth Dalmia, et al. Can long-context language models subsume retrieval, rag, sql, and more? arXiv preprint arXiv:2406.13121,

work page arXiv
[24]

Same task, more tokens: the impact of input length on the reasoning performance of large language models

Mosh Levy, Alon Jacoby, and Yoav Goldberg. Same task, more tokens: the impact of input length on the reasoning performance of large language models. arXiv preprint arXiv:2402.14848,

work page arXiv
[25]

How long can open-source LLMs truly promise on context length?, 2023a

Dacheng Li, Rulin Shao, et al. How long can open-source LLMs truly promise on context length?, 2023a. URL https://lmsys.org/blog/2023-06-29-longchat . Jiaqi Li, Mengmeng Wang, Zilong Zheng, and Muhan Zhang. Loogle: Can long-context language models understand long contexts? arXiv:2311.04939, 2023b. Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention ...

work page arXiv 2023
[26]

World model on million-length video and language with ringattention.arXiv preprint arXiv:2402.08268, 2024

Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with Ring Attention. arxiv:2402.08268, 2024a. Jiaheng Liu et al. E2-LLM: Efficient and extreme length extension of large language models. arXiv:2401.06951, 2024b. Jiawei Liu, Jia Le Tian, Vijay Daita, Yuxiang Wei, Yifeng Ding, Yuhan Katherine Wang, Jun ...

work page arXiv
[27]

13 Published as a conference paper at COLM 2024 Amirkeivan Mohtashami and Martin Jaggi

URL https://mistral.ai/news/la-plateforme/. 13 Published as a conference paper at COLM 2024 Amirkeivan Mohtashami and Martin Jaggi. Landmark attention: Random-access infinite context length for Transformers. In Workshop on Efficient Systems for Foundation Models @ ICML,

work page 2024
[28]

GPT-4 Technical Report

https://transformer-circuits.pub/2022/in-context-learning-and-induction- heads/index.html. OpenAI: Josh Achiam et al. GPT-4 technical report. arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv 2022
[29]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Machel Reid et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv:2403.05530,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

RoFormer: Enhanced Transformer with Rotary Position Embedding

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced Transformer with rotary position embedding. arXiv:2104.09864,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

ChapterBreak: A challenge dataset for long-range language models

Simeng Sun, Katherine Thai, and Mohit Iyyer. ChapterBreak: A challenge dataset for long-range language models. In Proc. of the 2022 Conference of the North American Chapter of the ACL: Human Language Technologies,

work page 2022
[32]

Retentive Network: A Successor to Transformer for Large Language Models

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to Transformer for large language models. arXiv:2307.08621, 2023a. Yutao Sun, Li Dong, Barun Patra, Shuming Ma, Shaohan Huang, Alon Benhaim, Vishrav Chaudhary, Xia Song, and Furu Wei. A length-extrapolatable Transformer. In Pr...

work page internal anchor Pith review arXiv
[33]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Together.AI. Preparing for the era of 32k context: Early learnings and explorations, 2023a. URL https://www.together.ai/blog/llama-2-7b-32k . 14 Published as a conference paper at COLM 2024 Together.AI. Llama-2-7b-32k-instruct — and fine-tuning for llama-2 models with together api, 2023b. URL https://www.together.ai/blog/llama-2-7b-32k-instruct . Hugo Tou...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

Thomas Wolf et al. Huggingface’s Transformers: State-of-the-art natural language process- ing. arXiv:1910.03771,

work page internal anchor Pith review arXiv 1910
[35]

Infllm: Unveiling the intrinsic capacity of llms for understanding extremely long sequences with training-free memory.arXiv preprint arXiv:2402.04617, 3(7), 2024

URL https://x.ai/blog/grok-1.5. Chaojun Xiao et al. InfLLM: Unveiling the intrinsic capacity of LLMs for understanding extremely long sequences with training-free memory. arXiv:2402.04617, 2024a. Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient stream- ing language models with attention sinks. In ICLR, 2024b. Wenhan Xiong et ...

work page arXiv
[36]

Retrieval meets long context large language models

Peng Xu, Wei Ping, Xianchao Wu, Lawrence McAfee, Chen Zhu, Zihan Liu, Sandeep Subramanian, Evelina Bakhturina, Mohammad Shoeybi, and Bryan Catanzaro. Retrieval meets long context large language models. In ICLR, 2024a. Xiaoyue Xu, Qinyuan Ye, and Xiang Ren. Stress-testing long-context language models with lifelong icl and task haystack. arXiv preprint arXi...

work page arXiv
[37]

Yi: Open Foundation Models by 01.AI

Alex Young et al. Yi: Open foundation models by 01.AI. arXiv:2403.04652,

work page internal anchor Pith review arXiv
[38]

Lv-eval: A balanced long-context benchmark with 5 length levels up to 256k

Tao Yuan, Xuefei Ning, Dong Zhou, Zhijie Yang, Shiyao Li, Minghui Zhuang, Zheyue Tan, Zhuyu Yao, Dahua Lin, Boxun Li, et al. Lv-eval: A balanced long-context benchmark with 5 length levels up to 256k. arXiv preprint arXiv:2402.05136,

work page arXiv
[39]

arXiv preprint arXiv:2401.03462 , year=

Peitian Zhang, Zheng Liu, Shitao Xiao, Ninglu Shao, Qiwei Ye, and Zhicheng Dou. Soaring from 4k to 400k: Extending LLM’s context with activation beacon. arXiv:2401.03462, 2024a. 15 Published as a conference paper at COLM 2024 Xinrong Zhang, Yingfa Chen, Shengding Hu, Zihang Xu, Junhao Chen, Moo Khai Hao, Xu Han, Zhen Leng Thai, Shuo Wang, Zhiyuan Liu, and...

work page arXiv 2024
[40]

Our results in the main text only include aligned models (GPT-4, Gemini-1.5, and 15 open-source models)

16 Published as a conference paper at COLM 2024 A Models We select in total 37 models for evaluation and analysis. Our results in the main text only include aligned models (GPT-4, Gemini-1.5, and 15 open-source models). Besides the aligned models, we also evaluate 7 open-source base models using RULER . We use the performance of Llama2-7b (base) and Llama...

work page 2024
[41]

/ API GPT-4 (OpenAI: Josh Achiam et al., 2023)✓ - 128K gpt-4-1106-previewGemini-1.5 (Reid et al., 2024)✓ - 1M gemini-1.5-pro Llama3.1 (Meta.AI, 2024b) ✓ 70B 128K meta-llama/Meta-Llama-3.1-70B-InstructLlama3.1 (Meta.AI, 2024b) ✓ 8B 128K meta-llama/Meta-Llama-3.1-8B-InstructCommand-R-plus (Cohere, 2024)✓ 104B 128K CohereForAI/c4ai-command-r-plusQwen2 (Yang et al.,

work page 2023
[42]

✓ 34B 200K 01-ai/Yi-34B-200KMixtral-8x22B (Jiang et al., 2024)✓ 39B/141B 32K mistralai/Mixtral-8x22B-Instruct-v0.1Mistral-v0.2 (Mistral.AI, 2023)✓ 7B 32K mistralai/Mistral-7B-Instruct-v0.2GLM4 (GLM et al.,

work page 2024
[43]

✓ 9B 1M THUDM/glm-4-9b-chat-1mGradientAI/Llama3 (Meta.AI, 2024a)✓ 70B 1M gradientai/Llama-3-70B-Instruct-Gradient-1048kPhi3-medium (Abdin et al., 2024)✓ 14B 128K microsoft/Phi-3-medium-128k-instructLWM (Liu et al., 2024a) ✓ 7B 1M LargeWorldModel/LWM-Text-Chat-1MDBRX (Databricks,

work page 2024
[44]

✓ 36B/132B 1M databricks/dbrx-instructTogether (Together.AI, 2023b)✓ 7B 32K togethercomputer/Llama-2-7B-32K-InstructLongChat (Li et al., 2023a) ✓ 7B 32K lmsys/longchat-7b-v1.5-32kLongAlpaca (Chen et al., 2024)✓ 13B 32K Yukang/LongAlpaca-13B Mixtral-base (Jiang et al., 2024)✗ 8x7B 32K mistralai/Mixtral-8x7B-v0.1Mistral-base (Mistral.AI, 2023)✗ 7B 32K alpin...

work page 2024
[45]

✗ 52B 256K ai21labs/Jamba-v0.1 Llama2 (chat) (Touvron et al., 2023)✓ 7B 4K meta-llama/Llama-2-7b-chat-hfLlama2 (base) (Touvron et al., 2023)✗ 7B 4K meta-llama/Llama-2-7b-hf Yi series (Young et al.,

work page 2023
[46]

17 Published as a conference paper at COLM 2024 B Task Configurations RULER is designed to be configurable to allow for diverse sequence lengths and task complexities

✗ 7B 4K RWKV/v5-Eagle-7B-HF Table 4: Information of evaluated and analyzed models in RULER . 17 Published as a conference paper at COLM 2024 B Task Configurations RULER is designed to be configurable to allow for diverse sequence lengths and task complexities. For each task, there arises combinatorially large number of configurations one can adopt. In the...

work page 2024
[47]

Additionally, we change the value type to UUID, for the purpose of testing model robustness at retrieving long strings from context

and the vanilla NIAH (Kamradt, 2023), both use word-number as key-value and differ only by the background haystack. Additionally, we change the value type to UUID, for the purpose of testing model robustness at retrieving long strings from context. For MK-NIAH, we add three distractor needles into the haystack. We also include existing setups from previou...

work page 2023
[48]

They are representative of single-hop and multi-hop question answering tasks respectively

to simulate long-context scenario. They are representative of single-hop and multi-hop question answering tasks respectively. Task Configurations Subtask-1 Subtask-2 Subtask-3 Single NIAH type key = word type value = number type haystack = repeat ∼passkey retrieval type key = word type value = number type haystack = essay ∼vanilla NIAH type key = word typ...

work page 2024
[49]

To prevent models from refusing to answer our questions, we append the input with an answer prefix to elicit model responses

The model template is the model chat format while the task template combines instruction, context, and query. To prevent models from refusing to answer our questions, we append the input with an answer prefix to elicit model responses. For VT and CWE, we use one task sample as in-context demonstration. Model Template GPT-4 {task template} Do not provide a...

work page 2024
[50]

word-f ...... Question: What are the 10 most common words in the above list? Task Answer Prefix: Answer: The top 10 words that appear most often in the list are: FWE Task Template: Read the following coded text and track the frequency of each coded word. Find the three most frequently appeared coded words. ... ... word-a ... word-b ... ... ... word-c ... ...

work page 2024