Recognition: 2 theorem links
· Lean TheoremRULER: What's the Real Context Size of Your Long-Context Language Models?
Pith reviewed 2026-05-11 03:50 UTC · model grok-4.3
The pith
Long-context language models lose accuracy on tasks beyond basic retrieval as context length grows, despite advertised sizes of 32K or more.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Despite near-perfect scores on the standard needle-in-a-haystack test, nearly all evaluated long-context language models display large performance drops as input length increases and as tasks move beyond simple retrieval to multi-hop tracing and aggregation. While every model claims a context size of 32K tokens or greater, only half maintain acceptable accuracy at that length on the full set of RULER tasks.
What carries the argument
RULER benchmark, a configurable synthetic test suite that extends needle-in-a-haystack retrieval with controlled numbers of needles plus multi-hop tracing and aggregation categories.
If this is right
- Models that pass basic retrieval tests still need separate validation on multi-hop and aggregation tasks before their context claims can be trusted.
- Only half of the 17 tested models that advertise 32K or larger contexts actually sustain performance at 32K on the expanded tasks.
- Even the Yi-34B model, which supports up to 200K tokens, shows substantial room for improvement when both length and task complexity are increased together.
- Current approaches to extending context length do not automatically produce models that can use that length for reasoning steps spread across the input.
Where Pith is reading between the lines
- Evaluation standards for long-context models may shift toward requiring these harder synthetic tasks rather than relying on needle retrieval alone.
- Training objectives focused on maintaining and linking distant pieces of information could close more of the observed gap than simply increasing maximum length.
- Applications that depend on synthesizing information from many sources may need to keep inputs shorter or add retrieval steps until model capabilities improve.
Load-bearing premise
That results on these synthetic tracing and aggregation tasks reliably indicate how well a model will handle the long-context demands of real applications.
What would settle it
A model that scores low on RULER at 32K yet achieves high accuracy on production tasks that require connecting facts across long documents or conversations.
read the original abstract
The needle-in-a-haystack (NIAH) test, which examines the ability to retrieve a piece of information (the "needle") from long distractor texts (the "haystack"), has been widely adopted to evaluate long-context language models (LMs). However, this simple retrieval-based test is indicative of only a superficial form of long-context understanding. To provide a more comprehensive evaluation of long-context LMs, we create a new synthetic benchmark RULER with flexible configurations for customized sequence length and task complexity. RULER expands upon the vanilla NIAH test to encompass variations with diverse types and quantities of needles. Moreover, RULER introduces new task categories multi-hop tracing and aggregation to test behaviors beyond searching from context. We evaluate 17 long-context LMs with 13 representative tasks in RULER. Despite achieving nearly perfect accuracy in the vanilla NIAH test, almost all models exhibit large performance drops as the context length increases. While these models all claim context sizes of 32K tokens or greater, only half of them can maintain satisfactory performance at the length of 32K. Our analysis of Yi-34B, which supports context length of 200K, reveals large room for improvement as we increase input length and task complexity. We open source RULER to spur comprehensive evaluation of long-context LMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces RULER, a synthetic benchmark extending the needle-in-a-haystack (NIAH) test with configurable sequence lengths and task complexities, including variations with multiple/diverse needles, multi-hop tracing, and aggregation tasks. It evaluates 17 long-context LMs on 13 tasks, claiming that near-perfect vanilla NIAH performance does not hold as context length grows, with only half the models maintaining satisfactory results at 32K tokens despite claiming 32K+ windows; it also analyzes Yi-34B (200K claimed) and releases RULER openly.
Significance. If the central results hold after addressing potential confounds, the work is significant for the field: it provides a more demanding, flexible benchmark than vanilla NIAH for assessing genuine long-context capabilities, demonstrates that claimed context windows often overestimate effective length, and supplies reproducible code and tasks. The broad model coverage and open release are explicit strengths that support follow-on work.
major comments (2)
- [Abstract] Abstract and evaluation description: the central claim attributes large performance drops (and the 'only half maintain satisfactory performance at 32K' result) to insufficient effective context length. However, RULER is described as having 'flexible configurations for customized sequence length and task complexity,' and the Yi-34B analysis explicitly increases both length and complexity together. It is not stated that complexity parameters (needle count, hop count, aggregation difficulty) are held fixed while scaling length in the 13-task suite. If complexity rises with length, the observed degradation cannot be cleanly attributed to length alone and the 32K threshold interpretation is under-supported.
- [Evaluation] Evaluation of 17 models: without explicit confirmation that the same task-complexity settings are used across the tested lengths (e.g., 4K vs 32K), the trends in degradation are difficult to interpret as length-specific. Adding a controlled experiment or table that fixes complexity parameters while varying only length would directly address this.
minor comments (2)
- [Evaluation] Provide additional detail on the precise configurations (needle quantities, hop counts, distractor types) used for each of the 13 tasks at each tested length, and on any statistical controls or variance estimates for the reported accuracy drops.
- [Abstract] The abstract states 'only half of them can maintain satisfactory performance at the length of 32K'; define 'satisfactory' quantitatively (e.g., accuracy threshold) and report per-model numbers to make the claim precise.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive feedback on potential confounds between sequence length and task complexity. We clarify that the primary evaluation of the 17 models holds complexity fixed while varying length, and we will revise the manuscript for explicitness as detailed below.
read point-by-point responses
-
Referee: [Abstract] Abstract and evaluation description: the central claim attributes large performance drops (and the 'only half maintain satisfactory performance at 32K' result) to insufficient effective context length. However, RULER is described as having 'flexible configurations for customized sequence length and task complexity,' and the Yi-34B analysis explicitly increases both length and complexity together. It is not stated that complexity parameters (needle count, hop count, aggregation difficulty) are held fixed while scaling length in the 13-task suite. If complexity rises with length, the observed degradation cannot be cleanly attributed to length alone and the 32K threshold interpretation is under-supported.
Authors: We appreciate this observation. In the main 13-task evaluation of the 17 models, complexity parameters are held fixed (e.g., fixed needle counts/types, hop counts, and aggregation functions per task) while only the haystack length is scaled across 4K/8K/16K/32K. The Yi-34B analysis deliberately varies both dimensions to probe upper limits. The manuscript does not explicitly state the fixed-complexity design for the primary results, which we agree weakens the length-specific interpretation. We will revise the abstract, Section 3 (Benchmark), and the evaluation description to state this clearly, and add a table enumerating the exact fixed parameter values used for each task. revision: yes
-
Referee: [Evaluation] Evaluation of 17 models: without explicit confirmation that the same task-complexity settings are used across the tested lengths (e.g., 4K vs 32K), the trends in degradation are difficult to interpret as length-specific. Adding a controlled experiment or table that fixes complexity parameters while varying only length would directly address this.
Authors: We agree that explicit confirmation is needed. The existing results already use fixed complexity settings across lengths by design of RULER's configurable tasks. To directly address the concern, we will add a new table (or appendix table) that lists the complexity parameters for all 13 tasks and confirms they are unchanged while length varies. This makes the length-specific trends unambiguous without requiring new experiments. revision: yes
Circularity Check
No circularity: empirical results from new benchmark tasks stand independently
full rationale
The paper introduces the RULER benchmark with 13 tasks (NIAH variants, multi-hop tracing, aggregation) and reports direct accuracy measurements from running 17 LMs at varying lengths. No equations, fitted parameters, or derivations are present; claims about performance drops are observational outputs from model evaluations on explicitly defined inputs, not reductions to self-defined quantities or self-citation chains. The evaluation is self-contained and falsifiable via external reproduction on the open-sourced tasks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Synthetic tasks such as multi-hop tracing and aggregation test genuine long-context understanding rather than narrow artifacts
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DimensionForcingalexander_duality_circle_linking unclearRULER expands upon the vanilla NIAH test to encompass variations with diverse types and quantities of needles. Moreover, RULER introduces new task categories multi-hop tracing and aggregation to test behaviors beyond searching from context.
Forward citations
Cited by 60 Pith papers
-
MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare
MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for ...
-
LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues
LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.
-
MEME: Multi-entity & Evolving Memory Evaluation
All tested LLM memory systems fail at dependency reasoning in multi-entity evolving scenarios, with only an expensive file-based setup showing partial recovery.
-
ProactBench: Beyond What The User Asked For
ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.
-
VORT: Adaptive Power-Law Memory for NLP Transformers
VORT assigns learnable fractional orders to tokens and approximates their power-law retention kernels via sum-of-exponentials for efficient long-range dependency modeling in transformers.
-
When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory
A new evaluation protocol shows agent memory reliability degrades variably with added irrelevant sessions depending on agent, memory interface, and scale.
-
The Text Uncanny Valley: Non-Monotonic Performance Degradation in LLM Information Retrieval
LLM information retrieval shows a U-shaped performance drop as words are fragmented by inserted whitespace, attributed to a disordered transition between word-level and character-level processing modes.
-
SCOUT: Active Information Foraging for Long-Text Understanding with Decoupled Epistemic States
SCOUT achieves state-of-the-art long-text understanding with up to 8x lower token use by actively foraging for sparse query-relevant information and updating a compact provenance-grounded epistemic state.
-
HUGO-CS: A Hybrid-Labeled, Uncertainty-Aware, General-Purpose, Observational Dataset for Cold Spray
HUGO-CS is a 4,383-experiment cold-spray dataset extracted from literature via a new hybrid LLM-manual framework that is 30 times larger than prior collections and released with code.
-
Token Arena: A Continuous Benchmark Unifying Energy and Cognition in AI Inference
TokenArena is a continuous benchmark for AI inference endpoints that measures output speed, time to first token, blended price, effective context, quality, and modeled energy to produce composites of joules per correc...
-
PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training
Stealth Pretraining Seeding plants persistent unsafe behaviors in LLMs via diffuse poisoned web content that activates on precise triggers and evades standard evaluation.
-
Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences
Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.
-
An Agentic Evaluation Architecture for Historical Bias Detection in Educational Textbooks
An agentic architecture with multimodal screening, a five-agent jury, meta-synthesis, and source attribution protocol detects biases in Romanian history textbooks more accurately than zero-shot baselines, achieving 83...
-
When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction
Attention to goal tokens declines in multi-turn LLM interactions while residual representations often retain decodable goal information, and the gap between these predicts whether goal-conditioned behavior survives.
-
AB-Sparse: Sparse Attention with Adaptive Block Size for Accurate and Efficient Long-Context Inference
AB-Sparse adaptively allocates per-head block sizes for sparse attention, adds lossless centroid quantization and custom variable-block GPU kernels, and reports up to 5.43% accuracy gain over fixed-block baselines wit...
-
Where Does Long-Context Supervision Actually Go? Effective-Context Exposure Balancing
EXACT re-allocates training supervision by inverse frequency of long effective-context targets, improving NoLiMa and RULER scores by 5-18 points on Qwen and LLaMA models without degrading standard QA or reasoning.
-
Remember to Forget: Gated Adaptive Positional Encoding
GAPE augments RoPE with query- and key-dependent gates to stabilize attention and improve long-context performance in language models.
-
FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning
FocuSFT uses an inner optimization loop to adapt fast-weight parameters into a parametric memory that sharpens attention on relevant content, then conditions outer-loop supervised fine-tuning on this representation, y...
-
KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving
KV-RM regularizes KV-cache movement in static-graph LLM serving via block paging and merge-staged transport to improve throughput, tail latency, and memory use for variable-length decoding.
-
ReST-KV: Robust KV Cache Eviction with Layer-wise Output Reconstruction and Spatial-Temporal Smoothing
ReST-KV formulates KV eviction as layer-wise output reconstruction optimization with spatial-temporal smoothing, outperforming baselines by 2.58% on LongBench and 15.2% on RULER while cutting decoding latency by 10.61...
-
Slipstream: Trajectory-Grounded Compaction Validation for Long-Horizon Agents
Slipstream uses asynchronous compaction with trajectory-grounded judge validation to improve long-horizon agent accuracy by up to 8.8 percentage points and reduce latency by up to 39.7%.
-
RDKV: Rate-Distortion Bit Allocation for Joint Eviction and Quantization of the KV Cache
RDKV derives per-token and per-channel weights from attention distortion, then uses reverse water-filling to assign bit-widths from full precision to zero after prefilling, recovering 97.81% accuracy with 2.48% cache ...
-
Reformulating KV Cache Eviction Problem for Long-Context LLM Inference
LaProx reformulates KV cache eviction as an output-aware matrix approximation, enabling a unified global token selection strategy that preserves LLM performance at 5% cache size across long-context benchmarks.
-
The Position Curse: LLMs Struggle to Locate the Last Few Items in a List
LLMs exhibit the Position Curse, with backward position retrieval in lists lagging far behind forward retrieval, showing only partial gains from PosBench fine-tuning.
-
Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache
Louver is a new index for LLM KV caches that guarantees zero false negatives for keys above a relevance threshold, runs faster than prior sparse and some dense attention methods, and integrates lightly into existing p...
-
Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache
Louver is a new index structure that guarantees zero false negatives for sparse attention in LLM KV caches by casting the problem as halfspace range searching.
-
UniPrefill: Universal Long-Context Prefill Acceleration via Block-wise Dynamic Sparsification
UniPrefill accelerates LLM prefill via block-wise dynamic sparsification, achieving up to 2.1x TTFT speedup while supporting hybrid architectures and native vLLM continuous batching.
-
When Does Value-Aware KV Eviction Help? A Fixed-Contract Diagnostic for Non-Monotone Cache Compression
A fixed-contract probe shows value-aware KV eviction recovers needed evidence in 72.6% of accuracy-improving cases on LongBench but only 32.4% otherwise, suggesting an order of recover evidence, rank value, then prese...
-
Retrieval and Multi-Hop Reasoning in 1M-Token Context Windows: Evaluating LLMs on Classical Chinese Text
Frontier LLMs solve single-needle retrieval at 1M tokens on classical Chinese but show three distinct accuracy-decay patterns in three-hop reasoning between 256K and 1M tokens.
-
Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling
HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.
-
Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs
NPUMoE accelerates MoE LLM inference on Apple Silicon NPUs via offline-calibrated static expert tiers, grouped execution, and load-aware graph residency, delivering 1.32x-5.55x lower latency and 1.81x-7.37x better ene...
-
OPSDL: On-Policy Self-Distillation for Long-Context Language Models
OPSDL improves long-context LLM performance by having the model self-distill from its short-context capability using point-wise reverse KL divergence on generated tokens, outperforming SFT and DPO on benchmarks withou...
-
Sketching the Readout of Large Language Models for Scalable Data Attribution and Valuation
RISE applies CountSketch to dual lexical and semantic channels derived from output-layer gradient outer products, cutting data attribution storage by up to 112x and enabling retrospective and prospective influence ana...
-
Accuracy Is Speed: Towards Long-Context-Aware Routing for Distributed LLM Serving
In long-context LLM serving, accuracy becomes speed via retry dynamics, and accuracy-aware routing reduces time-to-correct-answer.
-
Latent-Condensed Transformer for Efficient Long Context Modeling
Latent-Condensed Attention condenses context in MLA's latent space via query-aware semantic pooling and positional anchor selection, delivering up to 2.5x prefilling speedup and 90% KV cache reduction at 128K length w...
-
IceCache: Memory-efficient KV-cache Management for Long-Sequence LLMs
IceCache combines semantic token clustering with PagedAttention to keep only 25% of the KV cache tokens while retaining 99% accuracy on LongBench and matching or beating prior offloading methods in latency.
-
StructKV: Preserving the Structural Skeleton for Scalable Long-Context Inference
StructKV compresses LLM KV caches by tracking global in-degree centrality across network depth and dynamically selecting compression layers to preserve long-range dependencies better than local pruning methods.
-
In-Place Test-Time Training
In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.
-
Safety, Security, and Cognitive Risks in State-Space Models: A Systematic Threat Analysis with Spectral, Stateful, and Capacity Attacks
State-space models are vulnerable to three new attack types that corrupt state integrity, with experiments showing up to 156x output changes and 6x higher targeted corruption than random inputs.
-
Kimi Linear: An Expressive, Efficient Attention Architecture
Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
-
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
Applying a head-specific sigmoid gate after SDPA in LLMs boosts performance and stability by adding non-linearity and query-dependent sparse modulation while reducing attention sinks.
-
MemReread: Enhancing Agentic Long-Context Reasoning via Memory-Guided Rereading
MemReread improves agent long-context reasoning by triggering rereading on insufficient final memory to recover discarded indirect facts, outperforming baselines at linear complexity.
-
Instruction Adherence in Coding Agent Configuration Files: A Factorial Study of Four File-Structure Variables
A 1650-session factorial study found no measurable impact from config file size, instruction position, architecture, or conflicts on coding agent adherence, though compliance declined within sessions.
-
LensVLM: Selective Context Expansion for Compressed Visual Representation of Text
LensVLM trains VLMs to scan compressed rendered text images and selectively expand task-relevant regions, achieving 4.3x compression with near full-text accuracy and outperforming baselines up to 10.1x on text QA benchmarks.
-
MDN: Parallelizing Stepwise Momentum for Delta Linear Attention
MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.
-
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k
Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.
-
Budget-Aware Routing for Long Clinical Text
RCD balances relevance, coverage, and diversity in a knapsack-constrained selection framework, with experiments showing that selector choice and budget level determine optimal unitization strategies on clinical datasets.
-
Caracal: Causal Architecture via Spectral Mixing
Caracal is a Fourier-based sequence mixing architecture that achieves causal autoregressive modeling with standard operators and competitive performance on long sequences.
-
SpikingBrain2.0: Brain-Inspired Foundation Models for Efficient Long-Context and Cross-Platform Inference
SpikingBrain2.0 is a 5B hybrid spiking-Transformer that recovers most base model performance while delivering 10x TTFT speedup at 4M context and supporting over 10M tokens on limited GPUs via dual sparse attention and...
-
FG$^2$-GDN: Enhancing Long-Context Gated Delta Networks with Doubly Fine-Grained Control
FG²-GDN replaces the scalar beta in the delta update with a channel-wise vector and decouples key/value scaling to improve recall over prior GDN and KDA models.
-
LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning
LongAct uses saliency from high-magnitude activations to guide sparse weight updates in long-context RL, yielding about 8% gains on LongBench v2 across multiple algorithms.
-
A Decomposition Perspective to Long-context Reasoning for LLMs
Decomposing long-context reasoning into atomic skills, synthesizing targeted pseudo-datasets, and applying RL improves LLM performance on long-context benchmarks by an average of 7.7%.
-
Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference
Flux Attention uses a context-aware Layer Router to dynamically assign full or sparse attention to each LLM layer, achieving up to 2.8x prefill and 2.0x decode speedups with competitive performance on long-context and...
-
AgentCE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments
AgentCE-Bench is a lightweight grid-planning benchmark that controls task horizon via hidden slots H and difficulty via decoy budget B, validated across 13 models for consistent and discriminative evaluation.
-
JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency
JoyAI-LLM Flash delivers a 48B MoE LLM with 2.7B active parameters per token via FiberPO RL and dense multi-token prediction, released with checkpoints on Hugging Face.
-
Qwen3 Technical Report
Pith review generated a malformed one-line summary.
-
Submodular Benchmark Selection
Submodular maximization under a Gaussian model selects small benchmark subsets that outperform random selection for imputing leaderboard scores, with mutual information better than entropy at small sizes.
-
Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs
EngGPT2MoE-16B-A3B matches or beats other Italian models on most international benchmarks but trails top international models such as GPT-5 nano and Qwen3-8B.
-
XekRung Technical Report
XekRung achieves state-of-the-art performance on cybersecurity benchmarks among same-scale models via tailored data synthesis and multi-stage training while retaining strong general capabilities.
- RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference
Reference graph
Works this paper leans on
-
[1]
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Agrawal, P., Craig, N., Madden, A., and Lombera, I
Rishabh Agarwal, Avi Singh, Lei M Zhang, Bernd Bohnet, Stephanie Chan, Ankesh Anand, Zaheer Abbas, Azade Nova, John D Co-Reyes, Eric Chu, et al. Many-shot in-context learning. arXiv preprint arXiv:2404.11018,
-
[3]
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
Yushi Bai et al. LongBench: A bilingual, multitask benchmark for long context understand- ing. arXiv:2308.14508,
work page internal anchor Pith review arXiv
-
[4]
xLSTM: Extended Long Short-Term Memory
Maximilian Beck, Korbinian P¨oppel, Markus Spanring, Andreas Auer, Oleksandra Prud- nikova, Michael Kopp, G¨unter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xlstm: Extended long short-term memory. arXiv preprint arXiv:2405.04517,
-
[5]
arXiv preprint arXiv:2405.00200 , year=
Amanda Bertsch, Maor Ivgi, Uri Alon, Jonathan Berant, Matthew R Gormley, and Graham Neubig. In-context learning with long-context models: An in-depth exploration. arXiv preprint arXiv:2405.00200,
-
[6]
Scaling transformer to 1m tokens and beyond with rmt
Aydar Bulatov, Yuri Kuratov, and Mikhail S Burtsev. Scaling Transformer to 1M tokens and beyond with RMT. arXiv:2304.11062,
-
[7]
Generating Long Sequences with Sparse Transformers
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse Transformers. arXiv:1904.10509,
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[8]
URL https://docs.cohere.com/docs/command-r-plus# model-details. Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. arxiv:2307.08691,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Introducing dbrx: A new state-of-the-art open llm,
11 Published as a conference paper at COLM 2024 Databricks. Introducing dbrx: A new state-of-the-art open llm,
work page 2024
-
[10]
LongNet: Scaling transformers to 1,000,000,000 tokens.arXiv preprint arXiv:2307.02486,
URL https://www. databricks.com/blog/introducing-dbrx-new-state-art-open-llm . Jiayu Ding et al. LongNet: Scaling Transformers to 1,000,000,000 tokens. arXiv:2307.02486,
-
[11]
Longrope: Extending llm context window beyond 2 million tokens.arXiv preprint arXiv:2402.13753, 2024
Yiran Ding et al. LongRoPE: Extending LLM context window beyond 2 million tokens. arXiv:2402.13753,
-
[12]
Zican Dong, Tianyi Tang, Junyi Li, Wayne Xin Zhao, and Ji-Rong Wen. Bamboo: A com- prehensive benchmark for evaluating long text modeling capacities of large language models. arXiv:2309.13345,
-
[13]
Data engineering for scaling language models to 128K context.arXiv preprint arXiv:2402.10171, 2024
Daniel Y. Fu, Tri Dao, Khaled K. Saab, Armin W. Thomas, Atri Rudra, and Christopher R´e. Hungry Hungry Hippos: Towards language modeling with state space models. In ICLR, 2023a. Daniel Y. Fu et al. Simple hardware-efficient long convolutions for sequence modeling. ICML, 2023b. Yao Fu et al. Data engineering for scaling language models to 128k context.arXi...
-
[14]
Is it really long context if all you need is retrieval? towards genuinely difficult long context nlp
Omer Goldman, Alon Jacovi, Aviv Slobodkin, Aviya Maimon, Ido Dagan, and Reut Tsarfaty. Is it really long context if all you need is retrieval? towards genuinely difficult long context nlp. arXiv preprint arXiv:2407.00402,
-
[15]
Alex Graves, Greg Wayne, and Ivo Danihelka. Neural Turing machines. arXiv:1410.5401,
work page internal anchor Pith review arXiv
-
[16]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv:2312.00752,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Lm-infinite: Simple on-the-fly length generalization for large language models
Chi Han, Qifan Wang, Wenhan Xiong, Yu Chen, Heng Ji, and Sinong Wang. Lm-infinite: Simple on-the-fly length generalization for large language models. arXiv:2308.16137,
-
[18]
Sam Ade Jacobs et al. DeepSpeed Ulysses: System optimizations for enabling training of extreme long sequence Transformer models. arXiv:2309.14509,
work page internal anchor Pith review arXiv
-
[19]
Albert Q Jiang et al. Mixtral of experts. arXiv:2401.04088,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression
12 Published as a conference paper at COLM 2024 Huiqiang Jiang et al. LongLlmLingua: Accelerating and enhancing LLMs in long context scenarios via prompt compression. arXiv:2310.06839,
-
[21]
URL https://github.com/gkamradt/LLMTest NeedleInAHaystack/tree/main. Marzena Karpinska, Katherine Thai, Kyle Lo, Tanya Goyal, and Mohit Iyyer. One thousand and one pairs: A” novel” challenge for long-context language models. arXiv preprint arXiv:2406.16264,
-
[22]
Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Sorokin, Artyom Sorokin, and Mikhail Burtsev. Babilong: Testing the limits of llms with long context reasoning-in- a-haystack. arXiv preprint arXiv:2406.10149,
-
[23]
Jinhyuk Lee, Anthony Chen, Zhuyun Dai, Dheeru Dua, Devendra Singh Sachan, Michael Boratko, Yi Luan, S ´ebastien MR Arnold, Vincent Perot, Siddharth Dalmia, et al. Can long-context language models subsume retrieval, rag, sql, and more? arXiv preprint arXiv:2406.13121,
-
[24]
Mosh Levy, Alon Jacoby, and Yoav Goldberg. Same task, more tokens: the impact of input length on the reasoning performance of large language models. arXiv preprint arXiv:2402.14848,
-
[25]
How long can open-source LLMs truly promise on context length?, 2023a
Dacheng Li, Rulin Shao, et al. How long can open-source LLMs truly promise on context length?, 2023a. URL https://lmsys.org/blog/2023-06-29-longchat . Jiaqi Li, Mengmeng Wang, Zilong Zheng, and Muhan Zhang. Loogle: Can long-context language models understand long contexts? arXiv:2311.04939, 2023b. Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention ...
-
[26]
Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with Ring Attention. arxiv:2402.08268, 2024a. Jiaheng Liu et al. E2-LLM: Efficient and extreme length extension of large language models. arXiv:2401.06951, 2024b. Jiawei Liu, Jia Le Tian, Vijay Daita, Yuxiang Wei, Yifeng Ding, Yuhan Katherine Wang, Jun ...
-
[27]
13 Published as a conference paper at COLM 2024 Amirkeivan Mohtashami and Martin Jaggi
URL https://mistral.ai/news/la-plateforme/. 13 Published as a conference paper at COLM 2024 Amirkeivan Mohtashami and Martin Jaggi. Landmark attention: Random-access infinite context length for Transformers. In Workshop on Efficient Systems for Foundation Models @ ICML,
work page 2024
-
[28]
https://transformer-circuits.pub/2022/in-context-learning-and-induction- heads/index.html. OpenAI: Josh Achiam et al. GPT-4 technical report. arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[29]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Machel Reid et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv:2403.05530,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
RoFormer: Enhanced Transformer with Rotary Position Embedding
Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced Transformer with rotary position embedding. arXiv:2104.09864,
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
ChapterBreak: A challenge dataset for long-range language models
Simeng Sun, Katherine Thai, and Mohit Iyyer. ChapterBreak: A challenge dataset for long-range language models. In Proc. of the 2022 Conference of the North American Chapter of the ACL: Human Language Technologies,
work page 2022
-
[32]
Retentive Network: A Successor to Transformer for Large Language Models
Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to Transformer for large language models. arXiv:2307.08621, 2023a. Yutao Sun, Li Dong, Barun Patra, Shuming Ma, Shaohan Huang, Alon Benhaim, Vishrav Chaudhary, Xia Song, and Furu Wei. A length-extrapolatable Transformer. In Pr...
work page internal anchor Pith review arXiv
-
[33]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Together.AI. Preparing for the era of 32k context: Early learnings and explorations, 2023a. URL https://www.together.ai/blog/llama-2-7b-32k . 14 Published as a conference paper at COLM 2024 Together.AI. Llama-2-7b-32k-instruct — and fine-tuning for llama-2 models with together api, 2023b. URL https://www.together.ai/blog/llama-2-7b-32k-instruct . Hugo Tou...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
HuggingFace's Transformers: State-of-the-art Natural Language Processing
Thomas Wolf et al. Huggingface’s Transformers: State-of-the-art natural language process- ing. arXiv:1910.03771,
work page internal anchor Pith review arXiv 1910
-
[35]
URL https://x.ai/blog/grok-1.5. Chaojun Xiao et al. InfLLM: Unveiling the intrinsic capacity of LLMs for understanding extremely long sequences with training-free memory. arXiv:2402.04617, 2024a. Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient stream- ing language models with attention sinks. In ICLR, 2024b. Wenhan Xiong et ...
-
[36]
Retrieval meets long context large language models
Peng Xu, Wei Ping, Xianchao Wu, Lawrence McAfee, Chen Zhu, Zihan Liu, Sandeep Subramanian, Evelina Bakhturina, Mohammad Shoeybi, and Bryan Catanzaro. Retrieval meets long context large language models. In ICLR, 2024a. Xiaoyue Xu, Qinyuan Ye, and Xiang Ren. Stress-testing long-context language models with lifelong icl and task haystack. arXiv preprint arXi...
-
[37]
Yi: Open Foundation Models by 01.AI
Alex Young et al. Yi: Open foundation models by 01.AI. arXiv:2403.04652,
work page internal anchor Pith review arXiv
-
[38]
Lv-eval: A balanced long-context benchmark with 5 length levels up to 256k
Tao Yuan, Xuefei Ning, Dong Zhou, Zhijie Yang, Shiyao Li, Minghui Zhuang, Zheyue Tan, Zhuyu Yao, Dahua Lin, Boxun Li, et al. Lv-eval: A balanced long-context benchmark with 5 length levels up to 256k. arXiv preprint arXiv:2402.05136,
-
[39]
arXiv preprint arXiv:2401.03462 , year=
Peitian Zhang, Zheng Liu, Shitao Xiao, Ninglu Shao, Qiwei Ye, and Zhicheng Dou. Soaring from 4k to 400k: Extending LLM’s context with activation beacon. arXiv:2401.03462, 2024a. 15 Published as a conference paper at COLM 2024 Xinrong Zhang, Yingfa Chen, Shengding Hu, Zihang Xu, Junhao Chen, Moo Khai Hao, Xu Han, Zhen Leng Thai, Shuo Wang, Zhiyuan Liu, and...
-
[40]
16 Published as a conference paper at COLM 2024 A Models We select in total 37 models for evaluation and analysis. Our results in the main text only include aligned models (GPT-4, Gemini-1.5, and 15 open-source models). Besides the aligned models, we also evaluate 7 open-source base models using RULER . We use the performance of Llama2-7b (base) and Llama...
work page 2024
-
[41]
/ API GPT-4 (OpenAI: Josh Achiam et al., 2023)✓ - 128K gpt-4-1106-previewGemini-1.5 (Reid et al., 2024)✓ - 1M gemini-1.5-pro Llama3.1 (Meta.AI, 2024b) ✓ 70B 128K meta-llama/Meta-Llama-3.1-70B-InstructLlama3.1 (Meta.AI, 2024b) ✓ 8B 128K meta-llama/Meta-Llama-3.1-8B-InstructCommand-R-plus (Cohere, 2024)✓ 104B 128K CohereForAI/c4ai-command-r-plusQwen2 (Yang et al.,
work page 2023
-
[42]
✓ 34B 200K 01-ai/Yi-34B-200KMixtral-8x22B (Jiang et al., 2024)✓ 39B/141B 32K mistralai/Mixtral-8x22B-Instruct-v0.1Mistral-v0.2 (Mistral.AI, 2023)✓ 7B 32K mistralai/Mistral-7B-Instruct-v0.2GLM4 (GLM et al.,
work page 2024
-
[43]
✓ 9B 1M THUDM/glm-4-9b-chat-1mGradientAI/Llama3 (Meta.AI, 2024a)✓ 70B 1M gradientai/Llama-3-70B-Instruct-Gradient-1048kPhi3-medium (Abdin et al., 2024)✓ 14B 128K microsoft/Phi-3-medium-128k-instructLWM (Liu et al., 2024a) ✓ 7B 1M LargeWorldModel/LWM-Text-Chat-1MDBRX (Databricks,
work page 2024
-
[44]
✓ 36B/132B 1M databricks/dbrx-instructTogether (Together.AI, 2023b)✓ 7B 32K togethercomputer/Llama-2-7B-32K-InstructLongChat (Li et al., 2023a) ✓ 7B 32K lmsys/longchat-7b-v1.5-32kLongAlpaca (Chen et al., 2024)✓ 13B 32K Yukang/LongAlpaca-13B Mixtral-base (Jiang et al., 2024)✗ 8x7B 32K mistralai/Mixtral-8x7B-v0.1Mistral-base (Mistral.AI, 2023)✗ 7B 32K alpin...
work page 2024
-
[45]
✗ 52B 256K ai21labs/Jamba-v0.1 Llama2 (chat) (Touvron et al., 2023)✓ 7B 4K meta-llama/Llama-2-7b-chat-hfLlama2 (base) (Touvron et al., 2023)✗ 7B 4K meta-llama/Llama-2-7b-hf Yi series (Young et al.,
work page 2023
-
[46]
✗ 7B 4K RWKV/v5-Eagle-7B-HF Table 4: Information of evaluated and analyzed models in RULER . 17 Published as a conference paper at COLM 2024 B Task Configurations RULER is designed to be configurable to allow for diverse sequence lengths and task complexities. For each task, there arises combinatorially large number of configurations one can adopt. In the...
work page 2024
-
[47]
and the vanilla NIAH (Kamradt, 2023), both use word-number as key-value and differ only by the background haystack. Additionally, we change the value type to UUID, for the purpose of testing model robustness at retrieving long strings from context. For MK-NIAH, we add three distractor needles into the haystack. We also include existing setups from previou...
work page 2023
-
[48]
They are representative of single-hop and multi-hop question answering tasks respectively
to simulate long-context scenario. They are representative of single-hop and multi-hop question answering tasks respectively. Task Configurations Subtask-1 Subtask-2 Subtask-3 Single NIAH type key = word type value = number type haystack = repeat ∼passkey retrieval type key = word type value = number type haystack = essay ∼vanilla NIAH type key = word typ...
work page 2024
-
[49]
The model template is the model chat format while the task template combines instruction, context, and query. To prevent models from refusing to answer our questions, we append the input with an answer prefix to elicit model responses. For VT and CWE, we use one task sample as in-context demonstration. Model Template GPT-4 {task template} Do not provide a...
work page 2024
-
[50]
word-f ...... Question: What are the 10 most common words in the above list? Task Answer Prefix: Answer: The top 10 words that appear most often in the list are: FWE Task Template: Read the following coded text and track the frequency of each coded word. Find the three most frequently appeared coded words. ... ... word-a ... word-b ... ... ... word-c ... ...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.