pith. machine review for the scientific record. sign in

arxiv: 2605.06554 · v1 · submitted 2026-05-07 · 💻 cs.CL

Recognition: unknown

Long Context Pre-Training with Lighthouse Attention

Bowen Peng, Jeffrey Quesnelle, Subho Ghosh

Pith reviewed 2026-05-08 10:05 UTC · model grok-4.3

classification 💻 cs.CL
keywords Lighthouse Attentionhierarchical attentionlong context pre-trainingcausal transformerssubquadratic attentiontwo-stage trainingLLM pre-trainingattention compression
0
0 comments X

The pith

Lighthouse Attention enables faster pre-training of long-context transformers by using hierarchical compression for most training before a short full-attention recovery phase.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Lighthouse Attention to address the quadratic time and memory costs that limit causal transformer training at extreme sequence lengths. It wraps standard scaled dot-product attention with a gradient-free hierarchical selection step that adaptively compresses and decompresses the sequence. The approach uses symmetrical pooling of queries, keys, and values to preserve left-to-right causality while improving parallelism. The majority of pre-training occurs under this lighter method, followed by a brief period of full attention to recover a standard model. Small-scale experiments show this yields shorter total training time and lower final loss than running full attention throughout with all other settings identical.

Core claim

Lighthouse Attention is a training-only symmetrical selection-based hierarchical attention algorithm that wraps ordinary scaled dot-product attention. It introduces a subquadratic pre- and post-processing step for adaptive compression and decompression of the sequence, using symmetrical pooling of queries, keys, and values that maintains causality. A two-stage training process pre-trains for the bulk of steps with Lighthouse Attention and concludes with a short full-attention recovery phase, producing a deployable full-attention model. Preliminary small-scale LLM pre-training runs demonstrate faster overall training time and lower final loss than matched full-attention baselines.

What carries the argument

Lighthouse Attention, a gradient-free symmetrical selection-based hierarchical wrapper around scaled dot-product attention that performs adaptive sequence compression and decompression while preserving causality.

Load-bearing premise

The information lost during hierarchical compression can be reliably recovered in a short full-attention phase without introducing lasting biases or needing extensive extra training.

What would settle it

A controlled larger-scale experiment in which the final loss after the recovery phase is equal to or higher than that of a full-attention baseline trained for the same total steps.

Figures

Figures reproduced from arXiv: 2605.06554 by Bowen Peng, Jeffrey Quesnelle, Subho Ghosh.

Figure 1
Figure 1. Figure 1: Lighthouse Attention architecture. Forward (black): Ht is projected to Q, K, V , passed through the symmetric Pyramid Pool on the trunk, and guided by indices I from the Hierarchical Selector, is fed to a dense gather which topographically sorts the gathered hierarchies into a sin￾gle contiguous and causal sequence, then stock SDPA, and scatter-back to produce Ot. Selection (green): the selector taps the p… view at source ↗
Figure 2
Figure 2. Figure 2: Pyramid Pool and the Hierarchical Selector. The Pyramid Pool is a fixed pre-selection stage that lives outside the selector. (1) Pyramid Pool mean-pools Q, K, V by p ℓ ; lines show which tokens feed each summary. The pooled tokens enter the selector, where (2) Norm Score computes parameter-free ℓ2 norms ∥Q(ℓ)∥2, ∥K(ℓ)∥2 (coarser levels reuse finer norms via max￾pool) and (3) Chunked Bitonic Top-K keeps top… view at source ↗
Figure 3
Figure 3. Figure 3: Attention latency vs. context length for SDPA (cuDNN) and Lighthouse (w/ cuDNN) on a single B200, L=3, p=4, sparsity ≈ 1:64. SDPA scales as Θ(N2d); Lighthouse scales as Θ(S 2d) with S ≪ N. At N=512K, Lighthouse is 21× faster on the forward pass and 17.3× faster on the backward pass, equivalently Lighthouse at 512k takes same runtime as if training SDPA at ∼113K / ∼122K context. The throughput story is cons… view at source ↗
Figure 4
Figure 4. Figure 4: Needle-in-a-Haystack at 98K training, step 16,000. Four Lighthouse → SDPA configu￾rations (varying k ∈ {1536, 2048} and scorer ∈ {dilated, norm} at L=3, p=4) and the dense SDPA￾from-scratch baseline (bottom). Each cell is the mean retrieval rate over 10 single-digit passkeys at the given (context, depth); the per-panel mean is shown in each title. Random chance is 10%. End-to-end speedup. Aggregating both … view at source ↗
read the original abstract

Training causal transformers at extreme sequence lengths is bottlenecked by the quadratic time and memory of scaled dot-product attention (SDPA). In this work, we propose Lighthouse Attention, a training-only symmetrical selection-based hierarchical attention algorithm that wraps around ordinary SDPA and can be easily removed towards the end of the training. Our hierarchical selection is also gradient-free, which exempts us from dealing with a complicated and potentially inefficient backward pass kernel. Our contribution is three-fold: (i) A subquadratic hierarchical pre- and post-processing step that does adaptive compression and decompression of the sequence. (ii) A symmetrical compression strategy that pools queries, keys and values at the same time, while preserving left-to-right causality, which greatly improves parallelism. (iii) A two stage training approach which we pre-train for the majority of the time with Lighthouse Attention and recover a full attention model at the end with a short training. We run preliminary small scale LLM pre-training experiments that show the effectiveness of our method compared to full attention training with all other settings matched, where we achieve a faster total training time and lower final loss after the recovery phase. Full code is available at: https://github.com/ighoshsubho/lighthouse-attention

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Lighthouse Attention, a training-only symmetrical selection-based hierarchical attention algorithm that wraps around standard scaled dot-product attention (SDPA). It introduces subquadratic hierarchical pre- and post-processing for adaptive compression and decompression of the sequence, a gradient-free symmetrical pooling strategy for queries, keys, and values that preserves left-to-right causality, and a two-stage training procedure in which the majority of pre-training uses Lighthouse Attention followed by a short full-attention recovery phase. Preliminary small-scale LLM pre-training experiments are claimed to show faster total training time and lower final loss relative to matched full-attention baselines, with open-source code provided.

Significance. If the approach proves robust at larger scales, it could meaningfully reduce the computational burden of long-context pre-training by allowing most steps to avoid quadratic attention costs while recovering a deployable full-attention model. The explicit release of code is a clear strength that enables direct verification and extension.

major comments (3)
  1. [Abstract] Abstract: the claim that the method achieves 'faster total training time and lower final loss after the recovery phase' provides no quantitative values for model size, sequence length, total steps, recovery-phase length, speedup factor, or loss numbers, nor any loss curves or statistical details; this absence makes it impossible to assess whether the reported gains are load-bearing or artifacts of the small-scale setting.
  2. [Two-stage training approach] The two-stage training description: no analysis or ablation is given showing that the information discarded by the gradient-free symmetrical hierarchical pooling is reliably recoverable in a short full-attention phase without introducing persistent bias; the central efficiency claim rests on this unverified assumption.
  3. [Experimental results] Experimental validation: the manuscript reports only that results are 'preliminary' and 'small scale' with 'all other settings matched,' but supplies neither the exact experimental protocol, number of runs, nor any scaling behavior, leaving open whether the observed advantages hold beyond the tested regime.
minor comments (2)
  1. The description of the hierarchical selection mechanism would benefit from explicit pseudocode or a small diagram illustrating the simultaneous Q/K/V pooling and causality preservation.
  2. Notation for the compression levels and decompression step is introduced without a clear reference table or equation numbering, which reduces readability.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive feedback. The comments identify areas where greater specificity and analysis would strengthen the manuscript. We address each major comment below and describe the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the method achieves 'faster total training time and lower final loss after the recovery phase' provides no quantitative values for model size, sequence length, total steps, recovery-phase length, speedup factor, or loss numbers, nor any loss curves or statistical details; this absence makes it impossible to assess whether the reported gains are load-bearing or artifacts of the small-scale setting.

    Authors: We agree that the abstract would be more informative with concrete numbers. The full manuscript reports preliminary small-scale results, and we will revise the abstract to include specific values drawn from those experiments (model size, sequence length, total steps, recovery-phase length, observed speedup, and loss comparisons). We will also ensure loss curves and basic statistical details are referenced in the abstract and prominently displayed in the main text. revision: yes

  2. Referee: [Two-stage training approach] The two-stage training description: no analysis or ablation is given showing that the information discarded by the gradient-free symmetrical hierarchical pooling is reliably recoverable in a short full-attention phase without introducing persistent bias; the central efficiency claim rests on this unverified assumption.

    Authors: This observation is correct; we currently rely on the empirical outcome that the recovered model reaches lower loss than a matched full-attention baseline rather than on explicit ablations of recoverability. In the revision we will add an analysis subsection with ablations that vary recovery-phase length and track loss trajectories, to demonstrate that the information loss is not persistent and that convergence occurs within a short full-attention phase. revision: yes

  3. Referee: [Experimental results] Experimental validation: the manuscript reports only that results are 'preliminary' and 'small scale' with 'all other settings matched,' but supplies neither the exact experimental protocol, number of runs, nor any scaling behavior, leaving open whether the observed advantages hold beyond the tested regime.

    Authors: We will expand the experimental section to supply the complete protocol, all hyperparameters, data details, and the precise matching conditions used with the baseline. We performed multiple runs with different seeds and will report means with standard deviations. Regarding scaling, our current results are limited to the small-scale regime; we will add a discussion of observed limitations and expected scaling behavior, while acknowledging that large-scale validation lies outside the scope of this preliminary study. revision: partial

standing simulated objections not resolved
  • Empirical demonstration that the observed advantages persist at larger model scales and longer sequences, which would require substantial additional compute beyond the resources available for this preliminary work.

Circularity Check

0 steps flagged

No circularity in algorithmic proposal or empirical validation

full rationale

The paper introduces Lighthouse Attention as a new training-only hierarchical attention wrapper around standard SDPA, with a symmetrical gradient-free compression step and a two-stage pre-train-then-recover procedure. No equations or first-principles derivations are presented that reduce claimed performance (faster training time, lower final loss) to quantities defined by the method's own fitted parameters or by construction. The experimental results are reported as independent small-scale validation runs with matched settings, not as outputs of any self-referential fitting or renaming process. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked to justify the core claims. The derivation chain is therefore self-contained as an algorithmic contribution plus separate empirical check.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that a gradient-free symmetrical hierarchical compression can be inverted during recovery without loss of essential causal information; no explicit free parameters, axioms, or invented entities are named in the abstract beyond the new attention wrapper itself.

pith-pipeline@v0.9.0 · 5510 in / 1187 out tokens · 43939 ms · 2026-05-08T10:05:21.912416+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 21 canonical work pages · 13 internal anchors

  1. [1]

    The Claude 3 model family, 2024

    Anthropic. The Claude 3 model family, 2024. URLhttps://www.anthropic.com/news/ claude-3-family

  2. [2]

    Zoology: Measuring and improving recall in efficient language models, 2024

    S. Arora, S. Eyuboglu, A. Timalsina, I. Johnson, M. Poli, J. Zou, A. Rudra, and C. R ´e. Zoology: Measuring and improving recall in efficient language models. InICLR, 2024. arXiv:2312.04927

  3. [3]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    Y . Bengio, N. L´eonard, and A. Courville. Estimating or propagating gradients through stochas- tic neurons for conditional computation.arXiv preprint arXiv:1308.3432, 2013

  4. [4]

    Choromanski, V

    K. Choromanski, V . Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser, D. Belanger, L. Colwell, and A. Weller. Rethinking at- tention with Performers. InICLR, 2021

  5. [5]

    T. Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. In ICLR, 2024. arXiv:2307.08691

  6. [6]

    Dao and A

    T. Dao and A. Gu. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. InICML, 2024

  7. [7]

    T. Dao, D. Y . Fu, S. Ermon, A. Rudra, and C. R´e. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InNeurIPS, 2022

  8. [8]

    DeepSeek-V3 Technical Report

    DeepSeek-AI. DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437, 2024

  9. [9]

    DeepSeek-V3.2-Exp: Boosting long-context efficiency with DeepSeek sparse attention.arXiv preprint, 2025

    DeepSeek-AI. DeepSeek-V3.2-Exp: Boosting long-context efficiency with DeepSeek sparse attention.arXiv preprint, 2025

  10. [10]

    Q. Fu, M. Cho, T. Merth, S. Mehta, M. Rastegari, and M. Najibi. LazyLLM: Dynamic token pruning for efficient long context LLM inference.arXiv preprint arXiv:2407.14057, 2024

  11. [11]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Google DeepMind. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

  12. [12]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023

  13. [13]

    Guo et al

    H. Guo et al. Log-linear attention.arXiv preprint, 2025

  14. [14]

    E. Jang, S. Gu, and B. Poole. Categorical reparameterization with Gumbel-softmax. InInter- national Conference on Learning Representations (ICLR), 2017

  15. [15]

    Minference 1.0: Ac- celerating pre-filling for long-context llms via dynamic sparse attention

    H. Jiang, Y . Li, C. Zhang, Q. Wu, X. Luo, S. Ahn, Z. Han, A. H. Abdi, D. Li, C.-Y . Lin, Y . Yang, and L. Qiu. MInference 1.0: Accelerating pre-filling for long-context LLMs via dynamic sparse attention.arXiv preprint arXiv:2407.02490, 2024

  16. [16]

    Lai et al

    X. Lai et al. FlexPrefill: A context-aware sparse attention mechanism for efficient long- sequence inference.arXiv preprint, 2025

  17. [17]

    Y . Li, Y . Huang, B. Yang, B. Venkitesh, A. Locatelli, H. Ye, T. Cai, P. Lewis, and D. Chen. SnapKV: LLM knows what you are looking for before generation.arXiv preprint arXiv:2404.14469, 2024

  18. [18]

    Lin et al

    C. Lin et al. Twilight: Adaptive attention sparsity with hierarchical top-ppruning.arXiv preprint, 2025

  19. [19]

    H. Liu, M. Zaharia, and P. Abbeel. Ring attention with blockwise transformers for near-infinite context.arXiv preprint arXiv:2310.01889, 2023

  20. [20]

    E. Lu, Z. Jiang, J. Liu, Y . Du, T. Jiang, C. Hong, S. Liu, W. He, E. Yuan, Y . Wang, et al. MoBA: Mixture of block attention for long-context LLMs.arXiv preprint arXiv:2502.13189, 2025. 10

  21. [21]

    C. J. Maddison, A. Mnih, and Y . W. Teh. The concrete distribution: A continuous relaxation of discrete random variables. InInternational Conference on Learning Representations (ICLR), 2017

  22. [22]

    The Llama 3 Herd of Models

    Meta AI. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  23. [23]

    Kimi k1.5: Scaling Reinforcement Learning with LLMs

    Moonshot AI. Kimi K1.5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

  24. [24]

    Ni et al

    Y . Ni et al. DoubleP: Hierarchical cluster-and-refine attention with centroid approximation. arXiv preprint, 2026

  25. [25]

    GPT-4 Technical Report

    OpenAI. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2024

  26. [26]

    M. Oren, M. Hassid, Y . Adi, and R. Schwartz. Transformers are multi-state RNNs.arXiv preprint arXiv:2401.06104, 2024

  27. [27]

    Qwen2.5 Technical Report

    Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2025

  28. [28]

    Ribar, I

    L. Ribar, I. Chelombiev, L. Hudlass-Galley, C. Blake, C. Luschi, and D. Orr. SparQ atten- tion: Bandwidth-efficient LLM inference. InInternational Conference on Machine Learning (ICML), 2024

  29. [29]

    J. Shah, G. Bikshandi, Y . Zhang, V . Thakkar, P. Ramani, and T. Dao. FlashAttention-3: Fast and accurate attention with asynchrony and low-precision. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  30. [30]

    Y . Sun, L. Dong, S. Huang, S. Ma, Y . Xia, J. Xue, J. Wang, and F. Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621, 2023

  31. [31]

    J. Tang, Y . Zhao, K. Zhu, G. Xiao, B. Kasikci, and S. Han. Quest: Query-aware sparsity for efficient long-context LLM inference. InInternational Conference on Machine Learning (ICML), 2024

  32. [32]

    G. Xiao, Y . Tian, B. Chen, S. Han, and M. Lewis. Efficient streaming language models with attention sinks. InInternational Conference on Learning Representations (ICLR), 2024

  33. [33]

    Xu et al

    R. Xu et al. XAttention: Block sparse attention with antidiagonal scoring.arXiv preprint, 2025

  34. [34]

    S. Yang, B. Wang, Y . Shen, R. Panda, and Y . Kim. Gated linear attention transformers with hardware-efficient training.arXiv preprint arXiv:2312.06635, 2024

  35. [35]

    Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy. Hierarchical attention networks for document classification. InNAACL, 2016

  36. [36]

    J. Yuan, H. Gao, D. Dai, J. Luo, L. Zhao, Z. Zhang, Z. Xie, Y . Wei, L. Wang, Z. Xiao, et al. Na- tive sparse attention: Hardware-aligned and natively trainable sparse attention.arXiv preprint arXiv:2502.11089, 2025

  37. [37]

    Zhang et al

    J. Zhang et al. SpargeAttention: Accurate and training-free sparse attention accelerating any model inference.arXiv preprint, 2025

  38. [38]

    Zhang, Y

    Z. Zhang, Y . Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, Z. Song, Y . Tian, C. R´e, C. Barrett, Z. Wang, and B. Chen. H 2O: Heavy-hitter oracle for efficient generative inference of large language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  39. [39]

    arXiv preprint arXiv:2306.14048 , year=

    Z. Zhang, Y . Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, Z. Song, Y . Tian, C. R´e, C. Barrett, Z. Wang, and B. Chen. H2O: Heavy-hitter oracle for efficient generative inference of large language models. InNeurIPS, 2024. arXiv:2306.14048

  40. [40]

    HISA: Efficient Hierarchical Indexing for Fine-Grained Sparse Attention

    L. Zhao et al. HISA: Efficient hierarchical indexing for fine-grained sparse attention.arXiv preprint arXiv:2603.28458, 2026

  41. [41]

    what would softmax do,

    L. Zhao et al. InfLLM-V2: Dense–sparse switchable attention for seamless short-to-long adaptation.arXiv preprint, 2026. 11 A Ablations Configuration Scorer Params LH Total Total B200-Hrs Tok/s (k) Final Loss Steps Steps Tokens (↓) (↑) (↓) SDPA Baseline (ctx= 98k) — 530M — 16k 50.3B 303.2 45.6 0.7237 SDPA recoverability (L=3, p=2, k=6144, ctx= 98k) LH→SDPA...