pith. sign in

arxiv: 2606.07317 · v2 · pith:ZVLQNJWKnew · submitted 2026-06-05 · 💻 cs.IR

Gated Bidirectional Linear Attention for Generative Retrieval

Pith reviewed 2026-06-27 20:36 UTC · model grok-4.3

classification 💻 cs.IR
keywords generative retrievallinear attentionbidirectional attentionrecommender systemsuser history encodingattention efficiencyhybrid attention
0
0 comments X

The pith

A hybrid encoder interleaving self-attention and gated bidirectional linear attention in a 1:2 ratio matches full bidirectional self-attention quality on long user histories while delivering up to 8.2 times layer speedup.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the quadratic latency bottleneck in encoders for generative retrieval when user histories grow very long in streaming services. It extends kernelized linear attention with three additions to create a bidirectional linear layer that supports soft forgetting and local mixing. Experiments on a large Yandex Music dataset demonstrate that replacing two-thirds of self-attention blocks with this layer preserves retrieval quality. The same hybrid pattern transfers to Amazon benchmarks and yields substantial wall-clock gains on H100 hardware at lengths up to 32768. A reader cares because real-time recommendation at scale requires both accuracy and sub-quadratic scaling for active users.

Core claim

Gated Bidirectional Linear Attention (GBLA) recovers the quality of bidirectional self-attention when self-attention and GBLA blocks are interleaved 1:2 in the encoder. GBLA is built by adding Conv1D local causal mixing, sequence-level key gating, and gated RMSNorm to kernelized linear attention. On the Yandex Music dataset this hybrid encoder matches full self-attention retrieval metrics; on H100 GPUs a single GBLA layer is up to 8.2 times faster than FlashAttention-v3 at history length 32768. The hybrid design also preserves quality on public Amazon retrieval benchmarks.

What carries the argument

Gated Bidirectional Linear Attention (GBLA) that augments kernelized linear attention with Conv1D local causal mixing, sequence-level key gating for soft forgetting, and gated RMSNorm.

If this is right

  • The 1:2 SA-GBLA hybrid encoder matches bidirectional self-attention retrieval quality on the Yandex Music dataset.
  • GBLA achieves up to 8.2 times single-layer speedup versus FlashAttention-v3 on H100 GPUs at sequence length 32768.
  • The same hybrid architecture preserves self-attention retrieval quality on the public Amazon benchmarks.
  • GBLA removes the quadratic latency term from the encoder, allowing history lengths to grow without proportional slowdown.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The hybrid ratio may transfer to other long-context sequence tasks where full attention remains the quality ceiling.
  • Removing the remaining self-attention blocks entirely could be tested by increasing the GBLA proportion further.
  • The key-gating and Conv1D additions might be portable to causal linear attention in autoregressive language models.

Load-bearing premise

The three added components together suffice to recover full bidirectional self-attention quality when GBLA replaces two-thirds of the self-attention blocks, as shown only by end-to-end empirical results.

What would settle it

Training the identical 1:2 hybrid encoder on a fresh large-scale retrieval dataset with different interaction statistics and measuring whether its retrieval metrics fall measurably below those of a pure bidirectional self-attention encoder.

Figures

Figures reproduced from arXiv: 2606.07317 by Artem Matveev, Sergei Liamaev, Sergei Makeev, Vladislav Tytskiy.

Figure 1
Figure 1. Figure 1: Gated Bidirectional Linear Attention architecture. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Encoder latency for different sequence lengths. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
read the original abstract

In recommender systems, generative retrieval typically uses an encoder-decoder setup: an encoder processes a user interaction history, and an autoregressive decoder then generates recommended items. In large-scale streaming services, active users accumulate very long histories over time. As histories grow, the encoder becomes a major latency bottleneck because softmax attention scales quadratically with sequence length. In our experiments, using bidirectional attention in the encoder substantially improves quality. However, most sub-quadratic attention methods focus on causal attention. We propose Gated Bidirectional Linear Attention (GBLA), a linear-time bidirectional attention layer that extends kernelized linear attention with three lightweight components: local causal mixing (Conv1D), sequence-level key gating for soft forgetting, and a gated RMSNorm output. On a large-scale Yandex Music dataset, a hybrid encoder that interleaves self-attention (SA) and GBLA in a 1:2 ratio (one SA block followed by two GBLA blocks) matches bidirectional self-attention quality. On H100 GPUs, GBLA reaches up to an $8.2\times$ single-layer speedup at a history length of 32768, compared to FlashAttention-v3. Finally, we show that the same hybrid design generalizes beyond our proprietary setting, consistently preserving self-attention retrieval quality on public Amazon benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Gated Bidirectional Linear Attention (GBLA), which augments kernelized linear attention with three components (Conv1D local causal mixing, sequence-level key gating, and gated RMSNorm) to enable efficient bidirectional processing in the encoder of generative retrieval models. It reports that a hybrid encoder interleaving self-attention and GBLA blocks in a 1:2 ratio matches the retrieval quality of full bidirectional self-attention on a large proprietary Yandex Music dataset while delivering up to 8.2× single-layer speedup versus FlashAttention-v3 at sequence length 32768 on H100 GPUs; the same hybrid is shown to preserve quality on public Amazon benchmarks.

Significance. If the quality-parity result holds under controlled conditions, the work would be significant for large-scale recommender systems that must encode very long user histories without quadratic latency. The reported GPU speedup and the generalization experiment on public Amazon data are concrete strengths; the absence of component ablations and error bars, however, leaves the attribution of parity to the three GBLA extensions weakly supported.

major comments (2)
  1. [Experiments (Yandex Music)] Experiments section (Yandex Music results): the central claim that the 1:2 SA:GBLA hybrid recovers bidirectional self-attention quality rests solely on end-to-end metrics; no ablation removes any one of the three added components (Conv1D, key gating, gated RMSNorm) while retaining the hybrid ratio, nor compares against a hybrid using unmodified kernelized linear attention. This omission is load-bearing because the observed parity could be driven by the retained SA blocks rather than the proposed extensions.
  2. [Abstract and §4] Abstract and evaluation protocol description: no error bars, confidence intervals, or statistical significance tests accompany the quality metrics on the proprietary Yandex Music dataset, and training/evaluation details (optimizer, learning-rate schedule, negative sampling, exact metric definitions) are not provided. These omissions weaken the empirical support for the quality-parity claim given that the main result is reported on a single non-public corpus.
minor comments (2)
  1. [§3] Notation for the three GBLA components is introduced in the abstract but the precise mathematical definitions (especially the sequence-level key gating and gated RMSNorm) should be cross-referenced to the corresponding equations in §3 for clarity.
  2. [Experiments (speedup)] The speedup figure (8.2× at length 32768) should specify whether it is measured for a single GBLA layer in isolation or within the full hybrid encoder, and whether it includes the overhead of the interleaved SA blocks.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive comments and for acknowledging the potential impact of this work on large-scale recommender systems. We respond to each major comment below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: Experiments section (Yandex Music results): the central claim that the 1:2 SA:GBLA hybrid recovers bidirectional self-attention quality rests solely on end-to-end metrics; no ablation removes any one of the three added components (Conv1D, key gating, gated RMSNorm) while retaining the hybrid ratio, nor compares against a hybrid using unmodified kernelized linear attention. This omission is load-bearing because the observed parity could be driven by the retained SA blocks rather than the proposed extensions.

    Authors: We concur that the lack of ablations limits the strength of the claim. The GBLA components were developed to overcome limitations of kernelized linear attention in bidirectional contexts, specifically for local dependencies, long-sequence forgetting, and training stability. In the revised manuscript, we will add an experiment comparing the 1:2 hybrid with unmodified kernelized linear attention to better isolate the contribution of the proposed extensions. We will also elaborate on the design rationale for each component. revision: partial

  2. Referee: Abstract and evaluation protocol description: no error bars, confidence intervals, or statistical significance tests accompany the quality metrics on the proprietary Yandex Music dataset, and training/evaluation details (optimizer, learning-rate schedule, negative sampling, exact metric definitions) are not provided. These omissions weaken the empirical support for the quality-parity claim given that the main result is reported on a single non-public corpus.

    Authors: We will update the manuscript to provide complete details on the optimizer, learning-rate schedule, negative sampling, and metric definitions in the evaluation protocol section. For the Yandex Music results, the experiments were performed with a single training run due to the scale of the dataset and associated costs. We will add a statement clarifying this and highlight the corroborating results on the public Amazon datasets. Statistical significance testing can be added for the Amazon experiments in the revision. revision: partial

standing simulated objections not resolved
  • We cannot provide error bars or results from multiple independent runs on the Yandex Music dataset, as repeating the full-scale training is computationally prohibitive.

Circularity Check

0 steps flagged

No circularity: empirical quality claims rest on direct held-out comparisons

full rationale

The paper advances an empirical claim that a 1:2 SA+GBLA hybrid recovers bidirectional self-attention retrieval quality on Yandex Music and Amazon benchmarks. This is supported by end-to-end experimental results against FlashAttention-v3 and self-attention baselines on held-out data. No equations, fitted parameters, or self-citations are shown to reduce the reported metrics to quantities defined inside the same experiment. The three added components are motivated by extension of prior kernelized linear attention but the performance parity is not derived from them by construction; it is measured directly. No load-bearing uniqueness theorem, ansatz smuggling, or renaming of known results appears in the provided text.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The method rests on the empirical observation that the added gating and mixing components recover quality; no first-principles derivation is supplied.

free parameters (1)
  • hybrid interleaving ratio
    1:2 ratio chosen to match self-attention quality on the target dataset
axioms (1)
  • domain assumption Kernelized linear attention can be made bidirectional by adding local causal mixing, sequence-level key gating, and gated RMSNorm
    Invoked as the basis for the GBLA layer design
invented entities (1)
  • Gated Bidirectional Linear Attention (GBLA) no independent evidence
    purpose: Provide linear-time bidirectional attention for long user histories
    New layer introduced in this work

pith-pipeline@v0.9.1-grok · 5770 in / 1228 out tokens · 20514 ms · 2026-06-27T20:36:16.269514+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 14 canonical work pages · 6 internal anchors

  1. [1]

    Arshia Afzal, Elias Abad Rocamora, Leyla Naz Candogan, Pol Puigdemont, Francesco Tonin, Yongtao Wu, Mahsa Shoaran, and Volkan Cevher. 2025. Lin- ear Attention for Efficient Bidirectional Sequence Modeling.arXiv preprint arXiv:2502.16249(2025). arXiv:2502.16249 [cs.LG]

  2. [2]

    Zheng Chai, Qin Ren, Xijun Xiao, Huizhi Yang, Bo Han, Sijun Zhang, Di Chen, Hui Lu, Wenlin Zhao, Lele Yu, et al . 2025. Longer: Scaling up long sequence modeling in industrial recommenders. InProceedings of the Nineteenth ACM Conference on Recommender Systems. 247–256

  3. [3]

    Jiaxin Deng, Shiyao Wang, Kuo Cai, Lejian Ren, Qigen Hu, Weifeng Ding, Qiang Luo, and Guorui Zhou. 2025. OneRec: Unifying Retrieve and Rank with Generative Recommender and Iterative Preference Alignment.arXiv preprint arXiv:2502.18965(2025). arXiv:2502.18965 [cs.IR]

  4. [4]

    Yue Dong, Han Li, Shen Li, Nikhil Patel, Xing Liu, Xiaodong Wang, and Chuanhao Zhuge. 2025. Scaling Generative Recommendations with Context Parallelism on Hierarchical Sequential Transducers.arXiv preprint arXiv:2508.04711(2025). arXiv:2508.04711 [cs.IR]

  5. [5]

    Albert Gu and Tri Dao. 2023. Mamba: Linear-Time Sequence Modeling with Selec- tive State Spaces.arXiv preprint arXiv:2312.00752(2023). arXiv:2312.00752 [cs.LG]

  6. [6]

    Clark Mingxuan Ju, Liam Collins, Leonardo Neves, Bhuvesh Kumar, Louis Yufeng Wang, Tong Zhao, and Neil Shah. 2025. Generative Recommendation with Semantic IDs: A Practitioner’s Handbook.arXiv preprint arXiv:2507.22224(2025). arXiv:2507.22224 [cs.IR]

  7. [8]

    Transformers are rnns: Fast autoregressive transformers with linear attention, 2020

    Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention.arXiv preprint arXiv:2006.16236(2020). arXiv:2006.16236 [cs.LG]

  8. [9]

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret

  9. [10]

    InInternational conference on machine learning

    Transformers are rnns: Fast autoregressive transformers with linear atten- tion. InInternational conference on machine learning. PMLR, 5156–5165

  10. [11]

    Kirill Khrylchenko, Artem Matveev, Sergei Makeev, and Vladimir Baikalov. 2025. Scaling Recommender Transformers to One Billion Parameters.arXiv preprint arXiv:2507.15994(2025). arXiv:2507.15994 [cs.IR]

  11. [12]

    Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, Wentao Li, Enzhe Lu, Weizhou Liu, Yanru Chen, Weixin Xu, Longhui Yu, Yejie Wang, Yu Fan, Longguang Zhong, Enming Yuan, Dehao Zhang, Yizhi Zhang, T. Y. Liu, Haiming Wang, Shengjun Fang, Weiran He, Shaowei Liu, Yiwei Li, Jianlin Su, Jiezh...

  12. [13]

    Nikil Pancha, Andrew Zhai, Jure Leskovec, and Charles Rosenberg. 2022. Pinner- Former: Sequence Modeling for User Representation at Pinterest.arXiv preprint arXiv:2205.04507(2022). arXiv:2205.04507 [cs.LG] doi:10.48550/arXiv.2205.04507

  13. [14]

    Shashank Rajput et al. 2023. Recommender Systems with Generative Retrieval. arXiv preprint arXiv:2305.05065(2023). arXiv:2305.05065 [cs.IR]

  14. [15]

    Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. 2024. FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision.arXiv preprint arXiv:2407.08608(2024). arXiv:2407.08608 [cs.LG] doi:10.48550/arXiv.2407.08608

  15. [16]

    Anima Singh, Trung Vu, Nikhil Mehta, Raghunandan Keshavan, Maheswaran Sathiamoorthy, Yilin Zheng, Lichan Hong, Lukasz Heldt, Li Wei, Devansh Tandon, et al. 2024. Better generalization with semantic ids: A case study in ranking for recommendations. InProceedings of the 18th ACM Conference on Recommender Systems. 1039–1044

  16. [17]

    Dan Tito Svenstrup, Jonas Hansen, and Ole Winther. 2017. Hash embeddings for efficient word representations.Advances in neural information processing systems 30 (2017)

  17. [18]

    Songlin Yang, Jan Kautz, and Ali Hatamizadeh. 2024. Gated Delta Networks: Improving Mamba2 with Delta Rule.arXiv preprint arXiv:2412.06464(2024). arXiv:2412.06464 [cs.LG]

  18. [19]

    Jun Zhai et al . 2024. Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations.arXiv preprint arXiv:2402.17152(2024). arXiv:2402.17152 [cs.IR]

  19. [20]

    Guorui Zhou et al. 2025. OneRec Technical Report.arXiv preprint arXiv:2506.13695 (2025). arXiv:2506.13695 [cs.IR]