pith. machine review for the scientific record. sign in

arxiv: 2604.18603 · v1 · submitted 2026-04-09 · 🧬 q-bio.QM · cs.LG

Recognition: unknown

Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings

Jason P. Gleghorn, Logan Hallee

Pith reviewed 2026-05-10 16:46 UTC · model grok-4.3

classification 🧬 q-bio.QM cs.LG
keywords dual triangle attentionbidirectional attentiontriangular maskpositional inductive biasmasked language modelingprotein sequencestransformercausal attention
0
0 comments X

The pith

Dual Triangle Attention splits each head into past and future triangular masks to give bidirectional models implicit positional information without embeddings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to prove that bidirectional attention can acquire positional order for free by dividing the query-key subspace of every head into two complementary triangular masks, one covering past-and-self positions and the other future-and-self positions. A reader would care because ordinary bidirectional attention is permutation-invariant and therefore needs separate positional embeddings, whereas causal attention obtains positional bias automatically from its mask; if the split works, models could drop the extra embeddings while still seeing both directions. The authors test this on a synthetic position probe where standard bidirectional attention fails but the new mechanism succeeds, and on masked language modeling for both English text and protein sequences.

Core claim

Dual Triangle Attention separates the query-key subspace of each attention head into two complementary triangular masks—one that attends to past-and-self positions and one that attends to future-and-self positions. This design supplies full bidirectional context while preserving the causal mask's implicit positional inductive bias in both directions. The mechanism is realized as a single compiled kernel call with no added parameters. On an argmax position probe, it learns positional information like causal attention does, unlike standard bidirectional attention. In masked language modeling on natural language and on protein sequences it performs competitively, with the strongest context-exti

What carries the argument

Dual Triangle Attention, which partitions each attention head's query-key subspace into two complementary triangular masks to supply bidirectional context plus positional bias.

If this is right

  • Models using Dual Triangle Attention learn positional information in synthetic argmax tasks without explicit embeddings, matching causal attention.
  • The mechanism supports masked language modeling on natural language with performance comparable to or better than baselines.
  • On protein sequences the same attention produces strong masked language modeling results.
  • Pairing Dual Triangle Attention with rotary embeddings yields the best observed context-extension behavior.
  • Implementation requires no additional learned parameters beyond ordinary multi-head attention.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same mask-splitting idea could be tested in other biological sequence domains such as DNA or small-molecule strings.
  • Removing positional embedding layers might reduce memory use and simplify scaling for very long protein or genomic sequences.
  • Performance in open-ended generation rather than masked reconstruction remains an open question for this attention variant.

Load-bearing premise

That dividing the attention subspace into two fixed triangular masks will reliably inject enough positional signal for effective learning in bidirectional masked language modeling tasks.

What would settle it

If Dual Triangle Attention models fail to learn accurate position predictions on the argmax probe while causal-attention models succeed, or if they underperform standard bidirectional attention plus embeddings on masked language modeling, the claim would be falsified.

Figures

Figures reproduced from arXiv: 2604.18603 by Jason P. Gleghorn, Logan Hallee.

Figure 1
Figure 1. Figure 1: Argmax position probe accuracy across attention types, positional encoding strategies, and model configurations. Each cell shows the best evaluation accuracy (mean across three seeds) for a given combination of hidden size (columns) and number of layers (rows). Dual Triangle Attention and causal attention achieve high accuracy across nearly all configurations, including without explicit positional embeddin… view at source ↗
Figure 2
Figure 2. Figure 2: Argmax probe accuracy trend, varying number of layers and hidden size. (a) Without positional embeddings, Dual Triangle and causal attention learn the task, while bidirectional attention cannot. Error bars show 95% confidence intervals across three seeds. (b) At the largest model size (768 hidden, 12 layers), Dual Triangle Attention matches or exceeds both causal and bidirectional attention across all posi… view at source ↗
Figure 3
Figure 3. Figure 3: Masked language modeling on natural language (FineWeb-Edu). (a) Validation loss, accuracy, MCC, and F1 at training context length (256 tokens). (b) Validation loss, accuracy, MCC, and F1 at extended context length (1,024 tokens). Shaded regions represent ±1 standard deviation across three seeds. (c) Best validation loss, accuracy, MCC, and F1 at training context length. (d) Test loss, accuracy, MCC, and F1… view at source ↗
Figure 4
Figure 4. Figure 4: Masked language modeling on protein sequences (OMG-Prot50). (a) Validation loss, accuracy, MCC, and F1 at training context length (256 tokens). (b) Validation loss, accuracy, MCC, and F1 at extended context length (1,024 tokens). Shaded regions represent ±1 standard deviation across three seeds. (c) Best validation loss, accuracy, MCC, and F1 at training context length. (d) Test loss, accuracy, MCC, and F1… view at source ↗
Figure 5
Figure 5. Figure 5: DroPE recovery analysis. (a) NLP extended-context validation loss, accuracy, MCC, and F1 before and after dropping positional embeddings at 70% of training. (b) Protein extended-context validation loss, accuracy, MCC, and F1. The vertical dashed line marks the drop point. Shaded regions represent ±1 standard deviation across three seeds. (c) NLP final test loss, accuracy, MCC, and F1 comparing RoPE (kept t… view at source ↗
Figure 6
Figure 6. Figure 6: Dual Triangle Attention. (a) Each logical head’s Q and K tensors (l × dh) are split at dh/2: the down half (Q↓,K↓, blue) computes the lower triangle (j ≤ i) and the up half (Q↑,K↑, orange) computes the upper triangle (j ≥ i). The diagonal (i = j) is shared by both sub-heads. (b) Attention mask comparison. Bidirectional attention permits all position pairs. Causal attention masks future tokens. Dual Triangl… view at source ↗
Figure 7
Figure 7. Figure 7: U-Net transformer architecture for masked language modeling. The encoder ( nlayers 2 blocks) feeds forward through the left path, and the decoder ( nlayers 2 blocks) ascends through the right path. Dashed arrows indicate skip connections with learnable scalar weights wskip. Per-layer value embeddings ve(x) (nn.Embedding) from raw token indices are injected into each block. The original input representation… view at source ↗
Figure 8
Figure 8. Figure 8: Argmax control experiment with random labels. All attention types achieve near-chance accuracy when labels are randomized, confirming that high accuracy in the main experiments reflects genuine positional learning. Hallee et al. | arXiv | April 22, 2026 | 12–12 [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
read the original abstract

Bidirectional transformers are the foundation of many sequence modeling tasks across natural, biological, and chemical language domains, but they are permutation-invariant without explicit positional embeddings. In contrast, unidirectional attention inherently encodes positional information through its triangular mask, enabling models to operate without positional embeddings altogether. Here, we introduce Dual Triangle Attention, a novel bidirectional attention mechanism that separates the query-key subspace of each attention head into two complementary triangular masks: one that attends to past-and-self positions and one that attends to future-and-self positions. This design provides bidirectional context while maintaining the causal mask's implicit positional inductive bias in both directions. Using PyTorch's flex_attention, Dual Triangle Attention is implemented as a single compiled kernel call with no additional parameters beyond standard multi-head attention. We evaluated Dual Triangle Attention across three settings: (1) a synthetic argmax position probe, (2) masked language modeling (MLM) on natural language, and (3) MLM on protein sequences. In the argmax task, both Dual Triangle Attention and causal attention learn positional information without explicit positional embeddings, whereas standard bidirectional attention cannot. In the MLM experiments, Dual Triangle Attention with Rotary Positional Embeddings (RoPE) achieved the best context extension performance and strong performance across the board. These findings suggest that Dual Triangle Attention is a viable attention mechanism for bidirectional transformers, with or without positional embeddings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes Dual Triangle Attention, a bidirectional attention variant that partitions each head's query-key subspace into two complementary triangular masks (past-and-self and future-and-self) to supply bidirectional context while retaining the implicit positional inductive bias of causal attention. This allows bidirectional transformers to operate without explicit positional embeddings. The mechanism is implemented as a single flex_attention kernel call with no added parameters. Evaluation consists of a synthetic argmax position probe (where Dual Triangle and causal attention succeed but standard bidirectional fails) plus masked language modeling on natural-language and protein sequences (where the Dual Triangle + RoPE variant performs best on context extension).

Significance. If the central claim holds, the work offers a parameter-free architectural route to positional awareness in bidirectional models, which could simplify training in domains such as protein language modeling where explicit positional encodings are often costly or brittle. The efficient single-kernel implementation and the synthetic probe result are concrete strengths that would be valuable if replicated with full quantitative reporting.

major comments (3)
  1. [Abstract / MLM experiments] Abstract and MLM-experiments paragraph: the claim that Dual Triangle Attention works 'with or without positional embeddings' on practical MLM tasks is unsupported because only the Dual Triangle + RoPE variant is reported as achieving best context-extension performance; no metrics, ablations, or even qualitative statements are given for the no-PE Dual Triangle model on the natural-language or protein MLM benchmarks. This is load-bearing for the central 'viable with or without' assertion.
  2. [Abstract] Abstract: the positive results on the argmax probe and MLM tasks are stated without any quantitative metrics, standard deviations, ablation tables, or implementation hyperparameters, leaving the magnitude and reliability of the claimed improvements impossible to assess from the provided text.
  3. [Method / Implementation] Implementation description: although the paper states that Dual Triangle Attention is realized as a single compiled flex_attention kernel, no pseudocode, mask-construction equations, or verification that the two triangular masks are applied to disjoint subspaces of the same QK projection are supplied, making reproducibility and correctness verification difficult.
minor comments (1)
  1. [Abstract] The abstract refers to 'three settings' but only describes the argmax probe and two MLM tasks; a brief enumeration of the exact datasets or sequence lengths used would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive review and valuable feedback. We have revised the manuscript to address the concerns on clarity, quantitative reporting, and implementation details. Point-by-point responses follow.

read point-by-point responses
  1. Referee: The claim that Dual Triangle Attention works 'with or without positional embeddings' on practical MLM tasks is unsupported because only the Dual Triangle + RoPE variant is reported as achieving best context-extension performance; no metrics, ablations, or even qualitative statements are given for the no-PE Dual Triangle model on the natural-language or protein MLM benchmarks.

    Authors: We agree the abstract phrasing could be clarified. The MLM results emphasize the RoPE variant for best performance, while no-PE viability is shown via the synthetic probe and mechanism design. We have revised the abstract to qualify the statement accurately and added qualitative discussion plus metrics for the no-PE variant on MLM benchmarks in the experiments section. revision: yes

  2. Referee: The positive results on the argmax probe and MLM tasks are stated without any quantitative metrics, standard deviations, ablation tables, or implementation hyperparameters.

    Authors: We concur that the abstract benefits from more specifics. We have updated it to include key metrics (e.g., probe accuracy, MLM perplexity), references to full tables with standard deviations, and hyperparameters from the main text and appendix. revision: yes

  3. Referee: No pseudocode, mask-construction equations, or verification that the two triangular masks are applied to disjoint subspaces of the same QK projection are supplied.

    Authors: We have expanded the Methods section with pseudocode for the single flex_attention kernel, explicit equations for the complementary triangular masks, and verification that they partition the QK subspace disjointly per head. This improves reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity: direct architectural definition with empirical support

full rationale

The paper defines Dual Triangle Attention directly as a split of the query-key subspace into complementary triangular masks, implemented via a single flex_attention kernel with no added parameters. No equations, derivations, or performance claims reduce by construction to fitted parameters, self-referential definitions, or a chain of self-citations. The argmax probe and MLM evaluations are presented as independent empirical tests rather than tautological restatements of the mechanism. The central claim of bidirectional context with preserved positional bias is therefore self-contained and does not rely on any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on standard transformer attention assumptions plus the empirical observation that causal triangular masks confer positional bias; no free parameters or new physical entities are introduced in the abstract.

axioms (2)
  • standard math Standard multi-head attention formulation with masking
    The paper builds directly on existing attention mechanisms without re-deriving them.
  • domain assumption Triangular causal masks implicitly encode positional information
    Invoked to justify why the dual masks will preserve positional inductive bias.
invented entities (1)
  • Dual Triangle Attention no independent evidence
    purpose: Bidirectional attention mechanism with implicit positional bias
    Newly defined attention variant whose behavior is validated only through the described experiments.

pith-pipeline@v0.9.0 · 5547 in / 1339 out tokens · 40056 ms · 2026-05-10T16:46:55.855517+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 57 canonical work pages · 21 internal anchors

  1. [1]

    Attention Is All You Need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. arXiv, 2017. doi:10.48550/arXiv.1706.03762. URLhttp://arxiv.org/abs/1706.03762. Number: arXiv:1706.03762

  2. [2]

    Language Models are Few-Shot Learners

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, et al. Language models are few-shot learners. arXiv, 2020. doi:10.48550/arXiv.2005.14165. URLhttp://arxiv.org/abs/2005.14165

  3. [3]

    Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023

    Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023. doi:10.1126/science.ade2574

  4. [4]

    Highly accurate protein structure prediction with AlphaFold

    John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, et al. Highly accurate protein structure prediction with AlphaFold.Nature, 596(7873):583–589, 2021. ISSN 1476-4687. doi:10.1038/s41586-021-03819-2. Number: 7873

  5. [5]

    Patrick Wilhelm, Thorsten Wittkopp, and Odej Kao

    Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, et al. Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference. arXiv, 2024. doi:10.48550/arXiv.2412.13663. URL http://arxiv.org/abs/2412.13663

  6. [6]

    Gleghorn, and Bohdan B

    Logan Hallee, Rohan Kapur, Arjun Patel, Jason P . Gleghorn, and Bohdan B. Khomtchouk. Contrastive learning and mixture of experts enables precise vector embeddings in biological databases.Sci Rep, 15(1):14953, 2025. ISSN 2045-2322. doi:10.1038/s41598-025-98185-8

  7. [7]

    Bichara, and Jason P

    Logan Hallee, Nikolaos Rafailidis, David B. Bichara, and Jason P . Gleghorn. Diffusion sequence models for enhanced protein representation and generation. arXiv,

  8. [8]

    URLhttp://arxiv.org/abs/2506.08293

    doi:10.48550/arXiv.2506.08293. URLhttp://arxiv.org/abs/2506.08293

  9. [9]

    Abramson, J

    Josh Abramson, Jonas Adler, Jack Dunger, Richard Evans, Tim Green, Alexander Pritzel, Olaf Ronneberger, Lindsay Willmore, et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3.Nature, 630(8016):493–500, 2024. ISSN 1476-4687. doi:10.1038/s41586-024-07487-w

  10. [10]

    Gleghorn

    Logan Hallee and Jason P . Gleghorn. Protein-protein interaction prediction is achievable with large language models. bioRxiv, 2023. doi:10.1101/2023.06.07.544109. URLhttps://www.biorxiv.org/content/10.1101/2023.06.07.544109v1. Pages: 2023.06.07.544109 Section: New Results

  11. [11]

    Gleghorn

    Logan Hallee, Tamar Peleg, Nikolaos Rafailidis, and Jason P . Gleghorn. Protein language models are accidental taxonomists. bioRxiv, 2025. doi:10.1101/2025. 10.07.681002. URLhttps://www.biorxiv.org/content/10.1101/2025.10.07.681002v1. ISSN: 2692-8205 Pages: 2025.10.07.681002 Section: New Results

  12. [12]

    TUnA: an uncertainty-aware transformer model for sequence-based protein–protein interaction prediction

    Y oung Su Ko, Jonathan Parkinson, Cong Liu, and Wei Wang. TUnA: an uncertainty-aware transformer model for sequence-based protein–protein interaction prediction. Brief Bioinform, 25(5):bbae359, 2024. ISSN 1477-4054. doi:10.1093/bib/bbae359

  13. [13]

    Gleghorn

    Logan Hallee, Niko Rafailidis, Colin Horger, David Hong, and Jason P . Gleghorn. Annotation vocabulary (might be) all you need. bioRxiv, 2024. doi:10.1101/2024.07. 30.605924. URLhttps://www.biorxiv.org/content/10.1101/2024.07.30.605924v1. Pages: 2024.07.30.605924 Section: New Results. Halleeet al.| arXiv | April 22, 2026 | 8–12

  14. [14]

    doi:10.1101/2023.10.01.560349 , abstract =

    Jin Su, Chenchen Han, Yuyang Zhou, Junjie Shan, Xibin Zhou, and Fajie Yuan. SaProt: Protein language modeling with structure-aware vocabulary. bioRxiv, 2023. doi:10.1101/2023.10.01.560349. URLhttps://www.biorxiv.org/content/10.1101/2023.10.01.560349v1. Pages: 2023.10.01.560349 Section: New Results

  15. [15]

    Gleghorn

    Logan Hallee, Nikolaos Rafailidis, and Jason P . Gleghorn. cdsBERT - extending protein language models with codon awareness. bioRxiv, 2023. doi:10.1101/2023.09. 15.558027. URLhttps://www.biorxiv.org/content/10.1101/2023.09.15.558027v1. Pages: 2023.09.15.558027 Section: New Results

  16. [16]

    Carlos Outeiral and Charlotte M. Deane. Codon language embeddings provide strong signals for use in protein engineering.Nat Mach Intell, 6(2):170–179, 2024. ISSN 2522-5839. doi:10.1038/s42256-024-00791-0

  17. [17]

    Diffusion language models are versatile protein learners.arXiv preprint arXiv:2402.18567, 2024

    Xinyou Wang, Zaixiang Zheng, Fei Y e, Dongyu Xue, Shujian Huang, and Quanquan Gu. Diffusion language models are versatile protein learners. arXiv, 2024. doi: 10.48550/arXiv.2402.18567. URLhttp://arxiv.org/abs/2402.18567

  18. [18]

    Durrant, Brian Kang, Dhruva Katrekar, David B

    Eric Nguyen, Michael Poli, Matthew G. Durrant, Brian Kang, Dhruva Katrekar, David B. Li, Liam J. Bartie, Armin W. Thomas, et al. Sequence modeling and design from molecular to genome scale with evo.Science, 386(6723):eado9336, 2024. doi:10.1126/science.ado9336

  19. [19]

    G., Ku, J., Poli, M., Brockman, G., Chang, D., Gonzalez, G

    Garyk Brixi, Matthew G. Durrant, Jerome Ku, Michael Poli, Greg Brockman, Daniel Chang, Gabriel A. Gonzalez, Samuel H. King, et al. Genome modeling and design across all domains of life with evo 2. bioRxiv, 2025. doi:10.1101/2025.02.18.638918. URLhttps://www.biorxiv.org/content/10.1101/2025.02.18. 638918v1. Pages: 2025.02.18.638918 Section: New Results

  20. [20]

    Taylor, Tom Ward, Clare Bycroft, Lauren Nicolaisen, et al

    Žiga Avsec, Natasha Latysheva, Jun Cheng, Guido Novati, Kyle R. Taylor, Tom Ward, Clare Bycroft, Lauren Nicolaisen, et al. Advancing regulatory variant effect prediction with AlphaGenome.Nature, 649(8099):1206–1218, 2026. ISSN 1476-4687. doi:10.1038/s41586-025-10014-0

  21. [21]

    Position information in transformers: An overview

    Philipp Dufter, Martin Schmitt, and Hinrich Schütze. Position information in transformers: An overview. arXiv, 2021. doi:10.48550/arXiv.2102.11090. URLhttp: //arxiv.org/abs/2102.11090

  22. [22]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv, 2019. doi:10.48550/arXiv.1810.04805. URLhttp://arxiv.org/abs/1810.04805

  23. [23]

    Self-attention with relative position repre- sentations

    Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. arXiv, 2018. doi:10.48550/arXiv.1803.02155. URLhttp: //arxiv.org/abs/1803.02155

  24. [24]

    Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Y anqi Zhou, Wei Li, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv, 2023. doi:10.48550/arXiv.1910.10683. URLhttp://arxiv.org/abs/1910.10683

  25. [25]

    URLhttps://doi.org/10.48550/arXiv

    Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. DeBERTa: Decoding-enhanced BERT with disentangled attention. arXiv, 2021. doi:10.48550/arXiv. 2006.03654. URLhttp://arxiv.org/abs/2006.03654

  26. [26]

    Pengcheng He, Jianfeng Gao, and Weizhu Chen

    Pengcheng He, Jianfeng Gao, and Weizhu Chen. DeBERTaV3: Improving DeBERTa using ELECTRA-style pre-training with gradient-disentangled embedding sharing. arXiv, 2023. doi:10.48550/arXiv.2111.09543. URLhttp://arxiv.org/abs/2111.09543

  27. [27]

    RoFormer: Enhanced Transformer with Rotary Position Embedding

    Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding. arXiv, 2023. URL http://arxiv.org/abs/2104.09864

  28. [28]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Y asmine Babaei, Nikolay Bashlykov, Soumya Batra, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv, 2023. doi:10.48550/arXiv.2307.09288. URLhttp://arxiv.org/abs/2307.09288

  29. [29]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, et al. The llama 3 herd of models. arXiv, 2024. doi:10.48550/arXiv.2407.21783. URLhttp://arxiv.org/abs/2407.21783

  30. [30]

    Mistral 7B

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, et al. Mistral 7b. arXiv, 2023. doi:10.48550/arXiv.2310.06825. URLhttp://arxiv.org/abs/2310.06825

  31. [31]

    A length-extrapolatable transformer

    Yutao Sun, Li Dong, Barun Patra, Shuming Ma, Shaohan Huang, Alon Benhaim, Vishrav Chaudhary, Xia Song, et al. A length-extrapolatable transformer. arXiv, 2022. doi:10.48550/arXiv.2212.10554. URLhttp://arxiv.org/abs/2212.10554

  32. [32]

    Contextual position encoding: Learning to count what’s important

    Olga Golovneva, Tianlu Wang, Jason Weston, and Sainbayar Sukhbaatar. Contextual position encoding: Learning to count what’s important. arXiv, 2024. doi: 10.48550/arXiv.2405.18719. URLhttp://arxiv.org/abs/2405.18719

  33. [33]

    Extending Context Window of Large Language Models via Positional Interpolation

    Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation. arXiv, 2023. doi:10.48550/arXiv.2306.15595. URLhttp://arxiv.org/abs/2306.15595

  34. [34]

    YaRN: Efficient Context Window Extension of Large Language Models

    Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Y aRN: Efficient context window extension of large language models. arXiv, 2023. doi:10.48550/ arXiv.2309.00071. URLhttp://arxiv.org/abs/2309.00071

  35. [35]

    Longrope: Extending llm context window beyond 2 million tokens

    Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Y ang, and Mao Y ang. LongRoPE: Extending LLM context window beyond 2 million tokens. arXiv, 2024. doi:10.48550/arXiv.2402.13753. URLhttp://arxiv.org/abs/2402.13753

  36. [36]

    Round and round we go! what makes rotary positional encodings useful?

    Federico Barbero, Alex Vitvitskyi, Christos Perivolaropoulos, Razvan Pascanu, and Petar Veli ˇckovi´c. Round and round we go! what makes rotary positional encodings useful? arXiv, 2024. URLhttp://arxiv.org/abs/2410.06205

  37. [37]

    Transformer language models without positional encodings still learn positional information

    Adi Haviv, Ori Ram, Ofir Press, Peter Izsak, and Omer Levy. Transformer language models without positional encodings still learn positional information. arXiv, 2022. doi:10.48550/arXiv.2203.16634. URLhttp://arxiv.org/abs/2203.16634

  38. [38]

    Length generalization of causal transformers without position encoding

    Jie Wang, Tao Ji, Yuanbin Wu, Hang Y an, Tao Gui, Qi Zhang, Xuanjing Huang, and Xiaoling Wang. Length generalization of causal transformers without position encoding. arXiv, 2024. doi:10.48550/arXiv.2404.12224. URLhttp://arxiv.org/abs/2404.12224

  39. [39]

    The impact of positional encoding on length generalization in transformers

    Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Payel Das, and Siva Reddy. The impact of positional encoding on length generalization in transformers. arXiv, 2023. doi:10.48550/arXiv.2305.19466. URLhttp://arxiv.org/abs/2305.19466

  40. [40]

    Rope to nope and back again: A new hybrid attention strategy.arXiv preprint arXiv:2501.18795, 2025

    Bowen Y ang, Bharat Venkitesh, Dwarak Talupuru, Hangyu Lin, David Cairuz, Phil Blunsom, and Acyr Locatelli. Rope to nope and back again: A new hybrid attention strategy. arXiv, 2025. doi:10.48550/arXiv.2501.18795. URLhttp://arxiv.org/abs/2501.18795

  41. [41]

    Extending the context of pretrained llms by dropping their positional embeddings

    Y oav Gelberg, Koshi Eguchi, Takuya Akiba, and Edoardo Cetin. Extending the context of pretrained LLMs by dropping their positional embeddings. arXiv, 2025. doi: 10.48550/arXiv.2512.12167. URLhttp://arxiv.org/abs/2512.12167

  42. [42]

    Johannes Linder, Divyanshi Srivastava, Han Yuan, Vikram Agarwal, and David R. Kelley. Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation.Nat Genet, 57(4):949–961, 2025. ISSN 1546-1718. doi:10.1038/s41588-024-02053-6

  43. [43]

    Lorbeer, Chandana Rajesh, Tristan Karch, et al

    Sam Boshar, Benjamin Evans, Ziqi Tang, Armand Picard, Y anis Adel, Franziska K. Lorbeer, Chandana Rajesh, Tristan Karch, et al. A foundational model for joint sequence-function multi-species modeling at scale for long-range genomic prediction. bioRxiv, 2025. doi:10.64898/2025.12.22.695963. URLhttps: //www.biorxiv.org/content/10.64898/2025.12.22.695963v1. ...

  44. [44]

    Flex attention: A programming model for generating optimized attention kernels

    Juechu Dong, Boyuan Feng, Driss Guessous, Y anbo Liang, and Horace He. Flex attention: A programming model for generating optimized attention kernels. arXiv,

  45. [45]

    György Dósa

    doi:10.48550/arXiv.2412.05496. URLhttp://arxiv.org/abs/2412.05496. version: 1

  46. [46]

    PyTorch: An Imperative Style, High-Performance Deep Learning Library

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, et al. PyTorch: An imperative style, high- performance deep learning library. arXiv, 2019. doi:10.48550/arXiv.1912.01703. URLhttp://arxiv.org/abs/1912.01703

  47. [47]

    The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

    Guilherme Penedo, Hynek Kydlí ˇcek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. The FineWeb datasets: Decanting the web for the finest text data at scale. arXiv, 2024. doi:10.48550/arXiv.2406.17557. URLhttp://arxiv.org/abs/2406.17557

  48. [48]

    The OMG dataset: An open MetaGenomic corpus for mixed-modality genomic language modeling

    Andre Cornman, Jacob West-Roberts, Antonio Pedro Camargo, Simon Roux, Martin Beracochea, Milot Mirdita, Sergey Ovchinnikov, and Yunha Hwang. The OMG dataset: An open MetaGenomic corpus for mixed-modality genomic language modeling. bioRxiv, 2024. doi:10.1101/2024.08.14.607850. URLhttps://www. biorxiv.org/content/10.1101/2024.08.14.607850v2. Pages: 2024.08....

  49. [49]

    U-Net: Convolutional Networks for Biomedical Image Segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. arXiv, 2015. doi:10.48550/arXiv.1505.04597. URLhttp://arxiv.org/abs/1505.04597

  50. [50]

    All are worth words: A ViT backbone for diffusion models

    Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A ViT backbone for diffusion models. arXiv, 2023. doi: 10.48550/arXiv.2209.12152. URLhttp://arxiv.org/abs/2209.12152

  51. [51]

    KellerJordan/modded-nanogpt

    Keller Jordan. KellerJordan/modded-nanogpt. 2026. URLhttps://github.com/KellerJordan/modded-nanogpt. original-date: 2024-06-01T06:01:50Z

  52. [52]

    Gleghorn-lab/SpeedrunningPLMs

    Hallee, Logan. Gleghorn-lab/SpeedrunningPLMs. Gleghorn Lab, Synthyra, 2026. URLhttps://github.com/Gleghorn-Lab/SpeedrunningPLMs. original- date: 2024-12-20T18:47:28Z

  53. [53]

    Neural Machine Translation of Rare Words with Subword Units

    Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. arXiv, 2016. doi:10.48550/arXiv.1508.07909. URL http://arxiv.org/abs/1508.07909

  54. [54]

    Muon: An optimizer for hidden layers in neural networks | keller jordan blog. 2024. URLhttps://kellerjordan.github.io/posts/muon/

  55. [55]

    Kimi K2: Open Agentic Intelligence

    Kimi Team, Yifan Bai, Yiping Bao, Y . Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, et al. Kimi k2: Open agentic intelligence. arXiv, 2026. doi: 10.48550/arXiv.2507.20534. URLhttp://arxiv.org/abs/2507.20534

  56. [56]

    Gaussian Error Linear Units (GELUs)

    Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (GELUs). arXiv, 2023. doi:10.48550/arXiv.1606.08415. URLhttp://arxiv.org/abs/1606.08415

  57. [57]

    GLU Variants Improve Transformer

    Noam Shazeer. GLU variants improve transformer. arXiv, 2020. doi:10.48550/arXiv.2002.05202. URLhttp://arxiv.org/abs/2002.05202. Halleeet al.| arXiv | April 22, 2026 | 9–12

  58. [58]

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. arXiv, 2016. doi:10.48550/arXiv.1607.06450. URLhttp://arxiv.org/abs/1607. 06450

  59. [59]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv, 2019. doi:10.48550/arXiv.1711.05101. URLhttp://arxiv.org/abs/1711. 05101

  60. [60]

    Mixed precision training

    Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, et al. Mixed precision training. arXiv,

  61. [61]

    Mixed Precision Training

    doi:10.48550/arXiv.1710.03740. URLhttp://arxiv.org/abs/1710.03740

  62. [62]

    B. W. Matthews. Comparison of the predicted and observed secondary structure of t4 phage lysozyme.Biochimica et Biophysica Acta (BBA) - Protein Structure, 405 (2):442–451, 1975. ISSN 0005-2795. doi:10.1016/0005-2795(75)90109-9

  63. [63]

    B. L. Welch. The generalization of ‘Student’s’ problem when several different population variances are involved.Biometrika, 34(1-2):28–35, 1947. ISSN 0006-3444. doi: 10.1093/biomet/34.1-2.28. Halleeet al.| arXiv | April 22, 2026 | 10–12 Supplementary information Dual Triangle Attention pseudocode We provide pseudocode for both a naive implementation and t...