pith. machine review for the scientific record. sign in

arxiv: 2605.08044 · v1 · submitted 2026-05-08 · 💻 cs.CL · cs.AI· cs.LG

Recognition: no theorem link

Fast Byte Latent Transformer

Artidoro Pagnoni, Christopher Potts, Gargi Ghosh, Julie Kallini, Luke Zettlemoyer, Srinivasan Iyer, Tomasz Limisiewicz, Xiaochuang Han

Pith reviewed 2026-05-11 02:17 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords byte-level language modelsByte Latent Transformerdiffusion modelsspeculative decodingparallel generationmemory bandwidthautoregressive generation
0
0 comments X

The pith

Byte-level language models achieve over 50% lower memory-bandwidth cost on generation using new diffusion and speculative techniques.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Byte-level language models match token-based performance without subword vocabularies but remain limited by slow byte-by-byte autoregressive generation. The paper introduces BLT Diffusion trained with an auxiliary block-wise diffusion objective to enable multiple bytes generated in parallel per decoding step. It adds BLT Self-speculation where the local decoder drafts bytes beyond patch boundaries for full-model verification and BLT Diffusion+Verification that combines diffusion with an autoregressive check. These approaches reduce forward passes and yield an estimated memory-bandwidth cost over 50% lower than the baseline Byte Latent Transformer. A reader would care because the changes address the main practical barrier to deploying vocabulary-free byte models.

Core claim

The authors establish that BLT-D, trained with a block-wise diffusion objective alongside next-byte prediction, supports parallel byte production that cuts the number of forward passes, while BLT-S and BLT-DV add speculative drafting and verification steps; together the variants deliver estimated memory-bandwidth costs more than 50% below standard BLT on generation tasks, each with distinct speed-quality trade-offs.

What carries the argument

The block-wise diffusion objective that trains the model to generate byte blocks in parallel, combined with self-speculation drafts from the local decoder and single-pass verification to maintain quality.

Load-bearing premise

The auxiliary block-wise diffusion objective and speculative verification steps preserve generation quality while enabling parallel byte production.

What would settle it

A benchmark comparison that measures both generation quality metrics and actual memory-bandwidth usage on held-out sequences, showing either no cost reduction below 50% or quality degradation beyond small thresholds, would falsify the claim.

read the original abstract

Recent byte-level language models (LMs) match the performance of token-level models without relying on subword vocabularies, yet their utility is limited by slow, byte-by-byte autoregressive generation. We address this bottleneck in the Byte Latent Transformer (BLT) through new training and generation techniques. First, we introduce BLT Diffusion (BLT-D), a new model and our fastest BLT variant, trained with an auxiliary block-wise diffusion objective alongside the standard next-byte prediction loss. This enables an inference procedure that generates multiple bytes in parallel per decoding step, substantially reducing the number of forward passes required to generate a sequence. Second, we propose two extensions inspired by speculative decoding that trade some of this speed for higher generation quality: BLT Self-speculation (BLT-S), in which BLT's local decoder continues generating past its normal patch boundaries to draft bytes, which are then verified with a single full-model forward pass; and BLT Diffusion+Verification (BLT-DV), which augments BLT-D with an autoregressive verification step after diffusion-based generation. All methods may achieve an estimated memory-bandwidth cost over 50% lower than BLT on generation tasks. Each approach offers its own unique advantages, together removing key barriers to the practical use of byte-level LMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes three extensions to the Byte Latent Transformer (BLT) to accelerate byte-level autoregressive generation: BLT-D, which augments next-byte prediction with a block-wise diffusion objective to enable parallel multi-byte sampling; BLT-S, which uses the local decoder for self-speculative drafting beyond patch boundaries followed by full-model verification; and BLT-DV, which combines diffusion generation with an autoregressive verification step. The central claim is that these methods can deliver an estimated memory-bandwidth cost more than 50% lower than baseline BLT on generation tasks while preserving quality.

Significance. If the empirical measurements and cost breakdowns confirm the claimed bandwidth reductions without hidden overheads from diffusion sampling or verification passes, the work would meaningfully address a key practical barrier for byte-level language models, providing concrete speed-quality trade-offs that could increase their adoption over subword-tokenized alternatives.

major comments (1)
  1. [Abstract] Abstract: the headline claim that 'all methods may achieve an estimated memory-bandwidth cost over 50% lower than BLT' is presented without any quantitative breakdown, activation-footprint accounting, or amortized cost per generated byte that includes the diffusion sampler's multiple steps and the additional full-model forward pass(es) required for verification in BLT-S and BLT-DV. Because memory-bandwidth is dominated by activation transfers per byte, this omission is load-bearing for the practical-utility argument.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on the presentation of our cost claims. We address the major comment below and have revised the abstract accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline claim that 'all methods may achieve an estimated memory-bandwidth cost over 50% lower than BLT' is presented without any quantitative breakdown, activation-footprint accounting, or amortized cost per generated byte that includes the diffusion sampler's multiple steps and the additional full-model forward pass(es) required for verification in BLT-S and BLT-DV. Because memory-bandwidth is dominated by activation transfers per byte, this omission is load-bearing for the practical-utility argument.

    Authors: We agree that the abstract would benefit from a concise quantitative breakdown to support the headline claim. The detailed cost model, activation footprints, and amortized per-byte costs (accounting for diffusion sampling steps and verification passes) are already provided in Section 4 of the manuscript. In the revised version, we have updated the abstract to include a brief summary of these elements, clarifying that the net reduction in forward passes still yields over 50% memory-bandwidth savings after amortization. This change directly addresses the concern without altering the manuscript's core claims. revision: yes

Circularity Check

0 steps flagged

No equations or derivations reduce to fitted parameters; claims rest on estimates of new procedures

full rationale

The paper introduces novel training objectives (block-wise diffusion) and inference procedures (parallel byte generation, self-speculation, verification) for byte-level LMs without presenting any mathematical derivations, uniqueness theorems, or first-principles predictions. The central performance claim is explicitly an 'estimated' memory-bandwidth reduction rather than a computed result from equations or fitted inputs. No self-definitional loops, fitted-input-as-prediction, or ansatz-smuggling via self-citation appear in the abstract or described methods; prior BLT work serves only as baseline context. The derivation chain is therefore self-contained and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract alone supplies no information on free parameters, background axioms, or invented entities; the work introduces new training objectives whose details are not visible.

pith-pipeline@v0.9.0 · 5553 in / 868 out tokens · 29798 ms · 2026-05-11T02:17:56.066344+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 6 internal anchors

  1. [1]

    Do all languages cost the same? Tokenization in the era of commercial language models

    Orevaoghene Ahia, Sachin Kumar, Hila Gonen, Jungo Kasai, David Mortensen, Noah Smith, and Yulia Tsvetkov. Do all languages cost the same? Tokenization in the era of commercial language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9904–9923, Singa...

  2. [2]

    doi: 10.18653/v1/2023.e mnlp-main.614.https://aclanthology.org/2023.emnlp-main.614

    Association for Computational Linguistics. doi: 10.18653/v1/2023.e mnlp-main.614.https://aclanthology.org/2023.emnlp-main.614. Marianne Arriola, Subham Sekhar Sahoo, Aaron Gokaslan, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Justin T Chiu, and Volodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models. InThe Thirteent...

  3. [3]

    Program Synthesis with Large Language Models

    https://openreview.net/forum?id=tyEy YT267x. Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Structured denoising diffusion models in discrete state-spaces. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pages 17981–1...

  4. [5]

    Evaluating Large Language Models Trained on Code

    https://arxiv.org/abs/2107.03374. Jonathan H. Clark, Dan Garrette, Iulia Turc, and John Wieting. CANINE: Pre-training an efficient tokenization-free encoder for language representation.Transactions of the Association for Computational Linguistics, 10:73–91,

  5. [6]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    doi: 10.1162/tacl_a_00448.https://aclanthology.org/2022.tacl-1.5. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv: 1803.05457, 2018.https://arxiv.org/abs/1803.05457. Gautier Dagan, Gabriel Synna...

  6. [7]

    Flashattention: Fast and memory-efficient 16 exact attention with IO-awareness

    Tri Dao, Daniel Y Fu, Stefano Ermon, Atri Rudra, and Christopher Re. Flashattention: Fast and memory-efficient 16 exact attention with IO-awareness. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.https://openreview.net/forum?id=H4DqfPSibmx. Jacob Devlin, Ming-Wei Chang,...

  7. [8]

    doi: 10.18653/v1/N19-1423.https://aclanthology.org/N19-1423/

    Association for Computational Linguistics. doi: 10.18653/v1/N19-1423.https://aclanthology.org/N19-1423/. Juechu Dong, BOYUAN FENG, Driss Guessous, Yanbo Liang, and Horace He. Flexattention: A programming model for generating fused attention variants. InEighth Conference on Machine Learning and Systems,

  8. [9]

    Lukas Edman, Helmut Schmid, and Alexander Fraser

    https://openreview.net/forum?id=2QMYV4bA0R. Lukas Edman, Helmut Schmid, and Alexander Fraser. CUTE: Measuring LLMs’ understanding of their tokens. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 3017–3026, Miami, Florida, USA, November

  9. [10]

    doi: 10.18653/v1/2024.emnlp-main.177.https://aclanthology.org/2024.emnlp-main.177/

    Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main.177.https://aclanthology.org/2024.emnlp-main.177/. Hicham El Boukkouri, Olivier Ferret, Thomas Lavergne, Hiroshi Noji, Pierre Zweigenbaum, and Jun’ichi Tsujii. CharacterBERT: Reconciling ELMo and BERT for word-level open-vocabulary representations from characters. In Donia Scott, ...

  10. [11]

    doi: 10.18653/v1/2020.coling-main.609.https://aclanthology.org/2020.coling-main

    International Committee on Computational Linguistics. doi: 10.18653/v1/2020.coling-main.609.https://aclanthology.org/2020.coling-main. 609/. Itai Gat, Heli Ben-Hamu, Marton Havasi, Daniel Haziza, Jeremy Reizenstein, Gabriel Synnaeve, David Lopez-Paz, Brian Karrer, and Yaron Lipman. Set block decoding is a language model inference accelerator.arXiv preprin...

  11. [12]

    doi: 10.18653/v1/D19-1633.https://aclanthology.org/D19-1633/

    Association for Computational Linguistics. doi: 10.18653/v1/D19-1633.https://aclanthology.org/D19-1633/. Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, Hao Peng, and Lingpeng Kong. Scaling diffusion language models via adaptation from autoregressive models. InThe Thirteenth Interna...

  12. [13]

    Ishaan Gulrajani and Tatsunori Hashimoto

    doi: 10.1162/tacl_a_00474.https://aclanthology.org/2022.tacl-1.30/. Ishaan Gulrajani and Tatsunori Hashimoto. Likelihood-based diffusion language models. InThirty-seventh Conference on Neural Information Processing Systems, 2023.https://openreview.net/forum?id=e2MCL6hObn. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and...

  13. [14]

    doi: 10.18653/v1/2023.findings-acl.770

    Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.770. https://aclanthology.org/2023.findings-acl.770. Sukjun Hwang, Brandon Wang, and Albert Gu. Dynamic chunking for end-to-end hierarchical sequence modeling. arXiv preprint arXiv: 2507.07955, 2025.https://arxiv.org/abs/2507.07955. Julie Kallini, Shikhar Murty, Christopher D Ma...

  14. [15]

    doi: 10.18653/v1/2022.naacl-main.179.https://aclanthology.org/2022.naacl-main.179

    Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.179.https://aclanthology.org/2022.naacl-main.179. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th...

  15. [16]

    Efficient memory management for large language model serving with pagedattention

    Association for Computing Machinery. ISBN 9798400702297. doi: 10.1145/3600006.3613165. https://doi.org/10.1145/3600006.3613165. Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org,

  16. [17]

    Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Yitzhak Gadre, Hritik Bansal, Etash Kumar Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee F Chen, Suchin Gururangan, Mitchell Wortsman, Alon Albalak, Yonatan Bitton, Marianna Nezhurina, Amro Kamal Mohamed Abbas, Cheng-Yu H...

  17. [18]

    Davis Liang, Hila Gonen, Yuning Mao, Rui Hou, Naman Goyal, Marjan Ghazvininejad, Luke Zettlemoyer, and Madian Khabsa

    https://proceedings.neurips.cc/paper_files/paper/2022/file/1be5bc25d50895ee656b8c2d9eb89d6a-Paper-Confe rence.pdf. Davis Liang, Hila Gonen, Yuning Mao, Rui Hou, Naman Goyal, Marjan Ghazvininejad, Luke Zettlemoyer, and Madian Khabsa. XLM-V: Overcoming the vocabulary bottleneck in multilingual masked language models. In Houda Bouamor, Juan Pino, and Kalika ...

  18. [19]

    Decoupled Weight Decay Regularization

    Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.813.https://aclanthology.org/2023.emnlp-main.813/. Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv: 1711.05101,

  19. [20]

    Decoupled Weight Decay Regularization

    https://arxiv.org/abs/1711.05101. Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org,

  20. [21]

    Hierarchical transformers are more efficient language models

    Piotr Nawrot, Szymon Tworkowski, Michał Tyrolski, Lukasz Kaiser, Yuhuai Wu, Christian Szegedy, and Henryk Michalewski. Hierarchical transformers are more efficient language models. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz, editors,Findings of the Association for Computational Linguistics: NAACL 2022, pages 1559–1571, Sea...

  21. [22]

    doi: 10.18653/v1/2022.findings-naacl.117.https://aclanthology.org/2022.findings-naacl.117

    Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-naacl.117.https://aclanthology.org/2022.findings-naacl.117. Piotr Nawrot, Jan Chorowski, Adrian Lancucki, and Edoardo Maria Ponti. Efficient transformers with dynamic token pooling. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meetin...

  22. [23]

    doi: 10.18653/v1/2023.acl-long.353

    Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.353. https://aclanthology.org/2023.acl-long.353. Jinjie Ni, Qian Liu, Longxu Dou, Chao Du, Zili Wang, Hang Yan, Tianyu Pang, and Michael Qizhe Shieh. Diffusion language models are super data learners.arXiv preprint arXiv: 2511.03276, 2025.https://arxiv.org/abs/2511.03276. Shen Nie, ...

  23. [24]

    ISBN 979-8-89176-251-0

    Association for Computational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/v1/2025.acl-long.453.https://aclanthology.org/2025.acl-long. 453/. Aleksandar Petrov, Emanuele La Malfa, Philip Torr, and Adel Bibi. Language model tokenizers introduce unfairness between languages. InThirty-seventh Conference on Neural Information Processing Systems,

  24. [25]

    Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean

    https://openre view.net/forum?id=78yDLKi95p. Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scaling transformer inference. In D. Song, M. Carbin, and T. Chen, editors,Proceedings of Machine Learning and Systems, volume 5, pages 606–624. Curan, 2023.http...

  25. [26]

    doi: 10.18653/v1/P19-1561.https://aclanthology.org/P19-1561/

    Association for Computational Linguistics. doi: 10.18653/v1/P19-1561.https://aclanthology.org/P19-1561/. Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paque...

  26. [27]

    GLU Variants Improve Transformer

    doi: 10.52202/079017-4135.https://proc eedings.neurips.cc/paper_files/paper/2024/file/eb0b13cc515724ab8015bc978fdde0ad-Paper-Conference.pdf. Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv: 2002.05202,

  27. [28]

    Simplified and Generalized Masked Diffusion for Discrete Data , url =

    doi: 10.52202/079017-3277. https://proceedings.neurips.cc/paper_files/paper/2024/file/bad233b9849f019aead5 e5cc60cef70f-Paper-Conference.pdf. Aaditya K. Singh and DJ Strouse. Tokenization counts: the impact of tokenization on arithmetic in frontier llms. arXiv preprint arXiv:2402.14903, 2024.https://arxiv.org/abs/2402.14903. Kevin Slagle. Spacebyte: Towar...

  28. [29]

    Yi Tay, Vinh Q

    https://arxiv.org/abs/2003.04985. Yi Tay, Vinh Q. Tran, Sebastian Ruder, Jai Gupta, Hyung Won Chung, Dara Bahri, Zhen Qin, Simon Baumgartner, Cong Yu, and Donald Metzler. Charformer: Fast character transformers via gradient-based subword tokenization. InInternational Conference on Learning Representations,

  29. [30]

    Fast-dllm: Training-free acceleration of diffusion LLM by enabling KV cache and parallel decoding.CoRR, abs/2505.22618, 2025

    https://openreview.net/forum?id=X1xNsu Kssb. Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, and Enze Xie. Fast-dllm: Training-free acceleration of diffusion llm by enabling kv cache and parallel decoding.arXiv preprint arXiv: 2505.22618, 2025.https://arxiv.org/abs/2505.22618. Wenhan Xiong, Jingyu Liu, Igor M...

  30. [31]

    doi: 10.18653/v1/2024.naacl-long.260.https://aclanthology.org/2024.naacl-long.260/

    Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl-long.260.https://aclanthology.org/2024.naacl-long.260/. Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel. ByT5: Towards a token-free future with pre-trained byte-to-byte models.Transactions of the Association for Computat...

  31. [32]

    Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong

    doi: 10.1162/tacl_a_00461.https://aclanthology.org/2022.tacl-1.17. Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv: 2508.15487, 2025.https://arxiv.org/abs/2508.15487. Lili Yu, Daniel Simig, Colin Flaherty, Armen Aghajanyan, Luke Zettlemoyer, ...

  32. [33]

    doi: 10.18653/v1/P19-1472.https://aclanthology.org/P19-1472/

    Association for Computational Linguistics. doi: 10.18653/v1/P19-1472.https://aclanthology.org/P19-1472/. Biao Zhang and Rico Sennrich. Root mean square layer normalization. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume

  33. [34]

    Jun Zhang, Jue Wang, Huan Li, Lidan Shou, Ke Chen, Gang Chen, and Sharad Mehrotra

    Curran Associates, Inc., 2019.https://proceedings.neurips.cc/paper_files/paper/2019/file/1e8a19426224ca89e8 3cef47f1e7f53b-Paper.pdf. Jun Zhang, Jue Wang, Huan Li, Lidan Shou, Ke Chen, Gang Chen, and Sharad Mehrotra. Draft & verify: Lossless large language model acceleration via self-speculative decoding. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, ...

  34. [35]

    doi: 10.18653/v1/2024.acl-long.607.https://aclanthology.org/2024.acl-long.607/

    Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.607.https://aclanthology.org/2024.acl-long.607/. Lin Zheng, Xueliang Zhao, Guangtao Wang, Chen Wu, David Dong, Angela Wang, Mingran Wang, Yun Du, Haige Bo, Amol Sharma, Bo Li, Kejie Zhang, Changran Hu, Urmish Thakker, and Lingpeng Kong. Evabyte: Efficient byte-level language models ...

  35. [36]

    doi: 10.18653/v1/2024.findings-emnlp.218

    Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-emnlp.218. https://aclanthology.org/2024.findings-emnlp.218/. 20 Appendix A Architecture and Optimization Details A.1 Architecture Implementation Details For all the BLT and BLT-D models we train, we maintain the same Transformer implementation details from the original BLT: the fee...

  36. [37]

    For self-attention in the encoder and global model, where the mask is fixed and follows a standard causal pattern with a fixed window, we use FlashAttention (Dao et al.,

    withθ= 500000(Xiong et al., 2024), and layer normalization is done with RMSNorm (Zhang and Sennrich, 2019). For self-attention in the encoder and global model, where the mask is fixed and follows a standard causal pattern with a fixed window, we use FlashAttention (Dao et al.,

  37. [38]

    For all cross-attention modules and the decoder’s self-attention module, which requires carefully constructed custom masks that depend on the patch structure and vary per example, we use FlexAttention (Dong et al., 2025). FlexAttention streamlines the implementation of attention mechanisms with structured sparsity in PyTorch and allows users to define cus...

  38. [39]

    We apply a weight decay of 0.1, and global gradient clipping at a threshold of 1.0

    The 1B models warm-up to 2000 steps; the 3B models warm-up to 4000 steps. We apply a weight decay of 0.1, and global gradient clipping at a threshold of 1.0. 21 B All 1B Model Results In this section, we report results for all 1B models. Figure 9 and Figure 10 present the 1B counterparts of the generation-task results from Section 4.3 and Section 5.3 for ...

  39. [40]

    at 1B parameters across five benchmarks. Benchmark BL T 1B BL T-D-4 1B BL T-D-8 1B BL T-D-16 1B ARC-Easy63.21 60.76 61.06 59.83 ARC-Challenge34.94 32.96 34.16 32.88 PIQA75.46 74.48 73.56 72.36 HellaSwag60.17 59.06 58.34 57.13 MMLU33.90 33.60 32.28 32.09 22 128 256 512 0 10 20 30 FR → EN (BLEU) 128 256 512 0 10 20 30 DE → EN (BLEU) 128 256 512 0 5 10 Human...