Recognition: unknown
Efficient Pre-Training with Token Superposition
Pith reviewed 2026-05-08 10:05 UTC · model grok-4.3
The pith
Token superposition in early training reduces large language model pre-training time by up to 2.5 times while matching or exceeding standard results.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Token-Superposition Training works by first training on bags of contiguous tokens using a multi-hot cross-entropy objective, then switching to standard single-token prediction. This produces faster loss reduction and equal or better downstream performance than conventional training, yielding up to a 2.5 times reduction in total pre-training time at the 10B scale when measured to equal loss.
What carries the argument
The multi-hot cross-entropy loss applied to superposed bags of contiguous tokens in the first training phase, which lets each forward pass update the model on multiple tokens simultaneously.
If this is right
- Models reach target loss with fewer total training steps than standard pre-training.
- Downstream task performance is maintained or improved at the same final loss.
- The time reduction scales from hundreds of millions to at least 10 billion parameters.
- No modifications to architecture, data pipeline, or distributed training setup are required.
Where Pith is reading between the lines
- The same two-phase pattern could be tested on training runs larger than 10B parameters to check whether the speedup grows or saturates.
- Shortening or lengthening the superposition phase might trade off early speed against final recovery quality.
- Combining token superposition with other throughput techniques such as sequence packing could produce additive gains.
Load-bearing premise
Any temporary degradation from the superposition phase can be completely recovered in the second phase without extra tokens or compute, and the benefit remains stable beyond the tested model sizes.
What would settle it
A run in which a TST-trained model ends with higher loss or worse downstream scores than a matched baseline after the same total number of tokens have been processed.
Figures
read the original abstract
Pre-training of Large Language Models is often prohibitively expensive and inefficient at scale, requiring complex and invasive modifications in order to achieve high data throughput. In this work, we present Token-Superposition Training (TST), a simple drop-in method that significantly improves the data throughput per FLOPs during pre-training without modifying the parallelism, optimizer, tokenizer, data, or model architecture. TST is done in two phases: (i) A highly efficient superposition phase where we combine many contiguous tokens into one bag and train using a multi-hot cross-entropy (MCE) objective, and (ii) a recovery phase where we revert back to standard training. We extensively evaluate TST on the scale of 270M and 600M parameters and validate on 3B and a 10B A1B mixture of experts model, demonstrating that it is highly robust in different settings. Ultimately, TST consistently outperforms baseline loss and downstream evaluations, and under equal-loss settings, TST yields up to a 2.5x reduction in total pre-training time at the 10B A1B scale.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Token-Superposition Training (TST), a two-phase pre-training method for LLMs. Phase (i) superposes contiguous tokens into bags and optimizes a multi-hot cross-entropy (MCE) objective for higher data throughput; phase (ii) reverts to standard next-token prediction for recovery. The authors report that TST yields lower loss and better downstream performance than baselines across 270M–10B scales (including a 10B A1B MoE model) and, under equal-loss conditions, up to 2.5× reduction in total pre-training time at the 10B scale, all without changes to architecture, optimizer, tokenizer, data, or parallelism.
Significance. If the efficiency claims hold after detailed verification, TST would be a practically important contribution: a simple, drop-in technique that improves pre-training throughput and final model quality at scales up to 10B parameters. The multi-scale validation, including an MoE model, is a clear strength and increases the result’s credibility. The absence of any architectural modification also makes the approach broadly applicable if the recoverability assumption is substantiated.
major comments (3)
- [Abstract and Experimental Results section] Abstract and Experimental Results section: the headline claim of a 2.5× reduction in total pre-training time to equal loss at the 10B A1B scale is presented without any quantitative breakdown of tokens or steps allocated to the superposition versus recovery phases, nor a direct cumulative-FLOPs or wall-time comparison against the pure baseline run. This leaves the central recoverability assumption untested.
- [Method section (TST description)] Method section (TST description): the multi-hot cross-entropy objective is defined without any analysis or ablation showing that representational biases induced in the superposition phase are fully erased in the recovery phase without extra tokens or steps. If even modest extra recovery compute is required, the net throughput gain disappears.
- [Experimental validation (results at 10B scale)] Experimental validation (results at 10B scale): no error bars, phase-ratio ablations, or equal-compute curves are reported despite the strong efficiency claim. The robustness statement across 270M–10B therefore rests on incomplete evidence.
minor comments (2)
- [Abstract] The abstract would be clearer if it briefly listed the downstream tasks and the exact loss-matching criterion used for the 2.5× timing comparison.
- [Method section] Notation for the bag size and MCE loss could be introduced earlier with an explicit equation to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. The comments have prompted us to strengthen the quantitative support for our efficiency claims, provide additional analysis of the recovery phase, and enhance the experimental robustness. We address each major comment below and have revised the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract and Experimental Results section] Abstract and Experimental Results section: the headline claim of a 2.5× reduction in total pre-training time to equal loss at the 10B A1B scale is presented without any quantitative breakdown of tokens or steps allocated to the superposition versus recovery phases, nor a direct cumulative-FLOPs or wall-time comparison against the pure baseline run. This leaves the central recoverability assumption untested.
Authors: We agree that an explicit breakdown is necessary to substantiate the central claim. In the revised manuscript, we have added a new table in the Experimental Results section that quantifies the token and step allocations for the 10B A1B model (superposition phase: 75% of total tokens under MCE; recovery phase: 25% under standard next-token prediction). We also include cumulative FLOPs calculations and wall-time estimates derived from our training runs, directly comparing TST to the baseline and confirming the 2.5× reduction to reach equivalent loss. Loss curves are provided to demonstrate that the target loss is attained without additional recovery overhead beyond the reported total compute. revision: yes
-
Referee: [Method section (TST description)] Method section (TST description): the multi-hot cross-entropy objective is defined without any analysis or ablation showing that representational biases induced in the superposition phase are fully erased in the recovery phase without extra tokens or steps. If even modest extra recovery compute is required, the net throughput gain disappears.
Authors: We accept that an explicit ablation on bias erasure would improve clarity. The revised Method section now includes an analysis of representational changes, using embedding cosine similarity and attention pattern divergence metrics to show that superposition-induced biases are largely corrected during recovery. An accompanying ablation varies recovery length and confirms that no extra tokens or steps beyond the planned phase ratio are required to restore performance; final loss and downstream metrics remain superior to baseline at equal total compute, preserving the reported throughput gains. revision: yes
-
Referee: [Experimental validation (results at 10B scale)] Experimental validation (results at 10B scale): no error bars, phase-ratio ablations, or equal-compute curves are reported despite the strong efficiency claim. The robustness statement across 270M–10B therefore rests on incomplete evidence.
Authors: The referee is correct that error bars and additional ablations were omitted from the initial submission. In the revision, we have added error bars from three independent runs with different random seeds for all 10B-scale results. We also report phase-ratio ablations (superposition fractions from 60% to 90% of total tokens) and equal-compute learning curves that directly compare TST and baseline performance at fixed total FLOPs across all model scales. These additions provide stronger empirical support for the robustness claims from 270M to 10B parameters, including the MoE model. revision: yes
Circularity Check
No circularity: purely empirical method with no derivations or self-referential claims
full rationale
The paper introduces Token-Superposition Training (TST) as a two-phase empirical procedure (superposition with multi-hot cross-entropy followed by standard recovery) and supports its efficiency claims solely through reported loss curves, wall-time measurements, and downstream evaluations on models up to 10B parameters. No equations, derivations, uniqueness theorems, or ansatzes are presented that could reduce to their own inputs. The 2.5x time-reduction result is framed as an experimental outcome under equal-loss conditions rather than a mathematical prediction derived from fitted parameters or prior self-citations. Any load-bearing assumptions (e.g., recoverability of superposition-phase effects) are empirical hypotheses tested in the reported runs, not tautological by construction. This is the normal case for an applied systems paper whose central contribution is benchmarked throughput improvement.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Agarwal, L
S. Agarwal, L. Ahmad, J. Ai, S. Altman, A. Applebaum, E. Arbus, R. K. Arora, Y . Bai, B. Baker, H. Bao, and others. gpt-oss-120b & gpt-oss-20b model card
-
[2]
Alayrac, J
J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Mil- licah, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Mon- teiro, J. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski, R. Barreira, O. Vinyals, A. Zisserman, and K. Simonyan. Flamingo: a visual language ...
2022
-
[3]
L. B. Allal, A. Lozhkov, E. Bakouch, G. M. Bl ´azquez, G. Penedo, L. Tunstall, A. Marafioti, H. Kydl´ıˇcek, A. P. Lajar´ın, V . Srivastav, J. Lochner, C. Fahlgren, X.-S. Nguyen, C. Fourrier, B. Burtenshaw, H. Larcher, H. Zhao, C. Zakka, M. Morlon, C. Raffel, L. v. Werra, and T. Wolf. SmolLM2: When Smol Goes Big – Data-Centric Training of a Small Language ...
-
[4]
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
URLhttp://arxiv.org/abs/2502.02737. arXiv:2502.02737 [cs]
work page internal anchor Pith review arXiv
-
[5]
Anagnostidis, G
S. Anagnostidis, G. Bachmann, I. Schlag, and T. Hofmann. Navigating scaling laws: compute optimality in adaptive model training. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofICML’24, pages 1511–1530, Vienna, Austria, 2024. JMLR.org
2024
-
[6]
J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huang, and others. Qwen technical report. 10
-
[7]
X. Bi, D. Chen, G. Chen, S. Chen, D. Dai, C. Deng, H. Ding, K. Dong, Q. Du, Z. Fu, and others. Deepseek llm: Scaling open-source language models with longtermism
-
[8]
Y . Bisk, R. Zellers, J. Gao, Y . Choi, et al. Piqa: Reasoning about physical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020
2020
-
[9]
M. Chen, B. Hui, Z. Cui, J. Yang, D. Liu, J. Sun, J. Lin, and Z. Liu. Parallel Scaling Law for Language Models. Oct. 2025. URLhttps://openreview.net/forum?id=dEi1S731lk
2025
-
[10]
Cheng, W
X. Cheng, W. Zeng, D. Dai, Q. Chen, B. Wang, Z. Xie, K. Huang, X. Yu, Z. Hao, Y . Li, and others. Conditional memory via scalable lookup: A new axis of sparsity for large language models
-
[11]
Clark, K
C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins, and K. Toutanova. Boolq: Ex- ploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 confer- ence of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers), pages 2924–2936, 2019
2019
-
[12]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018
work page internal anchor Pith review arXiv 2018
-
[13]
Cortes and V
C. Cortes and V . Vapnik. Support-vector networks.Machine learning, 20(3):273–297, 1995
1995
-
[14]
D. Dai, C. Deng, C. Zhao, R. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y . Wu, Z. Xie, Y . Li, P. Huang, F. Luo, C. Ruan, Z. Sui, and W. Liang. DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models. In L.-W. Ku, A. Martins, and V . Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Com- pu...
-
[15]
and Gao, Huazuo and Chen, Deli and Li, Jiashi and Zeng, Wangding and Yu, Xingkai and Wu, Y
Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.70. URL https://aclanthology.org/2024.acl-long.70/
-
[16]
Dosovitskiy, L
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. Oct. 2020. URLhttps://openreview.net/forum?id=YicbFdNTTy&utm_ campaign=The%20Batch&utm_source=hs_email&utm_medium=email&_...
2020
-
[17]
W. Ebeling and T. P ¨oschel. Entropy and Long-Range Correlations in Literary English.Euro- physics Letters, 26(4):241, May 1994. ISSN 0295-5075. doi: 10.1209/0295-5075/26/4/001. URLhttps://doi.org/10.1209/0295-5075/26/4/001
-
[18]
S. Y . Gadre, G. Smyrnis, V . Shankar, S. Gururangan, M. Wortsman, R. Shao, J. Mercat, A. Fang, J. Li, S. Keh, and others. Language models scale reliably with over-training and on downstream tasks
-
[19]
L. Gao, J. Tow, B. Abbasi, S. Biderman, S. Black, A. DiPofi, C. Foster, L. Golding, J. Hsu, A. Le Noac’h, H. Li, K. McDonell, N. Muennighoff, C. Ociepa, J. Phang, L. Reynolds, H. Schoelkopf, A. Skowron, L. Sutawika, E. Tang, A. Thite, B. Wang, K. Wang, and A. Zou. A framework for few-shot language model evaluation, 12 2023. URLhttps: //zenodo.org/records/10256836
-
[20]
T. Gigant, B. Peng, and J. Quesnelle. Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation, Apr. 2026. URLhttp://arxiv.org/ abs/2604.27263. arXiv:2604.27263 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[21]
Gisserot-Boukhlef, N
H. Gisserot-Boukhlef, N. Boizard, M. Faysse, D. M. Alves, E. Malherbe, A. Martins, C. Hude- lot, and P. Colombo. Should We Still Pretrain Encoders with Masked Language Modeling? Oct. 2025. URLhttps://openreview.net/forum?id=jpz7e3jhRq. 11
2025
-
[22]
Gloeckle, B
F. Gloeckle, B. Y . Idrissi, B. Rozi `ere, D. Lopez-Paz, and G. Synnaeve. Better & faster large language models via multi-token prediction. InProceedings of the 41st International Confer- ence on Machine Learning, volume 235 ofICML’24, pages 15706–15734, Vienna, Austria,
-
[23]
Grattafiori, A
A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, and others. The llama 3 herd of models
-
[24]
Measuring Massive Multitask Language Understanding
D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020
work page internal anchor Pith review arXiv 2009
-
[25]
M. Y . Hu, J. Petty, C. Shi, W. Merrill, and T. Linzen. Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases. In W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Associ- ation for Computational Linguistics (Volume 1: Long Papers), pages 9691–9709, Vienna, Aus- ...
2025
-
[26]
S. Hu, Y . Tu, X. Han, C. He, G. Cui, X. Long, Z. Zheng, Y . Fang, Y . Huang, W. Zhao, X. Zhang, Z. L. Thai, K. Zhang, C. Wang, Y . Yao, C. Zhao, J. Zhou, J. Cai, Z. Zhai, N. Ding, C. Jia, G. Zeng, D. Li, Z. Liu, and M. Sun. Minicpm: Unveiling the potential of small language mod- els with scalable training strategies, 2024. URLhttps://arxiv.org/abs/2404.06395
work page internal anchor Pith review arXiv 2024
-
[27]
arXiv preprint arXiv:2507.07955 , year=
S. Hwang, B. Wang, and A. Gu. Dynamic Chunking for End-to-End Hierarchical Sequence Modeling, July 2025. URLhttp://arxiv.org/abs/2507.07955. arXiv:2507.07955 [cs]
-
[28]
A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressand, G. Lengyel, G. Bour, G. Lample, L. R. Lavaud, L. Saulnier, M.-A. Lachaux, P. Stock, S. Subramanian, S. Yang, S. Antoniak, T. L. Scao, T. Gervet, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed. Mixtral of Experts, Jan. 2024. UR...
work page internal anchor Pith review arXiv 2024
-
[29]
New Record: Multi-token prediction and Untie LM Head 2/3rds through training (119.76 seconds) by varunneal · Pull Request #178 · KellerJordan/modded-nanogpt
KellerJordan. New Record: Multi-token prediction and Untie LM Head 2/3rds through training (119.76 seconds) by varunneal · Pull Request #178 · KellerJordan/modded-nanogpt. URL https://github.com/KellerJordan/modded-nanogpt/pull/178
- [30]
-
[31]
Kudo and J
T. Kudo and J. Richardson. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. InProceedings of the 2018 conference on empirical methods in natural language processing: System demonstrations, pages 66–71
2018
- [32]
-
[33]
Leviathan, M
Y . Leviathan, M. Kalman, and Y . Matias. Fast inference from transformers via speculative decoding. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofICML’23, pages 19274–19286, Honolulu, Hawaii, USA, 2023. JMLR.org
2023
-
[34]
J. Li, A. Fang, G. Smyrnis, M. Ivgi, M. Jordan, S. Gadre, H. Bansal, E. Guha, S. Keh, K. Arora, S. Garg, R. Xin, N. Muennighoff, R. Heckel, J. Mercat, M. Chen, S. Gururan- gan, M. Wortsman, A. Albalak, Y . Bitton, M. Nezhurina, A. Abbas, C.-Y . Hsieh, D. Ghosh, J. Gardner, M. Kilian, H. Zhang, R. Shao, S. Pratt, S. Sanyal, G. Ilharco, G. Daras, K. Marathe...
2024
-
[35]
Liang, T
W. Liang, T. Liu, L. Wright, W. Constable, A. Gu, C.-C. Huang, I. Zhang, W. Feng, H. Huang, J. Wang, S. Purandare, G. Nadathur, and S. Idreos. TorchTitan: One-stop PyTorch native solution for production ready LLM pretraining. Oct. 2024. URLhttps://openreview. net/forum?id=SFN6Wm7YBI. 12
2024
-
[36]
A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, and others. Deepseek-v3 technical report
-
[37]
A. Liu, J. Hayase, V . Hofmann, S. Oh, N. A. Smith, and Y . Choi. SuperBPE: Space travel for language models
-
[38]
H. Liu, J. Zhang, C. Wang, X. Hu, L. Lyu, J. Sun, X. Yang, B. Wang, F. Li, Y . Qian, and others. Scaling embeddings outperforms scaling experts in language models
- [39]
-
[40]
Decoupled Weight Decay Regularization
I. Loshchilov and F. Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review arXiv 2017
-
[41]
Lozhkov, L
A. Lozhkov, L. Ben Allal, L. von Werra, and T. Wolf. Fineweb-edu: the finest collection of educational content, 2024. URLhttps://huggingface.co/datasets/HuggingFaceFW/ fineweb-edu
2024
-
[42]
H. P. Luhn. The Automatic Creation of Literature Abstracts.IBM Journal of Research and Development, 2(2):159–165, Apr. 1958. ISSN 0018-8646. doi: 10.1147/rd.22.0159. URL https://ieeexplore.ieee.org/document/5392672. Conference Name: IBM Journal of Research and Development
-
[43]
D. Mahajan, S. Goyal, B. Y . Idrissi, M. Pezeshki, I. Mitliagkas, D. Lopez-Paz, and K. Ahuja. Beyond Multi-Token Prediction: Pretraining LLMs with Future Summaries, Oct. 2025. URL http://arxiv.org/abs/2510.14751. arXiv:2510.14751 [cs]
-
[44]
Mihaylov, P
T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2381–2391, 2018
2018
-
[45]
Mikolov, K
T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient Estimation of Word Repre- sentations in Vector Space. Jan. 2013. URLhttps://www.semanticscholar.org/ paper/Efficient-Estimation-of-Word-Representations-in-Mikolov-Chen/ f6b51c8753a871dc94ff32152c00c01e94f90f09
2013
-
[46]
Bolmo: Byteifying the next generation of language models
B. Minixhofer, T. Murray, T. Limisiewicz, A. Korhonen, L. Zettlemoyer, N. A. Smith, E. M. Ponti, L. Soldaini, and V . Hofmann. Bolmo: Byteifying the Next Generation of Language Models, Dec. 2025. URLhttp://arxiv.org/abs/2512.15586. arXiv:2512.15586 [cs]
-
[47]
S. Nie, F. Zhu, Z. You, X. Zhang, J. Ou, J. Hu, J. Zhou, Y . Lin, J.-R. Wen, and C. Li. Large Language Diffusion Models. Oct. 2025. URLhttps://openreview.net/forum? id=KnqiC0znVF
2025
-
[48]
A. Pagnoni, R. Pasunuru, P. Rodriguez, J. Nguyen, B. Muller, M. Li, C. Zhou, L. Yu, J. E. We- ston, L. Zettlemoyer, G. Ghosh, M. Lewis, A. Holtzman, and S. Iyer. Byte Latent Transformer: Patches Scale Better Than Tokens. In W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, edi- tors,Proceedings of the 63rd Annual Meeting of the Association for Computati...
-
[49]
Sakaguchi, R
K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y . Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021
2021
-
[50]
Sennrich, B
R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of rare words with sub- word units. InProceedings of the 54th annual meeting of the association for computational linguistics (volume 1: long papers), pages 1715–1725
-
[51]
C. E. Shannon. A mathematical theory of communication.The Bell system technical journal, 27(3):379–423, 1948. 13
1948
-
[52]
Sparck Jones
K. Sparck Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation, 28(1):11–21, 1972
1972
-
[53]
Y . Tay, M. Dehghani, V . Q. Tran, X. Garcia, J. Wei, X. Wang, H. W. Chung, D. Bahri, T. Schus- ter, S. Zheng, D. Zhou, N. Houlsby, and D. Metzler. UL2: Unifying Language Learning Paradigms. Sept. 2022. URLhttps://openreview.net/forum?id=6ruVLB727MC
2022
-
[54]
Touvron, T
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azhar, and others. Llama: Open and efficient foundation language models
-
[55]
Videau, B
M. Videau, B. Y . Idrissi, A. Leite, M. Schoenauer, O. Teytaud, and D. Lopez-Paz. From Bytes to Ideas: Language Modeling with Autoregressive U-Nets. Oct. 2025. URLhttps: //openreview.net/forum?id=FnFf7Ru2ur
2025
-
[56]
J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou. Chain-of-Thought Prompting Elicits Reasoning in Large Language Mod- els. Oct. 2022. URLhttps://openreview.net/forum?id=_VjQlMeSB_J&trk=public_ post_comment-text
2022
- [57]
-
[58]
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, and others. Qwen3 technical report
-
[59]
J. Yuan, H. Gao, D. Dai, J. Luo, L. Zhao, Z. Zhang, Z. Xie, Y . Wei, L. Wang, Z. Xiao, Y . Wang, C. Ruan, M. Zhang, W. Liang, and W. Zeng. Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention. In W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computatio...
-
[60]
M. Zaheer, G. Guruganesh, A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang, and A. Ahmed. Big bird: transformers for longer sequences. InProceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, pages 17283–17297, Red Hook, NY , USA, 2020. Curran Associates Inc. ISBN 978-1-7138- ...
-
[61]
Zellers, A
R. Zellers, A. Holtzman, Y . Bisk, A. Farhadi, and Y . Choi. Hellaswag: Can a machine re- ally finish your sentence? InProceedings of the 57th annual meeting of the association for computational linguistics, pages 4791–4800, 2019
2019
-
[62]
Y . Zhao, A. Gu, R. Varma, L. Luo, C.-C. Huang, M. Xu, L. Wright, H. Shojanazeri, M. Ott, S. Shleifer, A. Desmaison, C. Balioglu, P. Damania, B. Nguyen, G. Chauhan, Y . Hao, A. Math- ews, and S. Li. PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel.Proc. VLDB Endow., 16(12):3848–3860, 2023. ISSN 2150-8097. doi: 10.14778/3611540.3611569. URL...
-
[63]
Proxy compression for language modeling
L. Zheng, X. Li, Q. Liu, X. Feng, and L. Kong. Proxy Compression for Language Modeling, Feb. 2026. URLhttp://arxiv.org/abs/2602.04289. arXiv:2602.04289 [cs]
work page internal anchor Pith review arXiv 2026
-
[64]
R.-J. Zhu, Z. Wang, K. Hua, T. Zhang, Z. Li, H. Que, B. Wei, Z. Wen, F. Yin, H. Xing, and others. Scaling latent reasoning via looped language models
-
[65]
Z. M. K. Zuhri, E. H. Fuadi, and A. F. Aji. Predicting the Order of Upcoming Tokens Improves Language Modeling, Feb. 2026. URLhttp://arxiv.org/abs/2508.19228. arXiv:2508.19228 [cs]. 14 A Code 1 2[...] # within train loop 3 4if s u p e r p o s i t i o n _ b a g _ s i z e is not None and s u p e r p o s i t i o n _ b a g _ s i z e > 1: 5bs , seq = inputs . ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.