CoFrGeNet: Continued Fraction Architectures for Language Generation

Amit Dhurandhar; Dennis Wei; Karthikeyan Natesan Ramamurthy; Rahul Nair; Tejaswini Pedapati; Vijil Chenthamarakshan

arxiv: 2601.21766 · v4 · pith:23ZEPV2Ynew · submitted 2026-01-29 · 💻 cs.CL · cs.AI

CoFrGeNet: Continued Fraction Architectures for Language Generation

Amit Dhurandhar , Vijil Chenthamarakshan , Dennis Wei , Tejaswini Pedapati , Karthikeyan Natesan Ramamurthy , Rahul Nair This is my paper

Pith reviewed 2026-05-25 07:24 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords continued fractionstransformer architecturesparameter efficiencylanguage generationattention replacementgenerative networkspre-training

0 comments

The pith

Continued fraction components can replace attention and feed-forward layers in transformers while using half the parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CoFrGeNets, a family of architectures based on a continued-fraction function class for language generation. It replaces the multi-head attention and feed-forward blocks inside transformers with new components that need far fewer parameters. The approach is tested by modifying GPT2-xl and Llama3, pre-training the smaller versions on large text collections, and measuring results on classification, question answering, reasoning, and text understanding tasks. The modified models stay competitive or better than the originals despite the size cut and reduced training time. The design works as a drop-in replacement that needs almost no changes to existing training pipelines.

Core claim

CoFrGeNets implement a continued-fraction function class whose architectural components substitute directly for multi-head attention and feed-forward networks inside transformer blocks. Custom gradient rules allow more accurate optimization of these components. When the replacements are applied to GPT2-xl and Llama3 and the resulting models are pre-trained on OpenWebText, GneissWeb, or the docling mix, they reach performance on downstream tasks that is competitive with or exceeds the original models while using only two-thirds to one-half the parameters and less pre-training time.

What carries the argument

Continued-fraction components that substitute for multi-head attention and feed-forward networks inside each transformer block.

If this is right

Models with two-thirds to half the original parameter count reach competitive or superior accuracy on classification, Q&A, reasoning, and text-understanding tasks.
Pre-training finishes in less wall-clock time while using the same data mixes.
The new blocks plug into existing transformer code with minimal changes to training or inference routines.
The same replacement works across different base architectures, as shown on both GPT2-xl and Llama3.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hardware-specific kernels for the continued-fraction blocks could widen the efficiency gap beyond what software-only experiments show.
The same substitution pattern might be tried in non-transformer sequence models such as state-space or recurrent architectures.
Smaller parameter counts could make it practical to train and serve capable language models on more modest compute clusters.

Load-bearing premise

The continued-fraction components preserve enough modeling capacity to stand in for multi-head attention and feed-forward networks across the full range of language-generation tasks.

What would settle it

A head-to-head run in which a CoFrGeNet-modified transformer is pre-trained on the same data as the baseline and then scores substantially lower on a standard downstream suite such as GLUE or a reasoning benchmark.

Figures

Figures reproduced from arXiv: 2601.21766 by Amit Dhurandhar, Dennis Wei, Karthikeyan Natesan Ramamurthy, Rahul Nair, Tejaswini Pedapati, Vijil Chenthamarakshan.

**Figure 2.** Figure 2: Two CoFrNet architectures to simulate attention a.k.a. causal token-token mixing. For [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: CoFrNet architecture simulating FFNs – Cffn – in a transformer block. We create a gated non-expanded (i.e. α = 1) representation that we pass to the CoFrNet ladders. No transpose is taken and hence feature mixing in either direction does not interfere with causal generation which is why we have a linear layer on top. Again the collapsed implementation is described in section 4.2. For FFNs we simply require… view at source ↗

**Figure 4.** Figure 4: Architecture for implementing a linear combination of CoFrNet ladders (CF stands for continued fraction). To take advantage of Proposition 1, we implement the CF layer in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: GPT2-xl example generation when pre-trained on OWT. [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: CoFrGeNet-F example generation when pre-trained on OWT. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: CoFrGeNet-A example generation when pre-trained on OWT. [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: CoFrGeNet example generation when pre-trained on OWT. [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: GPT2-xl example generation when pre-trained on GneissWeb. [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: CoFrGeNet-F example generation when pre-trained on GneissWeb. [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: CoFrGeNet-A example generation when pre-trained on GneissWeb. [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: CoFrGeNet example generation when pre-trained on GneissWeb. [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: Validation loss of the different GPT2-xl variants on OWT as a function of training steps. [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗

read the original abstract

Transformers are arguably the preferred architecture for language generation. In this paper, inspired by continued fractions, we introduce a new function class for generative modeling. The architecture family implementing this function class is named CoFrGeNets - Continued Fraction Generative Networks. We design novel architectural components based on this function class that can replace Multi-head Attention and Feed-Forward Networks in Transformer blocks while requiring much fewer parameters. We derive custom gradient formulations to optimize the proposed components more accurately and efficiently than using standard PyTorch-based gradients. Our components are a plug-in replacement requiring little change in training or inference procedures that have already been put in place for Transformer-based models thus making our approach easy to incorporate in large industrial workflows. We experiment on two very different transformer architectures GPT2-xl (1.5B) and Llama3 (3.2B), where the former we pre-train on OpenWebText and GneissWeb, while the latter we pre-train on the docling data mix which consists of nine different datasets. Results show that the performance on downstream classification, Q\& A, reasoning and text understanding tasks of our models is competitive and sometimes even superior to the original models with $\frac{2}{3}$ to $\frac{1}{2}$ the parameters and shorter pre-training time. We believe that future implementations customized to hardware will further bring out the true potential of our architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The continued-fraction replacements claim competitive performance at half the parameters, but mismatched pre-training data between CoFrGeNet and the baselines prevents crediting the architecture.

read the letter

The one thing to know is that the data used for the new models differs from the originals, which undercuts the main claim. GPT2-xl CoFrGeNet was pre-trained on OpenWebText plus GneissWeb while the reference GPT2 used WebText; the Llama3 variant used a nine-dataset mix instead of the original larger corpus. Without matched-data runs, any parity in downstream results could trace to data volume or distribution rather than the continued-fraction layers preserving capacity for attention and FFN replacement.

Referee Report

2 major / 1 minor

Summary. The paper introduces CoFrGeNet, a family of architectures based on a continued-fraction function class whose components replace multi-head attention and feed-forward networks inside Transformer blocks. The central claim is that these replacements are plug-in compatible, admit custom gradients, require 1/2–2/3 the parameters of the original blocks, and yield competitive or superior downstream performance on classification, QA, reasoning and text-understanding tasks after pre-training GPT2-xl (1.5 B) on OpenWebText+GneissWeb and a Llama3-scale model (3.2 B) on a nine-dataset docling mix, with shorter pre-training time than the reference models.

Significance. If the capacity-preservation claim can be isolated from data-distribution effects, the work would supply a novel, parameter-efficient function class for generative modeling that could be adopted with minimal disruption to existing Transformer training pipelines. The explicit provision of custom gradient formulations and the demonstration on two architecturally dissimilar base models are positive features.

major comments (2)

[Abstract, §4] Abstract (final paragraph) and §4 (experimental setup): the reported performance parity is obtained after pre-training the CoFrGeNet GPT2-xl variant on OpenWebText+GneissWeb while the reference GPT2-xl was trained on WebText, and the Llama3 variant on the docling nine-dataset mix while the reference Llama3 used its own substantially larger corpus. No matched-data ablation or data-volume normalization is described, so the results do not isolate the effect of the continued-fraction substitution from differences in pre-training distribution or volume. This directly undermines the claim that the new components “preserve sufficient modeling capacity.”
[§3, §5] §3 (architectural definition) and §5 (gradient derivation): the manuscript states that custom gradient formulations are derived for the continued-fraction components, yet no explicit equations for the forward pass, the custom backward pass, or the parameter count reduction are supplied in the sections that would allow a reader to verify that the claimed 1/2–2/3 parameter reduction is achieved without loss of expressivity. The absence of these derivations makes it impossible to assess whether the substitution is mathematically well-founded or merely an empirical fit.

minor comments (1)

[Abstract] The abstract refers to “GneissWeb” and “docling data mix” without a citation or brief description of their composition, size, or overlap with the reference corpora; a one-sentence footnote or table entry would clarify the comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract, §4] Abstract (final paragraph) and §4 (experimental setup): the reported performance parity is obtained after pre-training the CoFrGeNet GPT2-xl variant on OpenWebText+GneissWeb while the reference GPT2-xl was trained on WebText, and the Llama3 variant on the docling nine-dataset mix while the reference Llama3 used its own substantially larger corpus. No matched-data ablation or data-volume normalization is described, so the results do not isolate the effect of the continued-fraction substitution from differences in pre-training distribution or volume. This directly undermines the claim that the new components “preserve sufficient modeling capacity.”

Authors: We agree that the differing pre-training corpora constitute a confound that prevents full isolation of the architectural effect. The manuscript already states the datasets used (OpenWebText+GneissWeb for the GPT2-xl variant and the nine-dataset docling mix for the Llama3-scale model), but does not contain a matched-data ablation. In the revised manuscript we will add an explicit limitations paragraph in §4 and moderate the capacity-preservation language in the abstract and conclusion to reflect this limitation. We retain the claim that the components are practically viable under the reported training regimes. revision: yes
Referee: [§3, §5] §3 (architectural definition) and §5 (gradient derivation): the manuscript states that custom gradient formulations are derived for the continued-fraction components, yet no explicit equations for the forward pass, the custom backward pass, or the parameter count reduction are supplied in the sections that would allow a reader to verify that the claimed 1/2–2/3 parameter reduction is achieved without loss of expressivity. The absence of these derivations makes it impossible to assess whether the substitution is mathematically well-founded or merely an empirical fit.

Authors: We acknowledge that the explicit forward-pass, custom backward-pass, and parameter-count equations are not presented with sufficient detail in the main text. In the revised manuscript we will expand §3 and §5 to include the complete mathematical derivations, the custom gradient expressions, and the step-by-step parameter-count calculations that establish the ½–⅔ reduction. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical performance claims rest on external benchmarks rather than self-referential fits or derivations.

full rationale

The paper proposes a continued-fraction-inspired function class and reports that CoFrGeNet variants achieve competitive downstream results versus GPT-2-xl and Llama-3 baselines at reduced parameter counts. No equations, gradient derivations, or architectural substitutions are shown to reduce by construction to quantities fitted from the same evaluation data. The central claim is an empirical comparison against independently trained reference models; any data-distribution differences affect evidential strength but do not create a definitional or self-citation loop within the derivation itself. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract alone supplies no explicit free parameters, axioms, or invented entities; none can be extracted.

pith-pipeline@v0.9.0 · 5806 in / 1044 out tokens · 37087 ms · 2026-05-25T07:24:35.519728+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel (J uniqueness) matches

?

matches
MATCHES: this paper passage directly uses, restates, or depends on the cited Recognition theorem or module.

canonical form ... a0 + 1/a1+1/a2+⋯ ... reciprocal of the function thus far is applied as a nonlinearity in each layer ... w0x + 1/(w1x + 1/(w2x + ⋯))
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean costAlphaLog_fourth_deriv_at_zero / J_uniquely_calibrated_via_higher_derivative echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

continuants K0=1, K1=ad, Kk=ad−k+1 Kk−1 + Kk−2 ... ∂f̃/∂ak = (−1)^k (Kd−k/Kd)^2
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection (bilinear branch forced by coupling combiner) refines

?

refines
Relation between the paper passage and the cited Recognition theorem.

depth d and number of ladders L ... parameter savings ... no expansion (α=1)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 3 internal anchors

[1]

Winogrande: An adversarial winograd schema challenge at scale. 2019

work page 2019
[2]

L. B. Allal, A. Lozhkov, E. Bakouch, G. M. Blázquez, G. Penedo, L. Tunstall, A. Marafioti, H. Kydlíˇcek, A. P. Lajarín, V . Srivastav, J. Lochner, C. Fahlgren, X.-S. Nguyen, C. Fourrier, 10 B. Burtenshaw, H. Larcher, H. Zhao, C. Zakka, M. Morlon, C. Raffel, L. von Werra, and T. Wolf. Smollm2: When smol goes big – data-centric training of a small language ...

work page 2025
[3]

Ben Allal, A

L. Ben Allal, A. Lozhkov, G. Penedo, T. Wolf, and L. von Werra. Cosmopedia, 2024

work page 2024
[4]

Y . Bisk, R. Zellers, R. L. Bras, J. Gao, and Y . Choi. Piqa: Reasoning about physical com- monsense in natural language. InThirty-Fourth AAAI Conference on Artificial Intelligence, 2020

work page 2020
[5]

Chelba, T

C. Chelba, T. Mikolov, M. Schuster, Q. Ge, T. Brants, P. Koehn, and T. Robinson. One billion word benchmark for measuring progress in statistical language modeling. In H. Li, H. M. Meng, B. Ma, E. Chng, and L. Xie, editors,15th Annual Conference of the International Speech Communication Association, INTERSPEECH 2014, Singapore, September 14-18, 2014, page...

work page 2014
[6]

Christopher, L

C. Christopher, L. Kenton, C. Ming-Wei, K. Tom, C. Michael, and T. Kristina. Boolq: Exploring the surprising difficulty of natural yes/no questions. InNAACL, 2019

work page 2019
[7]

H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y . Tay, W. Fedus, Y . Li, X. Wang, M. Dehghani, S. Brahma, et al. Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024

work page 2024
[8]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv:1803.05457v1, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[9]

Devlin, M.-W

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technolo- gies, volume 1 (long and short papers), pages 4171–4186, 2019

work page 2019
[10]

Z. Fu, W. Song, Y . Wang, X. Wu, Y . Zheng, Y . Zhang, D. Xu, X. Wei, T. Xu, and X. Zhao. Sliding window attention training for efficient large language models, 2025

work page 2025
[11]

Gadhikar, S

A. Gadhikar, S. K. Majumdar, N. Popp, P. Saranrittichai, M. Rapp, and L. Schott. Attention is all you need for mixture-of-depths routing, 2024

work page 2024
[12]

H. E. Gohari, S. R. Kadhe, S. Y . S. C. Adam, A. Adebayo, P. Adusumilli, F. Ahmed, N. B. Angel, S. Borse, Y .-C. Chang, X.-H. Dang, N. Desai, R. Eres, R. Iwamoto, A. Karve, Y . Koyfman, W.-H. Lee, C. Liu, B. Lublinsky, T. Ohko, P. Pesce, M. Touma, S. Wang, S. Witherspoon, H. Woisetschlager, D. Wood, K.-L. Wu, I. Yoshida, S. Zawad, P. Zerfos, Y . Zhou, and...

work page 2025
[13]

Gokaslan, V

A. Gokaslan, V . Cohen, E. Pavlick, and S. Tellex. Openwebtext corpus.http://Skylion007. github.io/OpenWebTextCorpus, 2019

work page 2019
[14]

Graef and A

N. Graef and A. Wasielewski. Slim attention: cut your context memory in half without loss – k-cache is all you need for mha, 2025

work page 2025
[15]

Gu and T

A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces. InFirst Conference on Language Modeling, 2024

work page 2024
[16]

A. Gu, K. Goel, and C. Re. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2022

work page 2022
[17]

X. Han, Y . Jian, X. Hu, H. Liu, Y . Wang, Q. Fan, Y . Ai, H. Huang, R. He, Z. Yang, and Q. You. Infimm-webmath-40b: Advancing multimodal pre-training for enhanced mathematical reasoning, 2024

work page 2024
[18]

Huang, T

S. Huang, T. Cheng, J. K. Liu, J. Hao, L. Song, Y . Xu, J. Yang, J. H. Liu, C. Zhang, L. Chai, R. Yuan, Z. Zhang, J. Fu, Q. Liu, G. Zhang, Z. Wang, Y . Qi, Y . Xu, and W. Chu. Opencoder: The open cookbook for top-tier code large language models. 2024

work page 2024
[19]

A. G. Ivakhnenko. Polynomial theory of complex systems.IEEE transactions on Systems, Man, and Cybernetics, (4):364–378, 1971. 11

work page 1971
[20]

W. B. Jones and W. Thron.Continued fractions. Analytic theory and applications. Encyclopedia of Mathematics and its Applications. Addison-Wesley, 1980

work page 1980
[21]

Joshua, J

A. Joshua, J. Lee-Thorp, M. de Jong, Y . Zemlyanskiy, F. Lebrón, and S. Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints.Empirical Method in Natural Language Prcessing, 2023

work page 2023
[22]

Jozefowicz, O

R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y . Wu. Exploring the limits of language modeling, 2016

work page 2016
[23]

J. Li, A. Fang, G. Smyrnis, M. Ivgi, M. Jordan, S. Gadre, H. Bansal, E. Guha, S. Keh, K. Arora, S. Garg, R. Xin, N. Muennighoff, R. Heckel, J. Mercat, M. Chen, S. Gururangan, M. Wortsman, A. Albalak, Y . Bitton, M. Nezhurina, A. Abbas, C.-Y . Hsieh, D. Ghosh, J. Gardner, M. Kilian, H. Zhang, R. Shao, S. Pratt, S. Sanyal, G. Ilharco, G. Daras, K. Marathe, ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Linnainmaa

S. Linnainmaa. Taylor expansion of the accumulated rounding error.BIT Numerical Mathemat- ics, 16(2):146–160, 1976

work page 1976
[25]

Lozhkov, L

A. Lozhkov, L. Ben Allal, L. von Werra, and T. Wolf. Fineweb-edu, May 2024

work page 2024
[26]

M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz. Building a large annotated corpus of english: The penn treebank.Comput. Linguistics, 19(2):313–330, 1993

work page 1993
[27]

W. S. McCulloch and W. Pitts. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5:115–133, 1943

work page 1943
[28]

Merity, C

S. Merity, C. Xiong, J. Bradbury, and R. Socher. Pointer sentinel mixture models. In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017

work page 2017
[29]

Mihaylov, P

T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InEMNLP, 2018

work page 2018
[30]

K. Milton. Summation techniques, Padé approximants, and continued fractions. 2011. http: //www.nhn.ou.edu/~milton/p5013/chap8.pdf

work page 2011
[31]

Paperno, G

D. Paperno, G. Kruszewski, A. Lazaridou, Q. N. Pham, R. Bernardi, S. Pezzelle, M. Baroni, G. Boleda, and R. Fernández. The LAMBADA dataset: Word prediction requiring a broad discourse context. InProceedings of the 54th Annual Meeting of the Association for Compu- tational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers. Th...

work page 2016
[32]

I. Puri, A. Dhurandhar, T. Pedapati, K. Shanmugam, D. Wei, and K. R. Varshney. Cofrnets: Interpretable neural architecture inspired by continued fractions. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P. Liang, and J. W. Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pages 21668–21680. Curran Associates, Inc., 2021

work page 2021
[33]

Radford, K

A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever. Improving language understanding by generative pre-training. 2018

work page 2018
[34]

Radford, J

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

work page 2019
[35]

Raffel, N

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

work page 2020
[36]

Rosenblatt

F. Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain.Psychological review, 65(6):386, 1958. 12

work page 1958
[37]

D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors.nature, 323(6088):533–536, 1986

work page 1986
[38]

S. S. Sahoo, M. Arriola, A. Gokaslan, E. M. Marroquin, A. M. Rush, Y . Schiff, J. T. Chiu, and V . Kuleshov. Simple and effective masked diffusion language models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[39]

S. S. Sahoo, M. Arriola, Y . Schiff, A. Gokaslan, E. Marroquin, J. T. Chiu, A. Rush, and V . Kuleshov. Simple and effective masked diffusion language models, 2024

work page 2024
[40]

N. Shazeer. Fast transformer decoding: One write-head is all you need, 2019

work page 2019
[41]

Sohl-Dickstein, E

J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In F. Bach and D. Blei, editors,Proceedings of the 32nd International Conference on Machine Learning, volume 37 ofProceedings of Machine Learning Research, pages 2256–2265, Lille, France, 07–09 Jul 2015. PMLR

work page 2015
[42]

Sutskever, O

I. Sutskever, O. Vinyals, and Q. V . Le. Sequence to sequence learning with neural networks. Advances in neural information processing systems, 27, 2014

work page 2014
[43]

Y . Tay, D. Bahri, D. Metzler, D.-C. Juan, Z. Zhao, and C. Zheng. Synthesizer: Rethinking self-attention in transformer models. InIntl. Conference on Machine Learning, 2021

work page 2021
[44]

D. S. Team. Docling technical report. Technical report, 8 2024

work page 2024
[45]

Tillet, H.-T

P. Tillet, H.-T. Kung, and D. Cox. Triton: an intermediate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pages 10–19, 2019

work page 2019
[46]

Tolstikhin, N

I. Tolstikhin, N. Houlsby, A. Kolesnikov, L. Beyer, X. Zhai, T. Unterthiner, J. Yung, A. Steiner, D. Keysers, J. Uszkoreit, M. Lucic, and A. Dosovitskiy. Mlp-mixer: An all-mlp architecture for vision. InComputer Vision and Pattern Recognition, 2021

work page 2021
[47]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, 2017

work page 2017
[48]

A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. InProceedings of the 24th International Conference on Learning Representations, 2019

work page 2019
[49]

S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma. Linformer: Self-attention with linear complexity.CoRR, abs/2006.04768, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006
[50]

Y . Wang, S. Mishra, P. Alipoormolabashi, Y . Kordi, A. Mirzaei, A. Naik, A. Ashok, A. S. Dhanasekaran, A. Arunkumar, D. Stap, et al. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5085–5109, 2022

work page 2022
[51]

Welbl, N

J. Welbl, N. F. Liu, and M. Gardner. Crowdsourcing multiple choice science questions. 2017

work page 2017
[52]

Zaheer, G

M. Zaheer, G. Guruganesh, A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang, and A. Ahmed. Big bird: transformers for longer sequences. NeurIPS ’24, 2024

work page 2024
[53]

Zellers, A

R. Zellers, A. Holtzman, Y . Bisk, A. Farhadi, and Y . Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

work page 2019
[54]

Zhang, J

X. Zhang, J. Zhao, and Y . LeCun. Character-level convolutional networks for text classification. InProceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, page 649–657, Cambridge, MA, USA, 2015. MIT Press. 13 7 Brief Historical Perspective One of the starting points of artificial neural networks was ...

work page 2015
[55]

8 Lemma 2 [32] We have ∂ ∂ak Kd+1(a0,

and demonstrated for weight update and learning representation in neural networks [37]. 8 Lemma 2 [32] We have ∂ ∂ak Kd+1(a0, . . . , ad) Kd(a1, . . . , ad) = (−1) k Kd−k(ak+1, . . . , ad) Kd(a1, . . . , ad) 2 . Proof. To compute the partial derivative of the ratio of continuants above, we first determine the partial derivative of a single continuant Kk(a...

work page

[1] [1]

Winogrande: An adversarial winograd schema challenge at scale. 2019

work page 2019

[2] [2]

L. B. Allal, A. Lozhkov, E. Bakouch, G. M. Blázquez, G. Penedo, L. Tunstall, A. Marafioti, H. Kydlíˇcek, A. P. Lajarín, V . Srivastav, J. Lochner, C. Fahlgren, X.-S. Nguyen, C. Fourrier, 10 B. Burtenshaw, H. Larcher, H. Zhao, C. Zakka, M. Morlon, C. Raffel, L. von Werra, and T. Wolf. Smollm2: When smol goes big – data-centric training of a small language ...

work page 2025

[3] [3]

Ben Allal, A

L. Ben Allal, A. Lozhkov, G. Penedo, T. Wolf, and L. von Werra. Cosmopedia, 2024

work page 2024

[4] [4]

Y . Bisk, R. Zellers, R. L. Bras, J. Gao, and Y . Choi. Piqa: Reasoning about physical com- monsense in natural language. InThirty-Fourth AAAI Conference on Artificial Intelligence, 2020

work page 2020

[5] [5]

Chelba, T

C. Chelba, T. Mikolov, M. Schuster, Q. Ge, T. Brants, P. Koehn, and T. Robinson. One billion word benchmark for measuring progress in statistical language modeling. In H. Li, H. M. Meng, B. Ma, E. Chng, and L. Xie, editors,15th Annual Conference of the International Speech Communication Association, INTERSPEECH 2014, Singapore, September 14-18, 2014, page...

work page 2014

[6] [6]

Christopher, L

C. Christopher, L. Kenton, C. Ming-Wei, K. Tom, C. Michael, and T. Kristina. Boolq: Exploring the surprising difficulty of natural yes/no questions. InNAACL, 2019

work page 2019

[7] [7]

H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y . Tay, W. Fedus, Y . Li, X. Wang, M. Dehghani, S. Brahma, et al. Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024

work page 2024

[8] [8]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv:1803.05457v1, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[9] [9]

Devlin, M.-W

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technolo- gies, volume 1 (long and short papers), pages 4171–4186, 2019

work page 2019

[10] [10]

Z. Fu, W. Song, Y . Wang, X. Wu, Y . Zheng, Y . Zhang, D. Xu, X. Wei, T. Xu, and X. Zhao. Sliding window attention training for efficient large language models, 2025

work page 2025

[11] [11]

Gadhikar, S

A. Gadhikar, S. K. Majumdar, N. Popp, P. Saranrittichai, M. Rapp, and L. Schott. Attention is all you need for mixture-of-depths routing, 2024

work page 2024

[12] [12]

H. E. Gohari, S. R. Kadhe, S. Y . S. C. Adam, A. Adebayo, P. Adusumilli, F. Ahmed, N. B. Angel, S. Borse, Y .-C. Chang, X.-H. Dang, N. Desai, R. Eres, R. Iwamoto, A. Karve, Y . Koyfman, W.-H. Lee, C. Liu, B. Lublinsky, T. Ohko, P. Pesce, M. Touma, S. Wang, S. Witherspoon, H. Woisetschlager, D. Wood, K.-L. Wu, I. Yoshida, S. Zawad, P. Zerfos, Y . Zhou, and...

work page 2025

[13] [13]

Gokaslan, V

A. Gokaslan, V . Cohen, E. Pavlick, and S. Tellex. Openwebtext corpus.http://Skylion007. github.io/OpenWebTextCorpus, 2019

work page 2019

[14] [14]

Graef and A

N. Graef and A. Wasielewski. Slim attention: cut your context memory in half without loss – k-cache is all you need for mha, 2025

work page 2025

[15] [15]

Gu and T

A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces. InFirst Conference on Language Modeling, 2024

work page 2024

[16] [16]

A. Gu, K. Goel, and C. Re. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2022

work page 2022

[17] [17]

X. Han, Y . Jian, X. Hu, H. Liu, Y . Wang, Q. Fan, Y . Ai, H. Huang, R. He, Z. Yang, and Q. You. Infimm-webmath-40b: Advancing multimodal pre-training for enhanced mathematical reasoning, 2024

work page 2024

[18] [18]

Huang, T

S. Huang, T. Cheng, J. K. Liu, J. Hao, L. Song, Y . Xu, J. Yang, J. H. Liu, C. Zhang, L. Chai, R. Yuan, Z. Zhang, J. Fu, Q. Liu, G. Zhang, Z. Wang, Y . Qi, Y . Xu, and W. Chu. Opencoder: The open cookbook for top-tier code large language models. 2024

work page 2024

[19] [19]

A. G. Ivakhnenko. Polynomial theory of complex systems.IEEE transactions on Systems, Man, and Cybernetics, (4):364–378, 1971. 11

work page 1971

[20] [20]

W. B. Jones and W. Thron.Continued fractions. Analytic theory and applications. Encyclopedia of Mathematics and its Applications. Addison-Wesley, 1980

work page 1980

[21] [21]

Joshua, J

A. Joshua, J. Lee-Thorp, M. de Jong, Y . Zemlyanskiy, F. Lebrón, and S. Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints.Empirical Method in Natural Language Prcessing, 2023

work page 2023

[22] [22]

Jozefowicz, O

R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y . Wu. Exploring the limits of language modeling, 2016

work page 2016

[23] [23]

J. Li, A. Fang, G. Smyrnis, M. Ivgi, M. Jordan, S. Gadre, H. Bansal, E. Guha, S. Keh, K. Arora, S. Garg, R. Xin, N. Muennighoff, R. Heckel, J. Mercat, M. Chen, S. Gururangan, M. Wortsman, A. Albalak, Y . Bitton, M. Nezhurina, A. Abbas, C.-Y . Hsieh, D. Ghosh, J. Gardner, M. Kilian, H. Zhang, R. Shao, S. Pratt, S. Sanyal, G. Ilharco, G. Daras, K. Marathe, ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

Linnainmaa

S. Linnainmaa. Taylor expansion of the accumulated rounding error.BIT Numerical Mathemat- ics, 16(2):146–160, 1976

work page 1976

[25] [25]

Lozhkov, L

A. Lozhkov, L. Ben Allal, L. von Werra, and T. Wolf. Fineweb-edu, May 2024

work page 2024

[26] [26]

M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz. Building a large annotated corpus of english: The penn treebank.Comput. Linguistics, 19(2):313–330, 1993

work page 1993

[27] [27]

W. S. McCulloch and W. Pitts. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5:115–133, 1943

work page 1943

[28] [28]

Merity, C

S. Merity, C. Xiong, J. Bradbury, and R. Socher. Pointer sentinel mixture models. In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017

work page 2017

[29] [29]

Mihaylov, P

T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InEMNLP, 2018

work page 2018

[30] [30]

K. Milton. Summation techniques, Padé approximants, and continued fractions. 2011. http: //www.nhn.ou.edu/~milton/p5013/chap8.pdf

work page 2011

[31] [31]

Paperno, G

D. Paperno, G. Kruszewski, A. Lazaridou, Q. N. Pham, R. Bernardi, S. Pezzelle, M. Baroni, G. Boleda, and R. Fernández. The LAMBADA dataset: Word prediction requiring a broad discourse context. InProceedings of the 54th Annual Meeting of the Association for Compu- tational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers. Th...

work page 2016

[32] [32]

I. Puri, A. Dhurandhar, T. Pedapati, K. Shanmugam, D. Wei, and K. R. Varshney. Cofrnets: Interpretable neural architecture inspired by continued fractions. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P. Liang, and J. W. Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pages 21668–21680. Curran Associates, Inc., 2021

work page 2021

[33] [33]

Radford, K

A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever. Improving language understanding by generative pre-training. 2018

work page 2018

[34] [34]

Radford, J

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

work page 2019

[35] [35]

Raffel, N

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

work page 2020

[36] [36]

Rosenblatt

F. Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain.Psychological review, 65(6):386, 1958. 12

work page 1958

[37] [37]

D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors.nature, 323(6088):533–536, 1986

work page 1986

[38] [38]

S. S. Sahoo, M. Arriola, A. Gokaslan, E. M. Marroquin, A. M. Rush, Y . Schiff, J. T. Chiu, and V . Kuleshov. Simple and effective masked diffusion language models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024

[39] [39]

S. S. Sahoo, M. Arriola, Y . Schiff, A. Gokaslan, E. Marroquin, J. T. Chiu, A. Rush, and V . Kuleshov. Simple and effective masked diffusion language models, 2024

work page 2024

[40] [40]

N. Shazeer. Fast transformer decoding: One write-head is all you need, 2019

work page 2019

[41] [41]

Sohl-Dickstein, E

J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In F. Bach and D. Blei, editors,Proceedings of the 32nd International Conference on Machine Learning, volume 37 ofProceedings of Machine Learning Research, pages 2256–2265, Lille, France, 07–09 Jul 2015. PMLR

work page 2015

[42] [42]

Sutskever, O

I. Sutskever, O. Vinyals, and Q. V . Le. Sequence to sequence learning with neural networks. Advances in neural information processing systems, 27, 2014

work page 2014

[43] [43]

Y . Tay, D. Bahri, D. Metzler, D.-C. Juan, Z. Zhao, and C. Zheng. Synthesizer: Rethinking self-attention in transformer models. InIntl. Conference on Machine Learning, 2021

work page 2021

[44] [44]

D. S. Team. Docling technical report. Technical report, 8 2024

work page 2024

[45] [45]

Tillet, H.-T

P. Tillet, H.-T. Kung, and D. Cox. Triton: an intermediate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pages 10–19, 2019

work page 2019

[46] [46]

Tolstikhin, N

I. Tolstikhin, N. Houlsby, A. Kolesnikov, L. Beyer, X. Zhai, T. Unterthiner, J. Yung, A. Steiner, D. Keysers, J. Uszkoreit, M. Lucic, and A. Dosovitskiy. Mlp-mixer: An all-mlp architecture for vision. InComputer Vision and Pattern Recognition, 2021

work page 2021

[47] [47]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, 2017

work page 2017

[48] [48]

A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. InProceedings of the 24th International Conference on Learning Representations, 2019

work page 2019

[49] [49]

S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma. Linformer: Self-attention with linear complexity.CoRR, abs/2006.04768, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006

[50] [50]

Y . Wang, S. Mishra, P. Alipoormolabashi, Y . Kordi, A. Mirzaei, A. Naik, A. Ashok, A. S. Dhanasekaran, A. Arunkumar, D. Stap, et al. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5085–5109, 2022

work page 2022

[51] [51]

Welbl, N

J. Welbl, N. F. Liu, and M. Gardner. Crowdsourcing multiple choice science questions. 2017

work page 2017

[52] [52]

Zaheer, G

M. Zaheer, G. Guruganesh, A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang, and A. Ahmed. Big bird: transformers for longer sequences. NeurIPS ’24, 2024

work page 2024

[53] [53]

Zellers, A

R. Zellers, A. Holtzman, Y . Bisk, A. Farhadi, and Y . Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

work page 2019

[54] [54]

Zhang, J

X. Zhang, J. Zhao, and Y . LeCun. Character-level convolutional networks for text classification. InProceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, page 649–657, Cambridge, MA, USA, 2015. MIT Press. 13 7 Brief Historical Perspective One of the starting points of artificial neural networks was ...

work page 2015

[55] [55]

8 Lemma 2 [32] We have ∂ ∂ak Kd+1(a0,

and demonstrated for weight update and learning representation in neural networks [37]. 8 Lemma 2 [32] We have ∂ ∂ak Kd+1(a0, . . . , ad) Kd(a1, . . . , ad) = (−1) k Kd−k(ak+1, . . . , ad) Kd(a1, . . . , ad) 2 . Proof. To compute the partial derivative of the ratio of continuants above, we first determine the partial derivative of a single continuant Kk(a...

work page