arxiv: 2605.09751 · v1 · submitted 2026-05-10 · 💻 cs.CL

Recognition: no theorem link

Language Models Without a Trainable Input Embedding Table: Learning from Fixed Minimal Binary Token Codes

A. Bochkov

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:48 UTC · model grok-4.3

classification 💻 cs.CL

keywords language modelsinput embeddingsbinary codesparameter efficiencytransformertoken representationsperplexityvocabulary encoding

0 comments

The pith

Language models achieve comparable performance using fixed minimal binary token codes instead of a trainable input embedding table.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether the standard trainable input embedding matrix is required for effective language modeling. It substitutes this matrix with fixed binary codes that represent each token using the minimal number of bits needed for its identity, then lifts those codes to the full model width through simple tiling or affine recoding with no additional learned parameters. Matched 32-layer decoder-only transformers trained on roughly 17 billion tokens show that the fixed-code models reach held-out validation perplexities statistically indistinguishable from the learned-embedding baseline. The result holds across three independent seeds and for both a tiled-binary variant and a fully table-free affine-recoded variant.

Core claim

For a 65,536-token vocabulary, 16-bit fixed binary codes, when deterministically tiled or affinely recoded over GF(2) to model dimension 1024, allow a 32-layer decoder-only transformer to reach mean validation perplexity of 2.36, compared with 2.44 for a standard learned-input baseline; the gap lies inside seed-to-seed variation, indicating that a trainable input embedding table is not necessary in this regime while the output projection remains trainable.

What carries the argument

Fixed minimal binary token codes (K = ceil(log2 V) bits per token, lifted to d_model by zero-parameter tiling or invertible affine transform over GF(2)^K) that carry token identity directly to the first transformer layer.

If this is right

The input embedding matrix, normally 67.1 million parameters for this vocabulary and width, is eliminated while perplexity remains comparable.
A fully table-free variant that generates and affinely recodes codes on the fly still stays within 0.03 perplexity of the baseline.
The output projection stays fully trainable and standard, isolating the effect to the input side.
Performance holds across three independent training seeds, with the fixed-code mean actually slightly lower than the baseline mean.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fixed-code construction could be tested on encoder-decoder or non-transformer architectures to check whether the result generalizes beyond decoder-only models.
For vocabularies much larger than 65k the bit-length K grows only logarithmically, so the parameter saving would increase while the input representation cost stays modest.
Because the codes are deterministic and invertible, any downstream analysis that needs exact token identity can recover it exactly without reference to learned weights.

Load-bearing premise

The transformer layers can recover token identity and semantics from the fixed binary representation alone, without any learned projection at the input.

What would settle it

A statistically significant rise in validation perplexity for the fixed-code models relative to the learned baseline when both are trained on the same data and architecture with additional random seeds or a larger training set would falsify the claim.

read the original abstract

Trainable input embedding tables are a standard component of modern language models. We ask whether they are actually necessary at the input interface. For a vocabulary of size $V$, exact token identity requires only $K=\lceil \log_2 V\rceil$ bits. We replace the usual trainable $V\times d_{\text{model}}$ input embedding matrix with fixed minimal binary token codes and a zero-parameter lift to model width. In our main setting, $V=65{,}536$, so $K=16$, and tokens are represented by fixed 16-dimensional binary codes tiled to $d_{\text{model}}=1024$. We also evaluate a fully table-free variant in which codes are generated from token IDs on the fly and randomly recoded by an invertible affine transform over $\mathbb{F}_2^K$. Across matched 32-layer decoder-only models trained on approximately 17B tokens and evaluated over three independent training seeds, fixed minimal codes achieve comparable held-out validation perplexity to a standard learned-input baseline while removing 67.1M trainable input parameters. The fixed-code runs have a lower mean validation perplexity in our experiments, 2.36 versus 2.44, but the observed gap is within the measured seed-to-seed variation of 4.8\%; we therefore interpret the result as evidence that the trainable input table is not necessary, rather than as a statistically resolved superiority claim. The table-free affine-recoded variant remains close at 2.39 despite a slightly shorter training run. These results show that, in this regime, a trainable input embedding table is not necessary for useful language modeling. The output projection remains standard and trainable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Fixed binary codes match learned embeddings within seed variation in matched training runs, though full details are still needed.

read the letter

Fixed binary codes can stand in for a learned input embedding table in these decoder-only models without hurting perplexity beyond seed variation. The paper runs a direct comparison on 32-layer models trained on around 17 billion tokens. With a 65k vocabulary, they replace the usual V by d_model embedding matrix with fixed 16-bit binary codes. These are either tiled to fill the 1024 dimensions or generated on the fly and passed through a random invertible affine transform over the field with two elements. The learned baseline uses 67.1 million parameters at the input. The fixed-code version scores 2.36 validation perplexity on average, the affine version 2.39, and the baseline 2.44. All differences sit inside the 4.8 percent variation seen across three seeds. The output projection stays as a trainable matrix in all cases. This is a clean ablation that removes a standard component and checks the effect under matched conditions. The authors report the numbers plainly and avoid claiming the fixed codes are better, which keeps the result grounded. The soft spots are straightforward. We only have the abstract, so the training data, optimizer settings, and precise code implementation are not visible. That limits how far we can trust the reproducibility right now. The result holds at this scale, but the assumption that the transformer layers can extract full token semantics from these minimal fixed codes might be easier to satisfy here than in much larger models. No formal proofs or external data checks are involved, just the empirical runs. This paper is aimed at researchers who work on simplifying transformer architectures and cutting parameters in language models. Anyone thinking about whether every standard piece is required will find the matched runs useful. It deserves a serious referee because the comparison is focused and the efficiency gain is concrete, provided the full paper supplies the missing details for verification. I recommend sending it out for peer review. Reviewers will likely push on the statistical side and ask for scaling tests, but the core idea is worth checking.

Referee Report

1 major / 1 minor

Summary. The manuscript claims that trainable input embedding tables are not necessary for language modeling. By replacing the V x d_model embedding matrix (V=65536) with fixed minimal binary token codes (K=16 bits, tiled to d_model=1024 or generated on-the-fly via invertible affine recoding over F_2^K), 32-layer decoder-only models trained on approximately 17B tokens achieve validation perplexities of 2.36 (tiled codes) and 2.39 (affine variant) versus 2.44 for the learned-embedding baseline. The gap falls within the observed 4.8% seed-to-seed variation across three independent runs, while removing 67.1M input parameters; the output projection remains trainable and standard.

Significance. If the result holds, the work demonstrates that a substantial fraction of model parameters can be removed at the input interface without harming performance in this regime, offering a concrete route to more efficient transformers. The direct matched-run comparison, use of three seeds, and explicit reporting of seed variation provide a falsifiable empirical test of the necessity of learned embeddings, which is a strength.

major comments (1)

[Abstract (experimental results paragraph)] The central claim that fixed codes suffice (and that the input table is not necessary) rests on the reported perplexity gap lying inside the 4.8% seed-to-seed variation. The abstract states that three seeds were run and gives mean values, but does not report the individual per-seed perplexities, the exact training data splits, the precise token count, or any statistical test of the difference. This information is load-bearing for interpreting the result as evidence rather than an inconclusive trend.

minor comments (1)

[Abstract] The description 'approximately 17B tokens' and 'slightly shorter training run' for the affine-recoded variant would benefit from exact figures and a statement of whether the shorter run was due to early stopping or a fixed budget, to support reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful review and for highlighting the importance of detailed reporting in the experimental results. We address the major comment point by point below and have updated the manuscript to incorporate the suggested improvements for greater transparency.

read point-by-point responses

Referee: [Abstract (experimental results paragraph)] The central claim that fixed codes suffice (and that the input table is not necessary) rests on the reported perplexity gap lying inside the 4.8% seed-to-seed variation. The abstract states that three seeds were run and gives mean values, but does not report the individual per-seed perplexities, the exact training data splits, the precise token count, or any statistical test of the difference. This information is load-bearing for interpreting the result as evidence rather than an inconclusive trend.

Authors: We agree that the abstract would benefit from more granular reporting to allow readers to fully assess the variability. In the revised manuscript, we have added the individual per-seed validation perplexities for all conditions, updated the description to specify the exact number of training tokens (17 billion), and detailed the training/validation data splits used from the C4 dataset. We have also included a brief discussion of the lack of a formal statistical test, explaining that with three seeds the primary evidence is the direct comparison to the observed seed-to-seed variation of 4.8%, which encompasses the mean difference. These revisions strengthen the presentation without changing the core findings or conclusions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical result is self-contained

full rationale

The paper reports a direct head-to-head empirical comparison of matched 32-layer decoder-only models trained on ~17B tokens, with fixed minimal binary codes (tiled or affinely recoded) versus a standard learned V×d_model input embedding table. Validation perplexity is shown to be comparable (within measured 4.8% seed-to-seed variation), supporting the claim that the trainable table is not necessary under the stated conditions. No equations, derivations, or self-citations are invoked that reduce the result to a fitted parameter, prior ansatz, or self-referential definition; the weakest assumption is tested by the experiment itself. The output projection remains trainable and standard, preserving an independent component.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that fixed codes suffice; no free parameters are fitted to produce the codes, no new entities are postulated, and the only background assumptions are standard transformer training practices.

axioms (1)

domain assumption A transformer can extract token identity from deterministically tiled or affinely transformed binary codes of length log2(V).
Implicit in the replacement of the embedding table by fixed codes lifted to d_model.

pith-pipeline@v0.9.0 · 5577 in / 1213 out tokens · 54776 ms · 2026-05-12T02:48:14.193191+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 4 internal anchors

[1]

Efficient Estimation of Word Representations in Vector Space

Efficient Estimation of Word Representations in Vector Space , author=. arXiv preprint arXiv:1301.3781 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

, booktitle=

Pennington, Jeffrey and Socher, Richard and Manning, Christopher D. , booktitle=

work page
[3]

Advances in Neural Information Processing Systems , volume=

Attention Is All You Need , author=. Advances in Neural Information Processing Systems , volume=

work page
[4]

2019 , note=

Language Models are Unsupervised Multitask Learners , author=. 2019 , note=

work page 2019
[5]

Advances in Neural Information Processing Systems , volume=

Language Models are Few-Shot Learners , author=. Advances in Neural Information Processing Systems , volume=

work page
[6]

Lan, Zhenzhong and Chen, Mingda and Goodman, Sebastian and Gimpel, Kevin and Sharma, Piyush and Soricut, Radu , booktitle=

work page
[7]

International Conference on Learning Representations , year=

Adaptive Input Representations for Neural Language Modeling , author=. International Conference on Learning Representations , year=

work page
[8]

Li, Xiang and Qin, Tao and Yang, Jian and Hu, Xiaolin and Liu, Tie-Yan , booktitle=

work page
[9]

Advances in Neural Information Processing Systems , volume=

Hash Embeddings for Efficient Word Representations , author=. Advances in Neural Information Processing Systems , volume=

work page
[10]

International Conference on Learning Representations , year=

Compressing Word Embeddings via Deep Compositional Code Learning , author=. International Conference on Learning Representations , year=

work page
[11]

Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics , pages=

Hierarchical Probabilistic Neural Network Language Model , author=. Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics , pages=

work page
[12]

Advances in Neural Information Processing Systems , volume=

A Scalable Hierarchical Distributed Language Model , author=. Advances in Neural Information Processing Systems , volume=

work page
[13]

Efficient Softmax Approximation for

Grave,. Efficient Softmax Approximation for. Proceedings of the 34th International Conference on Machine Learning , pages=

work page
[14]

Proceedings of the Thirtieth

Character-Aware Neural Language Models , author=. Proceedings of the Thirtieth

work page
[15]

Xue, Linting and Barua, Aditya and Constant, Noah and Kale, Mihir and Al-Rfou, Rami and Roberts, Adam and Raffel, Colin , journal=

work page
[16]

and Garrette, Dan and Turc, Iulia and Wieting, John , journal=

Clark, Jonathan H. and Garrette, Dan and Turc, Iulia and Wieting, John , journal=

work page
[17]

International Conference on Learning Representations , year=

Charformer: Fast Character Transformers via Gradient-Based Subword Tokenization , author=. International Conference on Learning Representations , year=

work page
[18]

Yu, Lili and others , journal=

work page
[19]

Mao, Yuning and others , booktitle=

work page
[20]

International Conference on Learning Representations , year=

Measuring Massive Multitask Language Understanding , author=. International Conference on Learning Representations , year=

work page
[21]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , author=. arXiv preprint arXiv:1803.05457 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Proceedings of NAACL-HLT , year=

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions , author=. Proceedings of NAACL-HLT , year=

work page
[23]

Proceedings of EMNLP , year=

SQuAD: 100,000+ Questions for Machine Comprehension of Text , author=. Proceedings of EMNLP , year=

work page
[24]

Proceedings of ACL , year=

HellaSwag: Can a Machine Really Finish Your Sentence? , author=. Proceedings of ACL , year=

work page
[25]

Proceedings of NAACL-HLT , year=

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge , author=. Proceedings of NAACL-HLT , year=

work page
[26]

Textbooks Are All You Need

Textbooks Are All You Need , author=. arXiv preprint arXiv:2306.11644 , year=

work page internal anchor Pith review arXiv
[27]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , author=. arXiv preprint arXiv:2406.17557 , year=

work page internal anchor Pith review arXiv
[28]

Hugging Face dataset , year=

Cosmopedia , author=. Hugging Face dataset , year=

work page