arxiv: 2605.01991 · v1 · submitted 2026-05-03 · 💻 cs.IT · cs.LG· math.IT

Recognition: 3 theorem links

· Lean Theorem

Real-Time Text Transmission via LLM-Based Entropy Coding over Fixed-Rate Channels

Vishnu Teja Kunde , Jean-Francois Chamberland , Krishna R. Narayanan , Jamison Ebert

Authors on Pith no claims yet

Pith reviewed 2026-05-08 19:25 UTC · model grok-4.3

classification 💻 cs.IT cs.LGmath.IT

keywords real-time text transmissionLLM entropy codingcompression-delay tradeofffixed-rate channelslanguage model scalingHuffman codingarithmetic codingcausal language models

0 comments

The pith

Larger language models cut bits per character by 38 percent, making zero-delay Huffman coding optimal for real-time text over fixed-rate channels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

A causal language model predicts each next character so that its probability estimates can drive an entropy coder and produce variable-length bit strings for transmission. These strings enter a queue on a fixed-rate channel, and the resulting per-token delay depends on the average and spread of the bit lengths plus any built-in latency of the coder itself. The paper compares ideal Shannon coding, Huffman, arithmetic coding, rANS at different block sizes, and gzip under this queue model. Scaling the predictor from a 124-million-parameter model to a 3-billion-parameter model lowers the bits per character by roughly 38 percent, which over-provisions the channel and shifts the best practical coder from more complex schemes to Huffman. The same scaling result is shown to hold across both model sizes.

Core claim

When a causal language model supplies sequential predictions inside a predict-then-code pipeline for text sent over a fixed-rate channel, increasing model size from 124 million to 3 billion parameters reduces bits per character by approximately 38 percent. This reduction effectively over-provisions the channel relative to the compressed rate, so that Huffman coding, which has zero algorithmic delay, becomes the practical choice while arithmetic coding's near-optimal compression is offset by added decodability delay.

What carries the argument

The predict-then-code architecture in which a causal language model supplies per-character probabilities to variable-length entropy coders, whose output bit-length statistics then determine queueing delay on the fixed-rate channel.

If this is right

For channels over-provisioned by strong predictors, Huffman coding delivers near-optimal compression with no algorithmic delay.
Arithmetic coding and block-based rANS reach closer to the entropy limit but add decodability delay that grows with block size.
Scaling the language model improves compression enough to change which coder minimizes total delay.
Computational delay shrinks with faster hardware while algorithmic delay stays fixed for each coder type.
Real-time text can approach information-theoretic efficiency without large latency penalties when the predictor is sufficiently accurate.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same predict-then-code approach could be tested on other streaming sequences such as speech or sensor data if comparable causal predictors exist.
Hardware that cuts per-token inference time would widen the region where higher-compression coders with modest algorithmic latency become attractive.
Even larger models might push the operating point further, potentially making zero-delay schemes dominant across a wider range of channel rates.
The queue analysis assumes statistics derived from mean and variance; real traffic burstiness could change the measured delay distributions.

Load-bearing premise

Per-token delay is captured by the mean and variance of the coder's bit lengths together with its algorithmic latency, and the language model's predictions remain accurate and strictly causal under real-time constraints.

What would settle it

Run the 3-billion-parameter model with Huffman coding on a fixed-rate channel at the measured bits-per-character rate and check whether the observed average and tail delays are lower than those produced by arithmetic coding at the same throughput.

Figures

Figures reproduced from arXiv: 2605.01991 by Jamison Ebert, Jean-Francois Chamberland, Krishna R. Narayanan, Vishnu Teja Kunde.

**Figure 1.** Figure 1: Notional diagram for streaming predict-then-code transmission. Characters arrive at rate λ characters per second (cps), are encoded using the LLM’s conditional distributions, queued, and served through a constant-rate channel. The delay D(n) for token n is the elapsed time between the moment the token is generated at the source and the moment it is successfully decoded at the destination. the source coder … view at source ↗

**Figure 2.** Figure 2: Bits per character by coding scheme and text. Both view at source ↗

**Figure 3.** Figure 3: Mean delay vs. channel rate C in bps (Moby-Dick) averaged over 10K tokens. The vertical dashed line in each plot marks the model-based source rate λ¯ℓSh for the corresponding LLM (32.8 bps for GPT-2, 17.9 bps for Llama). Both plots share the same axes, making the effect of model quality directly visible. waits on average about 3 seconds. The per-K curves in view at source ↗

read the original abstract

Learning, prediction, and compression are intimately connected: a model that accurately predicts the next symbol in a sequence can be coupled with a source coder to compress that sequence near its information-theoretic limit. When tokenized characters arriving at a fixed reading pace are encoded into variable-length codewords and streamed over a fixed-rate channel, a queue forms whose per-token delay depends on the mean and variance of the bit lengths and on the coder's algorithmic latency. This paper investigates the compression--delay tradeoff that arises when a causal language model serves as the sequential predictor within a predict-then-code architecture for real-time text transmission. Several coding schemes are compared: Shannon (ideal), Huffman, arithmetic coding, rANS at various block sizes, and gzip. The analysis separates algorithmic delay, inherent to the coder, from computational delay, which shrinks as hardware improves. Huffman is the practical choice for over-provisioned channels, with zero algorithmic delay and modest compression overhead. Arithmetic coding achieves near-optimal compression at the cost of decodability delay. Findings are validated across two scales: GPT-2 (124M) and Llama~3.2 (3B), a twenty-five-fold parameter range. This scaling yields an approximately 38\% reduction in bits per character, effectively over-provisioning the channel and thereby changing which coder is optimal.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Scaling the LLM from GPT-2 to Llama 3.2 cuts bits per character by 38 percent and can flip the best coder from arithmetic to Huffman once the channel is over-provisioned, but the queueing delay argument rests on an untested model.

read the letter

The main thing to know is that larger language models can change which entropy coder wins on a fixed-rate channel by driving the average rate low enough that Huffman's zero algorithmic delay beats arithmetic coding's compression edge. The paper sets up a predict-then-code pipeline where a causal LLM feeds a source coder, then tracks how variable bit lengths create per-token queueing delay on a constant-rate link. It compares Shannon, Huffman, arithmetic, rANS at different blocks, and gzip, while splitting algorithmic delay from compute time that hardware can shrink. The scaling check across GPT-2 (124M) and Llama 3.2 (3B) is the concrete new piece: the bigger model lowers bits per character enough to over-provision the channel and shift the optimum coder. That is a direct, measurable observation rather than a vague claim. The comparisons are clean and the separation of delay types is useful for anyone thinking about real-time text links. The soft spot is the delay model itself. It treats mean and variance of codeword lengths plus fixed algorithmic latency as sufficient to predict queue behavior, yet LLM forward-pass time grows with context and is not constant. No end-to-end queue simulation or measured tail latencies appear in the abstract to confirm the crossover point, so the argument that the 38 percent gain actually flips the optimum stays partly extrapolated. The abstract also gives the scaling number without error bars or setup details, which makes it hard to judge robustness. This work is for people in information theory or low-latency communications who already know the classic coders and want to see how modern predictors interact with delay constraints. A reader looking for a practical recommendation on when to pick Huffman versus arithmetic over a fixed-rate link will find the scaling result worth checking. It deserves peer review because the core scaling observation is straightforward and the delay framing is coherent, even though the queue analysis needs tighter validation and full experimental numbers.

Referee Report

2 major / 2 minor

Summary. The paper explores using causal language models as sequential predictors in a predict-then-code architecture for real-time text transmission over fixed-rate channels. It compares entropy coders including Shannon, Huffman, arithmetic coding, rANS at various block sizes, and gzip, separating algorithmic delay from computational delay. The central empirical claim is that scaling the LLM from GPT-2 (124M parameters) to Llama 3.2 (3B parameters) yields an approximately 38% reduction in bits per character, which over-provisions the channel and shifts the optimal coder to Huffman (zero algorithmic delay) rather than arithmetic coding.

Significance. If the queueing analysis holds, the result usefully shows how stronger predictors can relax real-time constraints and favor low-latency coders in streaming applications. The explicit separation of algorithmic versus computational delay and the scaling experiment across a 25-fold parameter range are concrete strengths that could inform practical system design.

major comments (2)

[Abstract] Abstract: the scaling claim of an approximately 38% bpc reduction and the resulting change in optimal coder are stated without any accompanying quantitative bpc values, tables, figures, error bars, or experimental setup details, preventing verification of the magnitude or the conditions for the crossover.
[Analysis of delay tradeoff] The central argument that lower bpc over-provisions the channel and makes Huffman optimal rests on an idealized queueing model (mean/variance of bit lengths plus fixed algorithmic latency determining per-token delay). This model is not supported by end-to-end queue simulations, measured tail latencies, or empirical anchoring to actual LLM forward-pass times, which are input-dependent and grow with context length.

minor comments (2)

[Methods] Specify the exact block sizes used for rANS and the precise definition of algorithmic latency for each coder.
[Related work] Add references to standard queueing models (e.g., M/G/1) used for the delay analysis.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments highlight opportunities to strengthen the presentation of empirical results and to clarify the assumptions underlying the queueing analysis. We address each major comment below and outline the planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract: the scaling claim of an approximately 38% bpc reduction and the resulting change in optimal coder are stated without any accompanying quantitative bpc values, tables, figures, error bars, or experimental setup details, preventing verification of the magnitude or the conditions for the crossover.

Authors: We agree that the abstract would be clearer with explicit quantitative support. In the revised version we will insert the measured bpc values for the GPT-2 (124 M) and Llama 3.2 (3 B) models, cite the corresponding tables and figures that report the experimental setup (dataset, tokenization, and block-size configurations), and include error bars or standard deviations across the evaluated sequences to indicate variability. These additions will allow readers to verify both the magnitude of the 38 % reduction and the conditions under which the optimal coder shifts. revision: yes
Referee: [Analysis of delay tradeoff] The central argument that lower bpc over-provisions the channel and makes Huffman optimal rests on an idealized queueing model (mean/variance of bit lengths plus fixed algorithmic latency determining per-token delay). This model is not supported by end-to-end queue simulations, measured tail latencies, or empirical anchoring to actual LLM forward-pass times, which are input-dependent and grow with context length.

Authors: The referee correctly notes that our analysis employs an analytical mean-field queueing model rather than full end-to-end packet-level simulations or hardware-timed measurements. The model is deliberately chosen to isolate algorithmic delay (inherent to the coder) from computational delay (hardware-dependent and rapidly improving). We do not claim empirical tail-latency results or context-length scaling of inference time, as these lie outside the paper’s scope. In the revision we will add an explicit limitations paragraph acknowledging that the model assumes average bit-length statistics and fixed per-token computation, and we will discuss how input-dependent inference times could affect the over-provisioning margin in practice. We believe the analytical separation remains useful for system design even without the additional simulations. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical comparison of coders with LLMs

full rationale

The paper performs an empirical evaluation of existing entropy coders (Huffman, arithmetic, rANS, gzip, Shannon) paired with causal LLMs (GPT-2 124M and Llama 3.2 3B) for real-time text transmission. The reported 38% bpc reduction is a direct measurement from scaling the models on text data, not a fitted parameter renamed as a prediction. The queueing delay model is presented as a standard M/G/1-style analysis based on observed bit-length statistics and known algorithmic latencies of the coders; no equations derive these quantities from the target results or from self-citations. No self-definitional loops, ansatzes smuggled via citation, or uniqueness theorems imported from the authors' prior work appear in the derivation chain. The central claims rest on reproducible experiments across two model scales and standard coding theory, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard information-theoretic queueing models and the assumption that LLMs function as effective causal predictors; no new free parameters, ad-hoc axioms, or invented entities are introduced in the abstract.

axioms (1)

domain assumption Causal language models provide sufficiently accurate next-token predictions to enable effective entropy coding
Invoked as the core of the predict-then-code architecture for compression.

pith-pipeline@v0.9.0 · 5552 in / 1161 out tokens · 32155 ms · 2026-05-08T19:25:45.275267+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation (J = ½(x+x⁻¹)−1 uniqueness) washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

H(p, q) = H(p) + D_KL(p‖q), where H(p) is the source entropy rate and the Kullback–Leibler (KL) divergence D_KL(p‖q) is the penalty for model mismatch.
Foundation.Atomicity (sequential schedule / queueing on finite histories) atomic_tick unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

D(n) = max(0, t_exit(n−1) − t_arr(n)) + b(x_n)/C, with t_exit(0) = 0. ... Lindley's recursion ... per-token delay depends on the mean and variance of the bit lengths and on the coder's algorithmic latency.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 2 canonical work pages

[1]

Language models are unsupervised multitask learners,

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,”OpenAI Blog, 2019

2019
[2]

Llama 3 model card,

Meta AI, “Llama 3 model card,” https://github.com/meta-llama/llama3, 2024

2024
[3]

Prediction and entropy of printed English,

C. E. Shannon, “Prediction and entropy of printed English,”Bell System Technical Journal, vol. 30, no. 1, pp. 50–64, 1951

1951
[4]

Universal coding, information, prediction, and estimation,

J. Rissanen, “Universal coding, information, prediction, and estimation,” IEEE Transactions on Information Theory, vol. 30, no. 4, pp. 629–636, 1984

1984
[5]

Data compression using adaptive coding and partial string matching,

J. G. Cleary and I. H. Witten, “Data compression using adaptive coding and partial string matching,”IEEE Transactions on Communications, vol. 32, no. 4, pp. 396–402, 1984

1984
[6]

Language modeling is compression,

G. Del ´etang, A. Ruoss, P.-A. Duquenne, E. Catt, T. Genewein, C. Mat- tern, J. Grau-Moya, L. K. Wenliang, M. Aitchison, L. Orseau, M. Hutter, and J. Veness, “Language modeling is compression,” inInternational Conference on Learning Representations (ICLR), Vienna, Austria, 2024

2024
[7]

LLMZip: Lossless text compression using large language models,

C. S. K. Valmeekam, K. Narayanan, D. Kalathil, J.-F. Chamberland, and S. Shakkottai, “LLMZip: Lossless text compression using large language models,”arXiv preprint arXiv:2306.04050, 2023

work page arXiv 2023
[8]

The minimum description length principle in coding and modeling,

A. Barron, J. Rissanen, and B. Yu, “The minimum description length principle in coding and modeling,”IEEE Transactions on Information Theory, vol. 44, no. 6, pp. 2743–2760, 1998

1998
[9]

Arithmetic coding,

J. Rissanen and G. G. Langdon, Jr., “Arithmetic coding,”IBM Journal of Research and Development, vol. 23, no. 2, pp. 149–162, 1979

1979
[10]

Arithmetic coding for data compression,

I. H. Witten, R. M. Neal, and J. G. Cleary, “Arithmetic coding for data compression,”Communications of the ACM, vol. 30, no. 6, pp. 520–540, 1987

1987
[11]

A method for the construction of minimum-redundancy codes,

D. A. Huffman, “A method for the construction of minimum-redundancy codes,”Proceedings of the IRE, vol. 40, no. 9, pp. 1098–1101, 1952

1952
[12]

Asymmetric numeral systems,

J. Duda, “Asymmetric numeral systems,”arXiv preprint arXiv:0902.0271, 2009

work page arXiv 2009
[13]

DEFLATE compressed data format specification version 1.3,

L. P. Deutsch, “DEFLATE compressed data format specification version 1.3,” RFC 1951, 1996

1951
[14]

The bitter lesson,

R. S. Sutton, “The bitter lesson,” http://www.incompleteideas.net/ IncIdeas/BitterLesson.html, 2019

2019
[15]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 30, Long Beach, California, USA, 2017

2017
[16]

Project Gutenberg,

“Project Gutenberg,” https://www.gutenberg.org/, accessed: 2025

2025
[17]

Transformers: State-of-the-art natural language processing,

T. Wolf, L. Debut, V . Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y . Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush, “Transformers: State-of-the-art natural language processing,”Proceedings of the 2020 Conference on Empirical Method...

2020
[18]

Neural machine translation of rare words with subword units,

R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” inAnnual Meeting of The Association for Computational Linguistics, vol. 1, 2016, pp. 1715–1725

2016