arxiv: 2605.07266 · v1 · submitted 2026-05-08 · 💻 cs.IT · cs.LG· math.IT

Recognition: 2 theorem links

· Lean Theorem

How Big Should a Wireless Foundation Model Be?

Wanjiun Liao, Wei-Lun Cheng

Pith reviewed 2026-05-11 01:32 UTC · model grok-4.3

classification 💻 cs.IT cs.LGmath.IT

keywords wireless foundation modelsintrinsic dimensionalitychannel scaling lawstest-time trainingnonlinear manifold dimensionNTN satellite channelsphysical-layer AI

0 comments

The pith

Wireless foundation models are capped at modest sizes by the channel's intrinsic dimensionality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that wireless foundation models do not follow the same scaling rules as language models because the wireless channel itself has a low intrinsic dimensionality imposed by physics. This dNL value, ranging from 5 to 35 in both measurements and standard models, sets a hard ceiling on useful model growth once sufficient data is available. For satellite channels with dNL near 14, performance gains flatten after roughly 30 million parameters and become negligible beyond 70 million. Test-time adaptation of smaller models then outperforms much larger static ones by substantial margins in channel estimation tasks.

Core claim

The intrinsic dimensionality dNL of the nonlinear manifold describing the wireless channel acts as the fundamental bottleneck that defines the scaling ceiling for foundation models once a data-sufficient regime is reached. Maxwell's equations together with finite scatterers and finite antenna apertures constrain every propagation environment to a limited number of degrees of freedom, spanning only 5-35 across real OTA measurements and 3GPP models, orders of magnitude below the semantic dimensionality of language. Consequently, scaling gains for NTN satellite channels diminish rapidly beyond 30 million parameters and enter a stochastic asymptote above 70 million, while pilot-aided test-time t

What carries the argument

The nonlinear manifold dimension dNL of the wireless channel, which quantifies the effective physical degrees of freedom and thereby limits model scaling.

Load-bearing premise

That the dNL estimated from measurements and models truly bounds foundation-model performance and that the scaling and test-time-training behavior seen in NTN simulations will generalize to other wireless environments.

What would settle it

Continued large performance gains in channel estimation or related tasks when scaling a static model from 70 million to several hundred million parameters in a data-sufficient regime without any inference-time adaptation.

Figures

Figures reproduced from arXiv: 2605.07266 by Wanjiun Liao, Wei-Lun Cheng.

**Figure 2.** Figure 2: Scaling behavior of wireless foundation models on NTN satellite [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Intrinsic complexity across data domains. Wireless channels have [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Test-time training performance. (a) Convergence across model scales (SNR = 20 dB): performance improves monotonically and saturates at ∼20 steps with no overfitting; a randomly initialized model shows zero TTT improvement, confirming pre-training is essential. Even 5–10 steps capture 65–80% of the full gain. (b) NMSE vs. SNR: 12M+TTT20 vs. classical LS+interpolation (same 12.5% pilot overhead). A crossover… view at source ↗

read the original abstract

Wireless foundation models are rapidly emerging as a key enabler of AI-native communication systems, yet a fundamental question remains unanswered: how large should these models be? We present a principled, physics-grounded answer, showing that the intrinsic dimensionality (dNL, the nonlinear manifold dimension of the channel) acts as the fundamental bottleneck, defining the scaling ceiling once a data-sufficient regime is reached. This dimensionality is not a design choice but a physical constraint: Maxwell's equations, finite scatterers, and antenna aperture inherently constrain wireless propagation environments to a limited number of degrees of freedom -- spanning 5-35 across both real-world OTA measurements and 3GPP-standardized channel models we evaluate -- orders of magnitude below the ~1,000-dimensional semantic space of language. As a consequence, we propose a scaling framework for wireless AI: taking NTN satellite channels as a representative case (dNL ~= 14), scaling gains diminish rapidly beyond ~30 million parameters, entering a stochastic asymptote above 70M where a further 1.6x increase (96M->150M) yields only 0.52 dB. Beyond this ceiling, inference-time adaptation via pilot-aided test-time training (TTT) is far more effective: a compact 12M-parameter model surpasses a static 96M model by 9.9 dB (NMSE, SNR = 20 dB) / 7.6 dB (MCM, SNR = 10 dB) at one-eighth the parameters. With dNL distributions validated across real-world indoor massive MIMO measurements, our scaling laws and TTT gains are demonstrated through NTN satellite simulations, reframing wireless AI design: channel geometry -- not model size -- fundamentally governs the scaling laws of physical-layer wireless AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Channel dimension sets wireless model scaling limits, but the key curves are only from NTN simulations.

read the letter

The main takeaway is that this paper argues the nonlinear manifold dimension of wireless channels (dNL) acts as a physical ceiling on foundation model scaling once data is plentiful, and that pilot-aided test-time training beats simply growing the model size. They measure dNL from real indoor massive MIMO OTA data and 3GPP models, finding values between 5 and 35, then use NTN satellite channels (dNL around 14) to show diminishing returns past roughly 30 million parameters and only 0.52 dB gain from 96M to 150M. A 12M model with TTT outperforms a static 96M model by 9.9 dB NMSE at 20 dB SNR and 7.6 dB MCM at 10 dB SNR. This framing is new in tying measured channel degrees of freedom directly to AI scaling rules for wireless. The real-data dNL distributions and the concrete numerical thresholds give a useful starting point for thinking about efficient designs in 5G/6G or satellite systems. The physics grounding via Maxwell's equations and finite scatterers is a reasonable way to motivate why wireless channels differ from language data. The main limitation is that all the scaling curves and TTT comparisons come from NTN simulations at one fixed dNL value. The paper does not run ablations that vary dNL while holding architecture and data fixed, nor does it repeat the scaling sweeps on the actual OTA datasets where dNL was measured. Without that cross-check, it is hard to tell whether the saturation point is a universal physical limit or tied to the specific channel generator, pilot setup, or training loss. Details on the exact dNL estimation procedure and any error bars are also sparse. This paper is for researchers working on AI-native physical-layer systems who care about compute efficiency. Readers focused on scaling laws or channel-aware adaptation will get value from the dimensionality argument and the TTT results. It deserves a serious referee because the question is practical and the measurements provide some empirical anchor, even though broader validation across dNL ranges would strengthen it. I would send it for review.

Referee Report

2 major / 2 minor

Summary. The paper claims that the nonlinear manifold dimension dNL of wireless channels (measured as 5-35 from indoor massive MIMO OTA data and 3GPP models) is the fundamental physical bottleneck on wireless foundation model size once data is sufficient. Using NTN satellite channels (dNL≈14) as example, it reports that scaling gains saturate beyond ~30M parameters (with only 0.52 dB improvement from 96M to 150M) and that pilot-aided test-time training (TTT) allows a 12M model to outperform a static 96M model by 9.9 dB NMSE / 7.6 dB MCM.

Significance. If validated, the result would supply a physics-based rule for sizing wireless foundation models and prioritize inference-time adaptation over parameter scaling, with clear implications for efficient AI-native physical-layer systems. The grounding in Maxwell's equations plus finite scatterers and the direct dNL measurements on real OTA data are strengths that distinguish this from purely empirical scaling studies.

major comments (2)

[NTN simulation results and scaling framework] NTN simulation results (scaling curves and TTT comparisons): all quantitative evidence for the claimed scaling knee at ~30M parameters, the 0.52 dB gain from 96M to 150M, and the 9.9 dB / 7.6 dB TTT advantage is obtained at a single fixed dNL≈14 in NTN simulations. No ablation varies dNL across the reported 5-35 range while holding architecture, training objective, and data volume fixed, nor are the same scaling sweeps repeated on the OTA datasets whose dNL was measured. This leaves open whether the observed saturation is caused by dNL or by NTN-specific channel generation, pilot structure, or loss function.
[dNL measurement and validation] dNL estimation procedure: the central claim treats dNL as the causal, measurable limit, yet the manuscript provides insufficient detail on the exact estimation algorithm, hyperparameters, manifold-learning method, or error bars used to obtain the 5-35 range from OTA measurements and 3GPP models. Without this, it is impossible to assess whether the reported dNL values are robust or whether the NTN dNL≈14 is representative.

minor comments (2)

[Simulation setup] The abstract and main text should explicitly state the precise simulation parameters (SNR values, pilot overhead, training data volume, optimizer settings) used to generate the scaling curves and TTT results.
[Introduction and scaling framework] Notation for dNL should be introduced with a short formal definition (e.g., as the dimension of the nonlinear manifold on which the channel realizations lie) before its use in the scaling argument.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and insightful comments on our manuscript. We address each major comment below and indicate the revisions we will make to improve the clarity and completeness of the paper.

read point-by-point responses

Referee: [NTN simulation results and scaling framework] NTN simulation results (scaling curves and TTT comparisons): all quantitative evidence for the claimed scaling knee at ~30M parameters, the 0.52 dB gain from 96M to 150M, and the 9.9 dB / 7.6 dB TTT advantage is obtained at a single fixed dNL≈14 in NTN simulations. No ablation varies dNL across the reported 5-35 range while holding architecture, training objective, and data volume fixed, nor are the same scaling sweeps repeated on the OTA datasets whose dNL was measured. This leaves open whether the observed saturation is caused by dNL or by NTN-specific channel generation, pilot structure, or loss function.

Authors: We agree that the quantitative scaling results are demonstrated for the NTN case with dNL ≈ 14. This choice was made because NTN represents a practical scenario with well-defined channel models, allowing controlled large-scale training experiments. The dNL range of 5-35 is established from both OTA measurements and 3GPP models to show the general physical constraint. However, performing full scaling ablations across multiple dNL values while keeping all other factors fixed would require substantial additional computational resources. We will revise the manuscript to explicitly state that the scaling knee is observed for dNL ≈ 14 and to discuss how the saturation point is expected to shift with different dNL values based on the theoretical framework. We will also add a note on the limitations regarding direct validation on OTA data due to the scale of training required. revision: partial
Referee: [dNL measurement and validation] dNL estimation procedure: the central claim treats dNL as the causal, measurable limit, yet the manuscript provides insufficient detail on the exact estimation algorithm, hyperparameters, manifold-learning method, or error bars used to obtain the 5-35 range from OTA measurements and 3GPP models. Without this, it is impossible to assess whether the reported dNL values are robust or whether the NTN dNL≈14 is representative.

Authors: We acknowledge that the dNL estimation procedure requires more detailed description. In the revised manuscript, we will expand the relevant section to include the specific manifold learning method employed (e.g., the nonlinear dimensionality estimation technique), the hyperparameters used, the algorithm steps, and any error bars or robustness checks performed on the OTA data and 3GPP models. This will allow readers to better evaluate the reliability of the dNL range and its applicability to the NTN simulations. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper grounds its scaling-ceiling claim in an external physical constraint (Maxwell's equations plus finite scatterers and aperture limiting degrees of freedom), with dNL values (5-35) obtained from independent OTA measurements and 3GPP models. The concrete thresholds (~30 M parameters, 70 M asymptote, 0.52 dB gain from 96 M to 150 M) and TTT gains are presented as empirical observations from NTN simulations at a representative dNL ≈ 14, not as quantities algebraically computed from the measured dNL or fitted on the same data used to define the bottleneck. No self-definitional loop, fitted-input-renamed-as-prediction, or load-bearing self-citation chain appears; the physics argument and simulation results supply independent content.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on dNL as an empirically quantified physical limit plus the assumption that scaling laws observed in NTN simulations follow directly from it.

free parameters (1)

dNL = 5-35
Nonlinear manifold dimension estimated from OTA measurements and 3GPP models; values 5-35 and ~14 for NTN are used to set scaling ceilings.

axioms (1)

domain assumption Maxwell's equations, finite scatterers, and antenna aperture constrain wireless propagation to a limited number of degrees of freedom.
Invoked in the abstract to explain why dNL remains low compared with language models.

invented entities (1)

dNL (nonlinear manifold dimension of the channel) no independent evidence
purpose: Quantifies the physical bottleneck that caps wireless foundation-model scaling.
Introduced and estimated within the paper; no independent falsifiable prediction outside the reported measurements is given.

pith-pipeline@v0.9.0 · 5622 in / 1652 out tokens · 76746 ms · 2026-05-11T01:32:42.963483+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking (D=3 forcing) unclear
intrinsic dimensionality (dNL, the nonlinear manifold dimension of the channel) acts as the fundamental bottleneck... spanning 5-35 across both real-world OTA measurements and 3GPP-standardized channel models
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel (J-cost uniqueness) unclear
scaling gains diminish rapidly beyond ~30 million parameters, entering a stochastic asymptote above 70M

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 1 internal anchor

[1]

Scaling Laws for Neural Language Models

J. Kaplan et al., “Scaling laws for neural language models,” arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[2]

Training compute-optimal large language models,

J. Hoffmann et al., “Training compute-optimal large language models,” inProc. NeurIPS, 2022

work page 2022
[3]

Large wireless model (LWM): A foundation model for wireless channels,

A. Alkhateeb et al., “Large wireless model (LWM),”arXiv:2411.08872, 2024

work page arXiv 2024
[4]

WiFo: Wireless foundation model for channel prediction,

B. Liu et al., “WiFo: Wireless foundation model for channel prediction,” Sci. China Inf. Sci., 2025

work page 2025
[5]

Tiny-WiFo: Lightweight wireless foundation model via knowledge distillation,

B. Liu et al., “Tiny-WiFo: Lightweight wireless foundation model via knowledge distillation,”arXiv:2511.04015, 2025

work page arXiv 2025
[6]

WiMamba: Linear-scale wireless foundation model,

T. Raviv et al., “WiMamba: Linear-scale wireless foundation model,” arXiv:2603.26367, 2026

work page arXiv 2026
[7]

Study on NR to support non-terrestrial networks,

3GPP, “Study on NR to support non-terrestrial networks,”TR 38.811, v15.4.0, 2020

work page 2020
[8]

Study on channel model for frequencies from 0.5 to 100 GHz,

3GPP, “Study on channel model for frequencies from 0.5 to 100 GHz,” TR 38.901, v16.1.0, 2020

work page 2020
[9]

Estimating the intrinsic dimension of datasets by a minimal neighborhood information,

E. Facco et al., “Estimating the intrinsic dimension of datasets by a minimal neighborhood information,”Scientific Reports, vol. 7, 2017

work page 2017
[10]

Maximum likelihood estimation of intrinsic dimension,

E. Levina and P. Bickel, “Maximum likelihood estimation of intrinsic dimension,” inProc. NeurIPS, 2005

work page 2005
[11]

DeepMIMO: A generic deep learning dataset for mil- limeter wave and massive MIMO applications,

A. Alkhateeb, “DeepMIMO: A generic deep learning dataset for mil- limeter wave and massive MIMO applications,” inProc. ITA Workshop, 2019

work page 2019
[12]

A distributed massive MIMO channel sounder,

F. Euchner et al., “A distributed massive MIMO channel sounder,” in Proc. Asilomar, 2022

work page 2022
[13]

Ultra-dense indoor massive MIMO CSI dataset,

S. Jaeckel et al., “Ultra-dense indoor massive MIMO CSI dataset,”IEEE DataPort, 2024

work page 2024
[14]

Intrinsic dimensionality explains the effectiveness of language model fine-tuning,

A. Aghajanyan et al., “Intrinsic dimensionality explains the effectiveness of language model fine-tuning,” inProc. ACL, 2021, pp. 7319–7328

work page 2021