arxiv: 2605.04651 · v2 · submitted 2026-05-06 · 💻 cs.LG · cs.CL

Recognition: no theorem link

FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation

Guangsheng Bao, Han Cui, Hongbo Zhang, Juncai He, Yanbin Zhao, Yue Zhang

Pith reviewed 2026-05-11 01:46 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords fast weightstest-time adaptationforward-only learningassociative learningsupervised adaptationpretrained modelsefficient inferenceclosed-form adaptation

0 comments

The pith

FAAST performs supervised adaptation of pretrained models by analytically compiling labeled examples into fast weights from a single forward pass.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FAAST as an alternative to costly backpropagation or memory-heavy context methods for adapting pretrained models to new supervised tasks at test time. It computes fast weights in closed form during one forward pass over labeled data, then uses those weights for inference without further optimization or example storage. This decouples the adaptation step from the original model parameters and yields constant-time inference. If the approach holds, adaptation becomes feasible on devices with tight compute and memory limits while preserving accuracy on image classification and language modeling tasks.

Core claim

FAAST analytically computes fast weights from a single forward pass over labeled examples to perform supervised adaptation at test time. This forward-only method matches or exceeds the performance of backpropagation-based fine-tuning while cutting adaptation time by more than 90 percent and uses up to 95 percent less memory than context or memory-based approaches, across image classification and language modeling benchmarks.

What carries the argument

Closed-form fast weights computed associatively from a single forward pass over labeled examples, which encode task-specific input-to-label mappings without iterative gradients or stored context.

If this is right

Adaptation time drops by more than 90 percent relative to backpropagation methods.
Memory footprint shrinks by up to 95 percent compared with memory or context-based adaptation.
Inference runs in constant time independent of the number of adaptation examples.
Task-specific information is separated from the pretrained model's representations.
The same procedure applies to both image classification and language modeling without task-specific redesign.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method suggests that associative compilation of examples can substitute for gradient-based updates in many adaptation settings.
Constant-time inference could support repeated adaptation on the same device without accumulating costs.
Decoupling adaptation from the base model may simplify combining multiple specialized tasks.
The approach opens a route to test-time adaptation on hardware that cannot support backpropagation.

Load-bearing premise

That fast weights calculated analytically from one forward pass on labeled examples hold enough task-specific information to match results from iterative backpropagation or stored memory methods.

What would settle it

A benchmark result where FAAST accuracy falls well below backpropagation-based adaptation on the same set of labeled examples for a standard image classification or language modeling task.

Figures

Figures reproduced from arXiv: 2605.04651 by Guangsheng Bao, Han Cui, Hongbo Zhang, Juncai He, Yanbin Zhao, Yue Zhang.

**Figure 1.** Figure 1: Comparison of downstream task adaptation paradigms. tion, where task-specific associations are encoded as learned weights via iterative gradient descent. Figure (b) illustrates memory- or context-based adaptation, which injects task information through memory lookup or in-context attention at inference time, incurring costs that scale with the number of examples. Figure (c) presents FAAST, which compiles … view at source ↗

**Figure 2.** Figure 2: FAAST module and the integration with pretrained neural networks. For classification problems, the y is a class label, which probability is computed via an attention head p(y | xi) = exp(h ⊤ i vy) PK c=1 exp(h ⊤ i vc) . (5) The projection matrix W is learned by minimizing crossentropy loss using gradient-based optimization. This linear projection functions as an implicit associative memory (Hopfield, 1982… view at source ↗

**Figure 3.** Figure 3: GPT2 model size and mem layers vs. perplexity on WikiText-103. improves accuracy from 59.6% (ICL, 1-shot) to 78.5%, and further to 80.8% with 5-shot adaptation; similar trends are observed on IMDB, where FAAST achieves 86.7% accuracy in the 1-shot setting, surpassing both zero-shot and ICL baselines by a large margin. Under full-data adaptation, FAAST consistently exceeds zero-shot GPT2-XL, reaching 87.5%… view at source ↗

**Figure 4.** Figure 4: FAAST vs. Linear (backprop). The std represents the variance of accuracy across episodes. Generalization in Few-Shot Settings. FAAST demonstrates a clear advantage in few-shot scenarios, where backpropagation-based training often suffers from severe overfitting due to limited data. As illustrated in view at source ↗

**Figure 5.** Figure 5: Filter noisy components under a threshold ϵ. Experiments are conducted with N0 = 0 to avoid the influence of prior. Generalization to Arbitrarily Defined Labels. CLIP relies on a pretrained semantic alignment between image and text embeddings. When this alignment is broken by assigning arbitrary class names, zero-shot performance degrades to near chance. On mini-ImageNet, using WordNet IDs (e.g., n02119789… view at source ↗

**Figure 6.** Figure 6: Training dynamics for different memory size and update discount settings batch of memory items removed – only when the memory reaches its maximum capacity. The update discount plays a crucial role in preventing older fast weights from dominating the model. During training, as memory grows, the relative contribution of newly added batches diminishes. To mitigate the influence of outdated fast weights comput… view at source ↗

read the original abstract

Adapting pretrained models typically involves a trade-off between the high training costs of backpropagation and the heavy inference overhead of memory-based or in-context learning. We propose FAAST, a forward-only associative adaptation method that analytically compiles labeled examples into fast weights in a single pass. By eliminating memory or context dependence, FAAST achieves constant-time inference and decouples task adaptation from pretrained representation. Across image classification and language modeling benchmarks, FAAST matches or exceeds backprop-based adaptation while reducing adaptation time by over 90% and is competitive to memory/context-based adaptation while saving memory usage by up to 95%. These results demonstrate FAAST as a highly efficient, scalable solution for supervised task adaptation, particularly for resource-constrained models. We release the code and models at https://github.com/baoguangsheng/faast.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces FAAST, a forward-only associative adaptation method that analytically compiles labeled examples into closed-form fast weights via a single forward pass on pretrained models. It claims to match or exceed backprop-based test-time adaptation in performance on image classification and language modeling benchmarks while reducing adaptation time by over 90%, and to remain competitive with memory/context-based methods while using up to 95% less memory, with constant-time inference and decoupling of adaptation from the base representation. Code and models are released.

Significance. If the central analytical construction and empirical results hold, FAAST offers a notable efficiency advance for supervised test-time adaptation in resource-constrained settings. The parameter-free closed-form derivation, single-pass nature, and released code provide direct support for the claimed time and memory savings and enhance reproducibility.

minor comments (2)

[Abstract] Abstract: the claim of 'matches or exceeds' performance would be strengthened by a brief parenthetical reference to the specific closed-form expression used for the fast weights.
[§4] §4 (Experiments): confirm that all reported gains include standard deviations across multiple runs or seeds, particularly for the >90% time reduction and 95% memory savings figures.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of the manuscript, recognition of the efficiency claims, and recommendation for minor revision. The referee's assessment aligns with our intended contributions regarding the closed-form fast weights and resource savings. No major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's core method is an analytical closed-form computation of fast weights from a single forward pass on labeled examples, presented as parameter-free and independent of iterative optimization or memory storage. No equations or steps reduce predictions to fitted inputs by construction, nor rely on load-bearing self-citations whose validity is internal to the work. Benchmarks and released code provide external verifiability for efficiency claims. The derivation remains self-contained against the stated assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review prevents identification of specific free parameters, axioms, or invented entities beyond the high-level concept of fast weights.

invented entities (1)

fast weights no independent evidence
purpose: Analytically derived parameters that encode task adaptation for constant-time inference
Core mechanism introduced in the method; no independent evidence or falsifiable prediction given in abstract.

pith-pipeline@v0.9.0 · 5454 in / 1109 out tokens · 46930 ms · 2026-05-11T01:46:13.524105+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 7 internal anchors

[1]

Understanding intermediate layers using linear classifier probes

Alain, G. and Bengio, Y . Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644,

work page Pith review arXiv
[2]

Z., Rae, J., Wierstra, D., and Hass- abis, D

Blundell, C., Uria, B., Pritzel, A., Li, Y ., Ruderman, A., Leibo, J. Z., Rae, J., Wierstra, D., and Hass- abis, D. Model-free episodic control.arXiv preprint arXiv:1606.04460,

work page arXiv
[3]

D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901,

work page 1901
[4]

Overview of the iwslt 2017 evaluation campaign

Cettolo, M., Federico, M., Bentivogli, L., Niehues, J., St¨uker, S., Sudoh, K., Yoshino, K., and Federmann, C. Overview of the iwslt 2017 evaluation campaign. InPro- ceedings of the 14th International Conference on Spoken Language Translation, pp. 2–14,

work page 2017
[5]

Bert: Pre-training of deep bidirectional transformers for lan- guage understanding

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for lan- guage understanding. InProceedings of the 2019 confer- ence of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pp. 4171–4186,

work page 2019
[6]

Neural Turing Machines

Graves, A., Wayne, G., and Danihelka, I. Neural turing machines.arXiv preprint arXiv:1410.5401,

work page internal anchor Pith review arXiv
[7]

World Models

Ha, D. and Schmidhuber, J. World models.arXiv preprint arXiv:1803.10122, 2(3),

work page internal anchor Pith review arXiv
[8]

The forward-forward algorithm: Some preliminary investigations.arXiv preprint arXiv:2212.13345,

Hinton, G. The forward-forward algorithm: Some prelimi- nary investigations.arXiv preprint arXiv:2212.13345, 2 (3):5,

work page arXiv
[9]

Hopfield, J. J. Hopfield network.Scholarpedia, 2(5):1977,

work page 1977
[10]

Universal language model fine-tuning for text classification

Howard, J. and Ruder, S. Universal language model fine-tuning for text classification.arXiv preprint arXiv:1801.06146,

work page arXiv
[11]

Generalization through memorization: Nearest neighbor language models

Khandelwal, U., Levy, O., Jurafsky, D., Zettlemoyer, L., and Lewis, M. Generalization through memorization: Nearest neighbor language models.arXiv preprint arXiv:1911.00172,

work page arXiv 1911
[12]

Revisiting self- supervised visual representation learning

Kolesnikov, A., Zhai, X., and Beyer, L. Revisiting self- supervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1920–1929,

work page 1920
[13]

The Power of Scale for Parameter-Efficient Prompt Tuning

Lester, B., Al-Rfou, R., and Constant, N. The power of scale for parameter-efficient prompt tuning.arXiv preprint arXiv:2104.08691,

work page internal anchor Pith review arXiv
[14]

Li, X. L. and Liang, P. Prefix-tuning: Optimizing continuous prompts for generation.arXiv preprint arXiv:2101.00190,

work page internal anchor Pith review arXiv
[15]

Pointer Sentinel Mixture Models

Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843,

work page internal anchor Pith review arXiv
[16]

Oord, A. v. d., Li, Y ., and Vinyals, O. Representation learn- ing with contrastive predictive coding.arXiv preprint arXiv:1807.03748,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Press, O., Smith, N. A., and Lewis, M. Train short, test long: Attention with linear biases enables input length extrapolation.arXiv preprint arXiv:2108.12409,

work page internal anchor Pith review arXiv
[18]

D., Ng, A

Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A. Y ., and Potts, C. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642,

work page 2013
[19]

11 Published as a conference paper at ICLR 2020 A M ULTI-ROUND LSH A TTENTION In this section we describe in more detail the multi-hash version of our LSH attention mechanism

Weston, J., Chopra, S., and Bordes, A. Memory networks. arXiv preprint arXiv:1410.3916,

work page arXiv
[20]

11 Forward-Only Associative Learning for Test-Time Adaptation A. Related Work While individual components of FAAST – associative memory, fast weights, frozen representations, and pseudoinverse solutions – have been studied in isolation, prior work does not simultaneously achieve forward-only learning, closed-form associative memory, non-parametric storage...

work page 2019
[21]

However, these methods still require gradient-based optimization

reduce the cost of downstream adaptation by introducing small trainable parameter sets. However, these methods still require gradient-based optimization. FAAST computes task-specific mappings analytically via forward-only associative memory, eliminating the need for any parameter training for downstream adaptation. In-Context Learning and Test-Time Adapta...

work page 2020
[22]

Fast weights and associative memories.Associative memory has a long history, from Hopfield networks (Hopfield, 1982

exist but rely on stochastic search rather than deterministic closed-form fast weights. Fast weights and associative memories.Associative memory has a long history, from Hopfield networks (Hopfield, 1982

work page 1982
[23]

and Hebbian learning (Hebb, 1949; Kanter & Sompolinsky, 1987; Personnaz et al.,

work page 1949
[24]

Traditional approaches rely on iterative updates or learned plasticity rules

to modern fast-weight models (Schmidhuber, 1992; Ba et al., 2016). Traditional approaches rely on iterative updates or learned plasticity rules. FAAST differs by computing task-specific fast weights analytically from stored key-value pairs in a single forward pass, yielding deterministic, optimizer-free adaptation. Pseudoinverse-based associative memories...

work page 1992
[25]

support high-fidelity retrieval but have not been combined with pretrained representations or inference-time compression. Non-parametric memory and retrieval-augmented models.Memory-augmented neural networks, such as Neural Turing Machines (Graves et al., 2014), Memory Networks (Weston et al., 2014), Differentiable Neural Computers (Graves et al., 2016), ...

work page 2014
[26]

FAAST compresses all stored associations into a single fast-weight matrix, eliminating memory queries at inference while retaining the ability to adapt to new tasks

and RAG (Lewis et al., 2020), also rely on querying stored key-value pairs at inference time. FAAST compresses all stored associations into a single fast-weight matrix, eliminating memory queries at inference while retaining the ability to adapt to new tasks. 12 Forward-Only Associative Learning for Test-Time Adaptation B. Theoretical Foundations This sec...

work page 2020
[27]

A photo of a {label}

as the backbone model, using frozen image and text encoders. Image embeddings serve as keys, and text embeddings of class prompts “A photo of a {label}.” serve as values. All adaptation is performed on these fixed representations. Baselines.All methods operate on identical frozen features to isolate the effect of associative memory.CLIP zero- Shotmakes pr...

work page 2021
[28]

For k-NN, softmax memory, and FAAST, predictions are linearly interpolated with CLIP zero-shot predictions using the same prior count N0

with k= min(n,10) .Softmax memorydoes attention-based retrieval (Vaswani et al., 2017). For k-NN, softmax memory, and FAAST, predictions are linearly interpolated with CLIP zero-shot predictions using the same prior count N0. We set N0 to 40 times the number of classes, yielding N0 = 400 for CIFAR-10 and N0 = 800 for mini-ImageNet. All other hyperparamete...

work page 2017
[29]

We use the 20-class test split only, randomly dividing each class into equal size, obtaining 6,000 samples for support set and another 6,000 for query set

datasets.CIFAR-10contains 10 classes with 50,000 training and 10,000 test images; we use the training split as support set and the test split as query set.mini-ImageNetcontains 100 classes. We use the 20-class test split only, randomly dividing each class into equal size, obtaining 6,000 samples for support set and another 6,000 for query set. We evaluate...

work page 2016
[30]

Backprop Model Training.Table 7 summarizes the training configurations for backpropagation-based baselines in language modeling

During training, to avoid the dominance of historical fast weights computed using outdated readout and weighting, we apply a discount to incremental updateN t before each update, with an empirical value of 0.9. Backprop Model Training.Table 7 summarizes the training configurations for backpropagation-based baselines in language modeling. Both linear proje...

work page 1955