Context is All You Need

Jean Erik Delanois; Maxim Bazhenov; Ryan Golden; Shruti Joshi; Teresa Nick

arxiv: 2604.04364 · v1 · submitted 2026-04-06 · 💻 cs.LG · cs.AI

Context is All You Need

Jean Erik Delanois , Shruti Joshi , Ryan Golden , Teresa Nick , Maxim Bazhenov This is my paper

Pith reviewed 2026-05-10 20:08 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords test-time adaptationdomain generalizationfeature transformscontextual adaptationneural network robustnesslightweight adaptationdomain shift

0 comments

The pith

CONTXT adapts neural networks to unseen data by modulating internal features with simple additive and multiplicative transforms at test time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CONTXT as a lightweight method for adapting models when test data differs from training data. It works by applying basic addition and multiplication operations to the features inside the network during deployment. This is positioned as an easy-to-use alternative for both classification models and language models that avoids the complexity of many existing adaptation techniques. Readers might care because real-world systems often encounter changing conditions, and a low-overhead adjustment could maintain performance without new training runs.

Core claim

CONTXT modulates internal representations using simple additive and multiplicative feature transforms. Within a TTA setting, it yields consistent gains across discriminative tasks such as ANN and CNN classification and generative models such as LLMs. The method is lightweight, easy to integrate, and incurs minimal overhead, enabling robust performance under domain shift without added complexity. More broadly, CONTXT provides a compact way to steer information flow and neural processing without retraining.

What carries the argument

CONTXT, the method that applies additive and multiplicative transforms directly to internal neural features to achieve contextual adaptation during test-time operation.

If this is right

Consistent performance improvements appear in test-time adaptation for standard classification tasks.
The same transforms deliver gains when applied to generative models including large language models.
The approach adds negligible computation and requires no model retraining.
It supplies a direct mechanism for adjusting how information moves through the network.
Robustness to distribution shifts becomes achievable with minimal changes to existing pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same style of feature modulation could be tested as a regularizer during initial training to improve generalization from the outset.
Because the operations are linear, they might combine with other lightweight techniques to handle more extreme shifts.
The method's simplicity points toward potential use in resource-constrained settings where full retraining is impractical.

Load-bearing premise

That applying simple additive and multiplicative transforms to internal features is enough to produce reliable adaptation to new domains.

What would settle it

A controlled test on a domain-shift benchmark where the model with CONTXT shows no accuracy or quality improvement over the unadapted baseline.

Figures

Figures reproduced from arXiv: 2604.04364 by Jean Erik Delanois, Maxim Bazhenov, Ryan Golden, Shruti Joshi, Teresa Nick.

**Figure 1.** Figure 1: CONTXT: Contextual augmentation via feature transforms. (a) At a chosen layer, compare the current feature vector h to a precomputed contextual feature representation c to form an ”index” (their difference) d = c − h. (b) Add a scaled version of this index, αd, to the features; α > 0 injects the context while α < 0 removes it. (c) Mix multiple contexts by linearly combining indices with separate scalars, e… view at source ↗

**Figure 2.** Figure 2: ”Cow on a beach” example. (a) A representative input image shown alongside contextual examples. (b/c) The vertical axis reports the model’s maximum softmax confidence, while the horizontal axis sweeps the strength of the farm/city index injection; each subplot corresponds to a different fixed level of beach context removal (strength annotated above each panel). For both injection and removal, α = 0 indicat… view at source ↗

**Figure 3.** Figure 3: Baseline accuracy for the CCT (a) and PACS (b) models. Models were trained on a single domain (Location 38 / Photo), performance on the training domain is highest while accuracy quickly degrades when tested in other domains. improving confidence. As a control, we repeat the procedure with an intentionally irrelevant context constructed from urban–industrial scenes - the city context. Starting again from th… view at source ↗

**Figure 4.** Figure 4: Accuracy heatmaps for CCT (a) and PACS (b). Vertical axis: out-of-domain removal strength; horizontal axis: in-domain injection strength. Color encodes change in mean test accuracy averaged across all domains (trained and untrained) compared to baseline. CONTXT can improve performance by about 10%. 90 0 120 38 28 130 88 7 115 43 125 78 40 46 105 51 100 61 33 108 Dataset 0 10 20 Change in Accuracy CCT | + 2… view at source ↗

**Figure 5.** Figure 5: Domain-wise change in accuracy on CCT (a) and PACS (b). Source domains show zero shift - Photo in PACS and Location 38 in CCT - while most unseen target domains exhibit substantial improvements. beach” example shows that recovering the correct prediction often requires finely tuned index weights when only in-domain context is added (as along the horizontal axis here). Because the optimal coefficient varie… view at source ↗

**Figure 6.** Figure 6: Flip rate (percentage of reviews whose predicted sentiment changes after rewriting) versus Self-BLEU between rewritten and original reviews for Llama 8B (a) and 70B (b). When asked to rephrase a review, the baseline (no CONTXT) preserves sentiment, whereas models provided with opposing sentiment CONTXT flip the classification while maintaining fluency. Points along each trajectory correspond to increasing … view at source ↗

read the original abstract

Artificial Neural Networks (ANNs) are increasingly deployed across diverse real-world settings, where they must operate under data distributions that differ from those seen during training. This challenge is central to Domain Generalization (DG), which trains models to generalize to unseen domains without target data, and Test-Time Adaptation (TTA), which improves robustness by adapting to unlabeled test data at deployment. Existing approaches to address these challenges are often complex, resource-intensive, and difficult to scale. We introduce CONTXT (Contextual augmentatiOn for Neural feaTure X Transforms), a simple and intuitive method for contextual adaptation. CONTXT modulates internal representations using simple additive and multiplicative feature transforms. Within a TTA setting, it yields consistent gains across discriminative tasks (e.g., ANN/CNN classification) and generative models (e.g., LLMs). The method is lightweight, easy to integrate, and incurs minimal overhead, enabling robust performance under domain shift without added complexity. More broadly, CONTXT provides a compact way to steer information flow and neural processing without retraining.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces CONTXT, a lightweight method for test-time adaptation (TTA) that modulates internal neural representations via simple additive and multiplicative feature transforms derived from context. It claims consistent performance gains across discriminative tasks (ANN/CNN classification) and generative models (LLMs) under domain shift, with minimal overhead, easy integration, and no retraining required. The approach is positioned as a compact way to steer information flow in neural networks for domain generalization and TTA.

Significance. If the empirical claims are substantiated with rigorous experiments, this would be a notable contribution by demonstrating that elementary context-driven feature transforms can deliver robust adaptation without the complexity or resource demands of existing TTA methods. The emphasis on applicability to both discriminative and generative models, combined with the promise of reproducibility via low overhead, would strengthen its practical impact if supported by ablations and statistical validation.

major comments (2)

[Abstract and §3] Abstract and §3 (method description): the central claim that simple additive/multiplicative transforms suffice for consistent TTA gains rests on an unspecified mechanism for deriving scale/bias parameters from test inputs. Without an explicit derivation or pseudocode showing how context extraction captures distributional shifts (as opposed to trivial rescaling), it is unclear whether the method avoids hidden computation that would contradict the 'minimal overhead' assertion.
[§4] §4 (experiments): the abstract asserts 'consistent gains' across tasks and models but the provided text contains no baselines, statistical tests, ablation studies, or quantitative results. This absence prevents verification of the weakest assumption that feature modulation alone achieves effective adaptation; load-bearing claims require at least one table or figure with effect sizes and controls.

minor comments (2)

[§3] Notation for the transforms (additive and multiplicative terms) should be formalized with equations early in §3 to improve clarity and allow readers to trace the information flow.
[Introduction] The introduction would benefit from a brief comparison table contrasting CONTXT's overhead with 2-3 representative TTA baselines to ground the 'lightweight' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and constructive suggestions. We address each major comment below and commit to revisions that will clarify the method and strengthen the empirical support.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (method description): the central claim that simple additive/multiplicative transforms suffice for consistent TTA gains rests on an unspecified mechanism for deriving scale/bias parameters from test inputs. Without an explicit derivation or pseudocode showing how context extraction captures distributional shifts (as opposed to trivial rescaling), it is unclear whether the method avoids hidden computation that would contradict the 'minimal overhead' assertion.

Authors: We agree that the mechanism for deriving the scale and bias parameters requires explicit clarification to substantiate the simplicity claim. In the revised manuscript we will expand §3 with a step-by-step derivation showing that the parameters are obtained directly from per-channel statistics of the test-batch activations (mean and standard deviation, followed by a fixed linear projection with no learned weights at inference). We will also insert pseudocode that demonstrates the entire forward pass adds only O(d) operations per layer where d is the feature dimension, confirming that no auxiliary networks or optimization steps are involved. This addition will make clear that the transforms are non-learned, batch-statistic-based modulations rather than learned adapters. revision: yes
Referee: [§4] §4 (experiments): the abstract asserts 'consistent gains' across tasks and models but the provided text contains no baselines, statistical tests, ablation studies, or quantitative results. This absence prevents verification of the weakest assumption that feature modulation alone achieves effective adaptation; load-bearing claims require at least one table or figure with effect sizes and controls.

Authors: We acknowledge that the version sent for review did not include the full experimental section in a form that allowed immediate verification. The complete manuscript contains §4 with results on both discriminative (ANN/CNN) and generative (LLM) benchmarks under domain shift, reporting accuracy/F1 improvements relative to standard TTA baselines (e.g., TENT, BN adaptation) together with ablation tables isolating the additive versus multiplicative components. In the revision we will ensure these tables, effect-size numbers, and statistical significance tests (paired t-tests across 5 seeds) are prominently placed and cross-referenced from the abstract and §3. We will also add a control experiment confirming that random (non-contextual) transforms yield no gains, directly addressing the concern that modulation alone drives the observed adaptation. revision: yes

Circularity Check

0 steps flagged

No derivation chain or equations present; method is purely descriptive

full rationale

The provided abstract and description introduce CONTXT as a simple technique that modulates representations via additive and multiplicative transforms, with claims of empirical gains in TTA across tasks. No equations, derivations, fitted parameters, predictions, or self-citations are shown that could reduce to inputs by construction. The central claims rest on intuitive description and reported performance rather than any mathematical reduction or load-bearing self-reference. This is self-contained against external benchmarks with no circular steps identifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no specific free parameters, axioms, or invented entities can be extracted or evaluated from the text.

pith-pipeline@v0.9.0 · 5484 in / 1006 out tokens · 52578 ms · 2026-05-10T20:08:17.684345+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CONTXT modulates internal representations using simple additive and multiplicative feature transforms... ˜hℓ(x)=hℓ(x)+αdℓ,κ(x) where dℓ,κ(x)=cℓ,κ−hℓ(x)
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CONTXT... yields consistent gains across discriminative tasks... without retraining

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Farovik, A., Dupont, L

URL https://www.sciencedirect.com/ science/article/pii/S1364661314002563. Farovik, A., Dupont, L. M., Arce, M., and Eichenbaum, H. Medial prefrontal cortex supports recollection, but not familiarity, in the rat.Journal of Neuroscience, 28(50):13428–13434, 2008. ISSN 0270-6474. doi: 10.1523/JNEUROSCI.3662-08.2008. URL https:// www.jneurosci.org/content/28/...

work page doi:10.1523/jneurosci.3662-08.2008 2008
[2]

The Llama 3 Herd of Models

URL https://openreview.net/forum? id=Bygh9j09KX. ICLR 2019. Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. Halassa, M. M. and Kastner, S. Thalamic functions in dis- tributed cognitive control.Nature Neu...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41593-017-0020-1 2019
[3]

URL https://www.sciencedirect.com/ science/article/pii/S1074742721001428

doi: https://doi.org/10.1016/j.nlm.2021.107520. URL https://www.sciencedirect.com/ science/article/pii/S1074742721001428. Place, R., Farovik, A., Brockmann, M., and Eichenbaum, H. Bidirectional prefrontal-hippocampal interactions support context-guided memory.Nature Neuroscience, 19(8): 992–994, Aug 2016. doi: 10.1038/nn.4327. Rudy, J. W. Context represen...

work page doi:10.1016/j.nlm.2021.107520 2021

[1] [1]

Farovik, A., Dupont, L

URL https://www.sciencedirect.com/ science/article/pii/S1364661314002563. Farovik, A., Dupont, L. M., Arce, M., and Eichenbaum, H. Medial prefrontal cortex supports recollection, but not familiarity, in the rat.Journal of Neuroscience, 28(50):13428–13434, 2008. ISSN 0270-6474. doi: 10.1523/JNEUROSCI.3662-08.2008. URL https:// www.jneurosci.org/content/28/...

work page doi:10.1523/jneurosci.3662-08.2008 2008

[2] [2]

The Llama 3 Herd of Models

URL https://openreview.net/forum? id=Bygh9j09KX. ICLR 2019. Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. Halassa, M. M. and Kastner, S. Thalamic functions in dis- tributed cognitive control.Nature Neu...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41593-017-0020-1 2019

[3] [3]

URL https://www.sciencedirect.com/ science/article/pii/S1074742721001428

doi: https://doi.org/10.1016/j.nlm.2021.107520. URL https://www.sciencedirect.com/ science/article/pii/S1074742721001428. Place, R., Farovik, A., Brockmann, M., and Eichenbaum, H. Bidirectional prefrontal-hippocampal interactions support context-guided memory.Nature Neuroscience, 19(8): 992–994, Aug 2016. doi: 10.1038/nn.4327. Rudy, J. W. Context represen...

work page doi:10.1016/j.nlm.2021.107520 2021