pith. machine review for the scientific record. sign in

arxiv: 2605.09727 · v1 · submitted 2026-05-10 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

One for All: A Non-Linear Transformer can Enable Cross-Domain Generalization for In-Context Reinforcement Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:41 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords non-linear transformersin-context reinforcement learningRKHSkernel temporal difference learningcross-domain generalizationmeta reinforcement learningvalue function representation
0
0 comments X

The pith

Non-linear transformers represent value functions from different RL domains with shared weights when those functions lie in the same RKHS.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper connects non-linear transformers to kernel-based temporal difference learning by treating the transformer as a functional operator that performs regression inside a Reproducing Kernel Hilbert Space. This view shows that value functions drawn from separate task domains can be expressed using one common set of weights provided they belong to that shared RKHS. The connection supplies a concrete mechanism for in-context adaptation in reinforcement learning without any parameter updates at test time. A reader cares because the same operator that lets a transformer solve new tasks from a prompt can now be understood as unifying value functions across domains rather than learning them separately.

Core claim

By interpreting the transformer as performing regression in a Reproducing Kernel Hilbert Space (RKHS), we show that value functions from different domains can be represented using a shared set of weights, provided they lie within the same RKHS. Experiments on multiple MetaWorld domains support this interpretation, demonstrating convergence of the temporal-difference objective.

What carries the argument

The transformer viewed as an RKHS regressor that maps a context prompt to a task-specific value function, allowing weight sharing across domains.

If this is right

  • Value functions across domains share a single set of weights inside one RKHS.
  • In-context learning with transformers produces task-specific value functions without gradient updates.
  • The temporal-difference objective converges when the transformer operates under the shared RKHS view.
  • Cross-domain generalization follows directly from the functional operator being domain-agnostic inside the RKHS.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Architectures could be explicitly regularized toward RKHS properties to enlarge the set of domains that can share weights.
  • The same unification might extend to other in-context settings where underlying functions share a reproducing kernel.
  • Measuring the RKHS distance between value functions of real environments would give a practical test for when the shared-weight regime holds.

Load-bearing premise

Non-linear transformers actually perform regression inside an RKHS and value functions from distinct domains truly inhabit one common RKHS.

What would settle it

Run the same transformer on value functions from two MetaWorld domains that cannot be expressed in a shared RKHS; if the temporal-difference loss fails to converge to a common solution or requires separate weights, the claim is false.

Figures

Figures reproduced from arXiv: 2605.09727 by Bowen He, Juncheng Dong, Lin Lin, Xiang Cheng.

Figure 1
Figure 1. Figure 1: Task domains from MetaWorld. From left to right: Pick-Place, Pick-Place-Wall, Shelf-Place, Plate-Slide, and Button-Press. 4.1. Main Theoretical Result Let Zℓ ∈ R (2d+1)×(n+1) denote the output after ℓ Trans￾former layers. We will show that a single fixed set of Trans￾former weights can implement ℓ steps of the TD iteration (6) in-context: the weights are universal, while the input matrix Z0 encodes the par… view at source ↗
Figure 2
Figure 2. Figure 2: Cross-task TD errors for multiple model checkpoints during training. tween layers. During training, we expose the transformer only to tasks with high noise levels for each domain, while evaluating its TD error on tasks with medium and low noise levels. This setup allows us to assess the in-context learning capability of the model across diverse MRPs, even within the same domain. Since each domain contains … view at source ↗
Figure 3
Figure 3. Figure 3: Comparison between the model-approximated and ground-truth state value functions. 7. Discussion This work provides a first theoretical construction that trans￾formers with non-linear activations can generalize not only across tasks within a single domain, but also across domains in in-context reinforcement learning. By adopting a kernel￾based reinforcement learning perspective, we establish a principled co… view at source ↗
Figure 4
Figure 4. Figure 4: Learning curves across five MetaWorld domains. Each curve is averaged over five random seeds. C. One Model for All Domains It is also natural to ask whether a model trained on a single domain can be reused across multiple domains, a question that directly reflects the title of this work. As shown in Appendix E.2, four domains share a common kernel temperature, whereas 14 [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗
Figure 5
Figure 5. Figure 5: Models trained on a single domain are applied to all domains simultaneously. Button-Press-v3 requires a distinct setting. This observation suggests that models may generalize effectively across the four domains but fail to transfer to Button-Press-v3, a discrepancy whose underlying causes are discussed in Section 5. To test this hypothesis, we report a cross-domain evaluation matrix averaged over five rand… view at source ↗
Figure 6
Figure 6. Figure 6: Illustration of the induced state-value function under the original MetaWorld reward settings. The value function exhibits large discontinuities, violating the smoothness assumptions implicit in the RKHS approximation. D. Discontinuous State Value Function and TD Error A key assumption of the model is that the underlying state-value function lies within the induced RKHS, under which the function can be acc… view at source ↗
Figure 7
Figure 7. Figure 7: Comparison between the model-approximated and ground-truth state value functions. F. Synthetic Environment F.1. Additional Illustrations We provide additional illustrations to examine the empirical results on the synthetic environment from multiple perspectives. Figure 7a shows the model-approximated state value functions under different random seeds. For each seed, the context is sampled independently. Ac… view at source ↗
Figure 8
Figure 8. Figure 8: Selected centroid points and the induced state value function. Bellman-consistent reward. We derive the reward function from the desired value function so that the Bellman equation is satisfied exactly. For a single kernel term, using the Gaussian transition dynamics gives E [κ(cj , st+1) | st = s] = E " exp c ⊤ j (ρs + ϵt) δ !# = exp  (ρcj ) ⊤s δ  exp  ∥cj∥ 2σ 2 2δ 2  . Since the centroids lie on the … view at source ↗
read the original abstract

A central challenge in reinforcement learning (RL) is to learn models that generalize beyond the tasks on which they are trained, a goal traditionally pursued through multi-task and meta RL. Recently, transformer architectures have emerged as a promising approach, enabling adaptation to new tasks via in-context learning without explicit parameter updates. From a functional perspective, a transformer can be viewed as a functional operator that maps a context to a task-specific function. It is thus fundamental to understand and design this operator to support stronger generalization in RL. In this work, we address this resulting question of generalization from a kernel-based perspective by establishing a connection between non-linear transformers and kernel-based temporal difference learning. By interpreting the transformer as performing regression in a Reproducing Kernel Hilbert Space (RKHS), we show that value functions from different domains can be represented using a shared set of weights, provided they lie within the same RKHS. Experiments on multiple MetaWorld domains support this interpretation, demonstrating convergence of the temporal-difference objective.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper claims that non-linear transformers can be interpreted as performing regression in a Reproducing Kernel Hilbert Space (RKHS), establishing a link to kernel-based temporal difference learning. This interpretation implies that value functions from different domains can be represented using a shared set of transformer weights provided they lie in the same RKHS, enabling cross-domain generalization in in-context RL without parameter updates. Experiments on multiple MetaWorld domains are presented as support, showing convergence of the temporal-difference objective.

Significance. If the transformer-RKHS equivalence and the shared-RKHS condition for cross-domain value functions hold, the work would provide a principled functional view of in-context adaptation in RL, potentially unifying transformer-based meta-RL with kernel methods and guiding the design of more generalizable operators. The reported TD convergence on MetaWorld tasks is a useful empirical signal, though its explanatory power depends on verification of the underlying assumptions.

major comments (2)
  1. Abstract: the statement that experiments demonstrate convergence of the TD objective is presented without derivation details for the transformer-to-RKHS mapping, without error bars or statistical tests, and without any account of how RKHS membership was verified or enforced.
  2. Experiments section: the reported results show TD-objective convergence across MetaWorld domains but contain no diagnostic checks for the central shared-RKHS claim (e.g., no kernel Gram-matrix analysis, no RKHS-norm comparisons across domains, and no ablation that would be expected to fail if the RKHSs were disjoint). Convergence alone therefore does not distinguish the proposed mechanism from alternative explanations.
minor comments (3)
  1. Add error bars, multiple random seeds, and statistical significance tests to all experimental plots and tables.
  2. Provide a clearer, step-by-step derivation in the theoretical section that shows how the non-linear transformer layers implement the RKHS regression operator independently of the fitted weights.
  3. Define the specific kernel and feature map used, and ensure consistent notation for the RKHS inner product and norm throughout the manuscript.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our theoretical and empirical contributions. We address each major comment below and will revise the manuscript accordingly to strengthen the experimental support and clarity.

read point-by-point responses
  1. Referee: Abstract: the statement that experiments demonstrate convergence of the TD objective is presented without derivation details for the transformer-to-RKHS mapping, without error bars or statistical tests, and without any account of how RKHS membership was verified or enforced.

    Authors: The transformer-to-RKHS mapping is derived in Section 3 of the manuscript. We will revise the abstract to explicitly reference this section and add a brief summary of the key steps. In the experiments section, we will include error bars computed over multiple random seeds and report statistical significance tests. RKHS membership follows from the universal kernel and the non-linear transformer architecture as established in the theory; we will add a short explanatory paragraph in the revised experiments section describing this enforcement. revision: yes

  2. Referee: Experiments section: the reported results show TD-objective convergence across MetaWorld domains but contain no diagnostic checks for the central shared-RKHS claim (e.g., no kernel Gram-matrix analysis, no RKHS-norm comparisons across domains, and no ablation that would be expected to fail if the RKHSs were disjoint). Convergence alone therefore does not distinguish the proposed mechanism from alternative explanations.

    Authors: We agree that additional diagnostics would more directly isolate the shared-RKHS mechanism. The current results demonstrate successful cross-domain in-context TD learning, which is predicted by the theory when value functions share an RKHS. In the revision we will add (i) Gram-matrix visualizations and RKHS-norm comparisons across domains in an appendix and (ii) an ablation using domains with deliberately mismatched feature spaces (hence disjoint RKHS) to show that generalization fails when the shared-RKHS condition is violated. These additions will help distinguish our account from alternatives. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper establishes an interpretive connection between non-linear transformers and kernel-based TD learning by viewing the transformer as RKHS regression. This leads to the claim that value functions from different domains share weights when in the same RKHS. The experiments report TD objective convergence across MetaWorld domains without any fitted parameters being renamed as predictions or any self-citation chains that reduce the central claim to unverified prior results by the same authors. No self-definitional loops, smuggled ansatzes, or uniqueness theorems imported from self-citations are present. The derivation remains self-contained as a proposed functional perspective backed by independent empirical convergence results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper rests on the interpretive axiom that transformers implement RKHS regression and on the domain assumption that value functions from different MetaWorld tasks share one RKHS.

axioms (2)
  • domain assumption A non-linear transformer can be interpreted as performing regression in a Reproducing Kernel Hilbert Space.
    This is the bridge used to connect transformers to kernel TD learning.
  • domain assumption Value functions from different domains lie in the same RKHS.
    Required for the shared-weight representation to hold.

pith-pipeline@v0.9.0 · 5479 in / 1215 out tokens · 39431 ms · 2026-05-12T03:41:54.882345+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 1 internal anchor

  1. [1]

    Aky¨urek, E., Schuurmans, D., Andreas, J., Ma, T., and Zhou, D

    URL https:// arxiv.org/abs/2306.00297. Aky¨urek, E., Schuurmans, D., Andreas, J., Ma, T., and Zhou, D. What learning algorithm is in-context learning? investigations with linear models,

  2. [2]

    What learning algorithm is in-context learning? investigations with linear models

    URL https: //arxiv.org/abs/2211.15661. Cheng, X., Chen, Y ., and Sra, S. Transformers implement functional gradient descent to learn non-linear functions in context,

  3. [3]

    Choromanski, K., Likhosherstov, V ., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J., Mohiuddin, A., Kaiser, L., Belanger, D., Colwell, L., and Weller, A

    URL https://arxiv.org/abs/ 2312.06528. Choromanski, K., Likhosherstov, V ., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J., Mohiuddin, A., Kaiser, L., Belanger, D., Colwell, L., and Weller, A. Rethinking attention with performers,

  4. [4]

    Rethinking Attention with Performers

    URL https://arxiv.org/abs/2009.14794. Dai, D., Sun, Y ., Dong, L., Hao, Y ., Ma, S., Sui, Z., and Wei, F. Why can gpt learn in-context? language models implic- itly perform gradient descent as meta-optimizers,

  5. [5]

    Why can gpt learn in-context? language models secretly perform gradient descent as meta optimizers

    URLhttps://arxiv.org/abs/2212.10559. El-Nouby, A., Touvron, H., Caron, M., Bojanowski, P., Douze, M., Joulin, A., Laptev, I., Neverova, N., Syn- naeve, G., Verbeek, J., and Jegou, H. Xcit: Cross- covariance image transformers,

  6. [6]

    2106.09681 , archivePrefix=

    URL https: //arxiv.org/abs/2106.09681. Engel, Y ., Mannor, S., and Meir, R. Reinforcement learning with gaussian processes. InProceedings of the 22nd International Conference on Machine Learning (ICML), pp. 201–208,

  7. [7]

    URL https://arxiv.org/ abs/2208.01066. Hahn, M. and Goyal, N. A theory of emergent in-context learning as implicit structure induction.arXiv preprint arXiv:2303.07971,

  8. [8]

    Kirsch, L., Harrison, J., Freeman, C., Sohl-Dickstein, J., and Schmidhuber, J

    URL https: //arxiv.org/abs/2405.20692. Kirsch, L., Harrison, J., Freeman, C., Sohl-Dickstein, J., and Schmidhuber, J. Towards general-purpose in-context learning agents. InNeurIPS 2023 Foundation Models for Decision Making Workshop,

  9. [9]

    arXiv preprint arXiv:2306.14892 , year=

    URL https://arxiv.org/abs/2306.14892. Lin, L., Bai, Y ., and Mei, S. Transformers as decision makers: Provable in-context reinforcement learning via supervised pretraining,

  10. [10]

    org/abs/2310.08566

    URL https://arxiv. org/abs/2310.08566. Liu, H. and Abbeel, P. Emergent agentic transformer from chain of hindsight experience,

  11. [11]

    Mahankali, A., Hashimoto, T

    URL https:// arxiv.org/abs/2305.16554. Mahankali, A., Hashimoto, T. B., and Ma, T. One step of gradient descent is provably the optimal in-context learner with one layer of linear self-attention,

  12. [12]

    Arvind Mahankali, Tatsunori B

    URL https://arxiv.org/abs/2307.03576. Moeini, A., Wang, J., Beck, J., Blaser, E., Whiteson, S., Chandra, R., and Zhang, S. A survey of in-context re- inforcement learning,

  13. [13]

    arXiv preprint arXiv:2502.07978 , year=

    URL https://arxiv. org/abs/2502.07978. Nguyen, T., Nguyen, T. M., Le, D. D., Nguyen, D. K., Tran, V .-A., Baraniuk, R. G., Ho, N., and Osher, S. J. Improving transformers with probabilistic attention keys,

  14. [14]

    Ormoneit, D

    URL https://arxiv.org/abs/2110.08678. Ormoneit, D. and Sen, S. Kernel-based reinforce- ment learning.Machine Learning, 49(2):161–178, November

  15. [15]

    Sutton, R

    URLhttps://arxiv.org/abs/2310.08549. Sutton, R. and Barto, A.Reinforcement Learning: An Intro- duction. A Bradford book. MIT Press,

  16. [16]

    ISBN 9781605585161

    Associa- tion for Computing Machinery. ISBN 9781605585161. doi: 10.1145/1553374.1553504. URL https://doi. org/10.1145/1553374.1553504. Tsai, Y .-H. H., Bai, S., Yamada, M., Morency, L.-P., and Salakhutdinov, R. Transformer dissection: A unified un- derstanding of transformer’s attention via the lens of ker- nel,

  17. [17]

    URL https://arxiv.org/abs/1908. 11775. von Oswald, J., Niklasson, E., Randazzo, E., Sacramento, J., Mordvintsev, A., Zhmoginov, A., and Vladymyrov, M. Transformers learn in-context by gradient descent,

  18. [18]

    arXiv preprint arXiv:2212.07677 , title =

    URLhttps://arxiv.org/abs/2212.07677. Wang, J., Blaser, E., Daneshmand, H., and Zhang, S. Trans- formers can learn temporal difference methods for in- context reinforcement learning,

  19. [19]

    Wang, X., Zhu, W., Saxon, M., Steyvers, M., and Wang, W

    URL https: //arxiv.org/abs/2405.13861. Wang, X., Zhu, W., Saxon, M., Steyvers, M., and Wang, W. Y . Large language models are latent variable mod- els: Explaining and finding good demonstrations for in- context learning,

  20. [20]

    URL https://arxiv.org/ abs/2301.11916. Xie, S. M., Raghunathan, A., Liang, P., and Ma, T. An explanation of in-context learning as implicit bayesian inference,

  21. [21]

    An explanation of in-context learning as implicit bayesian inference.arXiv preprint arXiv:2111.02080, 2021

    URL https://arxiv.org/abs/ 2111.02080. Xu, X., Hu, D., and Lu, X. Kernel-based least squares policy iteration for reinforcement learning.Trans. Neur. Netw., 18(4):973–992, July

  22. [22]

    URL https://doi.org/ 10.1109/TNN.2007.899161

    1109/TNN.2007.899161. URL https://doi.org/ 10.1109/TNN.2007.899161. Yu, T., Quillen, D., He, Z., Julian, R., Narayan, A., Shively, H., Bellathur, A., Hausman, K., Finn, C., and Levine, S. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning,

  23. [23]

    Meta- world: A benchmark and evaluation for multi-task and meta reinforcement learning, 2021

    URL https: //arxiv.org/abs/1910.10897. Zhang, R., Frei, S., and Bartlett, P. L. Trained transformers learn linear models in-context,

  24. [24]

    10 One for ALL: A Non-Linear Transformer can enable Cross-Domain Generalization for In-Context Reinforcement Learning A

    URL https:// arxiv.org/abs/2306.09927. 10 One for ALL: A Non-Linear Transformer can enable Cross-Domain Generalization for In-Context Reinforcement Learning A. Construction Lemmas Lemma A.1(Construction for Attn1).Recall Attn˜h K,Q,V (Z) =V ZM ˜h(KZ, QZ) , with mask M as defined in (1). Let K=   Id 0d×d 0d×1 0d×d 0d×d 0d×1 01×d 01×d 0   , Q=   Id 0d...