arxiv: 2604.25898 · v1 · submitted 2026-04-28 · 💻 cs.LG · cs.AI

Recognition: unknown

TSN-Affinity: Similarity-Driven Parameter Reuse for Continual Offline Reinforcement Learning

Dominik \.Zurek, Kamil Faber, Marcin Pietron, Pawe{\l} Gajewski, Roberto Corizzo

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:36 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords continual learningoffline reinforcement learningparameter reusesubnetworksdecision transformercatastrophic forgettingtask similarityreinforcement learning

0 comments

The pith

Similarity-guided routing of sparse subnetworks in decision transformers enables continual offline RL without replay buffers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a method for learning a sequence of reinforcement learning tasks from fixed datasets collected over time, while retaining performance on earlier tasks. It builds on decision transformers by inserting tiny subnetworks that can be reused across tasks when their action patterns and internal representations are sufficiently similar. A routing mechanism decides whether to share parameters or allocate new ones based on compatibility measures computed from the offline data. Experiments on Atari games and simulated robotic manipulation show that this reuse preserves prior performance and improves results on multiple tasks together. The work positions the approach as an alternative to replay buffers, which add memory costs and create mismatches between old data and new policies.

Core claim

TSN-Affinity enables task-specific parameterization and controlled knowledge sharing through an RL-aware reuse strategy that routes tasks according to action compatibility and latent similarity, using sparse subnetworks inside a decision transformer architecture. This produces strong retention on previous tasks and further gains in multi-task settings across both discrete and continuous control benchmarks.

What carries the argument

TinySubNetworks inside a Decision Transformer, with routing that matches tasks by action compatibility and latent similarity to control parameter reuse.

Load-bearing premise

Routing decisions based on action compatibility and latent similarity will reliably prevent negative interference between tasks without requiring task-specific hyperparameter tuning.

What would settle it

A new sequence of tasks where the method produces clear drops in performance on earlier tasks or fails to match replay-buffer baselines despite similar action statistics.

Figures

Figures reproduced from arXiv: 2604.25898 by Dominik \.Zurek, Kamil Faber, Marcin Pietron, Pawe{\l} Gajewski, Roberto Corizzo.

**Figure 1.** Figure 1: Overview of the proposed TSN-Affinity method. view at source ↗

**Figure 2.** Figure 2: Additional Atari learning curves. The affinity-based TSN variants exhibit the character view at source ↗

**Figure 3.** Figure 3: Additional Panda learning curves. These plots are intended mainly for qualitative compar view at source ↗

read the original abstract

Continual offline reinforcement learning (CORL) aims to learn a sequence of tasks from datasets collected over time while preserving performance on previously learned tasks. This setting corresponds to domains where new tasks arise over time, but adapting the model in live environment interactions is expensive, risky, or impossible. However, CORL inherits the dual difficulty of offline reinforcement learning and adapting while preventing catastrophic forgetting. Replay-based continual learning approaches remain a strong baseline but incur memory overhead and suffer from a distribution mismatch between replayed samples and newly learned policies. At the same time, architectural continual learning methods have shown strong potential in supervised learning but remain underexplored in CORL. In this work, we propose TSN-Affinity, a novel CORL method based on TinySubNetworks and Decision Transformer. The method enables task-specific parameterization and controlled knowledge sharing through a RL-aware reuse strategy that routes tasks according to action compatibility and latent similarity. We evaluate the approach on benchmarks based on Atari games and simulations of manipulation tasks with the Franka Emika Panda robotic arm, covering both discrete and continuous control. Results show strong retention from sparse SubNetworks, with routing further improving multi-task performance. Our findings suggest that similarity-guided architectural reuse is a strong and viable alternative to replay-based strategies in a CORL setting. Our code is available at: https://github.com/anonymized-for-submission123/tsn-affinity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TSN-Affinity adds a similarity-based routing layer on top of TinySubNetworks and Decision Transformers to enable parameter reuse in continual offline RL without replay buffers.

read the letter

TSN-Affinity adds a similarity-based routing layer on top of TinySubNetworks and Decision Transformers to enable parameter reuse in continual offline RL without replay buffers. The routing uses action compatibility and latent similarity to decide when to share parameters across tasks, which directly targets the memory overhead and distribution shift problems that replay methods create in this setting. They evaluate on Atari games for discrete control and Franka arm simulations for continuous control, reporting strong retention from the sparse subnetworks plus further gains from the routing step. The code release is a clear plus for checking the implementation. The RL-specific affinity measure is the piece that feels new relative to prior subnetwork work, even though the building blocks themselves are established. This kind of approach could matter for robotics or other domains where online fine-tuning is off the table. The main soft spot is the routing reliability. If the similarity metric routes a new task to an incompatible subnetwork, negative transfer is possible, and the paper needs to show that this does not happen on task sequences with low overlap or conflicting actions rather than just the standard benchmarks. Threshold selection and exact similarity computation also need clear ablations and error bars in the full text, since the abstract stays high-level on the numbers. This is aimed at researchers working on practical continual RL pipelines. It has enough of a concrete method, relevant benchmarks, and released code to deserve peer review, even if the experiments will likely need tightening on generalization.

Referee Report

2 major / 2 minor

Summary. The paper proposes TSN-Affinity, a continual offline reinforcement learning method that combines TinySubNetworks with a Decision Transformer backbone. It introduces an RL-aware routing mechanism that reuses parameters across tasks by measuring action compatibility and latent similarity, aiming to enable controlled knowledge sharing while avoiding catastrophic forgetting. The approach is evaluated on Atari-based and Franka Emika Panda manipulation benchmarks for both discrete and continuous control, with claims of strong retention from sparse subnetworks and further gains from routing, positioning the method as a viable memory-efficient alternative to replay-based CORL strategies.

Significance. If the empirical results hold under scrutiny, the work provides a promising architectural route for CORL that sidesteps replay-buffer memory costs and distribution mismatch. The similarity-driven reuse of sparse subnetworks could scale better to long task sequences in settings where online adaptation is infeasible, and the open-sourced code link supports reproducibility.

major comments (2)

[Abstract] Abstract: the central claim that 'similarity-guided architectural reuse is a strong and viable alternative to replay-based strategies' rests on the routing module reliably selecting compatible subnetworks. The manuscript must demonstrate that the latent/action similarity metric and reuse threshold are robust to low-similarity or anti-correlated task sequences; without explicit stress-test results on such sequences, negative transfer remains a risk and the 'no task-specific hyperparameter' aspect of the claim is not yet supported.
[Evaluation] Evaluation section (Atari and Franka benchmarks): the reported retention and multi-task gains may be benchmark-specific. The paper should include ablations that vary task similarity (e.g., anti-correlated action spaces or dissimilar latent representations) and report whether routing still prevents interference or requires threshold retuning; otherwise the generalization of the routing strategy is not established.

minor comments (2)

[Abstract] Abstract: the description of how similarity is computed (latent vs. action-based) and how the reuse threshold is chosen is absent; add a concise paragraph or reference to the method section.
[Abstract] The code repository link is anonymized; replace with the final public URL in the camera-ready version.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important aspects of robustness and generalization that warrant further clarification and support. We address each major comment in detail below, outlining the revisions we will incorporate to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'similarity-guided architectural reuse is a strong and viable alternative to replay-based strategies' rests on the routing module reliably selecting compatible subnetworks. The manuscript must demonstrate that the latent/action similarity metric and reuse threshold are robust to low-similarity or anti-correlated task sequences; without explicit stress-test results on such sequences, negative transfer remains a risk and the 'no task-specific hyperparameter' aspect of the claim is not yet supported.

Authors: We agree that explicit evidence of robustness to low-similarity and anti-correlated sequences is required to fully support the routing mechanism's reliability and the lack of task-specific tuning. Our Atari and Franka benchmarks already encompass task sequences with a spectrum of action compatibility and latent similarities, and the RL-aware routing explicitly gates reuse via a fixed similarity threshold to limit negative transfer. To directly address the concern, the revised manuscript will add a dedicated stress-test subsection featuring deliberately anti-correlated sequences (e.g., opposing action directions in the Panda arm and mechanically dissimilar Atari games). These experiments will report retention, interference, and multi-task metrics under the same fixed threshold, confirming that no per-task retuning is needed. This addition will substantiate the central claim without altering the method's core design. revision: yes
Referee: [Evaluation] Evaluation section (Atari and Franka benchmarks): the reported retention and multi-task gains may be benchmark-specific. The paper should include ablations that vary task similarity (e.g., anti-correlated action spaces or dissimilar latent representations) and report whether routing still prevents interference or requires threshold retuning; otherwise the generalization of the routing strategy is not established.

Authors: We acknowledge that the current evaluation, while covering both discrete and continuous control, does not exhaustively isolate the effect of task similarity. The existing results already show that routing improves retention over non-routed sparse subnetworks across the chosen task orders. In the revision we will insert a new ablation study that systematically varies similarity by constructing task sequences with controlled anti-correlation in action spaces and latent representations. For each configuration we will report (i) whether interference is prevented, (ii) the resulting retention and multi-task performance, and (iii) whether the similarity threshold requires any adjustment. These results will clarify the operating regime of the routing strategy and any associated limitations, thereby establishing its generalization beyond the primary benchmarks. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical method with no load-bearing derivations or self-referential reductions

full rationale

The paper proposes TSN-Affinity as an empirical CORL architecture combining TinySubNetworks with Decision Transformer and a routing module based on action compatibility and latent similarity. Evaluation relies on benchmark experiments (Atari, Franka) rather than any claimed first-principles derivation chain. No equations, uniqueness theorems, or fitted-parameter predictions are presented that reduce the central performance claims to the method's own inputs by construction. External code link further supports independent reproducibility. This is the normal non-circular outcome for an applied ML methods paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only review prevents exhaustive extraction; standard RL assumptions are implicit.

axioms (1)

domain assumption Offline datasets remain representative of the tasks even after policy updates during continual learning.
Core premise of the CORL setting described in the abstract.

invented entities (1)

TSN-Affinity routing strategy no independent evidence
purpose: Decides parameter reuse across tasks using action compatibility and latent similarity.
New component introduced to combine subnetwork sparsity with controlled sharing.

pith-pipeline@v0.9.0 · 5566 in / 1248 out tokens · 56417 ms · 2026-05-07T16:36:25.922853+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 6 canonical work pages · 1 internal anchor

[1]

D. Abel, A. Barreto, B. Van Roy, D. Precup, H. P. van Hasselt, and S. Singh , A definition of continual reinforcement learning , Advances in Neural Information Processing Systems, 36 (2023), pp. 50377--50407

2023
[2]

Agarwal, D

R. Agarwal, D. Schuurmans, and M. Norouzi , An optimistic perspective on offline reinforcement learning , in Proceedings of the 37th International Conference on Machine Learning, H. D. III and A. Singh, eds., vol. 119 of Proceedings of Machine Learning Research, PMLR, 13--18 Jul 2020, pp. 104--114

2020
[3]

G. An, S. Moon, J.-H. Kim, and H. O. Song , Uncertainty-based offline reinforcement learning with diversified q-ensemble , Advances in neural information processing systems, 34 (2021), pp. 7436--7447

2021
[4]

L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas, and I. Mordatch , Decision transformer: Reinforcement learning via sequence modeling , in Advances in Neural Information Processing Systems, vol. 34, 2021, pp. 15084--15097

2021
[5]

Faber, C

K. Faber, C. Kanan, V. Lomonaco, and R. Corizzo , Continual anomaly detection: A comprehensive survey and research roadmap , Preprints, (2026), https://doi.org/10.20944/preprints202601.1931.v1, https://doi.org/10.20944/preprints202601.1931.v1

work page doi:10.20944/preprints202601.1931.v1 2026
[6]

Faber, D

K. Faber, D. Zurek, M. Pietron, N. Japkowicz, A. Vergari, and R. Corizzo , From MNIST to ImageNet and back: benchmarking continual curriculum learning , Machine Learning, 113 (2024), pp. 8137--8164, https://doi.org/10.1007/s10994-024-06524-z

work page doi:10.1007/s10994-024-06524-z 2024
[7]

S. Gai, D. Wang, and L. He , OER : Offline experience replay for continual offline reinforcement learning , in Proceedings of the 26th European Conference on Artificial Intelligence (ECAI 2023), IOS Press, 2023, pp. 772--779, https://doi.org/10.3233/FAIA230343

work page doi:10.3233/faia230343 2023
[8]

J. Hu, S. Huang, L. Shen, Z. Yang, S. Hu, S. Tang, H. Chen, L. Sun, Y. Chang, and D. Tao , Tackling continual offline rl through selective weights activation on aligned spaces , in Advances in Neural Information Processing Systems, 2025

2025
[9]

Janner, Q

M. Janner, Q. Li, and S. Levine , Offline reinforcement learning as one big sequence modeling problem , in Advances in Neural Information Processing Systems, vol. 34, 2021

2021
[10]

Kumar, A

A. Kumar, A. Zhou, G. Tucker, and S. Levine , Conservative Q -learning for offline reinforcement learning , in Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 1179--1191

2020
[11]

K.-H. Lee, O. Nachum, M. S. Yang, L. Lee, D. Freeman, S. Guadarrama, I. Fischer, W. Xu, E. Jang, H. Michalewski, et al. , Multi-game decision transformers , Advances in neural information processing systems, 35 (2022), pp. 27921--27936

2022
[12]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

S. Levine, A. Kumar, G. Tucker, and J. Fu , Offline reinforcement learning: Tutorial, review, and perspectives on open problems , arXiv preprint arXiv:2005.01643, (2020)

work page internal anchor Pith review arXiv 2005
[13]

Mallya, D

A. Mallya, D. Davis, and S. Lazebnik , Piggyback: Adapting a single network to multiple tasks by learning to mask weights , in Proceedings of the European Conference on Computer Vision (ECCV), 2018

2018
[14]

Mallya and S

A. Mallya and S. Lazebnik , Packnet: Adding multiple tasks to a single network by iterative pruning , in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018

2018
[15]

Pietron, K

M. Pietron, K. Faber, D. Zurek, and R. Corizzo , TinySubNets : An efficient and low capacity continual learning strategy , Proceedings of the AAAI Conference on Artificial Intelligence, 39 (2025), pp. 19913--19920, https://doi.org/10.1609/aaai.v39i19.34193

work page doi:10.1609/aaai.v39i19.34193 2025
[16]

Pietron, D

M. Pietron, D. Zurek, K. Faber, and R. Corizzo , Ada-qpacknet - multi-task forget-free continual learning with quantization driven adaptive pruning , in Proceedings of the 26th European Conference on Artificial Intelligence (ECAI 2023), 2023, pp. 1882--1889

2023
[17]

Rolnick, A

D. Rolnick, A. Ahuja, J. Schwarz, T. P. Lillicrap, and G. Wayne , Experience replay for continual learning , in Advances in Neural Information Processing Systems, vol. 32, 2019

2019
[18]

Schmied, M

T. Schmied, M. Hofmarcher, F. Paischer, R. Pascanu, and S. Hochreiter , Learning to modulate pre-trained models in rl , in Advances in Neural Information Processing Systems, 2023

2023
[19]

Serra, D

J. Serra, D. Suris, M. Miron, and A. Karatzoglou , Overcoming catastrophic forgetting with hard attention to the task , in Proceedings of the 35th International Conference on Machine Learning, vol. 80 of Proceedings of Machine Learning Research, 2018, pp. 4548--4557

2018
[20]

L. Wang, X. Zhang, H. Su, and J. Zhu , A comprehensive survey of continual learning: Theory, method and application , IEEE transactions on pattern analysis and machine intelligence, 46 (2024), pp. 5362--5383

2024
[21]

Z. Wang, X. Qu, J. Xiao, B. Chen, and J. Wang , P2dt: Mitigating forgetting in task-incremental learning with progressive prompt decision transformer , in 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 7265--7269, https://doi.org/10.1109/ICASSP48485.2024.10447775

work page doi:10.1109/icassp48485.2024.10447775 2024
[22]

Zhang, K

T. Zhang, K. Z. Shen, Z. Lin, B. Yuan, X. Wang, X. Li, and D. Ye , Replay-enhanced continual reinforcement learning , Transactions on Machine Learning Research, (2023)

2023
[23]

write newline

" write newline "" before.all 'output.state := FUNCTION fin.entry add.period write newline FUNCTION new.block output.state before.all = 'skip after.block 'output.state := if FUNCTION not #0 #1 if FUNCTION and 'skip pop #0 if FUNCTION or pop #1 'skip if FUNCTION new.block.checka empty 'skip 'new.block if FUNCTION field.or.null duplicate empty pop "" 'skip ...