Progressive Neural Networks

Andrei A. Rusu; Guillaume Desjardins; Hubert Soyer; James Kirkpatrick; Koray Kavukcuoglu; Neil C. Rabinowitz; Raia Hadsell; Razvan Pascanu

arxiv: 1606.04671 · v4 · submitted 2016-06-15 · 💻 cs.LG

Progressive Neural Networks

Andrei A. Rusu , Neil C. Rabinowitz , Guillaume Desjardins , Hubert Soyer , James Kirkpatrick , Koray Kavukcuoglu , Razvan Pascanu , Raia Hadsell This is my paper

Pith reviewed 2026-05-12 16:07 UTC · model grok-4.3

classification 💻 cs.LG

keywords progressive neural networkscatastrophic forgettingtransfer learningreinforcement learningAtari gameslifelong learningneural network columnslateral connections

0 comments

The pith

Progressive neural networks learn sequences of tasks without forgetting by adding task-specific columns with lateral connections to prior features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes progressive neural networks to address learning multiple tasks in sequence while avoiding the loss of earlier knowledge. Each new task receives its own column of layers that connects laterally to all previous columns, permitting reuse of useful features without overwriting old ones. This architecture is claimed to be immune to catastrophic forgetting unlike standard neural network training. The authors test it on reinforcement learning problems including Atari games and 3D mazes, reporting better results than pretraining or finetuning approaches. A sensitivity analysis indicates that beneficial transfer happens at both low-level sensory and high-level control layers.

Core claim

Progressive networks are immune to forgetting and can leverage prior knowledge via lateral connections to previously learned features, outperforming common baselines based on pretraining and finetuning across a wide variety of reinforcement learning tasks in Atari and 3D maze games.

What carries the argument

The progressive network architecture consisting of task-specific columns linked by lateral connections to features in all earlier columns.

If this is right

The network can accumulate skills across a sequence of tasks without interference between them.
Transfer of knowledge occurs at both low-level sensory features and high-level control policies.
The approach outperforms standard pretraining and finetuning on Atari games and 3D navigation tasks.
A sensitivity measure confirms the locations of useful feature reuse within the policy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This column-based design may extend to domains outside reinforcement learning where tasks arrive over time.
It could reduce the need to restart training from scratch when environments or goals change gradually.
Scaling the number of columns might eventually require mechanisms to manage computational cost.

Load-bearing premise

Lateral connections between columns will reliably produce positive transfer across tasks without introducing harmful interference.

What would settle it

If progressive networks exhibit significant forgetting of prior tasks or underperform fine-tuning on a sequence of reinforcement learning tasks, the central claim would be falsified.

read the original abstract

Learning to solve complex sequences of tasks--while both leveraging transfer and avoiding catastrophic forgetting--remains a key obstacle to achieving human-level intelligence. The progressive networks approach represents a step forward in this direction: they are immune to forgetting and can leverage prior knowledge via lateral connections to previously learned features. We evaluate this architecture extensively on a wide variety of reinforcement learning tasks (Atari and 3D maze games), and show that it outperforms common baselines based on pretraining and finetuning. Using a novel sensitivity measure, we demonstrate that transfer occurs at both low-level sensory and high-level control layers of the learned policy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Progressive networks add columns per task with lateral connections to avoid forgetting and beat pretrain/finetune baselines on Atari and mazes, but gains may partly trace to extra capacity rather than transfer.

read the letter

Progressive networks add a new column for each task in a sequence, freeze the previous columns to prevent forgetting, and use lateral connections to pull in features from earlier tasks. The paper reports that this setup outperforms pretraining and finetuning on a range of Atari games and 3D navigation tasks, while a sensitivity measure shows the lateral links carry useful information at both early and late layers. The architecture itself is the main contribution. It gives a simple way to grow the network as new tasks arrive without retraining everything or risking interference with old skills. The experiments are extensive for the setting, covering multiple games and including the sensitivity analysis to locate the transfer. That part is useful because it moves beyond just showing better scores to indicating how the mechanism works. The results look reliable on the tasks they chose. The no-forgetting property follows directly from freezing the old columns, and the performance edge over baselines is presented with enough detail to be convincing. The softer part is separating the effect of the lateral connections from the simple fact of having more parameters. Each new column adds a lot of capacity, and the comparison methods keep the network size fixed across tasks. Without an experiment that adds equivalent parameters but without the structured lateral connections, or a version where the connections are present but randomized, it's hard to know how much of the win comes from transfer versus just having room to learn the new task from scratch. The sensitivity numbers help by showing nonzero influence from prior columns, but they don't fully address the capacity confound. This matches the stress-test concern. Overall, this is a paper for researchers focused on continual or lifelong reinforcement learning. It provides a workable baseline architecture and some evidence that transfer can be made reliable in practice. I would take it to a reading group to discuss how to tighten the controls in follow-up work. It should go through peer review because the idea is clear, the problem is important, and the empirical support is on standard benchmarks even if some questions remain about the exact source of the gains.

Referee Report

2 major / 2 minor

Summary. The paper introduces progressive neural networks for continual learning in RL: a new column is added per task, prior columns are frozen to prevent forgetting, and lateral connections from previous columns to the new one enable transfer of features. The architecture is evaluated on Atari games and 3D maze navigation tasks, with claims of outperformance over pretraining and finetuning baselines plus a sensitivity analysis showing transfer at sensory and control layers.

Significance. If the performance gains can be shown to arise from the lateral transfer mechanism rather than capacity scaling, the approach offers a concrete, scalable architecture for avoiding catastrophic forgetting while reusing knowledge across tasks. This would be a useful contribution to multi-task and lifelong RL, with the sensitivity measure providing a starting point for analyzing where transfer occurs.

major comments (2)

[Evaluation] Evaluation section: the central claim that lateral connections enable positive transfer (and thus outperformance) is not isolated from the fact that total model capacity grows linearly with the number of tasks. No capacity-matched baseline (e.g., a single larger network with equivalent total parameters) or lateral-connection ablation is reported, so the evidence that gains are due to transfer rather than extra parameters remains indirect.
[Sensitivity analysis] The sensitivity measure is introduced to quantify cross-column influence, but without reported numerical values, error bars, or controls for task difficulty, it is unclear how strongly it supports the claim of transfer at both low- and high-level layers.

minor comments (2)

[Abstract] The abstract states outperformance but supplies no quantitative metrics, task counts, or statistical details; these should be summarized with key numbers and error bars for readers.
[Methods] Notation for the lateral connections and column indexing could be clarified with a single diagram or equation set early in the methods.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, acknowledging where the concerns are valid and indicating the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the central claim that lateral connections enable positive transfer (and thus outperformance) is not isolated from the fact that total model capacity grows linearly with the number of tasks. No capacity-matched baseline (e.g., a single larger network with equivalent total parameters) or lateral-connection ablation is reported, so the evidence that gains are due to transfer rather than extra parameters remains indirect.

Authors: We agree that the current evaluation does not fully isolate the contribution of lateral connections from the increase in total model capacity, as progressive networks add new columns (and thus parameters) for each task. The pretraining and finetuning baselines use fixed-capacity networks equivalent to a single column, which is the standard comparison in this setting, but a capacity-matched single-network baseline would indeed provide stronger evidence. We will add a dedicated discussion of this limitation in the revised manuscript and include an ablation or capacity-matched comparison where feasible with existing compute resources. This revision will clarify the role of the lateral transfer mechanism while preserving the core result that the architecture avoids catastrophic forgetting. revision: partial
Referee: [Sensitivity analysis] The sensitivity measure is introduced to quantify cross-column influence, but without reported numerical values, error bars, or controls for task difficulty, it is unclear how strongly it supports the claim of transfer at both low- and high-level layers.

Authors: The sensitivity analysis is presented via figures in the manuscript showing relative influence across layers. To address this, we will revise the relevant section to explicitly report the numerical sensitivity values, include error bars from multiple runs, and add a brief discussion of how task difficulty was accounted for in the analysis. These additions will provide quantitative support for the observation that transfer occurs at both sensory and control layers. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture evaluated on external benchmarks

full rationale

The paper introduces progressive neural networks as an architecture for continual RL, with lateral connections for transfer and frozen columns to prevent forgetting. It reports performance on Atari and 3D maze tasks against pretraining/finetuning baselines, plus a sensitivity analysis for transfer. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text or abstract. The central claims rest on external empirical comparisons rather than internal definitions or tautological reductions, satisfying the self-contained benchmark criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract contains no mathematical derivations, free parameters, or explicit axioms; the architecture is presented conceptually without formal assumptions listed.

pith-pipeline@v0.9.0 · 5418 in / 1021 out tokens · 68591 ms · 2026-05-12T16:07:34.759637+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

HierarchyEmergence hierarchy_emergence_forces_phi contradicts

?

contradicts
CONTRADICTS: the theorem conflicts with this paper passage, or marks a claim that would need revision before publication.

the addition of new capacity alongside pretrained networks gives these models the flexibility to both reuse old computations and learn new ones

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ReConText3D: Replay-based Continual Text-to-3D Generation
cs.CV 2026-04 conditional novelty 8.0

ReConText3D is the first replay-memory framework for continual text-to-3D generation that prevents catastrophic forgetting on new textual categories while preserving quality on previously seen classes.
LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning
cs.AI 2023-06 conditional novelty 8.0

LIBERO is a new benchmark for lifelong robot learning that evaluates transfer of declarative, procedural, and mixed knowledge across 130 manipulation tasks with provided demonstration data.
MedCRP-CL: Continual Medical Image Segmentation via Bayesian Nonparametric Semantic Modality Discovery
cs.CV 2026-05 unverdicted novelty 7.0

MedCRP-CL discovers semantic modalities online via CRP from text prompts and maintains modality-specific LoRA adapters with intra-modality EWC, achieving 73.3% Dice and 4.1% forgetting on 16 tasks while using 6x fewer...
Continual Learning of Domain-Invariant Representations
cs.LG 2026-05 unverdicted novelty 7.0

Introduces replay-based continual learning with sequential invariance alignment to learn domain-invariant representations, outperforming baselines on generalization to unseen domains across six datasets in vision, med...
Matrix-Space Reinforcement Learning for Reusing Local Transition Geometry
cs.LG 2026-05 unverdicted novelty 7.0

MSRL represents trajectory segments as PSD matrices to prove additive composition properties and bootstrap value functions for better transfer, reaching 0.73 AUC versus 0.57-0.65 baselines.
KAN-CL: Per-Knot Importance Regularization for Continual Learning with Kolmogorov-Arnold Networks
cs.LG 2026-05 conditional novelty 7.0

KAN-CL cuts catastrophic forgetting by 88-93% on Split-CIFAR-10/5T and Split-CIFAR-100/10T by anchoring KAN parameters at per-knot granularity while matching baseline accuracy.
MIST: Reliable Streaming Decision Trees for Online Class-Incremental Learning via McDiarmid Bound
cs.LG 2026-05 unverdicted novelty 7.0

MIST fixes unreliable splits in streaming decision trees for class-incremental learning by using a K-independent McDiarmid bound on Gini impurity, Bayesian moment projection for knowledge transfer, and KLL quantile sk...
MIST: Reliable Streaming Decision Trees for Online Class-Incremental Learning via McDiarmid Bound
cs.LG 2026-05 unverdicted novelty 7.0

MIST fixes unreliable splits in streaming decision trees for class-incremental learning by replacing Hoeffding-style bounds with a K-independent McDiarmid radius on Gini, plus Bayesian parent-to-child inheritance and ...
Dynamic Full-body Motion Agent with Object Interaction via Blending Pre-trained Modular Controllers
cs.CV 2026-05 unverdicted novelty 7.0

A two-stage framework augments HOI data with dynamic priors and blends pre-trained dynamic motion and static interaction agents via a composer network to enable long-term dynamic human-object interactions with higher ...
Beyond Forgetting in Continual Medical Image Segmentation: A Comprehensive Benchmark Study
cs.CV 2026-05 unverdicted novelty 7.0

Benchmark experiments in continual medical image segmentation reveal that no single method satisfies all clinical requirements, with replay-based approaches offering the best stability-plasticity trade-off while forwa...
Continual Learning for fMRI-Based Brain Disorder Diagnosis via Functional Connectivity Matrices Generative Replay
q-bio.TO 2026-04 conditional novelty 7.0

A structure-aware VAE generates realistic FC matrices for replay, combined with multi-level knowledge distillation and hierarchical contextual bandit sampling, to enable continual fMRI-based brain disorder diagnosis a...
EMBER: Autonomous Cognitive Behaviour from Learned Spiking Neural Network Dynamics in a Hybrid LLM Architecture
cs.AI 2026-04 unverdicted novelty 7.0

A hybrid SNN-LLM system uses learned spiking dynamics and lateral STDP propagation to trigger LLM actions without external prompts, producing the first autonomous action after 7 exchanges from a clean start.
SafeAdapt: Provably Safe Policy Updates in Deep Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

SafeAdapt certifies a Rashomon set of safe policies from demonstration data and projects updates from arbitrary RL algorithms onto it to guarantee preservation of safety on source tasks.
SLE-FNO: Single-Layer Extensions for Task-Agnostic Continual Learning in Fourier Neural Operators
cs.LG 2026-03 unverdicted novelty 7.0

SLE-FNO achieves zero forgetting and strong plasticity-stability balance in continual learning for FNO surrogate models of pulsatile blood flow by adding minimal single-layer extensions across four out-of-distribution tasks.
Prism: Policy Reuse via Interpretable Strategy Mapping in Reinforcement Learning
cs.LG 2026-03 unverdicted novelty 7.0

PRISM transfers RL policies zero-shot by aligning causally validated discrete concepts from agent encoders, achieving 69-76% win rates in Go 7x7 but random performance in Atari Breakout.
Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting
cs.CV 2025-08 unverdicted novelty 7.0

The paper offers a comprehensive survey and proposes a new taxonomy for continual learning strategies in VLMs and MLLMs to combat catastrophic forgetting beyond traditional methods.
A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA
cs.CL 2023-11 unverdicted novelty 7.0

LoRA adapters should be scaled by 1/sqrt(rank) rather than 1/rank to stabilize learning and enable effective use of higher ranks during fine-tuning of large language models.
A Generalist Agent
cs.AI 2022-05 accept novelty 7.0

Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.
Dota 2 with Large Scale Deep Reinforcement Learning
cs.LG 2019-12 accept novelty 7.0

OpenAI Five achieved superhuman performance in Dota 2 by defeating the world champions using scaled self-play reinforcement learning.
NetTailor: Tuning the Architecture, Not Just the Weights
cs.CV 2019-06 unverdicted novelty 7.0

NetTailor adapts CNN architecture for new tasks by assembling pre-trained universal blocks with task-specific layers, trained via activation mimicry and complexity penalties to match accuracy while reducing size for s...
Understanding Goal Generalisation in Sequential Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable...
Expandable, Compressible, Mineable: Open-World Thermal Image Restoration
cs.CV 2026-05 unverdicted novelty 6.0

ECMRNet is a continual-learning restoration network that decomposes features into isolated groups, expands new groups for novel degradations, prunes via structural entropy, and mines historical components for compound...
NeuroMAS: Multi-Agent Systems as Neural Networks with Joint Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 6.0

NeuroMAS reframes multi-agent language systems as neural architectures where LLM agents learn coordination via reinforcement learning rather than predefined roles.
TFGN: Task-Free, Replay-Free Continual Pre-Training Without Catastrophic Forgetting at LLM Scale
cs.LG 2026-05 unverdicted novelty 6.0

TFGN is an architectural overlay for transformers enabling task-free, replay-free continual pre-training across heterogeneous domains at LLM scale with near-zero backward transfer and high gradient orthogonality.
MoRe: Modular Representations for Principled Continual Representation Learning on Sequential Data
cs.LG 2026-05 unverdicted novelty 6.0

MoRe identifies modular structure in representations themselves to enable principled reuse, alignment, and expansion of modules during continual adaptation on sequential data.
DIMoE-Adapters: Dynamic Expert Evolution for Continual Learning in Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

DIMoE-Adapters uses self-calibrated expert evolution and prototype-guided selection to dynamically grow and allocate experts, outperforming prior continual learning methods on vision-language models.
Shortcut Solutions Learned by Transformers Impair Continual Compositional Reasoning
cs.LG 2026-05 unverdicted novelty 6.0

BERT learns shortcut solutions that impair generalization and forward transfer in continual LEGO, while ALBERT learns loop-like solutions for better performance, yet both fail at cross-experience composition, with ALB...
MILE: Mixture of Incremental LoRA Experts for Continual Semantic Segmentation across Domains and Modalities
cs.CV 2026-05 unverdicted novelty 6.0

MILE combines incremental LoRA experts with prototype-guided gating to support continual semantic segmentation across domains and modalities while adding only a small number of parameters per task.
Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting
cs.LG 2026-05 unverdicted novelty 6.0

Sharpness-aware pretraining and related flat-minima interventions reduce catastrophic forgetting by up to 80% after post-training across 20M-150M models and by 31-40% at 1B scale.
NORACL: Neurogenesis for Oracle-free Resource-Adaptive Continual Learning
cs.LG 2026-04 unverdicted novelty 6.0

NORACL dynamically grows network capacity via neurogenesis-inspired signals to achieve oracle-level continual learning performance without pre-specifying architecture size.
Cortex-Inspired Continual Learning: Unsupervised Instantiation and Recovery of Functional Task Networks
cs.LG 2026-04 unverdicted novelty 6.0

FTN achieves near-zero forgetting on continual learning benchmarks by isolating task subnetworks via self-organizing binary masks generated through gradient descent, smoothing, and k-winner-take-all.
Learning Without Losing Identity: Capability Evolution for Embodied Agents
cs.RO 2026-04 unverdicted novelty 6.0

Embodied agents maintain a persistent identity while evolving capabilities via modular ECMs, raising simulated task success from 32.4% to 91.3% over 20 iterations with zero policy drift or safety violations.
Learning Without Losing Identity: Capability Evolution for Embodied Agents
cs.RO 2026-04 unverdicted novelty 6.0

Embodied agents maintain persistent identity while evolving modular capabilities through a closed-loop process, raising simulated task success from 32.4% to 91.3% with zero policy drift.
Information as Structural Alignment: A Dynamical Theory of Continual Learning
cs.LG 2026-04 unverdicted novelty 6.0

IBF achieves near-zero forgetting and positive backward transfer in continual learning by driving configurations toward coherence through motion and modification dynamics without storing raw data.
When Modalities Remember: Continual Learning for Multimodal Knowledge Graphs
cs.CL 2026-04 unverdicted novelty 6.0

MRCKG combines a multimodal-structural curriculum, cross-modal preservation, and contrastive replay to let multimodal knowledge graphs learn new entities and relations over time without catastrophic forgetting.
Adaptive Memory Crystallization for Autonomous AI Agent Learning in Dynamic Environments
cs.LG 2026-04 unverdicted novelty 6.0

AMC models memory consolidation via a Liquid-Glass-Crystal process governed by an SDE with proven convergence to a Beta distribution, yielding 34-43% better forward transfer and 67-80% less forgetting on standard cont...
Evidence of an Emergent "Self" in Continual Robot Learning
cs.RO 2026-03 unverdicted novelty 6.0

Continual learning robots form a significantly more stable invariant subnetwork than constant-task controls, and preserving it improves adaptation while damaging it hurts performance.
Causally Sufficient and Necessary Feature Expansion for Class-Incremental Learning
cs.LG 2026-03 unverdicted novelty 6.0

CPNS regularization with dual counterfactual generators mitigates intra-task and inter-task spurious correlations in class-incremental learning feature expansion.
CrispEdit: Low-Curvature Projections for Scalable Non-Destructive LLM Editing
cs.LG 2026-02 unverdicted novelty 6.0

CrispEdit edits LLMs via low-curvature projections using Bregman divergence and K-FAC approximations, achieving high edit success with under 1% average capability degradation.
Robust Policy Optimization to Prevent Catastrophic Forgetting
cs.LG 2026-02 unverdicted novelty 6.0

FRPO applies a max-min robust optimization over KL-bounded policy neighborhoods during RLHF to reduce catastrophic forgetting of safety and accuracy under subsequent SFT or RL fine-tuning.
CLARE: Continual Learning for Vision-Language-Action Models via Autonomous Adapter Routing and Expansion
cs.RO 2026-01 unverdicted novelty 6.0

CLARE is an exemplar-free continual learning framework for VLAs that autonomously expands modular adapters based on feature similarity and uses autoencoder routing for label-free deployment.
Continually Evolving Skill Knowledge in Vision Language Action Model
cs.RO 2025-11 unverdicted novelty 6.0

Stellar VLA achieves continual learning in VLA models by maintaining a growing knowledge space and routing tasks to specialized experts conditioned on semantic relations, delivering strong LIBERO benchmark results wit...
Routing-Based Continual Learning for Multimodal Large Language Models
cs.LG 2025-11 unverdicted novelty 6.0

Routing architecture for MLLMs enables continual learning with constant compute, matching multi-task learning performance and supporting cross-modal transfer.
A Survey of Continual Reinforcement Learning
cs.LG 2025-06 accept novelty 6.0

The paper surveys CRL literature, proposes a taxonomy of methods into four categories based on knowledge storage and transfer, reviews metrics and benchmarks, and outlines challenges and future research directions.
Little by Little: Continual Learning via Incremental Mixture of Rank-1 Associative Memory Experts
cs.LG 2025-06 unverdicted novelty 6.0

MoRAM frames continual learning as incremental addition of rank-1 adapters viewed as self-activating key-value associative memory units in a mixture-of-experts setup.
No Forgetting Learning: Buffer-free Continual Learning Classification
cs.LG 2025-03 unverdicted novelty 6.0

NFL is a buffer-free continual learning framework that decomposes networks, applies stepwise freezing with knowledge distillation, and adds an auto-encoder in NFL+ to match replay-based performance on image benchmarks...
Continual Domain Randomization
cs.RO 2024-03 unverdicted novelty 6.0

Continual Domain Randomization trains RL policies sequentially on randomization parameter subsets with continual learning to achieve robust sim-to-real transfer in robotic reaching and grasping.
Scalable and Efficient Continual Learning from Demonstration via a Hypernetwork-generated Stable Dynamics Model
cs.RO 2023-11 unverdicted novelty 6.0

A hypernetwork generates clock-augmented stable neural ODEs (sNODEs) for scalable continual learning from demonstration, achieving O(N) training time via stochastic regularization while outperforming baselines on LfD ...
Attentive Multi-Task Deep Reinforcement Learning
cs.LG 2019-07 unverdicted novelty 6.0

Attention mechanism dynamically groups task knowledge at state granularity in multi-task DRL to enable positive transfer and avoid negative transfer, matching or exceeding prior methods with fewer parameters.
Lifelong Learning Starting From Zero
cs.LG 2019-06 unverdicted novelty 6.0

A blank-slate neural network grows via expansion, generalization, forgetting, and backpropagation for lifelong learning with claimed gains in accuracy, efficiency, and versatility.
Continual Reinforcement Learning with Diversity Exploration and Adversarial Self-Correction
cs.LG 2019-06 unverdicted novelty 6.0

CDAN framework uses diversity exploration and adversarial self-correction for continual RL in continuous control, evaluated on new CAM environment with NSD metric showing 18.35% NSD improvement over baseline.
HyLoVQA: Dynamic Hypernetwork-Generated Low-Rank Adaptation for Continual Visual Question Answering
cs.CV 2026-05 unverdicted novelty 5.0

HyLoVQA combines an anchor memory bank with hypernetwork-generated LoRA adapters and an alignment loss to adapt to new VQA tasks while limiting interference with prior knowledge.
Tunable MAGMAX: Preference-Aware Model Merging for Continual Learning
cs.LG 2026-05 unverdicted novelty 5.0

Tunable MAGMAX adds a tunable preference vector to model merging for continual learning, enabling automatic adaptation to target environments using small amounts of data while maintaining or improving task-wise performance.
CP-MoE: Consistency-Preserving Mixture-of-Experts for Continual Learning
cs.LG 2026-05 unverdicted novelty 5.0

CP-MoE uses a transient expert, consistency-preserving routing bias, and guided regularization to reduce catastrophic forgetting in MoE-based LLMs and VLMs while preserving cross-task transfer, reporting SOTA on Super...
On the Stability of Growth in Structural Plasticity
cs.LG 2026-05 unverdicted novelty 5.0

Newborn units in growing neural networks are forward-active but backward-starved, receiving weaker gradients than existing units and creating integration challenges that make growth less reliable than pruning in compl...
MoRe: Modular Representations for Principled Continual Representation Learning on Sequential Data
cs.LG 2026-05 unverdicted novelty 5.0

MoRe identifies modular representations in sequential data for continual learning with identifiability guarantees, enabling principled adaptation without disrupting old modules.
MoRe: Modular Representations for Principled Continual Representation Learning on Sequential Data
cs.LG 2026-05 unverdicted novelty 5.0

MoRe decomposes representations into identifiable hierarchical modules to enable principled continual adaptation on sequential data.
FLAME: Adaptive Mixture-of-Experts for Continual Multimodal Multi-Task Learning
cs.LG 2026-05 unverdicted novelty 5.0

FLAME is an MoE architecture using modality-specific routers and low-rank compression of expert knowledge to support efficient continual multimodal multi-task learning while reducing catastrophic forgetting.
Learning Material-Aware Hamiltonian Risk Fields for Safe Navigation
cs.LG 2026-05 unverdicted novelty 5.0

A learned context-energy term in port-Hamiltonian policies creates selective risk navigation that activates evasive forces only when safer paths are available.
A Domain Incremental Continual Learning Benchmark for ICU Time Series Model Transportability
cs.LG 2026-05 unverdicted novelty 5.0

Proposes a domain incremental continual learning benchmark for ICU time series model transportability across US regions and evaluates data replay and EWC methods.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · cited by 74 Pith papers · 1 internal anchor

[1]

Adaptive multi-column deep neural networks with application to robust image denoising

Forest Agostinelli, Michael R Anderson, and Honglak Lee. Adaptive multi-column deep neural networks with application to robust image denoising. In Advances in Neural Information Processing Systems, 2013

work page 2013
[2]

Natural gradient works efﬁciently in learning

Shun-ichi Amari. Natural gradient works efﬁciently in learning. Neural Computation, 1998

work page 1998
[3]

M. G. Bellemare, Y . Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artiﬁcial Intelligence Research (JAIR), 47:253–279, 2013

work page 2013
[4]

Deep learning of representations for unsupervised and transfer learning

Yoshua Bengio. Deep learning of representations for unsupervised and transfer learning. In JMLR: Workshop on Unsupervised and Transfer Learning, 2012

work page 2012
[5]

Ciresan, Ueli Meier, and Jürgen Schmidhuber

Dan C. Ciresan, Ueli Meier, and Jürgen Schmidhuber. Multi-column deep neural networks for image classiﬁcation. In Conf. on Computer Vision and Pattern Recognition, 2012

work page 2012
[6]

Fahlman and Christian Lebiere

Scott E. Fahlman and Christian Lebiere. The cascade-correlation learning architecture. In Advances in Neural Information Processing Systems, 1990

work page 1990
[7]

G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, July 2006

work page 2006
[8]

Distilling the Knowledge in a Neural Network

Goeff Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. CoRR, abs/1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[9]

Denker, and Sara A

Yann LeCun, John S. Denker, and Sara A. Solla. Optimal brain damage. InAdvances in Neural Information Processing Systems, 1990

work page 1990
[10]

Network in network

Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. In Proc. of Int’l Conference on Learning Representations (ICLR), 2013

work page 2013
[11]

Mesnil, Y

G. Mesnil, Y . Dauphin, X. Glorot, S. Rifai, Y . Bengio, I. Goodfellow, E. Lavoie, X. Muller, G. Desjardins, D. Warde-Farley, P. Vincent, A. Courville, and J. Bergstra. Unsupervised and transfer learning challenge: a deep learning approach. In JMLR W& CP: Proc. of the Unsupervised and Transfer Learning challenge and workshop, volume 27, 2012

work page 2012
[12]

Mnih, Kk Kavukcuoglu, D

V . Mnih, Kk Kavukcuoglu, D. Silver, A. Rusu, J. Veness, M. Bellemare, A. Graves, M. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015

work page 2015
[13]

Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu

V olodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Int’l Conf. on Machine Learning (ICML), 2016

work page 2016
[14]

Actor-mimic: Deep multitask and transfer reinforcement learning

Emilio Parisotto, Lei Jimmy Ba, and Ruslan Salakhutdinov. Actor-mimic: Deep multitask and transfer reinforcement learning. In Proc. of Int’l Conference on Learning Representations (ICLR), 2016

work page 2016
[15]

Mark B. Ring. Continual Learning in Reinforcement Environments. R. Oldenbourg Verlag, 1995

work page 1995
[16]

Beyond sharing weights for deep domain adaptation

Artem Rozantsev, Mathieu Salzmann, and Pascal Fua. Beyond sharing weights for deep domain adaptation. CoRR, abs/1603.06432, 2016

work page arXiv 2016
[17]

A. Rusu, S. Colmenarejo, Ç. Gülçehre, G. Desjardins, J. Kirkpatrick, R. Pascanu, V . Mnih, K. Kavukcuoglu, and R. Hadsell. Policy distillation. abs/1511.06295, 2016

work page Pith review arXiv 2016
[18]

Ella: An efﬁcient lifelong learning algorithm

Paul Ruvolo and Eric Eaton. Ella: An efﬁcient lifelong learning algorithm. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), June 2013

work page 2013
[19]

Silver, Qiang Yang, and Lianghao Li

Daniel L. Silver, Qiang Yang, and Lianghao Li. Lifelong machine learning systems: Beyond learning algorithms. In AAAI Spring Symposium: Lifelong Machine Learning, 2013

work page 2013
[20]

Taylor and Peter Stone

Matthew E. Taylor and Peter Stone. An introduction to inter-task transfer for reinforcement learning. AI Magazine, 32(1):15–34, 2011

work page 2011
[21]

Terekhov, Guglielmo Montone, and J

Alexander V . Terekhov, Guglielmo Montone, and J. Kevin O’Regan. Knowledge Transfer in Deep Block-Modular Neural Networks, pages 268–279. Springer International Publishing, Cham, 2015

work page 2015
[22]

Tessler, S

C. Tessler, S. Givony, T. Zahavy, D. J. Mankowitz, and S. Mannor. A Deep Hierarchical Approach to Lifelong Learning in Minecraft. ArXiv e-prints, 2016

work page 2016
[23]

How transferable are features in deep neural networks? In Advances in Neural Information Processing Systems, pages 3320–3328, 2014

Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? In Advances in Neural Information Processing Systems, pages 3320–3328, 2014

work page 2014
[24]

Online incremental feature learning with denoising autoencoders

Guanyu Zhou, Kihyuk Sohn, and Honglak Lee. Online incremental feature learning with denoising autoencoders. In Proc. of Int’l Conf. on Artiﬁcial Intelligence and Statistics (AISTATS), pages 1453–1461, 2012. 9 Supplementary Material A Perturbation Analysis We explored two related methods for analysing transfer in progressive networks. One based on Fisher i...

work page 2012
[25]

(b-c) Comparison of per-layer sensitivities obtained using the APS method (b) and the AFS method (c; as per main text)

Grey line determines critical noise magnitude for each representation, σ2 i . (b-c) Comparison of per-layer sensitivities obtained using the APS method (b) and the AFS method (c; as per main text). These are highly similar. DeﬁneΛ(k) i = 1/σ2(k) i as the precision of the noise injected at layeri of columnk, which results in a 50% drop in performance. The ...

work page 2000

[1] [1]

Adaptive multi-column deep neural networks with application to robust image denoising

Forest Agostinelli, Michael R Anderson, and Honglak Lee. Adaptive multi-column deep neural networks with application to robust image denoising. In Advances in Neural Information Processing Systems, 2013

work page 2013

[2] [2]

Natural gradient works efﬁciently in learning

Shun-ichi Amari. Natural gradient works efﬁciently in learning. Neural Computation, 1998

work page 1998

[3] [3]

M. G. Bellemare, Y . Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artiﬁcial Intelligence Research (JAIR), 47:253–279, 2013

work page 2013

[4] [4]

Deep learning of representations for unsupervised and transfer learning

Yoshua Bengio. Deep learning of representations for unsupervised and transfer learning. In JMLR: Workshop on Unsupervised and Transfer Learning, 2012

work page 2012

[5] [5]

Ciresan, Ueli Meier, and Jürgen Schmidhuber

Dan C. Ciresan, Ueli Meier, and Jürgen Schmidhuber. Multi-column deep neural networks for image classiﬁcation. In Conf. on Computer Vision and Pattern Recognition, 2012

work page 2012

[6] [6]

Fahlman and Christian Lebiere

Scott E. Fahlman and Christian Lebiere. The cascade-correlation learning architecture. In Advances in Neural Information Processing Systems, 1990

work page 1990

[7] [7]

G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, July 2006

work page 2006

[8] [8]

Distilling the Knowledge in a Neural Network

Goeff Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. CoRR, abs/1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[9] [9]

Denker, and Sara A

Yann LeCun, John S. Denker, and Sara A. Solla. Optimal brain damage. InAdvances in Neural Information Processing Systems, 1990

work page 1990

[10] [10]

Network in network

Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. In Proc. of Int’l Conference on Learning Representations (ICLR), 2013

work page 2013

[11] [11]

Mesnil, Y

G. Mesnil, Y . Dauphin, X. Glorot, S. Rifai, Y . Bengio, I. Goodfellow, E. Lavoie, X. Muller, G. Desjardins, D. Warde-Farley, P. Vincent, A. Courville, and J. Bergstra. Unsupervised and transfer learning challenge: a deep learning approach. In JMLR W& CP: Proc. of the Unsupervised and Transfer Learning challenge and workshop, volume 27, 2012

work page 2012

[12] [12]

Mnih, Kk Kavukcuoglu, D

V . Mnih, Kk Kavukcuoglu, D. Silver, A. Rusu, J. Veness, M. Bellemare, A. Graves, M. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015

work page 2015

[13] [13]

Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu

V olodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Int’l Conf. on Machine Learning (ICML), 2016

work page 2016

[14] [14]

Actor-mimic: Deep multitask and transfer reinforcement learning

Emilio Parisotto, Lei Jimmy Ba, and Ruslan Salakhutdinov. Actor-mimic: Deep multitask and transfer reinforcement learning. In Proc. of Int’l Conference on Learning Representations (ICLR), 2016

work page 2016

[15] [15]

Mark B. Ring. Continual Learning in Reinforcement Environments. R. Oldenbourg Verlag, 1995

work page 1995

[16] [16]

Beyond sharing weights for deep domain adaptation

Artem Rozantsev, Mathieu Salzmann, and Pascal Fua. Beyond sharing weights for deep domain adaptation. CoRR, abs/1603.06432, 2016

work page arXiv 2016

[17] [17]

A. Rusu, S. Colmenarejo, Ç. Gülçehre, G. Desjardins, J. Kirkpatrick, R. Pascanu, V . Mnih, K. Kavukcuoglu, and R. Hadsell. Policy distillation. abs/1511.06295, 2016

work page Pith review arXiv 2016

[18] [18]

Ella: An efﬁcient lifelong learning algorithm

Paul Ruvolo and Eric Eaton. Ella: An efﬁcient lifelong learning algorithm. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), June 2013

work page 2013

[19] [19]

Silver, Qiang Yang, and Lianghao Li

Daniel L. Silver, Qiang Yang, and Lianghao Li. Lifelong machine learning systems: Beyond learning algorithms. In AAAI Spring Symposium: Lifelong Machine Learning, 2013

work page 2013

[20] [20]

Taylor and Peter Stone

Matthew E. Taylor and Peter Stone. An introduction to inter-task transfer for reinforcement learning. AI Magazine, 32(1):15–34, 2011

work page 2011

[21] [21]

Terekhov, Guglielmo Montone, and J

Alexander V . Terekhov, Guglielmo Montone, and J. Kevin O’Regan. Knowledge Transfer in Deep Block-Modular Neural Networks, pages 268–279. Springer International Publishing, Cham, 2015

work page 2015

[22] [22]

Tessler, S

C. Tessler, S. Givony, T. Zahavy, D. J. Mankowitz, and S. Mannor. A Deep Hierarchical Approach to Lifelong Learning in Minecraft. ArXiv e-prints, 2016

work page 2016

[23] [23]

How transferable are features in deep neural networks? In Advances in Neural Information Processing Systems, pages 3320–3328, 2014

Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? In Advances in Neural Information Processing Systems, pages 3320–3328, 2014

work page 2014

[24] [24]

Online incremental feature learning with denoising autoencoders

Guanyu Zhou, Kihyuk Sohn, and Honglak Lee. Online incremental feature learning with denoising autoencoders. In Proc. of Int’l Conf. on Artiﬁcial Intelligence and Statistics (AISTATS), pages 1453–1461, 2012. 9 Supplementary Material A Perturbation Analysis We explored two related methods for analysing transfer in progressive networks. One based on Fisher i...

work page 2012

[25] [25]

(b-c) Comparison of per-layer sensitivities obtained using the APS method (b) and the AFS method (c; as per main text)

Grey line determines critical noise magnitude for each representation, σ2 i . (b-c) Comparison of per-layer sensitivities obtained using the APS method (b) and the AFS method (c; as per main text). These are highly similar. DeﬁneΛ(k) i = 1/σ2(k) i as the precision of the noise injected at layeri of columnk, which results in a 50% drop in performance. The ...

work page 2000