pith. machine review for the scientific record. sign in

arxiv: 2306.03310 · v2 · submitted 2023-06-05 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

Bo Liu, Chongkai Gao, Peter Stone, Qiang Liu, Yifeng Zhu, Yihao Feng, Yuke Zhu

Pith reviewed 2026-05-12 20:59 UTC · model grok-4.3

classification 💻 cs.AI
keywords lifelong learningrobot manipulationknowledge transferbenchmarkprocedural knowledgeforward transfervisual encoderspretraining
0
0 comments X

The pith

LIBERO benchmark shows sequential finetuning transfers knowledge better than specialized lifelong methods for robot manipulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LIBERO, a benchmark with four task suites totaling 130 procedurally generated robot manipulation tasks, to study lifelong decision-making where agents transfer declarative knowledge of objects and concepts alongside procedural knowledge of actions and behaviors. It supplies human teleoperated demonstrations for all tasks and examines five research questions around knowledge transfer efficiency, policy architectures, algorithms, task ordering robustness, and pretraining effects. Experiments demonstrate that sequential finetuning achieves stronger forward transfer than existing lifelong learning techniques, that no visual encoder architecture performs best across all knowledge types, and that naive supervised pretraining can reduce performance in later lifelong phases. These findings address the unique demands of building robots that accumulate and adapt skills over time, distinct from lifelong learning in static image or text domains.

Core claim

LIBERO establishes a benchmark for lifelong decision-making in robot manipulation by providing an extendible procedural generation pipeline and four task suites that isolate declarative, procedural, and mixed knowledge transfer. The benchmark supports investigation of policy architectures, lifelong algorithms, ordering effects, and pretraining through standardized tasks with demonstration data. Experiments across these dimensions show sequential finetuning outperforming prior lifelong methods on forward transfer, variation in visual encoder effectiveness by transfer type, and negative impacts from naive supervised pretraining on subsequent lifelong performance.

What carries the argument

The LIBERO benchmark, consisting of procedurally generated task suites that separate declarative, procedural, and mixed knowledge transfer in robot manipulation policies.

If this is right

  • Sequential finetuning provides a strong baseline for forward knowledge transfer in lifelong robot learning.
  • Policy visual encoders require type-specific evaluation because effectiveness differs across declarative and procedural transfer.
  • Pretraining approaches need refinement since naive supervised pretraining can impair agents during later lifelong learning stages.
  • Task sequence influences lifelong learner robustness, requiring algorithms that handle ordering variations.
  • Provided demonstration data enables sample-efficient testing of transfer methods across the benchmark tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The focus on procedural knowledge indicates that lifelong learning techniques developed for non-action domains require targeted modifications when applied to robotics.
  • Applying the benchmark to physical robot hardware could test whether simulation results on transfer and pretraining generalize to real environments.
  • Hybrid methods that blend sequential finetuning with selective retention of procedural skills may address the observed limitations of existing approaches.
  • The generation pipeline supports scaling to larger task sets for more rigorous evaluation of knowledge transfer limits.

Load-bearing premise

The four task suites and procedural pipeline sufficiently represent the main challenges of declarative, procedural, and mixed knowledge transfer in lifelong robot decision-making.

What would settle it

If the same comparisons of finetuning, lifelong algorithms, encoders, and pretraining on a new collection of manipulation tasks yield results where specialized lifelong methods outperform sequential finetuning in forward transfer, the reported discoveries would be refuted.

read the original abstract

Lifelong learning offers a promising paradigm of building a generalist agent that learns and adapts over its lifespan. Unlike traditional lifelong learning problems in image and text domains, which primarily involve the transfer of declarative knowledge of entities and concepts, lifelong learning in decision-making (LLDM) also necessitates the transfer of procedural knowledge, such as actions and behaviors. To advance research in LLDM, we introduce LIBERO, a novel benchmark of lifelong learning for robot manipulation. Specifically, LIBERO highlights five key research topics in LLDM: 1) how to efficiently transfer declarative knowledge, procedural knowledge, or the mixture of both; 2) how to design effective policy architectures and 3) effective algorithms for LLDM; 4) the robustness of a lifelong learner with respect to task ordering; and 5) the effect of model pretraining for LLDM. We develop an extendible procedural generation pipeline that can in principle generate infinitely many tasks. For benchmarking purpose, we create four task suites (130 tasks in total) that we use to investigate the above-mentioned research topics. To support sample-efficient learning, we provide high-quality human-teleoperated demonstration data for all tasks. Our extensive experiments present several insightful or even unexpected discoveries: sequential finetuning outperforms existing lifelong learning methods in forward transfer, no single visual encoder architecture excels at all types of knowledge transfer, and naive supervised pretraining can hinder agents' performance in the subsequent LLDM. Check the website at https://libero-project.github.io for the code and the datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces LIBERO, a benchmark for lifelong learning in robot manipulation (LLDM) featuring an extendible procedural generation pipeline and four fixed task suites (130 tasks total) with human-teleoperated demonstrations. It positions the benchmark to study five LLDM topics: efficient transfer of declarative/procedural/mixed knowledge, policy architectures, algorithms, robustness to task ordering, and pretraining effects. Experiments report three main findings: sequential finetuning outperforms existing lifelong methods on forward transfer, no single visual encoder excels across all transfer types, and naive supervised pretraining can hinder subsequent LLDM performance.

Significance. If the task suites validly isolate the targeted knowledge-transfer distinctions, LIBERO would be a useful standardized resource for reproducible study of lifelong robot learning, filling a gap between declarative transfer benchmarks in vision/language and the procedural demands of decision-making. The provision of high-quality demos, procedural extensibility, and public code/datasets supports reproducibility and could accelerate work on the five listed topics.

major comments (3)
  1. [§3.2–3.3] §3.2–3.3 (Task Suites and Procedural Pipeline): The claim that the four suites differentially stress declarative knowledge (entities/concepts), procedural knowledge (actions/behaviors), or mixtures is asserted without quantitative support such as a per-task breakdown of required skills, horizon lengths, or visual similarity metrics, nor any sensitivity analysis showing that observed performance gaps arise from these distinctions rather than incidental factors.
  2. [§4] §4 (Experiments): The three reported discoveries (sequential finetuning superiority, visual-encoder non-universality, pretraining hindrance) are presented without error bars, statistical significance tests, full hyperparameter protocols, or baseline implementation details, making it impossible to assess whether the results are robust or sensitive to random seeds and implementation choices.
  3. [§4.3] §4.3 (Pretraining and Transfer): The finding that naive supervised pretraining can hinder LLDM performance is load-bearing for topic 5 yet lacks controls for pretraining dataset size, domain gap, or fine-tuning schedule, leaving open whether the hindrance is general or specific to the chosen pretraining regime and task ordering.
minor comments (3)
  1. [Figures/Tables] Figure 3 and Table 2: Axis labels and legend entries are too small for readability; consider increasing font size and adding explicit task-suite identifiers.
  2. [§2] §2 (Related Work): A few recent lifelong robot learning benchmarks (e.g., those using simulation suites like RLBench or Meta-World) are cited but not compared on metrics such as task diversity or demonstration quality.
  3. [Throughout] Notation: The distinction between “declarative” and “procedural” knowledge is used throughout but never given an operational definition tied to the task suites; a short clarifying paragraph would help.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. The comments have helped us identify areas where additional quantitative support and experimental rigor will strengthen the presentation of LIBERO. We address each major comment below and indicate the revisions made.

read point-by-point responses
  1. Referee: [§3.2–3.3] §3.2–3.3 (Task Suites and Procedural Pipeline): The claim that the four suites differentially stress declarative knowledge (entities/concepts), procedural knowledge (actions/behaviors), or mixtures is asserted without quantitative support such as a per-task breakdown of required skills, horizon lengths, or visual similarity metrics, nor any sensitivity analysis showing that observed performance gaps arise from these distinctions rather than incidental factors.

    Authors: We agree that the original manuscript would benefit from explicit quantitative validation of the intended knowledge distinctions. In the revised version, we have added a new table in §3.2 that provides a per-task breakdown of required skills (e.g., object manipulation vs. spatial reasoning), average horizon lengths across suites, and visual similarity metrics computed as average cosine distances between task image features extracted from a frozen CLIP encoder. We have also included a sensitivity analysis in §3.3: by systematically ablating declarative elements (e.g., object identity changes) or procedural elements (e.g., action sequence modifications) in selected tasks and measuring the resulting transfer performance gaps, we show that the observed differences align with the targeted knowledge types rather than incidental factors. These additions directly support the differential stress claims. revision: yes

  2. Referee: [§4] §4 (Experiments): The three reported discoveries (sequential finetuning superiority, visual-encoder non-universality, pretraining hindrance) are presented without error bars, statistical significance tests, full hyperparameter protocols, or baseline implementation details, making it impossible to assess whether the results are robust or sensitive to random seeds and implementation choices.

    Authors: We acknowledge that the lack of statistical reporting and implementation details limits reproducibility assessment. We have revised §4 and the appendix to include error bars computed from five independent runs with distinct random seeds for all reported metrics. Paired t-tests with p-values are now provided for the key comparisons underlying the three discoveries. The appendix has been expanded with complete hyperparameter tables for every method and baseline (including learning rates, batch sizes, and network dimensions), along with explicit references to the public code repository where the exact implementations can be inspected. revision: yes

  3. Referee: [§4.3] §4.3 (Pretraining and Transfer): The finding that naive supervised pretraining can hinder LLDM performance is load-bearing for topic 5 yet lacks controls for pretraining dataset size, domain gap, or fine-tuning schedule, leaving open whether the hindrance is general or specific to the chosen pretraining regime and task ordering.

    Authors: The original experiments used a standard ImageNet-supervised pretrained visual encoder as the naive baseline. To address the referee's concern, we have added controlled experiments in the revised §4.3: (1) varying pretraining dataset size via random subsets of ImageNet, (2) reducing domain gap by comparing against a model pretrained on a large robot manipulation dataset, and (3) testing multiple fine-tuning schedules (different learning-rate decays and epoch counts). The hindrance effect remains consistent across these controls, supporting that it is not an artifact of the specific regime or ordering. We have updated the text to clarify the scope of the claim while noting that exhaustive exploration of all pretraining variants lies beyond a single benchmark paper. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark with no derivations or self-referential predictions

full rationale

The paper introduces LIBERO as an empirical benchmark for lifelong decision-making in robot manipulation, supported by procedural task generation, four task suites, and human demonstration data. It reports experimental findings on transfer, architectures, algorithms, ordering, and pretraining without any equations, fitted parameters renamed as predictions, or derivation chains. No self-citations are load-bearing for core claims, and the work is self-contained against external benchmarks via provided code and datasets. The primary assumption (suite representativeness) is an empirical design choice open to external validation rather than a circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on the domain assumption that the generated tasks sufficiently probe the five listed research topics in lifelong decision-making; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption The four task suites and procedural generation pipeline capture the essential challenges of declarative, procedural, and mixed knowledge transfer in robot lifelong learning.
    Invoked to justify the benchmark's relevance to the five research topics highlighted in the abstract.

pith-pipeline@v0.9.0 · 5589 in / 1315 out tokens · 43268 ms · 2026-05-12T20:59:50.455365+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 33 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Offline Policy Evaluation for Manipulation Policies via Discounted Liveness Formulation

    cs.RO 2026-05 conditional novelty 7.0

    A liveness-based Bellman operator enables conservative offline policy evaluation for manipulation tasks by encoding task progression and reducing truncation bias from finite horizons.

  2. ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 7.0

    ALAM creates algebraically consistent latent action transitions from videos to act as auxiliary generative targets, raising robot policy success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.

  3. OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

    cs.RO 2026-05 unverdicted novelty 7.0

    OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.

  4. Atomic-Probe Governance for Skill Updates in Compositional Robot Policies

    cs.RO 2026-04 unverdicted novelty 7.0

    A cross-version swap protocol reveals dominant skills that swing composition success by up to 50 percentage points, and an atomic probe with selective revalidation governs updates at lower cost than always re-testing ...

  5. Privileged Foresight Distillation: Zero-Cost Future Correction for World Action Models

    cs.RO 2026-04 unverdicted novelty 7.0

    Privileged Foresight Distillation distills the residual difference in action predictions with versus without future context into a current-only adapter, yielding consistent gains on LIBERO and RoboTwin benchmarks.

  6. CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies

    cs.CV 2026-04 unverdicted novelty 7.0

    CF-VLA uses a coarse initialization over endpoint velocity followed by single-step refinement to achieve strong performance with low inference steps on CALVIN, LIBERO, and real-robot tasks.

  7. Characterizing Vision-Language-Action Models across XPUs: Constraints and Acceleration for On-Robot Deployment

    cs.RO 2026-04 unverdicted novelty 7.0

    VLA models exhibit a compute-bound VLM phase followed by a memory-bound action phase on edge hardware; DP-Cache and V-AEFusion reduce redundancy and enable pipeline parallelism for up to 6x speedup on NPUs with margin...

  8. Mask World Model: Predicting What Matters for Robust Robot Policy Learning

    cs.RO 2026-04 unverdicted novelty 7.0

    Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization...

  9. STRONG-VLA: Decoupled Robustness Learning for Vision-Language-Action Models under Multimodal Perturbations

    cs.RO 2026-04 unverdicted novelty 7.0

    STRONG-VLA uses decoupled two-stage training to improve VLA model robustness, yielding up to 16% higher task success rates under seen and unseen perturbations on the LIBERO benchmark.

  10. TMRL: Diffusion Timestep-Modulated Pretraining Enables Exploration for Efficient Policy Finetuning

    cs.RO 2026-05 unverdicted novelty 6.0

    TMRL bridges behavioral cloning pretraining and RL finetuning via diffusion noise and timestep modulation to enable controlled exploration, improving sample efficiency and enabling real-world robot training in under one hour.

  11. Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models

    cs.RO 2026-05 unverdicted novelty 6.0

    Pace-and-Path Correction is a closed-form inference-time operator that decomposes a quadratic cost minimization into orthogonal pace compression and path offset channels to correct dynamics-blindness in chunked-action...

  12. BEACON: Cross-Domain Co-Training of Generative Robot Policies via Best-Effort Adaptation

    cs.RO 2026-05 unverdicted novelty 6.0

    BEACON uses discrepancy-aware importance reweighting to jointly train diffusion-based robot policies and source sample weights, improving performance over target-only and fixed-ratio baselines in cross-domain manipula...

  13. BEACON: Cross-Domain Co-Training of Generative Robot Policies via Best-Effort Adaptation

    cs.RO 2026-05 unverdicted novelty 6.0

    BEACON uses discrepancy-aware importance reweighting to co-train generative robot policies from abundant source and limited target demonstrations, yielding better robustness and implicit feature alignment.

  14. One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

    cs.CV 2026-05 unverdicted novelty 6.0

    Reducing visual input to one token per frame in world models for vision-language-action policies maintains long-horizon performance while improving success rates on MetaWorld, LIBERO, and real-robot tasks.

  15. One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

    cs.CV 2026-05 unverdicted novelty 6.0

    Reducing visual input to one token per frame via adaptive attention pooling and a unified flow-matching objective improves long-horizon performance in VLA policies on MetaWorld, LIBERO, and real-robot tasks.

  16. Predictive but Not Plannable: RC-aux for Latent World Models

    cs.LG 2026-05 unverdicted novelty 6.0

    RC-aux corrects spatiotemporal mismatch in reconstruction-free latent world models by adding multi-horizon prediction and reachability supervision, improving planning performance on goal-conditioned pixel-control tasks.

  17. Toward Visually Realistic Simulation: A Benchmark for Evaluating Robot Manipulation in Simulation

    cs.RO 2026-05 unverdicted novelty 6.0

    VISER is a new visually realistic simulation benchmark for robot manipulation tasks that uses PBR materials and MLLM-assisted asset generation, achieving 0.92 Pearson correlation with real-world policy performance.

  18. ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation

    cs.RO 2026-05 unverdicted novelty 6.0

    ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.

  19. PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations

    cs.AI 2026-04 unverdicted novelty 6.0

    PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.

  20. Atomic-Probe Governance for Skill Updates in Compositional Robot Policies

    cs.RO 2026-04 unverdicted novelty 6.0

    Empirical study on robosuite tasks reveals a dominant-skill effect in compositions and shows that an atomic probe approximates full revalidation for skill updates at much lower cost.

  21. CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors

    cs.RO 2026-04 unverdicted novelty 6.0

    CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.

  22. Grounded World Model for Semantically Generalizable Planning

    cs.RO 2026-04 conditional novelty 6.0

    A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.

  23. RoboPlayground: Democratizing Robotic Evaluation through Structured Physical Domains

    cs.RO 2026-04 unverdicted novelty 6.0

    RoboPlayground reframes robotic manipulation evaluation as a language-driven process over structured physical domains, letting users author varied yet reproducible tasks that reveal policy generalization failures.

  24. Fast-WAM: Do World Action Models Need Test-time Future Imagination?

    cs.CV 2026-03 unverdicted novelty 6.0

    Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.

  25. RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

    cs.RO 2024-06 unverdicted novelty 6.0

    RoboCasa supplies a large-scale kitchen simulator, generative assets, 100 tasks, and automated data pipelines that produce a clear scaling trend in imitation learning for generalist robots.

  26. ProcVLM: Learning Procedure-Grounded Progress Rewards for Robotic Manipulation

    cs.RO 2026-05 unverdicted novelty 5.0

    ProcVLM learns procedure-grounded dense progress rewards for robotic manipulation via a reasoning-before-estimation VLM trained on a 60M-frame synthesized corpus from 30 embodied datasets.

  27. Understanding Asynchronous Inference Methods for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 5.0

    Controlled benchmarks show per-step residual correction (A2C2) as most effective for VLA asynchronous inference up to d=8 delays on Kinetix with over 90% solve rate, outperforming inpainting and conditioning while tra...

  28. Gated Memory Policy

    cs.RO 2026-04 unverdicted novelty 5.0

    GMP selectively activates and represents memory via a gate and lightweight cross-attention, yielding 30.1% higher success on non-Markovian robotic tasks while staying competitive on Markovian ones.

  29. From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data

    cs.RO 2026-04 accept novelty 5.0

    A survey introduces an interface-centric taxonomy for video-to-control methods in robotic manipulation and identifies the robotics integration layer as the central open challenge.

  30. SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

    cs.RO 2025-01 unverdicted novelty 5.0

    SpatialVLA adds 3D-aware position encoding and adaptive discretized action grids to visual-language-action models, enabling strong zero-shot performance and fine-tuning on new robot setups after pre-training on 1.1 mi...

  31. World Action Models: The Next Frontier in Embodied AI

    cs.RO 2026-05 unverdicted novelty 4.0

    The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

  32. OmniVLA-RL: A Vision-Language-Action Model with Spatial Understanding and Online RL

    cs.RO 2026-04 unverdicted novelty 4.0

    OmniVLA-RL uses a mix-of-transformers architecture and flow-matching reformulated as SDE with group segmented policy optimization to surpass prior VLA models on LIBERO benchmarks.

  33. Vision-Language-Action in Robotics: A Survey of Datasets, Benchmarks, and Data Engines

    cs.RO 2026-04 unverdicted novelty 3.0

    A survey of VLA robotics research identifies data infrastructure as the primary bottleneck and distills four open challenges in representation alignment, multimodal supervision, reasoning assessment, and scalable data...

Reference graph

Works this paper leans on

94 extracted references · 94 canonical work pages · cited by 30 Pith papers · 7 internal anchors

  1. [1]

    Ahmed, F

    Ossama Ahmed, Frederik Träuble, Anirudh Goyal, Alexander Neitz, Yoshua Bengio, Bern- hard Schölkopf, Manuel Wüthrich, and Stefan Bauer. Causalworld: A robotic manipulation benchmark for causal structure and transfer learning. arXiv preprint arXiv:2010.04296, 2020

  2. [2]

    Few-shot continual active learning by a robot

    Ali Ayub and Carter Fendley. Few-shot continual active learning by a robot. arXiv preprint arXiv:2210.04137, 2022

  3. [3]

    F-siol-310: A robotic dataset and benchmark for few-shot incre- mental object learning

    Ali Ayub and Alan R Wagner. F-siol-310: A robotic dataset and benchmark for few-shot incre- mental object learning. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 13496–13502. IEEE, 2021. 10

  4. [4]

    A framework for behavioural cloning

    Michael Bain and Claude Sammut. A framework for behavioural cloning. In Machine Intelli- gence 15, pages 103–129, 1995

  5. [5]

    Lifelong reinforcement learning with modulating masks

    Eseoghene Ben-Iwhiwhu, Saptarshi Nath, Praveen K Pilly, Soheil Kolouri, and Andrea Soltog- gio. Lifelong reinforcement learning with modulating masks. arXiv preprint arXiv:2212.11110, 2022

  6. [6]

    Curriculum learning

    Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48, 2009

  7. [7]

    Continual lifelong learning in natural language processing: A survey

    Magdalena Biesialska, Katarzyna Biesialska, and Marta R Costa-Jussa. Continual lifelong learning in natural language processing: A survey. arXiv preprint arXiv:2012.09823, 2020

  8. [8]

    Mixture density networks

    Christopher M Bishop. Mixture density networks. 1994

  9. [9]

    Dark experience for general continual learning: a strong, simple baseline

    Pietro Buzzega, Matteo Boschini, Angelo Porrello, Davide Abati, and Simone Calderara. Dark experience for general continual learning: a strong, simple baseline. Advances in neural information processing systems, 33:15920–15930, 2020

  10. [10]

    Multitask learning

    Rich Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997

  11. [11]

    Riemannian walk for incremental learning: Understanding forgetting and intransigence

    Arslan Chaudhry, Puneet K Dokania, Thalaiyasingam Ajanthan, and Philip HS Torr. Riemannian walk for incremental learning: Understanding forgetting and intransigence. In Proceedings of the European Conference on Computer Vision (ECCV), pages 532–547, 2018

  12. [12]

    Efficient lifelong learning with A-GEM.CoRR, abs/1812.00420, 2018

    Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient lifelong learning with a-gem. arXiv preprint arXiv:1812.00420, 2018

  13. [13]

    Chaudhry et al

    Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, Thalaiyasingam Ajanthan, Puneet K Dokania, Philip HS Torr, and Marc’Aurelio Ranzato. On tiny episodic memories in continual learning. arXiv preprint arXiv:1902.10486, 2019

  14. [14]

    Su- perposition of many models into one

    Brian Cheung, Alexander Terekhov, Yubei Chen, Pulkit Agrawal, and Bruno Olshausen. Su- perposition of many models into one. Advances in neural information processing systems, 32, 2019

  15. [15]

    Leveraging procedural generation to benchmark reinforcement learning

    Karl Cobbe, Chris Hesse, Jacob Hilton, and John Schulman. Leveraging procedural generation to benchmark reinforcement learning. In International conference on machine learning, pages 2048–2056. PMLR, 2020

  16. [16]

    A continual learning survey: Defying forgetting in classification tasks

    Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Aleš Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. A continual learning survey: Defying forgetting in classification tasks. IEEE transactions on pattern analysis and machine intelligence, 44(7):3366– 3385, 2021

  17. [17]

    Imagenet: A large- scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

  18. [18]

    The mnist database of handwritten digit images for machine learning research

    Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142, 2012

  19. [19]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

  20. [20]

    Don’t forget, there is more than for- getting: new metrics for continual learning.arXiv preprint arXiv:1810.13166, 2018

    Natalia Díaz-Rodríguez, Vincenzo Lomonaco, David Filliat, and Davide Maltoni. Don’t forget, there is more than forgetting: new metrics for continual learning. arXiv preprint arXiv:1810.13166, 2018

  21. [21]

    Memory efficient continual learning with transformers

    Beyza Ermis, Giovanni Zappella, Martin Wistuba, and Cédric Archambeau. Memory efficient continual learning with transformers. 2022. 11

  22. [22]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022

  23. [23]

    Visualizing and understanding atari agents

    Sam Greydanus, Anurag Koul, Jonathan Dodge, and Alan Fern. Visualizing and understanding atari agents. ArXiv, abs/1711.00138, 2017

  24. [24]

    (2023) Maniskill2: A unified benchmark for generalizable manipulation skills

    Jiayuan Gu, Fanbo Xiang, Xuanlin Li, Zhan Ling, Xiqiang Liu, Tongzhou Mu, Yihe Tang, Stone Tao, Xinyue Wei, Yunchao Yao, et al. Maniskill2: A unified benchmark for generalizable manipulation skills. arXiv preprint arXiv:2302.04659, 2023

  25. [25]

    Deep recurrent q-learning for partially observable mdps

    Matthew Hausknecht and Peter Stone. Deep recurrent q-learning for partially observable mdps. In 2015 aaai fall symposium series, 2015

  26. [26]

    Compacting, picking and growing for unforgetting continual learning

    Ching-Yi Hung, Cheng-Hao Tu, Cheng-En Wu, Chien-Hung Chen, Yi-Ming Chan, and Chu- Song Chen. Compacting, picking and growing for unforgetting continual learning. Advances in Neural Information Processing Systems, 32, 2019

  27. [27]

    Rlbench: The robot learning benchmark & learning environment

    Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J Davison. Rlbench: The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

  28. [28]

    The foundation of efficient robot learning

    Leslie Pack Kaelbling. The foundation of efficient robot learning. Science, 369(6506):915–916, 2020

  29. [29]

    Class-incremental learning by knowledge distillation with adaptive feature consolidation

    Minsoo Kang, Jaeyoo Park, and Bohyung Han. Class-incremental learning by knowledge distillation with adaptive feature consolidation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16071–16080, 2022

  30. [30]

    Vizdoom: A doom-based ai research platform for visual reinforcement learning

    Michał Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech Ja´skowski. Vizdoom: A doom-based ai research platform for visual reinforcement learning. In 2016 IEEE conference on computational intelligence and games (CIG), pages 1–8. IEEE, 2016

  31. [31]

    Vilt: Vision-and-language transformer without convolution or region supervision

    Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, pages 5583–5594. PMLR, 2021

  32. [32]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

  33. [33]

    Overcoming catastrophic forgetting in neural networks

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

  34. [34]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009

  35. [35]

    Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation

    Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gokmen, Sanjana Srivastava, Roberto Martín- Martín, Chen Wang, Gabrael Levine, Michael Lingelbach, Jiankai Sun, et al. Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation. In Conference on Robot Learning, pages 80–93. PMLR, 2023

  36. [36]

    Liu, Qian Liu, and Peter Stone

    B. Liu, Qian Liu, and Peter Stone. Continual learning and private unlearning. In CoLLAs, 2022

  37. [37]

    Continual learning with recursive gradient optimization

    Hao Liu and Huaping Liu. Continual learning with recursive gradient optimization. arXiv preprint arXiv:2201.12522, 2022

  38. [38]

    Core50: a new dataset and benchmark for continuous object recognition

    Vincenzo Lomonaco and Davide Maltoni. Core50: a new dataset and benchmark for continuous object recognition. In Conference on Robot Learning, pages 17–26. PMLR, 2017

  39. [39]

    Gradient episodic memory for continual learning

    David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. Advances in neural information processing systems, 30, 2017. 12

  40. [40]

    Online continual learning in image classification: An empirical survey

    Zheda Mai, Ruiwen Li, Jihwan Jeong, David Quispe, Hyunwoo Kim, and Scott Sanner. Online continual learning in image classification: An empirical survey. Neurocomputing, 469:28–51, 2022

  41. [41]

    Packnet: Adding multiple tasks to a single network by iterative pruning

    Arun Mallya and Svetlana Lazebnik. Packnet: Adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018

  42. [42]

    What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

    Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Martín-Martín. What matters in learning from offline human demonstrations for robot manipulation. arXiv preprint arXiv:2108.03298, 2021

  43. [43]

    Pddl-the planning domain definition language

    Drew McDermott, Malik Ghallab, Adele Howe, Craig Knoblock, Ashwin Ram, Manuela Veloso, Daniel Weld, and David Wilkins. Pddl-the planning domain definition language. 1998

  44. [44]

    Composuite: A compo- sitional reinforcement learning benchmark

    Jorge A Mendez, Marcel Hussing, Meghna Gummadi, and Eric Eaton. Composuite: A compo- sitional reinforcement learning benchmark. arXiv preprint arXiv:2207.04136, 2022

  45. [45]

    Architecture matters in continual learning

    Seyed Iman Mirzadeh, Arslan Chaudhry, Dong Yin, Timothy Nguyen, Razvan Pascanu, Dilan Gorur, and Mehrdad Farajtabar. Architecture matters in continual learning. arXiv preprint arXiv:2202.00275, 2022

  46. [46]

    Playing Atari with Deep Reinforcement Learning

    V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013

  47. [47]

    ManiSkill: Generalizable manipulation skill benchmark with large-scale demonstrations

    Tongzhou Mu, Zhan Ling, Fanbo Xiang, Derek Yang, Xuanlin Li, Stone Tao, Zhiao Huang, Zhiwei Jia, and Hao Su. Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations. arXiv preprint arXiv:2107.14483, 2021

  48. [48]

    Curriculum learning for reinforcement learning domains: A framework and survey

    Sanmit Narvekar, Bei Peng, Matteo Leonetti, Jivko Sinapov, Matthew E Taylor, and Peter Stone. Curriculum learning for reinforcement learning domains: A framework and survey. arXiv preprint arXiv:2003.04960, 2020

  49. [49]

    Continual lifelong learning with neural networks: A review

    German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review. Neural Networks, 113:54–71, 2019

  50. [50]

    Film: Visual reasoning with a general conditioning layer

    Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018

  51. [51]

    Cora: Benchmarks, baselines, and metrics as a platform for continual reinforcement learning agents

    Sam Powers, Eliot Xing, Eric Kolve, Roozbeh Mottaghi, and Abhinav Gupta. Cora: Benchmarks, baselines, and metrics as a platform for continual reinforcement learning agents. arXiv preprint arXiv:2110.10067, 2021

  52. [52]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021

  53. [53]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019

  54. [54]

    Lifelong learning without a task oracle

    Amanda Rios and Laurent Itti. Lifelong learning without a task oracle. In 2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI), pages 255–263. IEEE, 2020

  55. [55]

    A reduction of imitation learning and structured prediction to no-regret online learning

    Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth interna- tional conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011. 13

  56. [56]

    Progressive Neural Networks

    Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016

  57. [57]

    Space: Structured compression and sharing of representational space for continual learning

    Gobinda Saha, Isha Garg, Aayush Ankit, and Kaushik Roy. Space: Structured compression and sharing of representational space for continual learning. IEEE Access, 9:150480–150494, 2021

  58. [58]

    Minihack the planet: A sandbox for open-ended reinforcement learning research

    Mikayel Samvelyan, Robert Kirk, Vitaly Kurin, Jack Parker-Holder, Minqi Jiang, Eric Hambro, Fabio Petroni, Heinrich Küttler, Edward Grefenstette, and Tim Rocktäschel. Minihack the planet: A sandbox for open-ended reinforcement learning research. arXiv preprint arXiv:2109.13202, 2021

  59. [59]

    Superglue: Learning feature matching with graph neural networks

    Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4938–4947, 2020

  60. [60]

    Progress & compress: A scalable framework for continual learning

    Jonathan Schwarz, Wojciech Czarnecki, Jelena Luketina, Agnieszka Grabska-Barwinska, Yee Whye Teh, Razvan Pascanu, and Raia Hadsell. Progress & compress: A scalable framework for continual learning. In International Conference on Machine Learning, pages 4528–4537. PMLR, 2018

  61. [61]

    Openloris-object: A robotic vision dataset and benchmark for lifelong deep learning

    Qi She, Fan Feng, Xinyue Hao, Qihan Yang, Chuanlin Lan, Vincenzo Lomonaco, Xuesong Shi, Zhengwei Wang, Yao Guo, Yimin Zhang, et al. Openloris-object: A robotic vision dataset and benchmark for lifelong deep learning. In 2020 IEEE international conference on robotics and automation (ICRA), pages 4767–4773. IEEE, 2020

  62. [62]

    Alfred: A benchmark for interpreting grounded instructions for everyday tasks

    Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mot- taghi, Luke Zettlemoyer, and Dieter Fox. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10740–10749, 2020

  63. [63]

    Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments

    Sanjana Srivastava, Chengshu Li, Michael Lingelbach, Roberto Martín-Martín, Fei Xia, Kent El- liott Vainio, Zheng Lian, Cem Gokmen, Shyamal Buch, Karen Liu, et al. Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments. In Conference on Robot Learning, pages 477–490. PMLR, 2022

  64. [64]

    Open- ended learning leads to generally capable agents

    Open Ended Learning Team, Adam Stooke, Anuj Mahajan, Catarina Barros, Charlie Deck, Jakob Bauer, Jakub Sygnowski, Maja Trebacz, Max Jaderberg, Michael Mathieu, et al. Open- ended learning leads to generally capable agents. arXiv preprint arXiv:2107.12808, 2021

  65. [65]

    Lifelong robot learning

    Sebastian Thrun and Tom M Mitchell. Lifelong robot learning. Robotics and autonomous systems, 15(1-2):25–46, 1995

  66. [66]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

  67. [67]

    GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

    Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018

  68. [68]

    ://arxiv.org/abs/2302.12422

    Chen Wang, Linxi Fan, Jiankai Sun, Ruohan Zhang, Li Fei-Fei, Danfei Xu, Yuke Zhu, and Anima Anandkumar. Mimicplay: Long-horizon imitation learning by watching human play. arXiv preprint arXiv:2302.12422, 2023

  69. [69]

    Continual world: A robotic benchmark for continual reinforcement learning

    Maciej Wołczyk, Michal Zajkac, Razvan Pascanu, Lukasz Kuci’nski, and Piotr Milo’s. Continual world: A robotic benchmark for continual reinforcement learning. In Neural Information Processing Systems, 2021

  70. [70]

    Disen- tangling transfer in continual reinforcement learning

    Maciej Wołczyk, Michal Zajkac, Razvan Pascanu, Lukasz Kuci’nski, and Piotr Milo’s. Disen- tangling transfer in continual reinforcement learning. ArXiv, abs/2209.13900, 2022. 14

  71. [71]

    Firefly neural architecture descent: a general approach for growing neural networks

    Lemeng Wu, Bo Liu, Peter Stone, and Qiang Liu. Firefly neural architecture descent: a general approach for growing neural networks. Advances in Neural Information Processing Systems, 33:22373–22383, 2020

  72. [72]

    Lifelong learning with dynamically expandable networks

    Jaehong Yoon, Eunho Yang, Jeongtae Lee, and Sung Ju Hwang. Lifelong learning with dynamically expandable networks. arXiv preprint arXiv:1708.01547, 2017

  73. [73]

    Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning

    Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on robot learning, pages 1094–1100. PMLR, 2020

  74. [74]

    Forward compatible few-shot class-incremental learning

    Da-Wei Zhou, Fu-Yun Wang, Han-Jia Ye, Liang Ma, Shiliang Pu, and De-Chuan Zhan. Forward compatible few-shot class-incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9046–9056, 2022

  75. [75]

    Viola: Imitation learning for vision- based manipulation with object proposal priors

    Yifeng Zhu, Abhishek Joshi, Peter Stone, and Yuke Zhu. Viola: Imitation learning for vision- based manipulation with object proposal priors. arXiv preprint arXiv:2210.11339, 2022

  76. [76]

    robosuite: A Modular Simulation Framework and Benchmark for Robot Learning

    Yuke Zhu, Josiah Wong, Ajay Mandlekar, and Roberto Martín-Martín. robosuite: A modular simulation framework and benchmark for robot learning. arXiv preprint arXiv:2009.12293, 2020. 15 Checklist

  77. [77]

    For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] (c) Did you discuss any potential negative societal impacts of your work? [Yes] (d) Have you read the ethics review guidelines and ensured that your paper con...

  78. [78]

    (a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A]

    If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A]

  79. [79]

    for benchmarks)

    If you ran experiments (e.g. for benchmarks)... (a) Did you include the code, data, and instructions needed to reproduce the main experi- mental results (either in the supplemental material or as a URL)? [Yes] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] (c) Did you report error bars (e.g.,...

  80. [80]

    If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [N/A] (c) Did you include any new assets either in the supplemental material or as a URL? [Yes] (d) Did you discuss whether and how consent wa...

Showing first 80 references.