pith. machine review for the scientific record. sign in

arxiv: 2604.26689 · v3 · submitted 2026-04-29 · 💻 cs.RO · cs.AI

Recognition: unknown

Atomic-Probe Governance for Skill Updates in Compositional Robot Policies

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:21 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords atomic-quality probeskill-update governancecompositional robot policiesdominant-skill effecthybrid selectorpaired-sampling protocolrobosuite manipulation tasks
0
0 comments X

The pith

An atomic-quality probe predicts how skill replacements affect compositional robot task success using only individual skill tests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that skill libraries in deployed robots change over time through updates, yet existing composition methods assume a frozen library and ignore how swapping one skill alters overall outcomes. By introducing a paired-sampling protocol on robosuite tasks, the authors identify a dominant-skill effect in which one skill controls most of the composition's success rate, while cheap off-policy distance metrics miss this effect entirely. They then define an atomic-quality probe that evaluates each skill in isolation at zero per-decision cost and combine it with selective full-composition revalidation in a Hybrid Selector. Across 144 update events the probe alone stays within 3 percentage points of full oracle revalidation on average, and the hybrid version closes most of the remaining gap at roughly half the cost.

Core claim

In compositional robot policies, replacing a skill inside a composition can shift task success by up to 50 percentage points because of a dominant-skill effect; an atomic-quality probe that samples only the replaced skill's standalone performance predicts the new composition outcome sufficiently well to govern updates, while a Hybrid Selector further trades a modest amount of revalidation cost for higher accuracy.

What carries the argument

The atomic-quality probe, which uses paired sampling of a skill's atomic success rate to forecast its contribution inside any composition that contains it.

If this is right

  • Robot skill libraries can accept updates without exhaustive re-testing of every possible composition that uses the updated skill.
  • Dominant skills can be identified at deployment time so that updates to them receive higher priority or stricter validation.
  • A Hybrid Selector lets operators choose operating points on a cost-accuracy curve, using zero-cost atomic probes for most decisions and full revalidation only when the probe is uncertain.
  • Off-policy behavioral distance metrics are ruled out as reliable predictors for composition outcomes in these settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same probe could be applied to continual-learning pipelines where new skills arrive from human demonstrations or domain adaptation without requiring a full library re-composition pass each time.
  • If the dominant-skill pattern holds in contact-rich or long-horizon tasks, it would suggest that composition governance can be reduced to a small number of high-impact atomic checks rather than combinatorial search.
  • Deployed systems could log atomic probe scores over time to detect when an update has degraded a previously dominant skill and trigger targeted re-training.

Load-bearing premise

The dominant-skill effect and the probe's ability to forecast composition changes will continue to appear in tasks and update patterns beyond the three robosuite environments and 144 events examined.

What would settle it

A new manipulation task in which the atomic probe's predicted success rate for a skill swap deviates by more than 20 percentage points from the measured composition success rate.

Figures

Figures reproduced from arXiv: 2604.26689 by Cong Yang, John See, Simin Luan, Xue Qin, Zeyd Boukhers, Zhijun Li.

Figure 1
Figure 1. Figure 1: Cost–accuracy Pareto frontier for the seven selectors, per task and on cross-task view at source ↗
Figure 2
Figure 2. Figure 2: T6 atomic success rate per (seed, phase) ECM (visual companion to table view at source ↗
Figure 3
Figure 3. Figure 3: Pairwise action L 2 distance between (seedi , seedj ) ECMs, per (task, phase). All twelve panels are visually uniform; the dominant ECM on T6 (row 1, column 1; seed=2024 reach) is not behaviorally distant from its peers. lift swap swap=42 swap=7 swap=123 swap=2024 col. mean primary=42 13.3 13.3 20.0 26.7 — primary=7 23.3 26.7 13.3 26.7 — primary=123 16.7 30.0 46.7 46.7 — primary=2024 76.7 83.3 76.7 70.0 — … view at source ↗
read the original abstract

Skill libraries in deployed robotic systems are continually updated through fine-tuning, fresh demonstrations, or domain adaptation, yet existing typed-composition methods (BLADE, SymSkill, Generative Skill Chaining) treat the library as frozen at test time and do not analyze how composition outcomes change when a skill is replaced. We introduce a paired-sampling cross-version swap protocol on robosuite manipulation tasks to characterize this dimension of compositional skill learning. On a dual-arm peg-in-hole task we discover a dominant-skill effect: one ECM achieves 86.7% atomic success rate while every other ECM is at or below 26.7%, and whether this dominant ECM enters a composition shifts the success rate by up to +50pp. We characterize the boundary on a simpler pick task where all atomic policies saturate at 100% and the effect is undefined. Across three tasks we further find that off-policy behavioral distance metrics fail to identify the dominant ECM, ruling out the natural cheap predictor. We propose an atomic-quality probe and a Hybrid Selector combining per-skill probes (zero per-decision cost) with selective composition revalidation (full cost), and characterize its Pareto frontier on 144 skill-update decisions. On T6 the atomic-only probe sits 23pp below full revalidation (64.6% vs 87.5% oracle match) at zero per-decision cost; a Hybrid Selector with m=10 closes most of that gap to ~12pp at 46% of full-revalidation cost. On the cross-task average over 144 events, atomic-only is within 3pp of full revalidation under a mixed-oracle caveat. The atomic-quality probe is, to our knowledge, the first principled, deployment-ready primitive for skill-update governance in compositional robot policies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

4 major / 2 minor

Summary. The manuscript introduces a paired-sampling cross-version swap protocol to study how skill updates affect compositional outcomes in robot policies on robosuite tasks. It reports a dominant-skill effect on dual-arm peg-in-hole (one ECM at 86.7% atomic success, others ≤26.7%, shifting composition success by up to +50pp), shows off-policy behavioral distance fails to identify it, and proposes an atomic-quality probe (zero per-decision cost) plus Hybrid Selector (selective revalidation) that on 144 update events achieves 64.6% vs 87.5% oracle match on T6 (23pp gap) or ~12pp gap at 46% cost with m=10, and within 3pp cross-task average under a mixed-oracle caveat. The atomic-quality probe is claimed to be the first principled, deployment-ready primitive for skill-update governance.

Significance. If the dominant-skill effect and probe reliability hold, the work supplies a practical, low-overhead mechanism for governing continual updates to skill libraries in deployed compositional policies, avoiding full revalidation costs while maintaining high fidelity to oracle outcomes. The empirical protocol and Pareto characterization of the hybrid approach are concrete contributions to robot learning. The significance is limited by the narrow task set and lack of statistical detail, but the core idea of an atomic probe as a governance primitive has clear potential utility if validated more broadly.

major comments (4)
  1. [Abstract] Abstract: the reported figures (86.7% dominant ECM, ≤26.7% others, +50pp shift, 23pp gap on T6, 12pp hybrid gap, 46% cost, 3pp cross-task average) are given without error bars, trial counts, or statistical tests, so the robustness of the dominant-skill effect and the claimed closeness to oracle cannot be assessed.
  2. [Abstract] Abstract: the cross-task average claim of being 'within 3pp of full revalidation' is qualified by an undefined 'mixed-oracle caveat'; without an explicit definition of how the oracle is modified or why the caveat is needed, this quantitative equivalence is not load-bearing for the deployment-ready assertion.
  3. [Across three tasks] Across three tasks: the dominant-skill effect is demonstrated only on dual-arm peg-in-hole; it is explicitly undefined on the pick task due to 100% saturation, and no analysis is supplied for the third task or for whether the effect is an artifact of the chosen robosuite environments rather than a general property of compositional policies.
  4. [144 skill-update decisions] 144 skill-update decisions: the Hybrid Selector Pareto frontier and the atomic probe's 64.6% vs 87.5% result are evaluated only against full revalidation within the same narrow distribution; no cross-simulator, real-robot, or expanded task-suite results are reported, leaving the generalization required for the 'deployment-ready' claim untested.
minor comments (2)
  1. [Abstract] Abstract: the acronym ECM is used without expansion on first appearance.
  2. [Method] The precise computation of the atomic-quality probe (how per-skill success rates are turned into a governance signal) should be stated explicitly so that the failure of behavioral distance can be contrasted mechanistically.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback on our manuscript. We appreciate the recognition of the potential utility of the atomic probe for skill-update governance. We address each of the major comments point by point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reported figures (86.7% dominant ECM, ≤26.7% others, +50pp shift, 23pp gap on T6, 12pp hybrid gap, 46% cost, 3pp cross-task average) are given without error bars, trial counts, or statistical tests, so the robustness of the dominant-skill effect and the claimed closeness to oracle cannot be assessed.

    Authors: We agree that the reported figures in the abstract lack error bars, trial counts, and statistical tests, which hinders assessment of robustness. We will revise the abstract and main text to include the number of trials (30 per condition), standard errors, and results of statistical significance tests for key effects such as the dominant-skill phenomenon. revision: yes

  2. Referee: [Abstract] Abstract: the cross-task average claim of being 'within 3pp of full revalidation' is qualified by an undefined 'mixed-oracle caveat'; without an explicit definition of how the oracle is modified or why the caveat is needed, this quantitative equivalence is not load-bearing for the deployment-ready assertion.

    Authors: We agree that the mixed-oracle caveat requires explicit definition to make the claim clear. This caveat accounts for cases of atomic saturation where full revalidation provides no additional information. We will add a precise definition in the abstract and elaborate in the methods section of the revised manuscript. revision: yes

  3. Referee: [Across three tasks] Across three tasks: the dominant-skill effect is demonstrated only on dual-arm peg-in-hole; it is explicitly undefined on the pick task due to 100% saturation, and no analysis is supplied for the third task or for whether the effect is an artifact of the chosen robosuite environments rather than a general property of compositional policies.

    Authors: The dominant-skill effect is analyzed in depth for the dual-arm peg-in-hole task where atomic success rates vary sufficiently. We already note its undefined nature on the saturated pick task. For the third task, supporting results on metric failures are present but the dominant effect analysis is lighter. We will expand the cross-task discussion, include additional details on the third task, and add a limitations paragraph addressing potential environment-specific artifacts and the need for broader validation. revision: partial

  4. Referee: [144 skill-update decisions] 144 skill-update decisions: the Hybrid Selector Pareto frontier and the atomic probe's 64.6% vs 87.5% result are evaluated only against full revalidation within the same narrow distribution; no cross-simulator, real-robot, or expanded task-suite results are reported, leaving the generalization required for the 'deployment-ready' claim untested.

    Authors: We acknowledge that the evaluation is limited to the robosuite tasks and does not include cross-simulator or real-robot experiments, which would be required for full generalization. This is a genuine scope limitation of the present work. We will revise the abstract and conclusions to moderate the 'deployment-ready' phrasing to emphasize the probe as a promising, low-cost primitive supported by the current evidence, while highlighting the need for future broader testing. No new experiments are added at this stage. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation on fixed robosuite tasks

full rationale

The paper reports experimental results from a paired-sampling protocol on three robosuite manipulation tasks and 144 skill-update events. It discovers the dominant-skill effect, shows off-policy metrics fail, and evaluates the atomic-quality probe and Hybrid Selector directly against full revalidation oracle. No equations, derivations, or fitted parameters are presented that reduce to quantities defined by the authors' own choices; the central claims rest on measured success rates rather than any self-definitional or self-citation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on empirical observations from three robosuite tasks and the introduction of new methodological primitives. No explicit free parameters, axioms, or invented physical entities are stated in the abstract.

invented entities (1)
  • atomic-quality probe no independent evidence
    purpose: Per-skill quality assessment at zero per-decision cost to govern updates
    Newly proposed primitive whose effectiveness is demonstrated via the reported Pareto frontier.

pith-pipeline@v0.9.0 · 5639 in / 1336 out tokens · 70946 ms · 2026-05-08T03:21:18.279452+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 36 canonical work pages · 15 internal anchors

  1. [1]

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. OpenVLA: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024. URL https://arxiv.org/abs/2406. 09246

  2. [2]

    Ghosh, H

    Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, et al. Octo: An open-source generalist robot policy. InRobotics: Science and Systems (RSS), 2024. URLhttps://arxiv.org/abs/2405. 12213

  3. [3]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.π0: A vision-language-action flow model for general robot control. InRobotics: Science and Systems (RSS), 2025. URL https://arxiv.org/ abs/2410.24164

  4. [4]

    DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y. Chen, K. Ellis, et al. DROID: A large-scale in-the-wild robot 10 manipulation dataset. InRobotics: Science and Systems (RSS), 2024. URL https: //arxiv.org/abs/2403.12945

  5. [5]

    Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Open X-Embodiment Collaboration, A. O’Neill, A. Rehman, A. Maddukuri, et al. Open X-embodiment: Robotic learning datasets and RT-X models. InICRA, 2024. URL https://arxiv.org/abs/2310.08864

  6. [6]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning (CoRL), 2023. URL https://arxiv.org/abs/2307.15818

  7. [8]

    CoRL 2025

    URL https://arxiv.org/abs/2505.21981. CoRL 2025. LLM extracts precon- ditions/effects for high-level actions; neural controllers per action. Closest neighbor to typed-composition idea

  8. [9]

    Shao et al

    Y. Shao et al. SymSkill: Symbol and skill co-invention for data-efficient and reactive long- horizon manipulation.arXiv preprint arXiv:2510.01661, 2025. URLhttps://arxiv. org/abs/2510.01661. Jointly learns predicates, operators, skills from unsegmented demo; RoboCasa 6-step composition with real-time recovery

  9. [10]

    U. A. Mishra, S. Xue, Y. Chen, and D. Xu. Generative skill chaining: Long-horizon skill planning with diffusion models. InConference on Robot Learning (CoRL), 2023. URL https://arxiv.org/abs/2401.03360. Learns joint (precondition, skill params, effect) diffusion per skill; conditional sampling for chaining

  10. [11]

    G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, L. Fan, and A. Anandkumar. Voyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023. URL https://arxiv.org/abs/2305.16291. GPT-4 auto- generates executable code into a skill library with self-verification; append-only, no typed interfaces

  11. [12]

    Zhang, J

    J. Zhang, J. Zhang, K. Pertsch, Z. Liu, X. Ren, M. Chang, S.-H. Sun, and J. J. Lim. Bootstrap your own skills: Learning to solve new tasks with large language model guidance.CoRL, 2023. URL https://clvrai.github.io/boss/. BOSS: LLM-guided growing of skill library; chains base skills into long-horizon behaviors

  12. [14]

    Continual skill discovery from open- vocabulary VLM; append-only skill library

    URL https://arxiv.org/abs/2311.02058. Continual skill discovery from open- vocabulary VLM; append-only skill library

  13. [15]

    Keller, D

    L. Keller, D. Tanneberg, and J. Peters. Neuro-symbolic imitation learning: Discovering symbolic abstractions for skill learning.arXiv preprint arXiv:2503.21406, 2025. URL https://arxiv.org/abs/2503.21406. Learns PDDL predicates + neural skills from demos; symbolic planning for abstract plans refined by neural skills

  14. [16]

    Liang, N

    Y. Liang, N. Kumar, H. Tang, A. Weller, J. B. Tenenbaum, T. Silver, J. F. Henriques, and K. Ellis. VisualPredicator: Learning abstract world models with neuro-symbolic predicates for robot planning.arXiv preprint arXiv:2410.23156, 2024. URL https: //arxiv.org/abs/2410.23156. Neuro-symbolic predicates for abstract world model + planning

  15. [17]

    M. Ahn, A. Brohan, et al. Do as i can, not as i say: Grounding language in robotic affordances. InCoRL, 2022. URLhttps://arxiv.org/abs/2204.01691. LLM suggests actions weighted by learned affordance value function; foundational LLM+robotics. 11

  16. [18]

    Code as Policies: Language Model Programs for Embodied Control

    J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng. Code as policies: Language model programs for embodied control. InICRA, 2023. URL https://arxiv.org/abs/2209.07753. LLMs generate Python code that composes perception and control primitives

  17. [19]

    Y. Lee, J. J. Lim, A. Anandkumar, and Y. Zhu. Adversarial skill chaining for long-horizon robot manipulation via terminal state regularization. InConference on Robot Learning (CoRL), 2021. URL https://arxiv.org/abs/2111.07999. T-STAR: terminal-state regularization for skill chaining; closest neighbor on hand-off-state mismatch

  18. [20]

    Pertsch, Y

    K. Pertsch, Y. Lee, and J. J. Lim. Accelerating reinforcement learning with learned skill priors. InCoRL, 2020. URL https://arxiv.org/abs/2010.11944. SPiRL: learn skill embedding + prior from offline data; foundational skill-prior work

  19. [21]

    L. X. Shi, J. J. Lim, and Y. Lee. Skill-based model-based reinforcement learning. In CoRL, 2022. URLhttps://arxiv.org/abs/2207.07560. SkiMo: skill dynamics model + skill repertoire; 5x more sample efficient than SPiRL

  20. [22]

    Z. Feng, H. Luan, K. Y. Ma, and H. Soh. Diffusion meets options: Hierarchical generative skill composition for temporally-extended tasks.arXiv preprint arXiv:2410.02389, 2024. URL https://arxiv.org/abs/2410.02389. DOPPLER: LTL-specified planning + HRL + diffusion options; navigation and manipulation

  21. [23]

    C. L. Shek and P. Tokekar. Option discovery using LLM-guided semantic hierarchical reinforcement learning.arXiv preprint arXiv:2503.19007, 2025. URLhttps://arxiv. org/abs/2503.19007. LDSC: LLM subgoal selection + option reuse; outperforms baselines by 55.9%

  22. [24]

    X. Li, K. Hsu, J. Gu, K. Pertsch, O. Mees, H. R. Walke, C. Fang, I. S. Wang, N. Yokoyama, D. Sadigh, S. Levine, J. Wu, and C. Finn. Evaluating real-world robot manipulation policies in simulation. InConference on Robot Learning (CoRL),

  23. [25]

    URLhttps://arxiv.org/abs/2405.05941

  24. [26]

    Atreya, K

    P. Atreya, K. Pertsch, T. Lee, M. J. Kim, A. Jain, A. Prasad, O. Mees, H. Walke, J. Fu, S. Belkhale, et al. RoboArena: Distributed real-world evaluation of generalist robot policies.arXiv preprint arXiv:2506.18123, 2025. URL https://arxiv.org/abs/2506. 18123

  25. [27]

    C. Chi, S. Feng, Y. Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. InRobotics: Science and Systems (RSS),

  26. [28]

    URLhttps://arxiv.org/abs/2303.04137

  27. [29]

    T. Z. Zhao, V. Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipu- lation with low-cost hardware. InRobotics: Science and Systems (RSS), 2023. URL https://arxiv.org/abs/2304.13705. ACT: Action Chunking Transformer; canonical bimanual imitation-learning baseline

  28. [30]

    Overcoming catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13): 3521–3526, 2017

    J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences (PNAS), 114(13):3521–3526, 2017. doi:10.1073/pnas.1611835114

  29. [31]

    J. A. Mendez, M. Hussing, M. Gummadi, and E. Eaton. CompoSuite: A compositional reinforcement learning benchmark. InCoLLAs, 2022. URLhttps://arxiv.org/abs/ 2207.04136. 256 tasks = robot x obstacle x object x objective; canonical compositional RL benchmark. 12

  30. [32]

    B. Liu, Y. Zhu, C. Gao, Y. Feng, Q. Liu, Y. Zhu, and P. Stone. LIBERO: Benchmarking knowledge transfer for lifelong robot learning. InNeurIPS Datasets and Benchmarks,

  31. [33]

    LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

    URL https://arxiv.org/abs/2306.03310. 130 language-conditioned manipu- lation tasks; 4 suites including LIBERO-Long for skill chaining

  32. [34]

    X. Zhou, Y. Xu, G. Tie, Y. Chen, G. Zhang, D. Chu, P. Zhou, and L. Sun. LIBERO-PRO: Towards robust and fair evaluation of vision-language-action models beyond memo- rization.arXiv preprint arXiv:2510.03827, 2025. URL https://arxiv.org/abs/2510. 03827. Extended LIBERO eval across objects/init-states/instructions/environments; SOTA fails near-completely und...

  33. [35]

    Haresh, D

    S. Haresh, D. Dijkman, A. Bhattacharyya, and R. Memisevic. ClevrSkills: Compositional language and visual reasoning in robotics. InNeurIPS Datasets and Benchmarks Track,

  34. [36]

    33 tasks over 3 compositional levels (L0/L1/L2) on ManiSkill2; even pretrained VLMs fail on L1/L2

    URL https://arxiv.org/abs/2411.09052. 33 tasks over 3 compositional levels (L0/L1/L2) on ManiSkill2; even pretrained VLMs fail on L1/L2

  35. [37]

    O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard. CALVIN: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters, 2022. URLhttps://arxiv.org/abs/2112.03227. Long-horizon language-conditioned benchmark; chains of up to 5 sub-goals

  36. [38]

    Y. Zhu, J. Wong, A. Mandlekar, R. Martín-Martín, A. Joshi, S. Nasiriany, and Y. Zhu. robosuite: A modular simulation framework and benchmark for robot learning.arXiv preprint arXiv:2009.12293, 2020. URL https://arxiv.org/abs/2009.12293. Stan- dard manipulation benchmark used in this paper

  37. [39]

    Soft Actor-Critic Algorithms and Applications

    T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, and S. Levine. Soft actor-critic algorithms and applications.arXiv preprint arXiv:1812.05905, 2018. URLhttps://arxiv.org/abs/1812.05905

  38. [40]

    Agarwal, M

    R. Agarwal, M. Schwarzer, P. S. Castro, A. C. Courville, and M. G. Bellemare. Deep reinforcement learning at the edge of the statistical precipice. InAdvances in Neural Information Processing Systems (NeurIPS), 2021. URLhttps://arxiv.org/abs/2108. 13264

  39. [41]

    Fine-tuning can distort pretrained features and underperform out-of-distribution.arXiv preprint arXiv:2202.10054,

    A. Kumar, A. Raghunathan, R. Jones, T. Ma, and P. Liang. Fine-tuning can distort pretrained features and underperform out-of-distribution. InInternational Conference on Learning Representations (ICLR), 2022. URLhttps://arxiv.org/abs/2202.10054

  40. [42]

    Cheng and D

    S. Cheng and D. Xu. LEAGUE: Guided skill learning and abstraction for long-horizon manipulation.IEEE Robotics and Automation Letters, 2023. URL https://arxiv. org/abs/2210.12631

  41. [43]

    Y. Zhu, P. Stone, and Y. Zhu. Bottom-up skill discovery from unsegmented demon- strations for long-horizon robot manipulation.IEEE Robotics and Automation Letters, 2022

  42. [44]

    Z. Chen, Z. Gao, J. Huo, and T. Ji. SCaR: Refining skill chaining for long-horizon robotic manipulation via dual regularization. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  43. [45]

    Y. Wang, Y. Zhang, M. Huo, R. Tian, X. Zhang, Y. Xie, C. Xu, P. Ji, W. Zhan, M. Ding, and M. Tomizuka. Sparse diffusion policy: A sparse, reusable, and flexible policy for robot learning. InConference on Robot Learning (CoRL), 2024

  44. [46]

    G. M. van de Ven, T. Tuytelaars, and A. S. Tolias. Three types of incremental learning. Nature Machine Intelligence, 4:1185–1197, 2022. doi:10.1038/s42256-022-00568-3. 13

  45. [47]

    Y. Ding, Y. Liu, Y. Wang, and H. Wang. Evaluating forgetting in pretrained robotic policy networks: A continual learning study with Octo. InDICTA, 2025

  46. [48]

    Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S

    M. Wortsman, G. Ilharco, S. Y. Gadre, R. Roelofs, R. Gontijo-Lopes, A. S. Morcos, H. Namkoong, A. Farhadi, Y. Carmon, S. Kornblith, and L. Schmidt. Model soups: Averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. InInternational Conference on Machine Learning (ICML), 2022. URL https://arxiv.org/abs/2203.05482

  47. [49]

    Editing Models with Task Arithmetic

    G. Ilharco, M. T. Ribeiro, M. Wortsman, S. Gururangan, L. Schmidt, H. Hajishirzi, and A. Farhadi. Editing models with task arithmetic. InInternational Conference on Learning Representations (ICLR), 2023. URLhttps://arxiv.org/abs/2212.04089. A Hybrid Selector Pseudocode Algorithm 1Hybrid Skill-Update Selector Require:Old ECMc p, candidatec a, atomic probes...