pith. machine review for the scientific record. sign in

arxiv: 2605.02867 · v1 · submitted 2026-05-04 · 💻 cs.LG · cs.AI· cs.RO

Recognition: 3 theorem links

· Lean Theorem

Enhancing RL Generalizability in Robotics through SHAP Analysis of Algorithms and Hyperparameters

Authors on Pith no claims yet

Pith reviewed 2026-05-08 19:26 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.RO
keywords reinforcement learningSHAPgeneralizabilityroboticshyperparametersalgorithm selectionexplainable AIconfiguration optimization
0
0 comments X

The pith

SHAP analysis of RL algorithms and hyperparameters reveals consistent patterns that improve generalization across robotic environments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reinforcement learning agents in robotics often fail to generalize because performance depends heavily on the chosen algorithm and its hyperparameters. The paper applies SHAP to measure exactly how much each configuration choice contributes to the gap between training and new environments. It shows that these contributions form stable patterns that hold across different tasks, allowing the authors to select better configurations in advance. This leads to agents that perform more reliably when deployed in unseen robotic settings without exhaustive re-testing. The approach turns explainability into a practical tool for reducing sensitivity to configuration decisions.

Core claim

The authors establish a theoretical link between Shapley values and generalizability, then use SHAP to empirically decompose the effects of algorithms and hyperparameters on RL performance, identify consistent impact patterns across tasks and environments, and demonstrate that selecting configurations according to these patterns yields improved generalization in robotic domains.

What carries the argument

SHAP-guided configuration selection, which quantifies the additive contribution of each algorithm and hyperparameter to generalization performance via Shapley values and applies the resulting rankings to choose robust settings.

If this is right

  • SHAP-derived rankings can be used to pre-select algorithms and hyperparameters that reduce the generalization gap in new robotic environments.
  • Impact patterns remain consistent enough across diverse tasks to transfer configuration guidance without per-task re-analysis.
  • Practitioners receive concrete, data-driven rules for choosing among common RL algorithms and their hyperparameters instead of relying on trial-and-error.
  • The framework supplies both empirical evidence and a theoretical basis for treating configuration choice as an explainable component of RL generalizability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same SHAP decomposition could be applied to RL domains outside robotics, such as navigation or manipulation in simulated warehouses, to test whether the consistency of patterns generalizes.
  • If the patterns prove stable, the method could be turned into an automated recommender that suggests configurations for a new environment after only a small number of runs.
  • Combining the SHAP approach with other attribution techniques might isolate whether the stability arises from algorithm properties or from the structure of the robotic state spaces.

Load-bearing premise

The impact patterns derived from SHAP remain stable enough across varied tasks and environments that they can guide configuration choices without needing fresh validation in the target setting.

What would settle it

Finding a new robotic task or environment in which the SHAP-recommended configurations produce larger generalization gaps than randomly chosen or default configurations would show the patterns do not reliably support better selection.

Figures

Figures reproduced from arXiv: 2605.02867 by Cong Yang, Lingxiao Kong, Oya Deniz Beyan, Zeyd Boukhers.

Figure 1
Figure 1. Figure 1: Our work aims to analyze and guide cross-environment generalization of RL models through bidirectional transfer experiments: MuJoCo ↔ PyBullet. Large Language Models (LLMs) and posing critical challenges for generalizabil￾ity across diverse environments. This is particularly evident in Simulation-to￾Simulation (Sim2Sim) and Simulation-to-Real (Sim2Real) transfers [9], where environmental discrepancies incl… view at source ↗
Figure 2
Figure 2. Figure 2: SHAP-based configuration framework: (1) sample algorithm and hyperparam￾eter configurations, (2) train RL models in source environment, (3) evaluate general￾ization gap across environments, (4) train surrogate SHAP explainer, and (5) analyze impact patterns and select optimal configurations. Algorithm 1 SHAP-Guided Configuration Selection Require: experimental data {(θj , JS(j), JT (j))} N j=1 Ensure: sens… view at source ↗
Figure 3
Figure 3. Figure 3: Main impact patterns across all algorithms and hyperparameters, lower (left￾ward) SHAP values indicate better generalizability. 5 Results and Discussion The results focus on four main patterns: (1) main impacts of RL algorithms and hyperparameters, (2) interaction impacts between configurations, (3) task and environment insights empirically validating Theorem 1 for the correlation be￾tween generalizability… view at source ↗
Figure 4
Figure 4. Figure 4: Feature interactions between hyperparameters in all four algorithms, where darker blue indicates stronger beneficial interactions. through long-term value estimation. Tau and buffer_size patterns depend on the exploration-exploitation balance. – SAC is an exception where learning rate ranks third, as entropy regularization already constrains policy updates. Gamma is most critical, with lower values reducin… view at source ↗
Figure 5
Figure 5. Figure 5: Feature interaction between learn￾ing rate and gamma in DDPG algorithm. Pattern 2: Interaction Impacts. RL algorithms and hyperparameters not only have individual impacts on gen￾eralizability but also interact in com￾plex ways. As shown in view at source ↗
Figure 6
Figure 6. Figure 6: SHAP dependence of learning rate and gamma across tasks and environments. We specifically examine the learning_rate vs. gamma interaction in DDPG in view at source ↗
Figure 7
Figure 7. Figure 7: Best and worst configurations identified by SHAP analysis. learning rate (0.0070), high vf_coef (0.9943), low gamma (0.8161), and moder￾ate gae_lambda (0.9719), aligning with our understanding that higher learning rates and lower gamma values improve generalizability, achieving a predicted gap of −413.5748. Conversely, the worst configuration features SAC with high gamma (0.9894), low learning rate (0.0001… view at source ↗
read the original abstract

Despite significant advances in Reinforcement Learning (RL), model performance remains highly sensitive to algorithm and hyperparameter configurations, while generalization gaps across environments complicate real-world deployment. Although prior work has studied RL generalization, the relative contribution of specific configurations to the generalization gap has not been quantitatively decomposed and systematically leveraged for configuration selection. To address this limitation, we propose an explainable framework that evaluates RL performance across robotic environments using SHapley Additive exPlanations (SHAP) to quantify configuration impacts. We establish a theoretical foundation connecting Shapley values to generalizability, empirically analyze configuration impact patterns, and introduce SHAP-guided configuration selection to enhance generalization. Our results reveal distinct patterns across algorithms and hyperparameters, with consistent configuration impacts across diverse tasks and environments. By applying these insights to configuration selection, we achieve improved RL generalizability and provide actionable guidance for practitioners.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The paper proposes an explainable framework applying SHAP to quantify the impacts of RL algorithms and hyperparameters on performance across robotic environments. It claims to establish a theoretical connection between Shapley values and generalizability, empirically identify consistent configuration impact patterns across tasks, and demonstrate that SHAP-guided configuration selection improves RL generalization while providing practitioner guidance.

Significance. If the observed patterns prove stable and the selection method transfers without per-domain re-analysis, the work could supply a systematic, interpretable alternative to ad-hoc tuning for reducing generalization gaps in robotic RL, with direct practical value for deployment.

major comments (3)
  1. [Abstract] Abstract: the central claim that SHAP-guided selection 'achieve[s] improved RL generalizability' rests on the unshown assertion of stable, transferable impact patterns; the manuscript must supply the experimental design (environments, algorithms, hyperparameter ranges, generalizability metric, and cross-environment validation protocol) to substantiate this.
  2. [Abstract] Abstract: the stated 'theoretical foundation connecting Shapley values to generalizability' is load-bearing for the framework but is not derived or axiomatized here; if generalizability is ultimately measured by the same fitted performance quantities used to compute the SHAP values, the reasoning risks circularity and requires explicit non-circular justification in the main text.
  3. [Abstract] The skeptic concern is material: without evidence that SHAP-derived recommendations remain effective on held-out target domains (rather than only on the studied environments), the transfer claim cannot be accepted as load-bearing support for the selection method.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating the specific revisions we will implement to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that SHAP-guided selection 'achieve[s] improved RL generalizability' rests on the unshown assertion of stable, transferable impact patterns; the manuscript must supply the experimental design (environments, algorithms, hyperparameter ranges, generalizability metric, and cross-environment validation protocol) to substantiate this.

    Authors: We agree that the abstract is overly concise and does not adequately detail the experimental setup supporting the central claim. In the revised manuscript, we will expand the abstract to explicitly summarize the experimental design, including the specific robotic environments (MuJoCo locomotion tasks), RL algorithms evaluated, hyperparameter ranges, the generalizability metric (performance on unseen environments), and the cross-environment validation protocol (e.g., leave-one-out across environments). These elements are already described in Sections 3 and 4 but will be condensed into the abstract for substantiation. revision: yes

  2. Referee: [Abstract] Abstract: the stated 'theoretical foundation connecting Shapley values to generalizability' is load-bearing for the framework but is not derived or axiomatized here; if generalizability is ultimately measured by the same fitted performance quantities used to compute the SHAP values, the reasoning risks circularity and requires explicit non-circular justification in the main text.

    Authors: We appreciate this observation on the theoretical component. We will introduce a dedicated subsection deriving the connection between Shapley values and generalizability. To address circularity concerns, the revision will explicitly separate the computation: SHAP values are derived from performance in source environments, while generalizability is evaluated on distinct target environments. This provides a non-circular justification, supported by formal reasoning that the impact patterns inform selection for transfer rather than merely reflecting fitted performance. revision: yes

  3. Referee: [Abstract] The skeptic concern is material: without evidence that SHAP-derived recommendations remain effective on held-out target domains (rather than only on the studied environments), the transfer claim cannot be accepted as load-bearing support for the selection method.

    Authors: We recognize the validity of requiring direct evidence on held-out domains. While our current cross-environment validation provides supporting patterns, we will add new experiments using completely held-out target robotic environments excluded from the SHAP analysis phase. The revised results will report the generalization performance of SHAP-guided selections on these unseen domains, thereby strengthening the transfer claim with explicit empirical validation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical SHAP analysis and configuration selection remain independent of input definitions.

full rationale

The paper's core chain consists of (1) running RL agents across robotic environments with varied algorithms/hyperparameters, (2) computing SHAP values on the resulting performance metrics to attribute impacts, (3) observing empirical patterns of consistency, and (4) using those patterns for downstream configuration selection. None of these steps reduce by construction to the inputs: SHAP attributions are computed from held-out performance data rather than being redefined as generalizability, the consistency claim is an empirical observation rather than a fitted prediction, and no self-citation or uniqueness theorem is invoked to force the framework. The theoretical link between Shapley values and generalizability is presented as a foundation for interpretation but does not substitute for the measured outcomes. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the framework implicitly relies on standard RL assumptions (Markov property, reward definitions) and the correctness of the SHAP implementation, none of which are audited here.

pith-pipeline@v0.9.0 · 5453 in / 1078 out tokens · 26973 ms · 2026-05-08T19:26:10.816250+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 17 canonical work pages · 1 internal anchor

  1. [1]

    Adler, P., Falk, C., Friedler, S.A., Nix, T., Rybeck, G., Scheidegger, C., Smith, B., Venkatasubramanian, S.: Auditing black-box models for indirect influence. Knowl. Inf. Syst.54(1), 95–122 (2018). https://doi.org/10.1007/S10115-017-1116- 3, https://doi.org/10.1007/s10115-017-1116-3

  2. [2]

    In: ICML

    Beechey, D., Smith, T.M.S., Simsek, Ö.: Explaining reinforcement learning with shapley values. In: ICML. Proceedings of Machine Learning Research, vol. 202, pp. 2003–2014. PMLR (2023), https://proceedings.mlr.press/v202/beechey23a.html

  3. [3]

    Archives of Computational Methods in Engineering31(7), 4209–4233 (2024) 14 L

    Bian, K., Priyadarshi, R.: Machine learning optimization techniques: a survey, classification, challenges, and future research issues. Archives of Computational Methods in Engineering31(7), 4209–4233 (2024) 14 L. Kong et al

  4. [4]

    In: AMIA

    Che, Z., Purushotham, S., Khemani, R.G., Liu, Y.: Interpretable deep models for ICU outcome prediction. In: AMIA. AMIA (2016), https://knowledge.amia.org/amia-63300-1.3360278/t004-1.3364525/f004- 1.3364526/2500209-1.3364981/2493688-1.3364976

  5. [5]

    In: ICML

    Cobbe, K., Klimov, O., Hesse, C., Kim, T., Schulman, J.: Quanti- fying generalization in reinforcement learning. In: ICML. Proceedings of Machine Learning Research, vol. 97, pp. 1282–1289. PMLR (2019), http://proceedings.mlr.press/v97/cobbe19a.html

  6. [6]

    ed: PyBullet Quickstart Guide

    Coumans, E., Bai, Y.: Pybullet quickstart guide. ed: PyBullet Quickstart Guide. https://docs. google. com/document/u/1/d (2021)

  7. [7]

    https://github.com/benelot/pybullet-gym (2018–2019)

    Ellenberger, B.: Pybullet gymperium. https://github.com/benelot/pybullet-gym (2018–2019)

  8. [8]

    In: xAI (3)

    Engelhardt, R.C., Lange, M., Wiskott, L., Konen, W.: Exploring the reliability of SHAP values in reinforcement learning. In: xAI (3). Communications in Computer and Information Science, vol. 2155, pp. 165–184. Springer (2024). https://doi.org/10.1007/978-3-031-63800-8\_9, https://doi.org/10.1007/978-3- 031-63800-8_9

  9. [9]

    IEEE Trans Autom

    Höfer, S., Bekris, K.E., Handa, A., Gamboa, J.C., Mozifian, M., Golemo, F., Atkeson, C.G., Fox, D., Goldberg, K., Leonard, J., Liu, C.K., Pe- ters, J., Song, S., Welinder, P., White, M.: Sim2real in robotics and automation: Applications and challenges. IEEE Trans Autom. Sci. Eng.18(2), 398–400 (2021). https://doi.org/10.1109/TASE.2021.3064065, https://doi...

  10. [10]

    Materials Today Communications42, 111286 (2025)

    Katlav, M., Tabar, M.E., Turk, K.: Ai-guided design framework for bond behavior of steel-concrete in steel reinforced concrete composites: From dataset cleaning to feature engineering. Materials Today Communications42, 111286 (2025)

  11. [11]

    Kaup et al.,A review of nine physics engines for reinforcement learning research, 2024

    Kaup, M., Wolff, C., Hwang, H., Mayer, J., Bruni, E.: A review of nine physics engines for reinforcement learning research. CoRR abs/2407.08590(2024). https://doi.org/10.48550/ARXIV.2407.08590, https://doi.org/10.48550/arXiv.2407.08590

  12. [12]

    Kong, L., Ramdan, Q., Zoubia, O., Polash, J.H., Elwes, M., Gurabi, M.A., Jin, L., Kutafina, E., Matzutt, R., Wang, Y., Xu, J., Beyan, O.D., Yang, C., Boukhers, Z.: Reinforcement learning for large language model fine-tuning: A systematic litera- ture review (2025)

  13. [13]

    Lee, J., Dosovitskiy, A., Bellicoso, D., Tsounis, V., Koltun, V., Hut- ter, M.: Learning agile and dynamic motor skills for legged robots. Sci. Robotics4(26) (2019). https://doi.org/10.1126/SCIROBOTICS.AAU5872, https://doi.org/10.1126/scirobotics.aau5872

  14. [14]

    In: NIPS

    Lundberg, S.M., Lee, S.: A unified approach to interpreting model predictions. In: NIPS. pp. 4765–4774 (2017)

  15. [15]

    Luo, F., Xu, T., Lai, H., Chen, X., Zhang, W., Yu, Y.: A survey on model-based reinforcement learning. Sci. China Inf. Sci.67(2) (2024). https://doi.org/10.1007/S11432-022-3696-5, https://doi.org/10.1007/s11432-022- 3696-5

  16. [16]

    ACM Comput

    Milani, S., Topin, N., Veloso, M., Fang, F.: Explainable reinforcement learning: A survey and comparative review. ACM Comput. Surv.56(7), 168:1–168:36 (2024). https://doi.org/10.1145/3616864, https://doi.org/10.1145/3616864

  17. [17]

    In: Explainable AI: Foundations, Methodologies and Applications, pp

    Onyekpe, U., Lu, Y., Apostolopoulou, E., Palade, V., Eyo, E.U., Kanarachos, S.: Explainable machine learning for autonomous vehicle positioning using shap. In: Explainable AI: Foundations, Methodologies and Applications, pp. 157–183. Springer (2022) SHAP for RL Generalizability in Robotics 15

  18. [18]

    Assessing Generalization in Deep Reinforcement Learning

    Packer, C., Gao, K., Kos, J., Krähenbühl, P., Koltun, V., Song, D.: Assessing generalization in deep reinforcement learning. CoRRabs/1810.12282(2018), http://arxiv.org/abs/1810.12282

  19. [19]

    In: SISY

    Pitkevich, A., Makarov, I.: A survey on sim-to-real trans- fer methods for robotic manipulation. In: SISY. pp. 259–266. IEEE (2024). https://doi.org/10.1109/SISY62279.2024.10737545, https://doi.org/10.1109/SISY62279.2024.10737545

  20. [20]

    https://github.com/DLR-RM/rl-baselines3-zoo (2020)

    Raffin, A.: Rl baselines3 zoo. https://github.com/DLR-RM/rl-baselines3-zoo (2020)

  21. [21]

    Raffin, A., Hill, A., Gleave, A., Kanervisto, A., Ernestus, M., Dormann, N.: Stable- baselines3: Reliable reinforcement learning implementations. J. Mach. Learn. Res. 22, 268:1–268:8 (2021), https://jmlr.org/papers/v22/20-1364.html

  22. [22]

    Remman, S.B., Lekkas, A.M.: Robotic lever manipulation using hind- sight experience replay and shapley additive explanations. In: ECC. pp. 586–593. IEEE (2021). https://doi.org/10.23919/ECC54610.2021.9654850, https://doi.org/10.23919/ECC54610.2021.9654850

  23. [23]

    A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay

    Smith, L.N.: A disciplined approach to neural network hyper-parameters: Part 1 - learning rate, batch size, momentum, and weight decay. CoRRabs/1803.09820 (2018), http://arxiv.org/abs/1803.09820

  24. [24]

    Mu- joco: A physics engine for model-based control

    Todorov, E., Erez, T., Tassa, Y.: Mujoco: A physics en- gine for model-based control. In: IROS. pp. 5026–5033. IEEE (2012). https://doi.org/10.1109/IROS.2012.6386109, https://doi.org/10.1109/IROS.2012.6386109

  25. [25]

    Gymnasium: A Standard Interface for Reinforcement Learning Environments

    Towers, M., Kwiatkowski, A., Terry, J.K., Balis, J.U., Cola, G.D., Deleu, T., Goulão, M., Kallinteris, A., Krimmel, M., KG, A., Perez-Vicente, R., Pierré, A., Schulhoff, S., Tai, J.J., Tan, H., Younis, O.G.: Gym- nasium: A standard interface for reinforcement learning environments. CoRRabs/2407.17032(2024). https://doi.org/10.48550/ARXIV.2407.17032, https...

  26. [26]

    In: NeurIPS (2024), http://papers.nips.cc/paper_files/paper/2024/hash/8fa068ffe59817175d176bd75641fe16- Abstract-Conference.html

    Wagenmaker, A., Huang, K., Ke, L., Jamieson, K., Gupta, A.: Overcoming the sim-to-real gap: Leveraging simulation to learn to explore for real-world RL. In: NeurIPS (2024), http://papers.nips.cc/paper_files/paper/2024/hash/8fa068ffe59817175d176bd75641fe16- Abstract-Conference.html

  27. [27]

    In: TAROS

    Yang, X., Ji, Z., Wu, J., Lai, Y.: An open-source multi-goal reinforcement learning environment for robotic manipulation with pybullet. In: TAROS. Lecture Notes in Computer Science, vol. 13054, pp. 14–24. Springer (2021). https://doi.org/10.1007/978-3-030-89177-0\_2, https://doi.org/10.1007/978-3- 030-89177-0_2

  28. [28]

    Neural Net- works191, 107650 (2025)

    Zhang, J., Bao, B., Wang, C., Zhu, F.: Shapley value-driven multi- modal deep reinforcement learning for complex decision-making. Neural Net- works191, 107650 (2025). https://doi.org/10.1016/J.NEUNET.2025.107650, https://doi.org/10.1016/j.neunet.2025.107650

  29. [29]

    IEEE Trans

    Zhang, K., Zhang, J.J., Xu, P., Gao, T., Gao, D.W.: Explainable AI in deep rein- forcement learning models for power system emergency control. IEEE Trans. Com- put. Soc. Syst.9(2), 419–427 (2022). https://doi.org/10.1109/TCSS.2021.3096824, https://doi.org/10.1109/TCSS.2021.3096824

  30. [30]

    In: SSCI

    Zhao, W., Queralta, J.P., Westerlund, T.: Sim-to-real transfer in deep reinforcement learning for robotics: a survey. In: SSCI. pp. 737–

  31. [31]

    https://doi.org/10.1109/SSCI47803.2020.9308468, https://doi.org/10.1109/SSCI47803.2020.9308468

    IEEE (2020). https://doi.org/10.1109/SSCI47803.2020.9308468, https://doi.org/10.1109/SSCI47803.2020.9308468