arxiv: 2605.02867 · v1 · submitted 2026-05-04 · 💻 cs.LG · cs.AI· cs.RO

Recognition: 3 theorem links

· Lean Theorem

Enhancing RL Generalizability in Robotics through SHAP Analysis of Algorithms and Hyperparameters

Lingxiao Kong , Cong Yang , Oya Deniz Beyan , Zeyd Boukhers

Authors on Pith no claims yet

Pith reviewed 2026-05-08 19:26 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.RO

keywords reinforcement learningSHAPgeneralizabilityroboticshyperparametersalgorithm selectionexplainable AIconfiguration optimization

0 comments

The pith

SHAP analysis of RL algorithms and hyperparameters reveals consistent patterns that improve generalization across robotic environments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reinforcement learning agents in robotics often fail to generalize because performance depends heavily on the chosen algorithm and its hyperparameters. The paper applies SHAP to measure exactly how much each configuration choice contributes to the gap between training and new environments. It shows that these contributions form stable patterns that hold across different tasks, allowing the authors to select better configurations in advance. This leads to agents that perform more reliably when deployed in unseen robotic settings without exhaustive re-testing. The approach turns explainability into a practical tool for reducing sensitivity to configuration decisions.

Core claim

The authors establish a theoretical link between Shapley values and generalizability, then use SHAP to empirically decompose the effects of algorithms and hyperparameters on RL performance, identify consistent impact patterns across tasks and environments, and demonstrate that selecting configurations according to these patterns yields improved generalization in robotic domains.

What carries the argument

SHAP-guided configuration selection, which quantifies the additive contribution of each algorithm and hyperparameter to generalization performance via Shapley values and applies the resulting rankings to choose robust settings.

If this is right

SHAP-derived rankings can be used to pre-select algorithms and hyperparameters that reduce the generalization gap in new robotic environments.
Impact patterns remain consistent enough across diverse tasks to transfer configuration guidance without per-task re-analysis.
Practitioners receive concrete, data-driven rules for choosing among common RL algorithms and their hyperparameters instead of relying on trial-and-error.
The framework supplies both empirical evidence and a theoretical basis for treating configuration choice as an explainable component of RL generalizability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same SHAP decomposition could be applied to RL domains outside robotics, such as navigation or manipulation in simulated warehouses, to test whether the consistency of patterns generalizes.
If the patterns prove stable, the method could be turned into an automated recommender that suggests configurations for a new environment after only a small number of runs.
Combining the SHAP approach with other attribution techniques might isolate whether the stability arises from algorithm properties or from the structure of the robotic state spaces.

Load-bearing premise

The impact patterns derived from SHAP remain stable enough across varied tasks and environments that they can guide configuration choices without needing fresh validation in the target setting.

What would settle it

Finding a new robotic task or environment in which the SHAP-recommended configurations produce larger generalization gaps than randomly chosen or default configurations would show the patterns do not reliably support better selection.

Figures

Figures reproduced from arXiv: 2605.02867 by Cong Yang, Lingxiao Kong, Oya Deniz Beyan, Zeyd Boukhers.

**Figure 1.** Figure 1: Our work aims to analyze and guide cross-environment generalization of RL models through bidirectional transfer experiments: MuJoCo ↔ PyBullet. Large Language Models (LLMs) and posing critical challenges for generalizability across diverse environments. This is particularly evident in Simulation-toSimulation (Sim2Sim) and Simulation-to-Real (Sim2Real) transfers [9], where environmental discrepancies incl… view at source ↗

**Figure 2.** Figure 2: SHAP-based configuration framework: (1) sample algorithm and hyperparameter configurations, (2) train RL models in source environment, (3) evaluate generalization gap across environments, (4) train surrogate SHAP explainer, and (5) analyze impact patterns and select optimal configurations. Algorithm 1 SHAP-Guided Configuration Selection Require: experimental data {(θj , JS(j), JT (j))} N j=1 Ensure: sens… view at source ↗

**Figure 3.** Figure 3: Main impact patterns across all algorithms and hyperparameters, lower (leftward) SHAP values indicate better generalizability. 5 Results and Discussion The results focus on four main patterns: (1) main impacts of RL algorithms and hyperparameters, (2) interaction impacts between configurations, (3) task and environment insights empirically validating Theorem 1 for the correlation between generalizability… view at source ↗

**Figure 4.** Figure 4: Feature interactions between hyperparameters in all four algorithms, where darker blue indicates stronger beneficial interactions. through long-term value estimation. Tau and buffer_size patterns depend on the exploration-exploitation balance. – SAC is an exception where learning rate ranks third, as entropy regularization already constrains policy updates. Gamma is most critical, with lower values reducin… view at source ↗

**Figure 5.** Figure 5: Feature interaction between learning rate and gamma in DDPG algorithm. Pattern 2: Interaction Impacts. RL algorithms and hyperparameters not only have individual impacts on generalizability but also interact in complex ways. As shown in view at source ↗

**Figure 6.** Figure 6: SHAP dependence of learning rate and gamma across tasks and environments. We specifically examine the learning_rate vs. gamma interaction in DDPG in view at source ↗

**Figure 7.** Figure 7: Best and worst configurations identified by SHAP analysis. learning rate (0.0070), high vf_coef (0.9943), low gamma (0.8161), and moderate gae_lambda (0.9719), aligning with our understanding that higher learning rates and lower gamma values improve generalizability, achieving a predicted gap of −413.5748. Conversely, the worst configuration features SAC with high gamma (0.9894), low learning rate (0.0001… view at source ↗

read the original abstract

Despite significant advances in Reinforcement Learning (RL), model performance remains highly sensitive to algorithm and hyperparameter configurations, while generalization gaps across environments complicate real-world deployment. Although prior work has studied RL generalization, the relative contribution of specific configurations to the generalization gap has not been quantitatively decomposed and systematically leveraged for configuration selection. To address this limitation, we propose an explainable framework that evaluates RL performance across robotic environments using SHapley Additive exPlanations (SHAP) to quantify configuration impacts. We establish a theoretical foundation connecting Shapley values to generalizability, empirically analyze configuration impact patterns, and introduce SHAP-guided configuration selection to enhance generalization. Our results reveal distinct patterns across algorithms and hyperparameters, with consistent configuration impacts across diverse tasks and environments. By applying these insights to configuration selection, we achieve improved RL generalizability and provide actionable guidance for practitioners.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The paper proposes an explainable framework applying SHAP to quantify the impacts of RL algorithms and hyperparameters on performance across robotic environments. It claims to establish a theoretical connection between Shapley values and generalizability, empirically identify consistent configuration impact patterns across tasks, and demonstrate that SHAP-guided configuration selection improves RL generalization while providing practitioner guidance.

Significance. If the observed patterns prove stable and the selection method transfers without per-domain re-analysis, the work could supply a systematic, interpretable alternative to ad-hoc tuning for reducing generalization gaps in robotic RL, with direct practical value for deployment.

major comments (3)

[Abstract] Abstract: the central claim that SHAP-guided selection 'achieve[s] improved RL generalizability' rests on the unshown assertion of stable, transferable impact patterns; the manuscript must supply the experimental design (environments, algorithms, hyperparameter ranges, generalizability metric, and cross-environment validation protocol) to substantiate this.
[Abstract] Abstract: the stated 'theoretical foundation connecting Shapley values to generalizability' is load-bearing for the framework but is not derived or axiomatized here; if generalizability is ultimately measured by the same fitted performance quantities used to compute the SHAP values, the reasoning risks circularity and requires explicit non-circular justification in the main text.
[Abstract] The skeptic concern is material: without evidence that SHAP-derived recommendations remain effective on held-out target domains (rather than only on the studied environments), the transfer claim cannot be accepted as load-bearing support for the selection method.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating the specific revisions we will implement to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that SHAP-guided selection 'achieve[s] improved RL generalizability' rests on the unshown assertion of stable, transferable impact patterns; the manuscript must supply the experimental design (environments, algorithms, hyperparameter ranges, generalizability metric, and cross-environment validation protocol) to substantiate this.

Authors: We agree that the abstract is overly concise and does not adequately detail the experimental setup supporting the central claim. In the revised manuscript, we will expand the abstract to explicitly summarize the experimental design, including the specific robotic environments (MuJoCo locomotion tasks), RL algorithms evaluated, hyperparameter ranges, the generalizability metric (performance on unseen environments), and the cross-environment validation protocol (e.g., leave-one-out across environments). These elements are already described in Sections 3 and 4 but will be condensed into the abstract for substantiation. revision: yes
Referee: [Abstract] Abstract: the stated 'theoretical foundation connecting Shapley values to generalizability' is load-bearing for the framework but is not derived or axiomatized here; if generalizability is ultimately measured by the same fitted performance quantities used to compute the SHAP values, the reasoning risks circularity and requires explicit non-circular justification in the main text.

Authors: We appreciate this observation on the theoretical component. We will introduce a dedicated subsection deriving the connection between Shapley values and generalizability. To address circularity concerns, the revision will explicitly separate the computation: SHAP values are derived from performance in source environments, while generalizability is evaluated on distinct target environments. This provides a non-circular justification, supported by formal reasoning that the impact patterns inform selection for transfer rather than merely reflecting fitted performance. revision: yes
Referee: [Abstract] The skeptic concern is material: without evidence that SHAP-derived recommendations remain effective on held-out target domains (rather than only on the studied environments), the transfer claim cannot be accepted as load-bearing support for the selection method.

Authors: We recognize the validity of requiring direct evidence on held-out domains. While our current cross-environment validation provides supporting patterns, we will add new experiments using completely held-out target robotic environments excluded from the SHAP analysis phase. The revised results will report the generalization performance of SHAP-guided selections on these unseen domains, thereby strengthening the transfer claim with explicit empirical validation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical SHAP analysis and configuration selection remain independent of input definitions.

full rationale

The paper's core chain consists of (1) running RL agents across robotic environments with varied algorithms/hyperparameters, (2) computing SHAP values on the resulting performance metrics to attribute impacts, (3) observing empirical patterns of consistency, and (4) using those patterns for downstream configuration selection. None of these steps reduce by construction to the inputs: SHAP attributions are computed from held-out performance data rather than being redefined as generalizability, the consistency claim is an empirical observation rather than a fitted prediction, and no self-citation or uniqueness theorem is invoked to force the framework. The theoretical link between Shapley values and generalizability is presented as a foundation for interpretation but does not substitute for the measured outcomes. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the framework implicitly relies on standard RL assumptions (Markov property, reward definitions) and the correctness of the SHAP implementation, none of which are audited here.

pith-pipeline@v0.9.0 · 5453 in / 1078 out tokens · 26973 ms · 2026-05-08T19:26:10.816250+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Shapley value of component i ... ϕ_i(θ) = Σ_{S⊆N\{i}} |S|!(|N|-|S|-1)!/|N|! [v(S∪{i}) - v(S)]

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 17 canonical work pages · 1 internal anchor

[1]

Adler, P., Falk, C., Friedler, S.A., Nix, T., Rybeck, G., Scheidegger, C., Smith, B., Venkatasubramanian, S.: Auditing black-box models for indirect influence. Knowl. Inf. Syst.54(1), 95–122 (2018). https://doi.org/10.1007/S10115-017-1116- 3, https://doi.org/10.1007/s10115-017-1116-3

work page doi:10.1007/s10115-017-1116- 2018
[2]

In: ICML

Beechey, D., Smith, T.M.S., Simsek, Ö.: Explaining reinforcement learning with shapley values. In: ICML. Proceedings of Machine Learning Research, vol. 202, pp. 2003–2014. PMLR (2023), https://proceedings.mlr.press/v202/beechey23a.html

2003
[3]

Archives of Computational Methods in Engineering31(7), 4209–4233 (2024) 14 L

Bian, K., Priyadarshi, R.: Machine learning optimization techniques: a survey, classification, challenges, and future research issues. Archives of Computational Methods in Engineering31(7), 4209–4233 (2024) 14 L. Kong et al

2024
[4]

In: AMIA

Che, Z., Purushotham, S., Khemani, R.G., Liu, Y.: Interpretable deep models for ICU outcome prediction. In: AMIA. AMIA (2016), https://knowledge.amia.org/amia-63300-1.3360278/t004-1.3364525/f004- 1.3364526/2500209-1.3364981/2493688-1.3364976

2016
[5]

In: ICML

Cobbe, K., Klimov, O., Hesse, C., Kim, T., Schulman, J.: Quanti- fying generalization in reinforcement learning. In: ICML. Proceedings of Machine Learning Research, vol. 97, pp. 1282–1289. PMLR (2019), http://proceedings.mlr.press/v97/cobbe19a.html

2019
[6]

ed: PyBullet Quickstart Guide

Coumans, E., Bai, Y.: Pybullet quickstart guide. ed: PyBullet Quickstart Guide. https://docs. google. com/document/u/1/d (2021)

2021
[7]

https://github.com/benelot/pybullet-gym (2018–2019)

Ellenberger, B.: Pybullet gymperium. https://github.com/benelot/pybullet-gym (2018–2019)

2018
[8]

In: xAI (3)

Engelhardt, R.C., Lange, M., Wiskott, L., Konen, W.: Exploring the reliability of SHAP values in reinforcement learning. In: xAI (3). Communications in Computer and Information Science, vol. 2155, pp. 165–184. Springer (2024). https://doi.org/10.1007/978-3-031-63800-8\_9, https://doi.org/10.1007/978-3- 031-63800-8_9

work page doi:10.1007/978-3-031-63800-8 2024
[9]

IEEE Trans Autom

Höfer, S., Bekris, K.E., Handa, A., Gamboa, J.C., Mozifian, M., Golemo, F., Atkeson, C.G., Fox, D., Goldberg, K., Leonard, J., Liu, C.K., Pe- ters, J., Song, S., Welinder, P., White, M.: Sim2real in robotics and automation: Applications and challenges. IEEE Trans Autom. Sci. Eng.18(2), 398–400 (2021). https://doi.org/10.1109/TASE.2021.3064065, https://doi...

work page doi:10.1109/tase.2021.3064065 2021
[10]

Materials Today Communications42, 111286 (2025)

Katlav, M., Tabar, M.E., Turk, K.: Ai-guided design framework for bond behavior of steel-concrete in steel reinforced concrete composites: From dataset cleaning to feature engineering. Materials Today Communications42, 111286 (2025)

2025
[11]

Kaup et al.,A review of nine physics engines for reinforcement learning research, 2024

Kaup, M., Wolff, C., Hwang, H., Mayer, J., Bruni, E.: A review of nine physics engines for reinforcement learning research. CoRR abs/2407.08590(2024). https://doi.org/10.48550/ARXIV.2407.08590, https://doi.org/10.48550/arXiv.2407.08590

work page doi:10.48550/arxiv.2407.08590 2024
[12]

Kong, L., Ramdan, Q., Zoubia, O., Polash, J.H., Elwes, M., Gurabi, M.A., Jin, L., Kutafina, E., Matzutt, R., Wang, Y., Xu, J., Beyan, O.D., Yang, C., Boukhers, Z.: Reinforcement learning for large language model fine-tuning: A systematic litera- ture review (2025)

2025
[13]

Lee, J., Dosovitskiy, A., Bellicoso, D., Tsounis, V., Koltun, V., Hut- ter, M.: Learning agile and dynamic motor skills for legged robots. Sci. Robotics4(26) (2019). https://doi.org/10.1126/SCIROBOTICS.AAU5872, https://doi.org/10.1126/scirobotics.aau5872

work page doi:10.1126/scirobotics.aau5872 2019
[14]

In: NIPS

Lundberg, S.M., Lee, S.: A unified approach to interpreting model predictions. In: NIPS. pp. 4765–4774 (2017)

2017
[15]

Luo, F., Xu, T., Lai, H., Chen, X., Zhang, W., Yu, Y.: A survey on model-based reinforcement learning. Sci. China Inf. Sci.67(2) (2024). https://doi.org/10.1007/S11432-022-3696-5, https://doi.org/10.1007/s11432-022- 3696-5

work page doi:10.1007/s11432-022-3696-5 2024
[16]

ACM Comput

Milani, S., Topin, N., Veloso, M., Fang, F.: Explainable reinforcement learning: A survey and comparative review. ACM Comput. Surv.56(7), 168:1–168:36 (2024). https://doi.org/10.1145/3616864, https://doi.org/10.1145/3616864

work page doi:10.1145/3616864 2024
[17]

In: Explainable AI: Foundations, Methodologies and Applications, pp

Onyekpe, U., Lu, Y., Apostolopoulou, E., Palade, V., Eyo, E.U., Kanarachos, S.: Explainable machine learning for autonomous vehicle positioning using shap. In: Explainable AI: Foundations, Methodologies and Applications, pp. 157–183. Springer (2022) SHAP for RL Generalizability in Robotics 15

2022
[18]

Assessing Generalization in Deep Reinforcement Learning

Packer, C., Gao, K., Kos, J., Krähenbühl, P., Koltun, V., Song, D.: Assessing generalization in deep reinforcement learning. CoRRabs/1810.12282(2018), http://arxiv.org/abs/1810.12282

work page Pith review arXiv 2018
[19]

In: SISY

Pitkevich, A., Makarov, I.: A survey on sim-to-real trans- fer methods for robotic manipulation. In: SISY. pp. 259–266. IEEE (2024). https://doi.org/10.1109/SISY62279.2024.10737545, https://doi.org/10.1109/SISY62279.2024.10737545

work page doi:10.1109/sisy62279.2024.10737545 2024
[20]

https://github.com/DLR-RM/rl-baselines3-zoo (2020)

Raffin, A.: Rl baselines3 zoo. https://github.com/DLR-RM/rl-baselines3-zoo (2020)

2020
[21]

Raffin, A., Hill, A., Gleave, A., Kanervisto, A., Ernestus, M., Dormann, N.: Stable- baselines3: Reliable reinforcement learning implementations. J. Mach. Learn. Res. 22, 268:1–268:8 (2021), https://jmlr.org/papers/v22/20-1364.html

2021
[22]

Remman, S.B., Lekkas, A.M.: Robotic lever manipulation using hind- sight experience replay and shapley additive explanations. In: ECC. pp. 586–593. IEEE (2021). https://doi.org/10.23919/ECC54610.2021.9654850, https://doi.org/10.23919/ECC54610.2021.9654850

work page doi:10.23919/ecc54610.2021.9654850 2021
[23]

A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay

Smith, L.N.: A disciplined approach to neural network hyper-parameters: Part 1 - learning rate, batch size, momentum, and weight decay. CoRRabs/1803.09820 (2018), http://arxiv.org/abs/1803.09820

work page Pith review arXiv 2018
[24]

Mu- joco: A physics engine for model-based control

Todorov, E., Erez, T., Tassa, Y.: Mujoco: A physics en- gine for model-based control. In: IROS. pp. 5026–5033. IEEE (2012). https://doi.org/10.1109/IROS.2012.6386109, https://doi.org/10.1109/IROS.2012.6386109

work page doi:10.1109/iros.2012.6386109 2012
[25]

Gymnasium: A Standard Interface for Reinforcement Learning Environments

Towers, M., Kwiatkowski, A., Terry, J.K., Balis, J.U., Cola, G.D., Deleu, T., Goulão, M., Kallinteris, A., Krimmel, M., KG, A., Perez-Vicente, R., Pierré, A., Schulhoff, S., Tai, J.J., Tan, H., Younis, O.G.: Gym- nasium: A standard interface for reinforcement learning environments. CoRRabs/2407.17032(2024). https://doi.org/10.48550/ARXIV.2407.17032, https...

work page internal anchor Pith review doi:10.48550/arxiv.2407.17032 2024
[26]

In: NeurIPS (2024), http://papers.nips.cc/paper_files/paper/2024/hash/8fa068ffe59817175d176bd75641fe16- Abstract-Conference.html

Wagenmaker, A., Huang, K., Ke, L., Jamieson, K., Gupta, A.: Overcoming the sim-to-real gap: Leveraging simulation to learn to explore for real-world RL. In: NeurIPS (2024), http://papers.nips.cc/paper_files/paper/2024/hash/8fa068ffe59817175d176bd75641fe16- Abstract-Conference.html

2024
[27]

In: TAROS

Yang, X., Ji, Z., Wu, J., Lai, Y.: An open-source multi-goal reinforcement learning environment for robotic manipulation with pybullet. In: TAROS. Lecture Notes in Computer Science, vol. 13054, pp. 14–24. Springer (2021). https://doi.org/10.1007/978-3-030-89177-0\_2, https://doi.org/10.1007/978-3- 030-89177-0_2

work page doi:10.1007/978-3-030-89177-0 2021
[28]

Neural Net- works191, 107650 (2025)

Zhang, J., Bao, B., Wang, C., Zhu, F.: Shapley value-driven multi- modal deep reinforcement learning for complex decision-making. Neural Net- works191, 107650 (2025). https://doi.org/10.1016/J.NEUNET.2025.107650, https://doi.org/10.1016/j.neunet.2025.107650

work page doi:10.1016/j.neunet.2025.107650 2025
[29]

IEEE Trans

Zhang, K., Zhang, J.J., Xu, P., Gao, T., Gao, D.W.: Explainable AI in deep rein- forcement learning models for power system emergency control. IEEE Trans. Com- put. Soc. Syst.9(2), 419–427 (2022). https://doi.org/10.1109/TCSS.2021.3096824, https://doi.org/10.1109/TCSS.2021.3096824

work page doi:10.1109/tcss.2021.3096824 2022
[30]

In: SSCI

Zhao, W., Queralta, J.P., Westerlund, T.: Sim-to-real transfer in deep reinforcement learning for robotics: a survey. In: SSCI. pp. 737–
[31]

https://doi.org/10.1109/SSCI47803.2020.9308468, https://doi.org/10.1109/SSCI47803.2020.9308468

IEEE (2020). https://doi.org/10.1109/SSCI47803.2020.9308468, https://doi.org/10.1109/SSCI47803.2020.9308468

work page doi:10.1109/ssci47803.2020.9308468 2020