Reflex: Reinforcement Learning with Reflection Symmetry Exploitation in State-Based Continuous Control

Shuai Zhen; Yanhua Yu; Yifan Zhang; Yuling Wang

arxiv: 2605.23415 · v1 · pith:3R4VRVQLnew · submitted 2026-05-22 · 💻 cs.LG · cs.AI

Reflex: Reinforcement Learning with Reflection Symmetry Exploitation in State-Based Continuous Control

Shuai Zhen , Yifan Zhang , Yuling Wang , Yanhua Yu This is my paper

Pith reviewed 2026-05-25 04:55 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords reinforcement learningreflection symmetrycontinuous controlsample efficiencysymmetry regularizationPPOSAC

0 comments

The pith

Reflex exploits reflection symmetry in state-based RL to improve sample efficiency and performance on continuous control tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Reflex as a way to incorporate reflection symmetry into reinforcement learning for state-based continuous control. It defines axial and bilateral reflection symmetries as state-action transformations that preserve MDP dynamics and shows how to use symmetry regularization to shape optimal value functions and policies. Reflex is integrated with PPO and SAC and tested on OpenAI Gym and DeepMind Control benchmarks, where it outperforms standard baselines in both final performance and learning speed. A sympathetic reader would care because reflection symmetry is a common structure in many control problems, and exploiting it could reduce the number of samples needed to learn good policies without changing the underlying algorithm structure.

Core claim

By formalizing reflection symmetries and enforcing symmetry preservation through regularization on value functions and policies, Reflex produces policies that respect the symmetry of the task and achieves better sample efficiency and higher returns than symmetry-agnostic baselines on the evaluated benchmarks.

What carries the argument

Symmetry regularization mechanisms that penalize deviations from reflection symmetry in the learned policy and value function.

If this is right

Reflex can be added to both on-policy and off-policy algorithms without changing their core update rules.
Optimal policies under the symmetry-preserving regularization respect the reflection transformations of the state space.
Empirical gains appear on standard continuous-control suites when the symmetry is present.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same regularization idea might extend to other discrete symmetries such as 180-degree rotations if the task geometry permits.
If the symmetry is only approximate, a tunable regularization weight could control how strictly the policy is forced to obey it.
The approach could be combined with existing data-augmentation techniques that also exploit symmetry to further reduce sample needs.

Load-bearing premise

The benchmark tasks contain exploitable reflection symmetry that can be expressed as transformations preserving the MDP dynamics.

What would settle it

Running Reflex on control tasks deliberately constructed without reflection symmetry and observing whether it still improves or instead matches or underperforms the baselines.

Figures

Figures reproduced from arXiv: 2605.23415 by Shuai Zhen, Yanhua Yu, Yifan Zhang, Yuling Wang.

**Figure 1.** Figure 1: Left: A simple illustration of axial reflection. The cart moves along the x-axis, with state (θ, ω, v) (pole angle, angular velocity, and cart velocity) and action a (force to the cart). The reflection matrix to ℓu is R = [ −1 0 0 1 ]. After axial reflection, the augmented state becomes s˜ = [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Learning curves of Reflex compared with baselines, where w is set to 0.1 in Reflex-PPO. Due to space limitations, we report results only on tasks with bilateral reflection symmetry, complete results are provided in Appendix D.3. Suite (Tunyasuvunakool et al., 2020), all implemented with the MuJoCo physics simulator (Todorov et al., 2012). We select environments exhibiting either axial or bilateral reflect… view at source ↗

**Figure 3.** Figure 3: Ablation study result on Reflex-PPO, complete results are provided in Appendix 12. 6.4 Comparison with Different Settings Reflex-PPO. For Reflex-PPO, we further evaluate the effect of the symmetry regularization weight w in the PPO experiments. The coefficient w controls the relative strength of the symmetry loss with respect to the standard PPO objective. We tested multiple values of w ∈ {0.05, 0.1, 0.2… view at source ↗

**Figure 6.** Figure 6: Analysis of batch size effect on Reflex-SAC. 6.5 Applicability on TD3 Algorithm To demonstrate the generality of Reflex, we further apply the proposed symmetry regularization term to TD3 (Fujimoto et al., 2018), another widely used off-policy RL algorithm. We keep the philosophy consistent with Eqs. (11)(12). As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Applicability of Reflex on TD3 algorithm. 7 Conclusion In this work, we studied how reflection symmetry can be systematically exploited in state-based continuous control. We first formalized axial and bilateral reflection within the framework of group-invariant MDPs, and established that both optimal value functions and optimal policies admit equivariant solutions. Building on theoretical analysis, we prop… view at source ↗

**Figure 8.** Figure 8: Illustrations of representative tasks used in our experiments. (a) The cart-pole illustration applies to the CartpoleBalance, CartpoleThreePoles, and InvertedDoublePendulum tasks. They only differ from different pole configurations and environments source. (b) The walker illustration represents both Walker2d and WalkerRun tasks. (c) The quadruped illustration corresponds to the Ant tasks. 15 [PITH_FULL_IM… view at source ↗

**Figure 9.** Figure 9: Learning curves of PPO-based methods on all evaluated tasks. 0 2 5 8 10 Timesteps (×1e4) 400 600 800 1000 Episode reward SAC +RAS +GN SAC-c +Reflex (a) CartpoleBalance 0 5 10 15 20 Timesteps (×1e4) 0 50 100 150 200 Episode reward (b) CartpoleThreePoles 0 2 5 8 10 Timesteps (×1e4) 0 2500 5000 7500 10000 Episode reward (c) InvertedDoublePendulum 0 25 50 75 100 Timesteps (×1e4) 0 200 400 600 800 Episode rewar… view at source ↗

**Figure 10.** Figure 10: Learning curves of SAC-based methods on all evaluated tasks. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Sensitive analysis of PPO-based methods on all evaluated tasks. D.5 Complete Ablation Study Results on Reflex-PPO 0 8 15 22 30 Timesteps (×1e4) 400 600 800 1000 Episode reward PPO +Reflex_w=oA +Reflex_w=oC +Reflex_w=oD +Reflex_all (a) CartpoleBalance 0 8 15 22 30 Timesteps (×1e4) 50 100 150 Episode reward (b) CartpoleThreePoles 0 5 10 15 20 Timesteps (×1e4) 0 2000 4000 6000 Episode reward (c) InvertedDoub… view at source ↗

**Figure 12.** Figure 12: Sensitive analysis of PPO-based methods on all evaluated tasks. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

read the original abstract

Reinforcement learning has long struggled with poor sample efficiency. One promising approach to mitigate this problem is leveraging group-invariant Markov Decision Processes ($G$-invariant MDPs). Existing works in this direction have primarily focused on image-based RL and rotational symmetry such as $\mathrm{SO(2)}$, leaving state-based RL and reflection symmetry largely underexplored. In this work, we focus on state-based continuous control tasks and exploit reflection symmetry by introducing Reflex, a paradigm that seamlessly integrates with both on-policy and off-policy RL algorithms. We formalize two types of reflection-axial reflection and bilateral reflection, and characterize their corresponding transformations. Building on a theoretical analysis of symmetry-preserving optimal value functions and policies, Reflex integrates reflection symmetry into policy learning through principled symmetry regularization mechanisms. We integrate Reflex with PPO and SAC, and evaluate it on a suite of OpenAI Gym and DeepMind Control benchmarks, demonstrating superior performance over standard baselines while improving sample efficiency. Our code is available at https://github.com/TonyStark042/Reflex.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Reflex formalizes axial and bilateral reflection symmetries for state-based continuous control and adds a regularization term to both PPO and SAC.

read the letter

The main takeaway is that this paper takes reflection symmetry in state-based RL seriously and turns it into a practical regularization trick that works with existing algorithms. They define the two reflection types as state-action transformations, argue that optimal value functions and policies can preserve them, and then add a symmetry penalty to the usual objectives. The integration covers both on-policy and off-policy methods, which is useful. Code is released, so the claims can be checked directly. That combination of formalization plus dual-algorithm implementation is the clearest new piece. Prior symmetry work has mostly stayed with image observations and rotations, so the shift to state vectors and reflections fills a gap they correctly flag. The theory section supports the regularization without circularity, and the stress-test found no load-bearing assumption that would invalidate the construction. The empirical side rests on standard Gym and DMC tasks where the symmetries are present by design. Gains in sample efficiency are reported, though the size of those gains will depend on how closely a given task matches the assumed reflection. If the symmetry is only approximate, the term could add variance instead of helping. Readers will want to see ablations on how sensitive performance is to the regularization weight. This is aimed at people already working on sample-efficient continuous control or symmetry exploitation in RL. It is an incremental but clean extension rather than a broad new direction. The formalization and the on/off-policy coverage are solid enough that it deserves a serious referee.

Referee Report

2 major / 2 minor

Summary. The paper introduces Reflex, a framework for exploiting reflection symmetry (axial and bilateral) in state-based continuous control reinforcement learning. It formalizes these symmetries as state-action transformations, provides a theoretical argument that optimal value functions and policies can be symmetry-preserving, integrates a symmetry regularization term into both PPO and SAC, and reports superior performance and improved sample efficiency on OpenAI Gym and DeepMind Control benchmarks.

Significance. If the empirical gains hold under rigorous evaluation, the work extends group-invariant MDP methods from image-based rotational symmetry to state-based reflection symmetry, offering a principled way to improve sample efficiency in continuous control without altering the underlying MDP. Code availability aids reproducibility and allows direct verification of the regularization implementation.

major comments (2)

[Abstract] Abstract: the central claim of superior performance and improved sample efficiency is asserted without any quantitative results, error bars, or description of regularization implementation details, preventing evaluation of whether the symmetry regularization actually drives measurable gains on the cited benchmarks.
[Theoretical analysis] Theoretical analysis section: the argument that optimal value functions and policies are symmetry-preserving relies on the formalization of axial/bilateral transformations; it is unclear whether this holds exactly for the continuous state-action spaces in the chosen Gym/DMC tasks or requires additional assumptions on the dynamics.

minor comments (2)

[Method] The description of how the symmetry regularization is added to the PPO and SAC objectives should include the precise loss term and hyperparameter schedule.
[Experiments] Benchmark selection: clarify which specific tasks (e.g., Hopper, Walker) are assumed to possess exploitable axial or bilateral reflection symmetry and provide a brief justification.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below and indicate the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of superior performance and improved sample efficiency is asserted without any quantitative results, error bars, or description of regularization implementation details, preventing evaluation of whether the symmetry regularization actually drives measurable gains on the cited benchmarks.

Authors: We agree that the abstract would be strengthened by including a concise indication of the empirical gains. In the revised manuscript we will add one sentence to the abstract summarizing the average performance improvement and sample-efficiency gains (with reference to the error bars reported in the experimental tables) while remaining within length limits. The regularization implementation details are already provided in Section 4.2 and Algorithm 1; we will ensure the abstract points readers to these sections. revision: yes
Referee: [Theoretical analysis] Theoretical analysis section: the argument that optimal value functions and policies are symmetry-preserving relies on the formalization of axial/bilateral transformations; it is unclear whether this holds exactly for the continuous state-action spaces in the chosen Gym/DMC tasks or requires additional assumptions on the dynamics.

Authors: The proof in Section 3.3 assumes that both the reward function and the transition dynamics are invariant under the defined reflection transformations. This invariance holds exactly for the standard Gym and DMC environments because their underlying physics engines (MuJoCo) treat left/right and forward/backward directions symmetrically in the absence of external asymmetric forces. We will add an explicit statement of this assumption together with a short paragraph discussing the conditions under which the symmetry is preserved in continuous state-action spaces. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces formalizations of axial and bilateral reflection symmetries as state-action transformations, presents a theoretical analysis showing that optimal value functions and policies can be symmetry-preserving, and incorporates a symmetry regularization term into PPO and SAC. These elements constitute new methodological contributions that do not reduce to self-citations, fitted inputs renamed as predictions, or definitional equivalences. The empirical results on OpenAI Gym and DeepMind Control benchmarks serve as independent validation outside any internal construction, rendering the derivation chain self-contained with no load-bearing steps that collapse to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.0 · 5711 in / 966 out tokens · 18779 ms · 2026-05-25T04:55:01.467278+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean, IndisputableMonolith/Foundation/AlexanderDuality.lean reality_from_one_distinction, alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formalize two types of reflection—axial reflection and bilateral reflection... G-invariant MDP... r(s,a)=r(gs,ga), P(s′|s,a)=P(gs′|gs,ga). Lemma 4.3 (Equivariance of the Bellman Operator)... Theorem 4.5 (Equivariance of Optimal Policies)
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel, Jcost unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lsym_π(θ)=E[‖πθ(gst)−g(πθ(st))‖²₂]... symmetry regularization... improves sample efficiency on Gym/DMC

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 6 internal anchors

[1]

OpenAI Gym

Brockman, G., Cheung, V ., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym. arXiv preprint arXiv:1606.01540,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Corrado, N

URL https://openreview.net/forum? id=sVEu295o70. Corrado, N. E., Qu, Y ., Balis, J. U., Labiosa, A., and Hanna, J. P. Guided data augmentation for offline reinforce- ment learning and imitation learning.arXiv preprint arXiv:2310.18247,

work page arXiv
[3]

Soft Actor-Critic Algorithms and Applications

Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V ., Zhu, H., Gupta, A., Abbeel, P., et al. Soft actor-critic algorithms and applications.arXiv preprint arXiv:1812.05905,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Revisiting data augmen- tation in deep reinforcement learning.arXiv preprint arXiv:2402.12181,

Hu, J., Jiang, Y ., and Weng, P. Revisiting data augmen- tation in deep reinforcement learning.arXiv preprint arXiv:2402.12181,

work page arXiv
[5]

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Continuous control with deep reinforcement learning

Laskin, M., Lee, K., Stooke, A., Pinto, L., Abbeel, P., and Srinivas, A. Reinforcement learning with augmented data. Advances in neural information processing systems, 33: 19884–19895, 2020a. Laskin, M., Srinivas, A., and Abbeel, P. Curl: Contrastive unsupervised representations for reinforcement learning. InInternational conference on machine learning, p...

work page internal anchor Pith review Pith/arXiv arXiv
[7]

K., Nair, P., and Siddiqi, K

Mondal, A. K., Nair, P., and Siddiqi, K. Group equiv- ariant deep reinforcement learning.arXiv preprint arXiv:2007.03437,

work page arXiv 2007
[8]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. High-dimensional continuous control using generalized advantage estimation.arXiv preprint arXiv:1506.02438,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Lever- aging symmetry in rl-based legged locomotion control

Su, Z., Huang, X., Ordo˜nez-Apraez, D., Li, Y ., Li, Z., Liao, Q., Turrisi, G., Pontil, M., Semini, C., Wu, Y ., et al. Lever- aging symmetry in rl-based legged locomotion control. In 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 6899–6906. IEEE,

work page 2024
[11]

Equivariant offline reinforcement learning.arXiv preprint arXiv:2406.13961,

Tangri, A., Biza, O., Wang, D., Klee, D., Howell, O., and Platt, R. Equivariant offline reinforcement learning.arXiv preprint arXiv:2406.13961,

work page arXiv
[12]

Todorov, E., Erez, T., and MuJoCo, Y . T. A physics engine for model-based control. InProceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. IEEE,

work page 2012
[13]

Master- ing visual continuous control: Improved data-augmented reinforcement learning

Yarats, D., Fergus, R., Lazaric, A., and Pinto, L. Master- ing visual continuous control: Improved data-augmented reinforcement learning. InDeep RL Workshop NeurIPS 2021, 2021a. Yarats, D., Kostrikov, I., and Fergus, R. Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. InInternational conference on learning represen...

work page 2021
[14]

Thus: π∗(a|gs) = 1 |A∗(gs)| = 1 |A∗(s)| =π ∗(g−1a|s).(46) Ifa /∈ A∗(gs), theng −1a /∈ A∗(s), both sides of the equation equal zero

Since gis a bijection, the cardinality is preserved:|A ∗(gs)|=|gA ∗(s)|=|A ∗(s)|. Thus: π∗(a|gs) = 1 |A∗(gs)| = 1 |A∗(s)| =π ∗(g−1a|s).(46) Ifa /∈ A∗(gs), theng −1a /∈ A∗(s), both sides of the equation equal zero. To verify optimality, note that: V π∗ (s) = X a∈A∗(s) 1 |A∗(s)| Q∗(s, a) = X a∈A∗(s) 1 |A∗(s)| V ∗(s) =V ∗(s),(47) where we used the fact that ...

work page 2012
[15]

Actor learning rate 3×10 −4 (1×10 −3†) Critic learning rate 1×10 −3 Batch size 256 (512†) Replay buffer size 106 Hidden layer size 256 Discount (γ) 0.99 Warmup steps 5000 Target smoothing coefficient (τ) 0.005 Target update interval 1 (2†) Policy update interval 2 Alpha (α) 0.2 Autotune alpha true 16 Reflex: Reinforcement Learning with Reflection Symmetry...

work page 2000

[1] [1]

OpenAI Gym

Brockman, G., Cheung, V ., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym. arXiv preprint arXiv:1606.01540,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Corrado, N

URL https://openreview.net/forum? id=sVEu295o70. Corrado, N. E., Qu, Y ., Balis, J. U., Labiosa, A., and Hanna, J. P. Guided data augmentation for offline reinforce- ment learning and imitation learning.arXiv preprint arXiv:2310.18247,

work page arXiv

[3] [3]

Soft Actor-Critic Algorithms and Applications

Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V ., Zhu, H., Gupta, A., Abbeel, P., et al. Soft actor-critic algorithms and applications.arXiv preprint arXiv:1812.05905,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Revisiting data augmen- tation in deep reinforcement learning.arXiv preprint arXiv:2402.12181,

Hu, J., Jiang, Y ., and Weng, P. Revisiting data augmen- tation in deep reinforcement learning.arXiv preprint arXiv:2402.12181,

work page arXiv

[5] [5]

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Continuous control with deep reinforcement learning

Laskin, M., Lee, K., Stooke, A., Pinto, L., Abbeel, P., and Srinivas, A. Reinforcement learning with augmented data. Advances in neural information processing systems, 33: 19884–19895, 2020a. Laskin, M., Srinivas, A., and Abbeel, P. Curl: Contrastive unsupervised representations for reinforcement learning. InInternational conference on machine learning, p...

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

K., Nair, P., and Siddiqi, K

Mondal, A. K., Nair, P., and Siddiqi, K. Group equiv- ariant deep reinforcement learning.arXiv preprint arXiv:2007.03437,

work page arXiv 2007

[8] [8]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. High-dimensional continuous control using generalized advantage estimation.arXiv preprint arXiv:1506.02438,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Lever- aging symmetry in rl-based legged locomotion control

Su, Z., Huang, X., Ordo˜nez-Apraez, D., Li, Y ., Li, Z., Liao, Q., Turrisi, G., Pontil, M., Semini, C., Wu, Y ., et al. Lever- aging symmetry in rl-based legged locomotion control. In 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 6899–6906. IEEE,

work page 2024

[11] [11]

Equivariant offline reinforcement learning.arXiv preprint arXiv:2406.13961,

Tangri, A., Biza, O., Wang, D., Klee, D., Howell, O., and Platt, R. Equivariant offline reinforcement learning.arXiv preprint arXiv:2406.13961,

work page arXiv

[12] [12]

Todorov, E., Erez, T., and MuJoCo, Y . T. A physics engine for model-based control. InProceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. IEEE,

work page 2012

[13] [13]

Master- ing visual continuous control: Improved data-augmented reinforcement learning

Yarats, D., Fergus, R., Lazaric, A., and Pinto, L. Master- ing visual continuous control: Improved data-augmented reinforcement learning. InDeep RL Workshop NeurIPS 2021, 2021a. Yarats, D., Kostrikov, I., and Fergus, R. Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. InInternational conference on learning represen...

work page 2021

[14] [14]

Thus: π∗(a|gs) = 1 |A∗(gs)| = 1 |A∗(s)| =π ∗(g−1a|s).(46) Ifa /∈ A∗(gs), theng −1a /∈ A∗(s), both sides of the equation equal zero

Since gis a bijection, the cardinality is preserved:|A ∗(gs)|=|gA ∗(s)|=|A ∗(s)|. Thus: π∗(a|gs) = 1 |A∗(gs)| = 1 |A∗(s)| =π ∗(g−1a|s).(46) Ifa /∈ A∗(gs), theng −1a /∈ A∗(s), both sides of the equation equal zero. To verify optimality, note that: V π∗ (s) = X a∈A∗(s) 1 |A∗(s)| Q∗(s, a) = X a∈A∗(s) 1 |A∗(s)| V ∗(s) =V ∗(s),(47) where we used the fact that ...

work page 2012

[15] [15]

Actor learning rate 3×10 −4 (1×10 −3†) Critic learning rate 1×10 −3 Batch size 256 (512†) Replay buffer size 106 Hidden layer size 256 Discount (γ) 0.99 Warmup steps 5000 Target smoothing coefficient (τ) 0.005 Target update interval 1 (2†) Policy update interval 2 Alpha (α) 0.2 Autotune alpha true 16 Reflex: Reinforcement Learning with Reflection Symmetry...

work page 2000