pith. machine review for the scientific record. sign in

arxiv: 1602.07714 · v2 · submitted 2016-02-24 · 💻 cs.LG · cs.AI· cs.NE· stat.ML

Recognition: unknown

Learning values across many orders of magnitude

Arthur Guez, David Silver, Hado van Hasselt, Matteo Hessel, Volodymyr Mnih

classification 💻 cs.LG cs.AIcs.NEstat.ML
keywords learningacrossbehaviorclippeddifferentfunctiongamesmagnitude
0
0 comments X
read the original abstract

Most learning algorithms are not invariant to the scale of the function that is being approximated. We propose to adaptively normalize the targets used in learning. This is useful in value-based reinforcement learning, where the magnitude of appropriate value approximations can change over time when we update the policy of behavior. Our main motivation is prior work on learning to play Atari games, where the rewards were all clipped to a predetermined range. This clipping facilitates learning across many different games with a single learning algorithm, but a clipped reward function can result in qualitatively different behavior. Using the adaptive normalization we can remove this domain-specific heuristic without diminishing overall performance.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Density estimation using Real NVP

    cs.LG 2016-05 accept novelty 8.0

    Real NVP uses affine coupling layers to create invertible transformations that support exact density estimation, sampling, and latent inference without approximations.

  2. TOPPO: Rethinking PPO for Multi-Task Reinforcement Learning with Critic Balancing

    cs.AI 2026-05 unverdicted novelty 5.0

    TOPPO reformulates PPO with critic balancing to address gradient ill-conditioning in multi-task RL and reports stronger mean and tail performance than SAC baselines on Meta-World+ using fewer parameters and steps.