An Information-Geometric Approach to Artificial Curiosity

Alexander Nedergaard; Pablo A. Morales

arxiv: 2504.06355 · v2 · submitted 2025-04-08 · 💻 cs.LG

An Information-Geometric Approach to Artificial Curiosity

Alexander Nedergaard , Pablo A. Morales This is my paper

Pith reviewed 2026-05-22 19:45 UTC · model grok-4.3

classification 💻 cs.LG

keywords artificial curiosityintrinsic rewardsinformation geometryreinforcement learningexplorationoccupancy measurecount-based exploration

0 comments

The pith

Information geometry constrains intrinsic rewards to strictly concave functions of reciprocal occupancy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that artificial curiosity in sparse-reward reinforcement learning can be placed on a firmer footing by applying principles from information geometry. It shows that the dual requirements of information monotonicity and invariance under agent-environment interactions force intrinsic rewards to take the form of strictly concave functions of the reciprocal occupancy measure. Requiring in addition a coherent way to balance exploration against exploitation further restricts the rewards to a one-parameter family obtained by geodesic interpolation on the occupancy manifold. A reader should care because this derivation replaces heuristic choices of curiosity bonuses with a small set of candidates that recover familiar methods as special cases.

Core claim

Leveraging information monotonicity and invariance under the agent-environment interaction, the authors show that intrinsic rewards are uniquely constrained to strictly concave functions of the reciprocal occupancy. When these rewards must also support a principled exploration-exploitation trade-off via information geodesic interpolation on the occupancy manifold, the candidates reduce to a one-parameter family. Special values of the parameter recover count-based exploration and maximum-entropy exploration.

What carries the argument

Information monotonicity together with invariance under agent-environment interaction, which pins intrinsic rewards to strictly concave functions of the reciprocal occupancy on the occupancy manifold.

If this is right

All valid intrinsic rewards for artificial curiosity share the mathematical form of a strictly concave function applied to the reciprocal occupancy.
Exploration and exploitation trade off by moving along an information geodesic controlled by a single scalar parameter.
Count-based exploration corresponds to one specific value of the scalar parameter.
Maximum-entropy exploration corresponds to another specific value of the scalar parameter.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Rewards that explicitly depend on a chosen state encoding would violate the invariance and could produce inconsistent exploration across equivalent representations of the same environment.
Different choices of strictly concave function within the allowed family could be tested empirically to discover new exploration bonuses with desirable properties.
The same geometric construction might be applied in continuous state spaces once suitable occupancy measures and geodesics are defined.

Load-bearing premise

Intrinsic rewards are required to be representation-agnostic and to depend only on the agent's information about the environment.

What would settle it

Constructing or exhibiting an intrinsic reward function that is not strictly concave in the reciprocal occupancy yet still obeys both information monotonicity and invariance under agent-environment interaction would disprove the claimed uniqueness.

Figures

Figures reproduced from arXiv: 2504.06355 by Alexander Nedergaard, Pablo A. Morales.

**Figure 1.** Figure 1: Artificial curiosity with α-information rewards on the curved occupancy manifold. (Top). The Amari-Cencov tensor ˇ constant α ∈ R encodes the occupancy manifold curvature (red– spherical, blue–flat, green–hyperbolic). Count-based exploration corresponds to the Riemannian geometry with α = 0, and maximum entropy exploration to the flat geometry with α = −1 (Theorem 3.3). (Bottom) The intrinsic rewards sca… view at source ↗

read the original abstract

Learning in environments with sparse rewards remains a fundamental challenge in reinforcement learning. Artificial curiosity addresses this limitation through intrinsic rewards to guide exploration, however, the precise formulation of these rewards has remained elusive. Ideally, such rewards should depend on the agent's information about the environment, remaining agnostic to its representation -- an invariance central to information geometry. Leveraging this, we show that information monotonicity and invariance under the agent-environment interaction uniquely constrains intrinsic rewards to strictly concave functions of the reciprocal occupancy. Requiring these rewards to yield a principled exploration-exploitation trade-off, via information geodesic interpolation on the occupancy manifold, effectively limits the candidates to those determined by a scalar parameter. Remarkably, special values of this parameter are found to correspond to count-based and maximum entropy exploration. This framework provides important constraints to the engineering of intrinsic rewards while integrating foundational exploration methods into a single, cohesive model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims a geometric derivation pins intrinsic rewards to strictly concave functions of reciprocal occupancy and then parametrizes them to recover known methods, but the invariance step looks under-specified.

read the letter

The punchline is that this paper tries to derive a specific form for artificial curiosity rewards from information geometry principles, claiming uniqueness from monotonicity and invariance, then parametrizes it to recover known methods. What stands out as new is the combination of information monotonicity with an invariance under agent-environment interaction to constrain intrinsic rewards to strictly concave functions of the reciprocal occupancy. The geodesic interpolation step to get a scalar-parameter family is a nice way to organize things and link back to count-based and max-entropy exploration. The paper does a decent job of framing the problem in geometric terms and showing how different exploration strategies fit into one model. That could be useful for thinking about how to design intrinsic rewards more systematically instead of trial and error. The main soft spot is the invariance assumption. The stress-test note points out that it's not clear how the agent-environment interaction map is formalized as a functional equation, which leaves the uniqueness claim a bit loose. Without seeing the explicit steps in the derivation, it's difficult to judge if the math really pins it down or if there are implicit assumptions about the occupancy manifold that do some of the work. The fact that the parameter just tunes between existing methods also means it's more of a retrospective unification than a forward-looking prediction tool. This kind of paper is for researchers in reinforcement learning who focus on exploration and intrinsic motivation, especially those who like information-theoretic or geometric approaches. A reader looking for new empirical results or fully worked proofs might be disappointed, but someone wanting conceptual constraints on reward design could find it worthwhile. It has enough of a specific claim to deserve a serious referee who can check the derivation details. I would recommend putting it through peer review rather than desk rejecting it, mainly because the geometric angle is not routine in this area and the authors seem to be engaging honestly with the literature.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes an information-geometric framework for artificial curiosity in sparse-reward RL. It claims that information monotonicity together with invariance under the agent-environment interaction uniquely constrains intrinsic rewards to strictly concave functions of the reciprocal occupancy. Requiring a principled exploration-exploitation trade-off via information-geodesic interpolation on the occupancy manifold then reduces the family to a one-parameter model whose special values recover count-based and maximum-entropy exploration.

Significance. If the uniqueness result can be placed on a fully rigorous footing, the work would supply a principled geometric unification of several existing exploration heuristics and useful constraints on the design of intrinsic rewards. The recovery of known methods as special cases of the scalar parameter is a constructive feature, though the framework largely selects among previously published strategies rather than generating new falsifiable predictions.

major comments (3)

Abstract: the central uniqueness claim states that 'information monotonicity and invariance under the agent-environment interaction uniquely constrains intrinsic rewards to strictly concave functions of the reciprocal occupancy,' yet no explicit functional equation or axiom set is supplied that defines how the interaction map acts on the information measure or reward functional. Without this, it is impossible to confirm that the constraint excludes non-concave or non-reciprocal candidates.
Derivation of the one-parameter family (via geodesic interpolation): the reduction to a scalar parameter inherits the same ambiguity; the manuscript must state any regularity conditions imposed on the occupancy manifold and show that the geodesic step does not introduce hidden assumptions that weaken the uniqueness result.
Representation-agnostic premise (abstract, paragraph 3): the assumption that intrinsic rewards must depend solely on the agent's information and remain representation-agnostic is taken as a direct consequence of information geometry, but the precise invariance property is not formalized before the derivation; this is load-bearing for the uniqueness conclusion.

minor comments (2)

Abstract: the term 'reciprocal occupancy' is used without a brief inline definition or reference to its standard definition in the literature; adding one sentence would improve accessibility.
The manuscript introduces the 'occupancy manifold' as a central object; a short paragraph clarifying its construction from the occupancy measure and its relation to existing information-geometric structures would aid readers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The comments correctly identify areas where additional formalization will strengthen the presentation. We address each major comment below and will revise the manuscript accordingly to make the axiomatic foundations explicit while preserving the core information-geometric results.

read point-by-point responses

Referee: Abstract: the central uniqueness claim states that 'information monotonicity and invariance under the agent-environment interaction uniquely constrains intrinsic rewards to strictly concave functions of the reciprocal occupancy,' yet no explicit functional equation or axiom set is supplied that defines how the interaction map acts on the information measure or reward functional. Without this, it is impossible to confirm that the constraint excludes non-concave or non-reciprocal candidates.

Authors: The abstract necessarily condenses the result. In the body (Section 3), information monotonicity is defined via the data-processing inequality on the chosen divergence, and invariance under the agent-environment interaction is stated as invariance of the reward functional under the push-forward of the occupancy measure. To address the concern directly, we will insert a new subsection (2.3) that states the two axioms as explicit functional equations before deriving the uniqueness theorem. This will make the exclusion of non-concave candidates fully verifiable from the axioms alone. revision: yes
Referee: Derivation of the one-parameter family (via geodesic interpolation): the reduction to a scalar parameter inherits the same ambiguity; the manuscript must state any regularity conditions imposed on the occupancy manifold and show that the geodesic step does not introduce hidden assumptions that weaken the uniqueness result.

Authors: The occupancy manifold is the simplex equipped with the Fisher-Rao metric; geodesics are the standard information-geometric interpolations between occupancy measures. Regularity assumptions are smoothness of the occupancy functions and a compact state space ensuring geodesic existence. The interpolation step is applied only after the concavity constraint has been established and does not relax it. We will add an explicit paragraph in Section 4 listing these conditions and a short verification that the interpolated family remains strictly concave for all admissible parameter values. revision: yes
Referee: Representation-agnostic premise (abstract, paragraph 3): the assumption that intrinsic rewards must depend solely on the agent's information and remain representation-agnostic is taken as a direct consequence of information geometry, but the precise invariance property is not formalized before the derivation; this is load-bearing for the uniqueness conclusion.

Authors: The representation-agnostic property follows from the invariance of the information measure under sufficient statistics, which is a standard axiom in information geometry. While motivated in the introduction, we agree that a self-contained statement of this invariance axiom should precede the main derivation. We will restructure Section 2 to list all three axioms (monotonicity, interaction invariance, and representation invariance) before the uniqueness theorem, thereby clarifying the logical order. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; derivation relies on external information-geometry axioms.

full rationale

The paper invokes information monotonicity and invariance under agent-environment interaction as central properties of information geometry to constrain intrinsic rewards to strictly concave functions of reciprocal occupancy. These axioms are presented as independent inputs rather than defined in terms of the target reward form. The subsequent geodesic interpolation step introduces a scalar parameter whose special values recover count-based and maximum-entropy methods; this is an integration of existing strategies, not a statistical fit or self-definition that forces the result by construction. No load-bearing self-citations, uniqueness theorems imported from the authors' prior work, or ansatzes smuggled via citation are evident. The framework remains self-contained against the stated monotonicity and invariance premises.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The claim rests on two domain assumptions from information geometry and introduces one free scalar parameter plus the occupancy manifold as a geometric construct.

free parameters (1)

scalar parameter
Single adjustable value that selects the specific concave function within the derived family; special values recover count-based and maximum-entropy exploration.

axioms (2)

domain assumption Information monotonicity
More information about the environment must not decrease the intrinsic reward (abstract).
domain assumption Invariance under agent-environment interaction
Intrinsic reward must remain unchanged under re-representations of the same information (abstract).

invented entities (1)

occupancy manifold no independent evidence
purpose: Geometric space on which information geodesic interpolation is performed to enforce exploration-exploitation trade-off.
Introduced to carry out the interpolation step that reduces the reward family to a scalar parameter.

pith-pipeline@v0.9.0 · 5668 in / 1371 out tokens · 125159 ms · 2026-05-22T19:45:11.349727+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

uniquely constrains intrinsic rewards to strictly concave functions of the reciprocal occupancy
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective / generatorOfLawsOfLogic echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

invariance under the agent-environment interaction and congruent Markov morphisms

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 4 internal anchors

[1]

A., Rosas, F

Aguilera, M., Morales, P. A., Rosas, F. E., and Shimazaki, H. Explosive neural networks via higher-order inter- actions in curved statistical manifolds. arXiv preprint arXiv:2408.02326,

work page arXiv
[2]

Exploration by Random Network Distillation

Burda, Y ., Edwards, H., Storkey, A., and Klimov, O. Ex- ploration by random network distillation. arXiv preprint arXiv:1810.12894,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

New foundations for the theory of quadratic forms with infinitely many variables (in german)

Hellinger, E. New foundations for the theory of quadratic forms with infinitely many variables (in german). Jour- nal f ¨ur die reine und angewandte Mathematik , 1909 9 An Information-Geometric Approach to Artificial Curiosity (136):210–271,

work page 1909
[4]

URL https://doi.org/10.1515/crll.1909

doi: doi:10.1515/crll.1909.136.210. URL https://doi.org/10.1515/crll.1909. 136.210. Hoeffding, W. Probability inequalities for sums of bounded random variables

work page doi:10.1515/crll.1909.136.210 1909
[6]

URL https://arxiv.org/abs/2103.04551. Meyn, S. P. and Tweedie, R. L. Markov chains and stochas- tic stability. Springer Science & Business Media,

work page arXiv
[7]

URL https://doi.org/ 10.1103%2Fphysrevresearch.3.033216

doi: 10.1103/ physrevresearch.3.033216. URL https://doi.org/ 10.1103%2Fphysrevresearch.3.033216. Morales, P. A., Korbel, J., and Rosas, F. E. Geometric struc- tures induced by deformations of the legendre transform. Entropy, 25(4):678, 2023a. Morales, P. A., Korbel, J., and Rosas, F. E. Thermodynam- ics of exponential kolmogorov–nagumo averages. New Journ...

work page 2020
[8]

and Cook, M

Nedergaard, A. and Cook, M. k-means maximum entropy exploration. arXiv preprint arXiv:2205.15623,

work page arXiv
[10]

Dota 2 with Large Scale Deep Reinforcement Learning

URL http://arxiv.org/abs/1912.06680. Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T. Curiosity-driven exploration by self-supervised predic- tion,

work page internal anchor Pith review Pith/arXiv arXiv 1912
[11]

Trust Region Policy Optimization

Schulman, J. Trust region policy optimization. arXiv preprint arXiv:1502.05477,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Proximal Policy Optimization Algorithms

URL http://arxiv. org/abs/1707.06347. Schultz, W., Dayan, P., and Montague, P. R. A neural substrate of prediction and reward. Science, 275(5306): 1593–1599,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

To ensure that ˜M forms a Markov chain given that M does, we need to handle a topological technicality: ˜B, the Borel σ-algebra of ˜S, must be countably generated

else, (7) where c(s) gives the counter and B ∈ ˜B. To ensure that ˜M forms a Markov chain given that M does, we need to handle a topological technicality: ˜B, the Borel σ-algebra of ˜S, must be countably generated. Let I have any topology and let ˜S have the product topology. If M formed a Markov chain, S must have been second-countable. Since I is second...

work page 2012
[15]

13 An Information-Geometric Approach to Artificial Curiosity Proof

Z S pπ(s)r(s)dV (s) = R(π). 13 An Information-Geometric Approach to Artificial Curiosity Proof. The proof is due to (Bojun, 2020). Define the state-action value Q(s, a) := Z S n−k nX i=k r(si)dδ(s, a, sk)dM(sk, sk+1) · · · dM(sn−1, sn) (19) where the notation R S n−k :=Qn−kR S had been adopted. Q(s, a) satisfies, R(π) = Z S r(s) + Z A Q(s, a)dπ(s, a) dµ(s...

work page 2020
[16]

(25b) Proposition 2.4

Z S pπ(s)r(s)dV (s). (25b) Proposition 2.4. Any divergence that is a strictly monotonic function of a geodetic divergence is geodetic. Proof. Let ¯D be a geodetic divergence and D = F ( ¯D) with F strictly monotonic function. Using the shorthand Dp := D(p∥·) and letting γ denote any path with arbitrary endpoints p and q, ¯Dp(q) = ¯Dp(q) − ¯Dp(p)|{z} =0 = ...

work page 2014
[17]

The unique invariance under the agent-environment interaction whenp = pπ follows from the unique invariance ofpπ under the agent-environment interaction by Theorem 2.1

Z S pπ(s)¯r(s)dV (s) = Z S n+1 nX i=0 ¯r(si)dµ(s0) nY i=1 dM(si−1, si) (30) where the equality holds by Lemma 2.2. The unique invariance under the agent-environment interaction whenp = pπ follows from the unique invariance ofpπ under the agent-environment interaction by Theorem 2.1. For the data processing inequality, we may consider ¯f [pπ(s)] = f pπ(s)−...

work page 2017
[18]

Interpreting 1 β(n+1) as a Lagrange multiplier, we have pα,β = arg min p∈{p∈P:R(p)=cα,β } Dα(p∥u) = arg min p∈H(cα,β) Dα(p∥u)

Z S pπ(s)Iα(s, pπ)dV (s) (55a) = R(π) − β(n + 1)Dα(p∥u) (55b) Then, the optima pα,β = arg max p∈P Rα,β(p) (56a) = arg max p∈P {R(p) − β(n + 1)Dα(p∥u)} (56b) = arg max p∈P −Dα(p∥u) + 1 β(n + 1)R(p) (56c) = arg min p∈P Dα(p∥u) − 1 β(n + 1)R(p) (56d) = arg min p∈P Dα(p∥u) − 1 β(n + 1)[R(p) − c] (56e) 18 An Information-Geometric Approach to Artificial Curiosi...

work page 2016
[19]

Rα,β is α-concave if Rα,β ◦ γ is concave for any α-geodesic

Z S pπ(s) {r(s) + βIα(s; pπ)} dV (s), (76) we assume the reward function is bounded. Rα,β is α-concave if Rα,β ◦ γ is concave for any α-geodesic. The α-geodesic γ : [0, 1] → P connecting p ∈ P and q ∈ P is given by γp,q(t) = n (1 − t)p 1−α 2 + tq 1−α 2 o 2 1−α ξ(t) (77) where ξ(t) ensures normalization (Ay et al., 2017, Equation 2.59). Assume first that α...

work page 2017

[1] [1]

A., Rosas, F

Aguilera, M., Morales, P. A., Rosas, F. E., and Shimazaki, H. Explosive neural networks via higher-order inter- actions in curved statistical manifolds. arXiv preprint arXiv:2408.02326,

work page arXiv

[2] [2]

Exploration by Random Network Distillation

Burda, Y ., Edwards, H., Storkey, A., and Klimov, O. Ex- ploration by random network distillation. arXiv preprint arXiv:1810.12894,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

New foundations for the theory of quadratic forms with infinitely many variables (in german)

Hellinger, E. New foundations for the theory of quadratic forms with infinitely many variables (in german). Jour- nal f ¨ur die reine und angewandte Mathematik , 1909 9 An Information-Geometric Approach to Artificial Curiosity (136):210–271,

work page 1909

[4] [4]

URL https://doi.org/10.1515/crll.1909

doi: doi:10.1515/crll.1909.136.210. URL https://doi.org/10.1515/crll.1909. 136.210. Hoeffding, W. Probability inequalities for sums of bounded random variables

work page doi:10.1515/crll.1909.136.210 1909

[5] [6]

URL https://arxiv.org/abs/2103.04551. Meyn, S. P. and Tweedie, R. L. Markov chains and stochas- tic stability. Springer Science & Business Media,

work page arXiv

[6] [7]

URL https://doi.org/ 10.1103%2Fphysrevresearch.3.033216

doi: 10.1103/ physrevresearch.3.033216. URL https://doi.org/ 10.1103%2Fphysrevresearch.3.033216. Morales, P. A., Korbel, J., and Rosas, F. E. Geometric struc- tures induced by deformations of the legendre transform. Entropy, 25(4):678, 2023a. Morales, P. A., Korbel, J., and Rosas, F. E. Thermodynam- ics of exponential kolmogorov–nagumo averages. New Journ...

work page 2020

[7] [8]

and Cook, M

Nedergaard, A. and Cook, M. k-means maximum entropy exploration. arXiv preprint arXiv:2205.15623,

work page arXiv

[8] [10]

Dota 2 with Large Scale Deep Reinforcement Learning

URL http://arxiv.org/abs/1912.06680. Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T. Curiosity-driven exploration by self-supervised predic- tion,

work page internal anchor Pith review Pith/arXiv arXiv 1912

[9] [11]

Trust Region Policy Optimization

Schulman, J. Trust region policy optimization. arXiv preprint arXiv:1502.05477,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [13]

Proximal Policy Optimization Algorithms

URL http://arxiv. org/abs/1707.06347. Schultz, W., Dayan, P., and Montague, P. R. A neural substrate of prediction and reward. Science, 275(5306): 1593–1599,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [14]

To ensure that ˜M forms a Markov chain given that M does, we need to handle a topological technicality: ˜B, the Borel σ-algebra of ˜S, must be countably generated

else, (7) where c(s) gives the counter and B ∈ ˜B. To ensure that ˜M forms a Markov chain given that M does, we need to handle a topological technicality: ˜B, the Borel σ-algebra of ˜S, must be countably generated. Let I have any topology and let ˜S have the product topology. If M formed a Markov chain, S must have been second-countable. Since I is second...

work page 2012

[12] [15]

13 An Information-Geometric Approach to Artificial Curiosity Proof

Z S pπ(s)r(s)dV (s) = R(π). 13 An Information-Geometric Approach to Artificial Curiosity Proof. The proof is due to (Bojun, 2020). Define the state-action value Q(s, a) := Z S n−k nX i=k r(si)dδ(s, a, sk)dM(sk, sk+1) · · · dM(sn−1, sn) (19) where the notation R S n−k :=Qn−kR S had been adopted. Q(s, a) satisfies, R(π) = Z S r(s) + Z A Q(s, a)dπ(s, a) dµ(s...

work page 2020

[13] [16]

(25b) Proposition 2.4

Z S pπ(s)r(s)dV (s). (25b) Proposition 2.4. Any divergence that is a strictly monotonic function of a geodetic divergence is geodetic. Proof. Let ¯D be a geodetic divergence and D = F ( ¯D) with F strictly monotonic function. Using the shorthand Dp := D(p∥·) and letting γ denote any path with arbitrary endpoints p and q, ¯Dp(q) = ¯Dp(q) − ¯Dp(p)|{z} =0 = ...

work page 2014

[14] [17]

The unique invariance under the agent-environment interaction whenp = pπ follows from the unique invariance ofpπ under the agent-environment interaction by Theorem 2.1

Z S pπ(s)¯r(s)dV (s) = Z S n+1 nX i=0 ¯r(si)dµ(s0) nY i=1 dM(si−1, si) (30) where the equality holds by Lemma 2.2. The unique invariance under the agent-environment interaction whenp = pπ follows from the unique invariance ofpπ under the agent-environment interaction by Theorem 2.1. For the data processing inequality, we may consider ¯f [pπ(s)] = f pπ(s)−...

work page 2017

[15] [18]

Interpreting 1 β(n+1) as a Lagrange multiplier, we have pα,β = arg min p∈{p∈P:R(p)=cα,β } Dα(p∥u) = arg min p∈H(cα,β) Dα(p∥u)

Z S pπ(s)Iα(s, pπ)dV (s) (55a) = R(π) − β(n + 1)Dα(p∥u) (55b) Then, the optima pα,β = arg max p∈P Rα,β(p) (56a) = arg max p∈P {R(p) − β(n + 1)Dα(p∥u)} (56b) = arg max p∈P −Dα(p∥u) + 1 β(n + 1)R(p) (56c) = arg min p∈P Dα(p∥u) − 1 β(n + 1)R(p) (56d) = arg min p∈P Dα(p∥u) − 1 β(n + 1)[R(p) − c] (56e) 18 An Information-Geometric Approach to Artificial Curiosi...

work page 2016

[16] [19]

Rα,β is α-concave if Rα,β ◦ γ is concave for any α-geodesic

Z S pπ(s) {r(s) + βIα(s; pπ)} dV (s), (76) we assume the reward function is bounded. Rα,β is α-concave if Rα,β ◦ γ is concave for any α-geodesic. The α-geodesic γ : [0, 1] → P connecting p ∈ P and q ∈ P is given by γp,q(t) = n (1 − t)p 1−α 2 + tq 1−α 2 o 2 1−α ξ(t) (77) where ξ(t) ensures normalization (Ay et al., 2017, Equation 2.59). Assume first that α...

work page 2017