Compositional Transduction with Latent Analogies for Offline Goal-Conditioned Reinforcement Learning

Dohyeong Kim; Junseok Kim; Mineui Hong; Songhwai Oh

arxiv: 2605.20609 · v1 · pith:P7KHAYIDnew · submitted 2026-05-20 · 💻 cs.LG

Compositional Transduction with Latent Analogies for Offline Goal-Conditioned Reinforcement Learning

Junseok Kim , Dohyeong Kim , Mineui Hong , Songhwai Oh This is my paper

Pith reviewed 2026-05-21 06:47 UTC · model grok-4.3

classification 💻 cs.LG

keywords offline reinforcement learninggoal-conditioned RLcompositional generalizationanalogy transductionlatent representationstask planninggeneralization to unseen goals

0 comments

The pith

A context-invariant latent analogy representation enables synthesizing optimal plans for unseen context-goal combinations in offline goal-conditioned reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formalizes analogy transduction as synthesizing new plans by composing task-endogenous analogies with arbitrary given contexts. It introduces a novel representation that encodes only the changes occurring under optimal task execution. This representation is designed to stay invariant across contextual variations while remaining sufficient to produce optimal goal-reaching behavior. The method specifically targets generalization to unseen analogy-context pairs, overcoming limits of prior trajectory-stitching approaches in offline GCRL.

Core claim

Grounded in theory, the analogy representation captures what changes under optimal task execution, remains invariant to contextual variations, and is sufficient for optimal goal reaching, which allows a new offline GCRL approach to perform analogy transduction to unseen combinations and substantially outperform prior methods on manipulation tasks.

What carries the argument

Analogy transduction via a latent representation of task-endogenous changes that composes with new contexts to generate plans.

If this is right

Offline agents can reach unseen goals under novel contextual variations by composing from existing data.
Generalization becomes possible for analogy-context pairs absent from the training trajectories.
Behavior composition no longer requires temporally contiguous segments from the same context.
Performance on goal-conditioned tasks improves over methods without explicit analogy transduction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same invariance principle might transfer to other planning or control domains with factored task and context elements.
If the representation generalizes reliably, it could lower sample requirements for learning generalist agents in robotics.
Further work could test whether the approach scales to higher-dimensional state spaces or longer task horizons.

Load-bearing premise

The learned latent analogy representation is invariant to contextual variations and sufficient to produce optimal goal-reaching behavior when composed with new contexts.

What would settle it

A test where policies built from the learned analogies composed with held-out contexts fail to reach goals at optimal performance levels in the OGBench environments.

Figures

Figures reproduced from arXiv: 2605.20609 by Dohyeong Kim, Junseok Kim, Mineui Hong, Songhwai Oh.

**Figure 1.** Figure 1: Analogy transduction. Analogy transduction synthesizes new plans by composing task-endogenous analogies into the current context; for instance, an analogy a captures drawer opening from trajectories under diverse contexts, enabling the agent to open the drawer when the window is closed and unlocked, which may be an absent context from the data. when the window is closed. Such reuse and recombination of p… view at source ↗

**Figure 2.** Figure 2: Temporal distance geometry and temporal distance difference field. Temporal distance geometry is the quasimetric space (Z en, d∗ ) induced by the optimal temporal distance over task-endogenous states, and is invariant to variations in task-exogenous contexts. Here, latent states whose task-endogenous components involve the drawer and the robot arm form a shared geometry across different tasks (e.g., openin… view at source ↗

**Figure 3.** Figure 3: Example of direct OOC case study. We remove all direct drawer-opening trajectories when the drawer and window are closed and both are unlocked. The agent can achieve direct success by extrapolating to this OOC context–task pair, or detour success via an in-distribution sequence: lock the window, open the drawer, then unlock the window. 6.2. Results in OGBench Manipulation Suite [PITH_FULL_IMAGE:figures/fu… view at source ↗

**Figure 4.** Figure 4: t-SNE visualization of nearest analogies. Each point represents α ∨ (s, g) for a state–goal pair from the scene-play-v0 dataset; we visualize three nearest-neighbor analogy pairs. 6.4. Dual Analogies Encode the Task-Endogenous Displacement To qualitatively verify that the dual analogies capture task-endogenous displacements, we sample 20,000 state– goal pairs (s, g) from the re-collected validation split o… view at source ↗

**Figure 5.** Figure 5: Environments. (Top row) From left to right: scene, cube-single, cube-double, cube-triple, puzzle-3x3, puzzle-4x4, puzzle-4x5, and puzzle-4x6. (Bottom row) From left to right: ant, antmaze-medium, antmaze-large, antmaze-giant, humanoid, humanoidmaze-medium, humanoidmaze-large, and humanoidmaze-giant. F.2. Baselines We compare CTA against prior methods that have reported strong performance on the OGBench man… view at source ↗

**Figure 6.** Figure 6: Examples of the removed context–task pairs in scene-play-v0. ◦ Pair 1: context: window closed, window unlocked, drawer closed / task: open drawer ◦ Pair 2: context: drawer closed, drawer locked, window open / task: close window ◦ Pair 3: context: window open, window locked, drawer open, cube not in drawer / task: put the cube into the drawer The 15 timesteps preceding each task completion event were remove… view at source ↗

**Figure 7.** Figure 7: Examples of the removed context–task pairs in puzzle-4x4-play-v0. ◦ Pair 1: context: button5 = 0, button1 = 1, button9 = 0, button4 = 1, button6 = 0 / task: press button5 ◦ Pair 2: context: button2 = 1, button1 = 0, button3 = 1, button6 = 0 / task: press button2 ◦ Pair 3: context: button15 = 0, button11 = 1, button14 = 1 / task: press button15 ◦ Pair 4: context: button10 = 1, button6 = 0, button14 = 1, but… view at source ↗

**Figure 8.** Figure 8: Qualitative visualization of dual analogies. For each OOC query pair, we visualize the query and its top-10 nearest analogies. 36 [PITH_FULL_IMAGE:figures/full_fig_p036_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative visualization of dual analogies. For each OOC query pair, we visualize the query and its top-10 nearest analogies. 37 [PITH_FULL_IMAGE:figures/full_fig_p037_9.png] view at source ↗

**Figure 10.** Figure 10: Ablation results on subgoal steps k. Left: step-wise success rate curves. Right: final performance aggregated over the last three evaluation steps. 38 [PITH_FULL_IMAGE:figures/full_fig_p038_10.png] view at source ↗

**Figure 11.** Figure 11: Ablation results on transductive feature dimension b. Left: step-wise success rate curves. Right: final performance aggregated over the last three evaluation steps. 39 [PITH_FULL_IMAGE:figures/full_fig_p039_11.png] view at source ↗

read the original abstract

Compositional generalization is essential for reaching unseen goals under novel contextual variations in offline goal-conditioned reinforcement learning (GCRL), where a generalist goal-reaching agent must be learned from limited data. Most prior approaches pursue this via trajectory stitching over temporally contiguous segments, which limits composing behaviors across varying contexts. To overcome this limitation, we formalize analogy transduction as synthesizing new plans by composing task-endogenous analogies with given contexts and propose a novel analogy representation tailored for it. Grounded in our theory, this analogy representation captures what changes under optimal task execution, remains invariant to contextual variations, and is sufficient for optimal goal reaching. We further contend that generalization to unseen analogy-context pairs is a practical obstacle in analogy transduction, and introduce a new approach for offline GCRL that enables analogy transduction beyond seen pairs to unseen combinations. We empirically demonstrate the effectiveness of our approach on OGBench manipulation environments, substantially outperforming prior methods that do not perform analogy transduction. Project page: https://rllab-snu.github.io/projects/CTA/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper formalizes analogy transduction for offline goal-conditioned RL as composing task-endogenous analogies with contexts to synthesize plans for unseen goals under novel variations. It introduces a latent analogy representation claimed to capture changes under optimal execution, remain invariant to context, and suffice for optimal goal-reaching policies. A transduction mechanism is proposed to handle unseen analogy-context pairs, with empirical evaluation on OGBench manipulation environments showing outperformance over prior non-transduction methods.

Significance. If the invariance and sufficiency properties hold, the work offers a principled alternative to trajectory stitching for compositional generalization in GCRL, potentially enabling more flexible generalist agents from limited offline data. The grounding in theory and focus on unseen pairs address a practical obstacle, and the OGBench results provide initial evidence of effectiveness in manipulation tasks.

major comments (2)

[Abstract and §3] Abstract and §3 (Theory of Analogy Representation): The central claim that the latent analogy z_a 'remains invariant to contextual variations' and 'is sufficient for optimal goal reaching' when composed with new contexts is load-bearing but lacks a direct verification mechanism. The offline objective must be shown to enforce strict separation (e.g., via an explicit invariance loss or mutual information bound) rather than relying on empirical success rates; without this, leakage of context into z_a would invalidate compositional generalization to unseen pairs even if OGBench metrics improve.
[§4] §4 (Transduction Mechanism): The approach for enabling transduction beyond seen analogy-context pairs is presented as solving a practical obstacle, but the manuscript does not specify how the learned latent space guarantees recovery of an optimal policy for g when z_a ⊕ c_new is used. A concrete test (e.g., policy optimality gap or value function comparison on held-out pairs) is needed to confirm sufficiency, as the current formulation risks reducing to standard goal-conditioned fitting.

minor comments (2)

[Figure 1] Figure 1 or equivalent diagram: The visualization of analogy-context composition would benefit from explicit notation for the latent variables z_a and c to clarify the transduction step.
[Related Work] Related work section: The distinction from prior trajectory-stitching methods in GCRL could be sharpened with a direct comparison table of generalization mechanisms.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our theoretical claims and empirical validation. We address each major comment below and indicate the revisions made to the manuscript.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Theory of Analogy Representation): The central claim that the latent analogy z_a 'remains invariant to contextual variations' and 'is sufficient for optimal goal reaching' when composed with new contexts is load-bearing but lacks a direct verification mechanism. The offline objective must be shown to enforce strict separation (e.g., via an explicit invariance loss or mutual information bound) rather than relying on empirical success rates; without this, leakage of context into z_a would invalidate compositional generalization to unseen pairs even if OGBench metrics improve.

Authors: We agree that an explicit verification mechanism strengthens the central claims. Section 3 derives invariance and sufficiency from the properties of optimal policies and value functions under the analogy representation, with the offline objective structured to promote separation via the analogy extraction and composition losses. To address the concern directly, the revised manuscript adds an appendix with quantitative verification: we report mutual information estimates between the learned z_a and context variables on held-out data, showing low dependence consistent with invariance. We also include an ablation that removes the context-separation terms from the objective and demonstrates degraded compositional performance, supporting that the objective enforces the required properties rather than relying solely on downstream success rates. revision: yes
Referee: [§4] §4 (Transduction Mechanism): The approach for enabling transduction beyond seen analogy-context pairs is presented as solving a practical obstacle, but the manuscript does not specify how the learned latent space guarantees recovery of an optimal policy for g when z_a ⊕ c_new is used. A concrete test (e.g., policy optimality gap or value function comparison on held-out pairs) is needed to confirm sufficiency, as the current formulation risks reducing to standard goal-conditioned fitting.

Authors: The sufficiency of z_a ⊕ c_new for recovering the optimal policy follows from the invariance and sufficiency properties proven in §3, which ensure that the analogy encodes only the task-endogenous changes independent of context. The transduction mechanism in §4 is designed to generalize the composition operator to unseen pairs while preserving these properties. To provide the requested concrete test, the revised manuscript adds experiments on held-out analogy-context pairs in the OGBench environments. We compare the success rates and trajectory quality of the composed policies against a non-transductive goal-conditioned baseline trained directly on the same data, showing consistent improvements that cannot be explained by standard fitting alone. We have also expanded the discussion in §4.2 to explicitly link the composition step to the theoretical guarantees. revision: partial

Circularity Check

0 steps flagged

No significant circularity; claims rest on presented theory and empirical validation rather than self-referential reduction

full rationale

The paper develops a theory for analogy representations in offline GCRL within the manuscript itself, defining the latent analogy z_a to capture task-endogenous changes while remaining invariant to context c and sufficient for goal-reaching when composed. This is not a reduction by construction to fitted inputs or prior self-citations; the invariance and sufficiency are posited as properties of the novel representation and then tested via a transduction mechanism on OGBench. No equations equate the claimed generalization directly to the offline objective or rename a fitted parameter as a prediction. The central derivation chain remains self-contained with independent theoretical content and external empirical benchmarks, consistent with a low circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the introduced formalization of analogy transduction and the three stated properties of the novel analogy representation; these are new constructs introduced by the paper rather than derived from prior literature.

axioms (1)

domain assumption The analogy representation captures what changes under optimal task execution, remains invariant to contextual variations, and is sufficient for optimal goal reaching.
Directly stated in the abstract as the grounding for the proposed representation.

invented entities (1)

latent analogy representation no independent evidence
purpose: To enable synthesis of new plans by composing task-endogenous analogies with arbitrary contexts for compositional generalization.
New representation proposed and tailored specifically for analogy transduction in this work.

pith-pipeline@v0.9.0 · 5718 in / 1406 out tokens · 44954 ms · 2026-05-21T06:47:56.810948+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the temporal distance difference field is a task-endogenous analogy... sufficient for optimal goal-reaching

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 6 internal anchors

[1]

Layer Normalization

Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Gaussian Error Linear Units (GELUs)

Hendrycks, D. and Gimpel, K. Gaussian error linear units (GELUs).arXiv preprint arXiv:1606.08415,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Levine, S., Kumar, A., Tucker, G., and Fu, J. Offline rein- forcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643, May

work page internal anchor Pith review Pith/arXiv arXiv 2005
[4]

Efficient Estimation of Word Representations in Vector Space

Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient estimation of word representations in vector space.arXiv preprint arXiv:1301.3781, Jan

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Playing Atari with Deep Reinforcement Learning

Mnih, V ., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing Atari with deep reinforcement learning.arXiv preprint arXiv:1312.5602, Dec

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

Peng, X. B., Kumar, A., Zhang, G., and Levine, S. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning.arXiv preprint arXiv:1910.00177, Oct

work page internal anchor Pith review Pith/arXiv arXiv 1910
[7]

and Isola, P

Wang, T. and Isola, P. Improved representation of asym- metrical distances with interval quasimetric embeddings. arXiv preprint arXiv:2211.15120, Nov. 2022a. Wang, T. and Isola, P. On the learning and learnability of quasimetrics. InProc. of the International Conference on Learning Representations (ICLR), Virtual conference, Apr. 2022b. Wang, T., Torralba...

work page arXiv
[8]

13 Compositional Transduction with Latent Analogies for Offline GCRL A. Extended Related Work Compositional generalization in sequential decision making.In sequential decision making, compositional generaliza- tion is most commonly studied through trajectory stitching, which synthesizes new trajectories by connecting segments from different demonstrations...

work page 2013
[9]

and sequence-modeling approaches (Janner et al., 2022; Kim et al., 2024; Li et al., 2024; Luo et al., 2025). In this paper, we study analogy transduction as a new axis of compositional generalization, where task-endogenous analogies are transplanted across contexts beyond trajectory stitching. Metric learning for sequential decision making.Metric learning...

work page 2022
[10]

behaviorally equivalent

Hence, d∗(x, g) =D ∗ en(ρ¯z(x), z2), d ∗(x, s) =D ∗ en(ρ¯z(x), z1). Therefore, α(s, g)(x) =D ∗ en(ρ¯z(x), z2)−D ∗ en(ρ¯z(x), z1). Now defineδ:Z en × Zen →(S →R)by δ(z1, z2)(x) :=D ∗ en(ρ(z1,z2)(x), z2)−D ∗ en(ρ(z1,z2)(x), z1). Then, for every(s, g)∈ B ¯z, α(s, g)(x) =δ(z en s|g, zen g|s)(x) for all relevantx∈ S. Hence, α(s, g) =δ(z en s|g, zen g|s), which...

work page 2022
[11]

for all possible actions. Definition C.1(Bisimulation Relations (Givan et al., 2003)).Given an MDP M, an equivalence relation B over the state space S is abisimulation relationif, for all states si, sj ∈S that are equivalent under B (denoted si ≡B sj), the following conditions hold: R(si, a) =R(s j, a)∀a∈A, P(G|s i, a) =P(G|s j, a)∀a∈A,∀G∈ S B, (17) where...

work page 2003
[12]

behavioral similarity

is defined with a pseudometric space (S, dbisim) where the distance function dbisim :S × S →R ≥0 on S refers to the “behavioral similarity" between two states. Our work is motivated by the goal-conditioned bisimulation (GCB) metric (Hansen-Estruch et al., 2022): dπ bisim((si,g i),(s j,g j)) =|R(s i, π(si,g i),g i)− R(s j, π(sj,g j),g j)| +γW 1(dπ bisim)(P...

work page 2022
[13]

is a BCMP that additionally assumes a product structure on the latent space and a corresponding decoupling of initialization and dynamics. Formally, the latent state space decomposes as Z=Z en × Zex with z= (z en, zex), and there exist initial distributions µen ∈∆(Z en), µex ∈∆(Z ex) and latent transition dynamicsP en :Z en × A →∆(Z en),P ex :Z ex →∆(Z ex...

work page 2019
[14]

as Z=Z en × Zex, and accordingly ¯Z= (Z en × Zex)×(Z en × Zex). Under this factorization, for each (s, g)∈supp(f e) we can uniquely write zs|g =f ℓ g(s) = νg(s), ξ g(s) , z g|s =f ℓ s(g) = νs(g), ξ s(g) , whereν g, νs, ξg, ξs are deterministic maps defined on the relevant domains induced bysupp(f e). For brevity, we define zen s|g :=ν g(s), z ex s|g :=ξ g...

work page 2026
[15]

on the summed objective min ϕ,φ,Q Lanalogy(ϕ, φ, Q) :=L(ϕ, φ) +L(Q).(37) After training, thedual analogyis extracted as the displacement in the learned goal embedding space, α∨(s, g) :=φ(g)−φ(s)∈R d,(38) so that for any probe statex, ˜V(x, g)− ˜V(x, s) =ϕ(x) ⊤α∨(s, g). E.2. Details of the CTA Analogy compression for practical deployment.The dual analogy α...

work page 2026
[16]

on the summed objective, min Ω1,Ω2,ωh1,ωh2,ωℓ1,ωℓ2,η LCTA :=L(Ω 1,Ω 2, η)− L(ω h1, ωh2)− L(ω ℓ1, ωℓ2),(45) where the negative signs reflect that the actor objectives are maximized andη is updated only through the value objective L(Ω1,Ω 2, η). Bilinear architecture of the value and policy functions.Applying bilinear transduction requires departing from a m...

work page 2024
[17]

Lights Out

benchmark manipulation suite, which consists of the following three environments: cube, scene, and puzzle. These tasks are built on MuJoCo with a 6-DoF UR5e robot arm, and are explicitly designed to probe object manipulation, sequential (long-horizon) reasoning, and combinatorial generalization—making them a natural testbed for compositional generalizatio...

work page 2021
[18]

(2026) for details

and refer readers to Park et al. (2026) for details. For Table 1 and Table 6, whenever results for a given environment are reported in OGBench (Park et al.,

work page 2026
[19]

or the dual goal representation paper (Park et al., 2026), we use those reported numbers; all remaining results are obtained from our own experiments. In particular, we implement GCIQL∨ in the same manner by replacing the TD update with the IQL update while keeping the representation module identical toGCIVL ∨ in the original implementation of Park et al....

work page 2026
[20]

In particular, Park et al

(see Table 7). In particular, Park et al. (2026) argue that representation-conditioned formulations cannot directly exploit early fusion of visual state and goal, since the goal must be processed separately before conditioning the policy. This architectural constraint effectively enforces a late-fusion design, which is often weaker than early fusion in vi...

work page 2026
[21]

Layer normalization (Ba et al., 2016)True Discount factorγ0.99 Target network update rateτ0.005 Dual representation expectileι0.7 IQL expectileκ0.7 Low-level AWR temperatureβ ℓ 3.0 High-level AWR temperatureβ h 3.0 Subgoal stepsk10 (scene) 30 (cube) 20 (puzzle) 25 (maze) Analogy projectionηrepresentation dimension32 Dual representation dimensiond256 Visua...

work page 2016
[22]

Layer normalization (Ba et al., 2016)True Discount factorγ0.99 Target network update rateτ0.005 Dual representation expectileι0.7 IQL expectileκ0.7 Low-level AWR temperatureβ ℓ 3.0 High-level AWR temperatureβ h 3.0 Subgoal stepsk10 (scene) 30 (cube) 20 (puzzle) 25 (maze) Goal representationηdimension32 Dual representation dimensiond256 Visual encoderimpal...

work page 2016

[1] [1]

Layer Normalization

Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Gaussian Error Linear Units (GELUs)

Hendrycks, D. and Gimpel, K. Gaussian error linear units (GELUs).arXiv preprint arXiv:1606.08415,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Levine, S., Kumar, A., Tucker, G., and Fu, J. Offline rein- forcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643, May

work page internal anchor Pith review Pith/arXiv arXiv 2005

[4] [4]

Efficient Estimation of Word Representations in Vector Space

Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient estimation of word representations in vector space.arXiv preprint arXiv:1301.3781, Jan

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Playing Atari with Deep Reinforcement Learning

Mnih, V ., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing Atari with deep reinforcement learning.arXiv preprint arXiv:1312.5602, Dec

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

Peng, X. B., Kumar, A., Zhang, G., and Levine, S. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning.arXiv preprint arXiv:1910.00177, Oct

work page internal anchor Pith review Pith/arXiv arXiv 1910

[7] [7]

and Isola, P

Wang, T. and Isola, P. Improved representation of asym- metrical distances with interval quasimetric embeddings. arXiv preprint arXiv:2211.15120, Nov. 2022a. Wang, T. and Isola, P. On the learning and learnability of quasimetrics. InProc. of the International Conference on Learning Representations (ICLR), Virtual conference, Apr. 2022b. Wang, T., Torralba...

work page arXiv

[8] [8]

13 Compositional Transduction with Latent Analogies for Offline GCRL A. Extended Related Work Compositional generalization in sequential decision making.In sequential decision making, compositional generaliza- tion is most commonly studied through trajectory stitching, which synthesizes new trajectories by connecting segments from different demonstrations...

work page 2013

[9] [9]

and sequence-modeling approaches (Janner et al., 2022; Kim et al., 2024; Li et al., 2024; Luo et al., 2025). In this paper, we study analogy transduction as a new axis of compositional generalization, where task-endogenous analogies are transplanted across contexts beyond trajectory stitching. Metric learning for sequential decision making.Metric learning...

work page 2022

[10] [10]

behaviorally equivalent

Hence, d∗(x, g) =D ∗ en(ρ¯z(x), z2), d ∗(x, s) =D ∗ en(ρ¯z(x), z1). Therefore, α(s, g)(x) =D ∗ en(ρ¯z(x), z2)−D ∗ en(ρ¯z(x), z1). Now defineδ:Z en × Zen →(S →R)by δ(z1, z2)(x) :=D ∗ en(ρ(z1,z2)(x), z2)−D ∗ en(ρ(z1,z2)(x), z1). Then, for every(s, g)∈ B ¯z, α(s, g)(x) =δ(z en s|g, zen g|s)(x) for all relevantx∈ S. Hence, α(s, g) =δ(z en s|g, zen g|s), which...

work page 2022

[11] [11]

for all possible actions. Definition C.1(Bisimulation Relations (Givan et al., 2003)).Given an MDP M, an equivalence relation B over the state space S is abisimulation relationif, for all states si, sj ∈S that are equivalent under B (denoted si ≡B sj), the following conditions hold: R(si, a) =R(s j, a)∀a∈A, P(G|s i, a) =P(G|s j, a)∀a∈A,∀G∈ S B, (17) where...

work page 2003

[12] [12]

behavioral similarity

is defined with a pseudometric space (S, dbisim) where the distance function dbisim :S × S →R ≥0 on S refers to the “behavioral similarity" between two states. Our work is motivated by the goal-conditioned bisimulation (GCB) metric (Hansen-Estruch et al., 2022): dπ bisim((si,g i),(s j,g j)) =|R(s i, π(si,g i),g i)− R(s j, π(sj,g j),g j)| +γW 1(dπ bisim)(P...

work page 2022

[13] [13]

is a BCMP that additionally assumes a product structure on the latent space and a corresponding decoupling of initialization and dynamics. Formally, the latent state space decomposes as Z=Z en × Zex with z= (z en, zex), and there exist initial distributions µen ∈∆(Z en), µex ∈∆(Z ex) and latent transition dynamicsP en :Z en × A →∆(Z en),P ex :Z ex →∆(Z ex...

work page 2019

[14] [14]

as Z=Z en × Zex, and accordingly ¯Z= (Z en × Zex)×(Z en × Zex). Under this factorization, for each (s, g)∈supp(f e) we can uniquely write zs|g =f ℓ g(s) = νg(s), ξ g(s) , z g|s =f ℓ s(g) = νs(g), ξ s(g) , whereν g, νs, ξg, ξs are deterministic maps defined on the relevant domains induced bysupp(f e). For brevity, we define zen s|g :=ν g(s), z ex s|g :=ξ g...

work page 2026

[15] [15]

on the summed objective min ϕ,φ,Q Lanalogy(ϕ, φ, Q) :=L(ϕ, φ) +L(Q).(37) After training, thedual analogyis extracted as the displacement in the learned goal embedding space, α∨(s, g) :=φ(g)−φ(s)∈R d,(38) so that for any probe statex, ˜V(x, g)− ˜V(x, s) =ϕ(x) ⊤α∨(s, g). E.2. Details of the CTA Analogy compression for practical deployment.The dual analogy α...

work page 2026

[16] [16]

on the summed objective, min Ω1,Ω2,ωh1,ωh2,ωℓ1,ωℓ2,η LCTA :=L(Ω 1,Ω 2, η)− L(ω h1, ωh2)− L(ω ℓ1, ωℓ2),(45) where the negative signs reflect that the actor objectives are maximized andη is updated only through the value objective L(Ω1,Ω 2, η). Bilinear architecture of the value and policy functions.Applying bilinear transduction requires departing from a m...

work page 2024

[17] [17]

Lights Out

benchmark manipulation suite, which consists of the following three environments: cube, scene, and puzzle. These tasks are built on MuJoCo with a 6-DoF UR5e robot arm, and are explicitly designed to probe object manipulation, sequential (long-horizon) reasoning, and combinatorial generalization—making them a natural testbed for compositional generalizatio...

work page 2021

[18] [18]

(2026) for details

and refer readers to Park et al. (2026) for details. For Table 1 and Table 6, whenever results for a given environment are reported in OGBench (Park et al.,

work page 2026

[19] [19]

or the dual goal representation paper (Park et al., 2026), we use those reported numbers; all remaining results are obtained from our own experiments. In particular, we implement GCIQL∨ in the same manner by replacing the TD update with the IQL update while keeping the representation module identical toGCIVL ∨ in the original implementation of Park et al....

work page 2026

[20] [20]

In particular, Park et al

(see Table 7). In particular, Park et al. (2026) argue that representation-conditioned formulations cannot directly exploit early fusion of visual state and goal, since the goal must be processed separately before conditioning the policy. This architectural constraint effectively enforces a late-fusion design, which is often weaker than early fusion in vi...

work page 2026

[21] [21]

Layer normalization (Ba et al., 2016)True Discount factorγ0.99 Target network update rateτ0.005 Dual representation expectileι0.7 IQL expectileκ0.7 Low-level AWR temperatureβ ℓ 3.0 High-level AWR temperatureβ h 3.0 Subgoal stepsk10 (scene) 30 (cube) 20 (puzzle) 25 (maze) Analogy projectionηrepresentation dimension32 Dual representation dimensiond256 Visua...

work page 2016

[22] [22]

Layer normalization (Ba et al., 2016)True Discount factorγ0.99 Target network update rateτ0.005 Dual representation expectileι0.7 IQL expectileκ0.7 Low-level AWR temperatureβ ℓ 3.0 High-level AWR temperatureβ h 3.0 Subgoal stepsk10 (scene) 30 (cube) 20 (puzzle) 25 (maze) Goal representationηdimension32 Dual representation dimensiond256 Visual encoderimpal...

work page 2016