Reward-Conditioned Reinforcement Learning

Marek Cygan; Michal Nauman; Pieter Abbeel

arxiv: 2603.05066 · v3 · pith:2YNZJQXKnew · submitted 2026-03-05 · 💻 cs.LG

Reward-Conditioned Reinforcement Learning

Michal Nauman , Marek Cygan , Pieter Abbeel This is my paper

Pith reviewed 2026-05-21 11:42 UTC · model grok-4.3

classification 💻 cs.LG

keywords reinforcement learningreward conditioningoff-policy learningmulti-objective RLsample efficiencyzero-shot adaptationcounterfactual rewards

0 comments

The pith

RCRL trains one off-policy agent under a nominal reward while conditioning the policy on multiple reward parameterizations and recomputing counterfactual rewards from shared replay data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Single-task reinforcement learning typically fixes one reward function, which restricts robustness when rewards are misspecified and slows adaptation to new preferences. RCRL collects all experience under one nominal objective yet conditions the policy on different reward parameterizations, allowing it to learn from multiple objectives by recalculating rewards directly from the existing replay buffer. This removes the need for separate data collection per objective and links single-task training to multi-objective methods. The result is higher sample efficiency on the original task plus fast adaptation and zero-shot behavior changes when the conditioning signal is adjusted at test time.

Core claim

Reward-Conditioned Reinforcement Learning conditions agents on reward parameterizations while collecting experience under a single nominal objective; by recomputing counterfactual rewards from shared replay data, the method exposes the agent to multiple reward objectives without additional environment interaction.

What carries the argument

Reward-conditioned policy trained off-policy with counterfactual rewards recomputed from transitions stored under the nominal objective.

If this is right

Sample efficiency improves under the original nominal reward parameterization.
Adaptation to new reward parameterizations occurs efficiently with no additional environment steps.
Zero-shot behavioral adjustment is possible at deployment simply by changing the reward conditioning input.
The approach applies across single-task, multi-task, and vision-based benchmarks without altering the core single-task training loop.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

In deployed systems, changing user preferences could be accommodated by swapping the conditioning signal rather than retraining.
The same replay buffer could support conditioning on variables other than reward, such as task goals or risk levels.
Robustness to reward misspecification may increase because the policy has already experienced a range of parameterizations during training.

Load-bearing premise

Counterfactual rewards recomputed from nominal transitions remain sufficiently unbiased and informative for the conditioned policy to learn useful behavior under other reward parameterizations.

What would settle it

An experiment in which the conditioned policy on new reward parameterizations requires as much or more new environment interaction as training separate agents from scratch to reach comparable performance.

read the original abstract

Single-task RL agents are typically trained under a fixed reward function, which limits their robustness to reward misspecification and their ability to adapt to changing preferences. We introduce Reward-Conditioned Reinforcement Learning (RCRL), an off-policy method that conditions agents on reward parameterizations while collecting experience under a single nominal objective. By recomputing counterfactual rewards from shared replay data, RCRL exposes the agent to multiple reward objectives without additional environment interaction, connecting single-task RL with ideas from multi-objective and multi-task learning. Across single-task, multi-task, and vision-based benchmarks, RCRL improves sample efficiency under the nominal reward parameterization, enables efficient adaptation to new parameterizations, and supports zero-shot behavioral adjustment at deployment. Our results show that RCRL provides a scalable mechanism for learning robust, steerable policies without sacrificing the simplicity of single-task training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RCRL adds reward-parameter conditioning to an off-policy actor-critic and reuses one replay buffer via counterfactual relabeling, which is a clean practical move, but the coverage of nominal trajectories for divergent rewards is the real question mark.

read the letter

The main point to take away is that RCRL conditions an off-policy policy on reward parameters and uses relabeling on a shared replay buffer to expose it to multiple objectives without extra data collection. This is meant to improve efficiency and allow post-training adjustments. What the paper does well is keep things simple by sticking to single-task data collection while adding the conditioning mechanism. The off-policy actor-critic loop with counterfactual rewards is a direct way to link single-task RL to multi-objective flexibility. They report gains on several benchmark families, including vision-based ones, which suggests the method scales at least to some degree. The citation pattern looks reasonable, focusing on relevant prior work in reward-conditioned and multi-task RL without obvious omissions in the abstract. The soft spots center on the data coverage issue. All experience is gathered under one nominal reward, so for reward parameterizations that require very different policies, the replay data may not include the necessary state-action pairs. This could make the learning for those cases rely on extrapolation, potentially leading to biased estimates or ineffective adaptation. The weakest assumption is that the nominal trajectories provide enough signal for divergent rewards, and without detailed ablations on how far the parameterizations can diverge, it's unclear how general the zero-shot adjustment really is. This paper is for RL folks working on agents that need to handle uncertain or changing rewards, like in real-world settings where objectives aren't fixed. A reader looking for practical ways to add steerability to standard RL pipelines would get some value from the method and the benchmark results. It deserves a serious referee because the core technique is well-defined and the empirical claims are specific enough to evaluate. I'd recommend putting it through peer review.

Referee Report

2 major / 2 minor

Summary. The paper introduces Reward-Conditioned Reinforcement Learning (RCRL), an off-policy algorithm that trains a single policy conditioned on reward parameterizations. Experience is collected exclusively under one nominal reward function; counterfactual rewards for alternative parameterizations are then recomputed from the same replay buffer. The central claims are that this yields improved sample efficiency on the nominal task, enables efficient adaptation to new reward parameterizations, and supports zero-shot behavioral adjustment at deployment, demonstrated across single-task, multi-task, and vision-based benchmarks.

Significance. If the empirical results are robust, RCRL offers a practical bridge between single-task RL and multi-objective/multi-task settings by avoiding additional environment interaction for each new reward parameterization. The approach is explicitly defined by a training procedure and evaluated on standard external benchmarks rather than relying on circular or fitted quantities. The main strength lies in the reported gains in sample efficiency and adaptation without sacrificing single-task training simplicity; however, the significance is tempered by the need to confirm that relabeling does not introduce coverage-induced bias on divergent objectives.

major comments (2)

[§4.3] §4.3 (zero-shot adjustment experiments): the reported success on opposing navigation goals and conflicting multi-objective weights does not include quantitative coverage diagnostics (e.g., state-action visitation overlap or effective support size between nominal and target policies). Without these metrics or an ablation that deliberately increases divergence, it remains unclear whether the observed adaptation stems from genuine extrapolation or from test cases where nominal trajectories already overlap substantially with high-value regions under the new parameterization.
[§3.2] §3.2 (off-policy update with relabeled rewards): the method relies on standard off-policy correction (e.g., importance sampling or clipped ratios) when training on counterfactual rewards, yet no analysis is provided of how the relabeling affects the effective behavior policy or the magnitude of distribution shift. If the nominal policy induces a narrow state distribution, the conditioned value estimates for distant parameterizations may suffer from extrapolation error that is not captured by the current benchmark results.

minor comments (2)

[Figure 3] Figure 3 and Table 2: axis labels and legend entries for the different reward parameterizations are not fully consistent with the notation introduced in §2.1; this makes it harder to map the plotted curves to the exact parameter values used in the adaptation experiments.
[§5] §5 (related work): the discussion of connections to multi-task RL and reward shaping could usefully cite the specific prior work on reward relabeling in offline RL (e.g., the relevant citations appear only in passing).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment below, indicating where revisions will be made to strengthen the empirical support for RCRL's claims.

read point-by-point responses

Referee: [§4.3] §4.3 (zero-shot adjustment experiments): the reported success on opposing navigation goals and conflicting multi-objective weights does not include quantitative coverage diagnostics (e.g., state-action visitation overlap or effective support size between nominal and target policies). Without these metrics or an ablation that deliberately increases divergence, it remains unclear whether the observed adaptation stems from genuine extrapolation or from test cases where nominal trajectories already overlap substantially with high-value regions under the new parameterization.

Authors: We agree that quantitative coverage diagnostics would strengthen the evidence for extrapolation in the zero-shot adjustment experiments. In the revised manuscript we will add state-action visitation overlap metrics and effective support size comparisons between nominal and target policies. We will also include an ablation that deliberately increases divergence between the nominal and target reward parameterizations to isolate whether adaptation arises from genuine extrapolation. revision: yes
Referee: [§3.2] §3.2 (off-policy update with relabeled rewards): the method relies on standard off-policy correction (e.g., importance sampling or clipped ratios) when training on counterfactual rewards, yet no analysis is provided of how the relabeling affects the effective behavior policy or the magnitude of distribution shift. If the nominal policy induces a narrow state distribution, the conditioned value estimates for distant parameterizations may suffer from extrapolation error that is not captured by the current benchmark results.

Authors: We acknowledge that an explicit analysis of how reward relabeling affects the effective behavior policy and the magnitude of distribution shift is currently absent. In the revision we will add a discussion of the effective behavior policy under relabeling together with quantitative measurements of distribution shift across the reported benchmarks. While the vision-based results already provide indirect evidence of robustness, we will make this analysis direct and include it in §3.2. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained via explicit procedure and external benchmarks

full rationale

The paper introduces RCRL as an off-policy algorithm that collects trajectories under one nominal reward and relabels them with counterfactual rewards for other parameterizations. This is presented as a direct training procedure rather than a derived claim that reduces to its own fitted outputs. Central results are empirical improvements on single-task, multi-task, and vision benchmarks, which are external to the method definition. No equation or step is shown to be equivalent to its inputs by construction, and no load-bearing premise rests solely on a self-citation chain that itself lacks independent verification. The method remains falsifiable through standard RL evaluation protocols.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard RL assumptions plus the domain assumption that reward functions can be parameterized in a way that allows accurate counterfactual evaluation from nominal trajectories. No new free parameters or invented entities are introduced beyond ordinary RL hyperparameters.

axioms (1)

domain assumption Reward functions admit a parameterization such that counterfactual rewards can be recomputed exactly from state-action-next-state tuples collected under a different parameterization.
Invoked when the method re-labels replay data with new reward values without additional interaction.

pith-pipeline@v0.9.0 · 5669 in / 1301 out tokens · 33191 ms · 2026-05-21T11:42:12.951226+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

RCRL conditions the agent on reward parameterizations ψ∈Ψ and learns multiple reward objectives from a shared replay data entirely off-policy... ψ=ψ⋆⊙Δ with Δ sampled from stratified log-uniform [0.25,4.0]
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat.embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Both the actor and critic are conditioned on this parameterization... z=[s,ψ]

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.