pith. machine review for the scientific record. sign in

arxiv: 2605.01862 · v2 · submitted 2026-05-03 · 💻 cs.LG

Recognition: unknown

QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:13 UTC · model grok-4.3

classification 💻 cs.LG
keywords offline goal-conditioned reinforcement learninghybrid attention-mambaQ-value estimatorhistory-dependent dataMarkovian and non-Markovian trajectoriesbehavior stitchingsparse rewardssequence models for RL
0
0 comments X

The pith

QHyer replaces return-to-go conditioning with a flow-parameterized goal-reaching Q-estimator and adds a gated hybrid attention-mamba backbone to learn goal-reaching policies from static datasets that mix Markovian and non-Markovian history.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to solve offline goal-conditioned reinforcement learning when real-world datasets contain partially observable, history-dependent trajectories that mix short-term Markovian dynamics with longer non-Markovian dependencies. Pure attention models struggle with efficiency and local structure, while existing hybrids use fixed windows that fail to adapt to varying dependency lengths; return-to-go signals also lose discriminative power under sparse rewards. QHyer addresses both issues by conditioning on a state-aware Q-estimator that supports stitching sub-trajectories across demonstrations and by using a gated hybrid backbone that compresses history content-adaptively while keeping local Markovian information intact. A sympathetic reader would care because this combination promises to turn more realistic static datasets into working goal-reaching policies without requiring new online interaction or dense rewards.

Core claim

QHyer replaces return-to-go with a flow-parameterized, state-conditioned goal-reaching Q-estimator that supplies discriminative guidance for stitching goal-reaching behaviors from diverse demonstrations under sparse rewards, and it introduces a gated Hybrid Attention-Mamba backbone that performs content-adaptive history compression while preserving local dynamics, allowing the model to handle temporally heterogeneous data more effectively than prior pure-attention or fixed-window hybrids.

What carries the argument

The gated hybrid attention-mamba backbone paired with a flow-parameterized, state-conditioned goal-reaching Q-estimator, which together enable adaptive compression of mixed history and effective behavior stitching.

If this is right

  • Goal-reaching policies become learnable from static datasets that combine Markovian local dynamics with longer non-Markovian dependencies.
  • Behavior stitching across demonstrations works under sparse rewards because the Q-estimator remains discriminative where return-to-go does not.
  • History compression adapts its effective memory length to the data rather than using a fixed window that truncates context.
  • The same architecture delivers improved results on both non-Markovian and Markovian datasets without separate handling for each case.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The Q-estimator approach may extend to other offline RL problems where return-to-go signals become uninformative due to sparse or delayed rewards.
  • Content-adaptive compression could apply to sequence modeling tasks outside goal-conditioned RL whenever local and global temporal structure coexist.
  • If the Q-estimator proves robust, it reduces reliance on reward engineering when curating offline datasets for goal-reaching tasks.

Load-bearing premise

The flow-parameterized Q-estimator must accurately estimate goal-reaching values from static data even when rewards are sparse, and the hybrid backbone must compress history in a content-adaptive way without losing essential local Markovian structure.

What would settle it

Training QHyer on the same non-Markovian and Markovian benchmark datasets and measuring that its success rate or normalized score does not exceed that of Decision Transformer or LSDT baselines would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2605.01862 by Donglin Wang, Jincheng Wang, Xing Lei, Xuetao Zhang.

Figure 1
Figure 1. Figure 1: Overview of QHyer architecture. The framework consists of three main components: (1) a NFs Q-value estimator that replaces RTG conditioning, (2) a Hybrid Attention-Mamba Block, (3) concatenated state-goal tokenization for effective goal information propagation and (4) reinforced learning with expectile regression. 3.1. Limitation 1: RTG Fails Under Sparse Rewards In standard DT-based methods, the return-to… view at source ↗
Figure 2
Figure 2. Figure 2: RTG vs. NFs-based Q-value conditioning in D4RL non-Markovian AntMaze-medium dataset. (a) Trajectories: successful (blue) and failed (purple). (b) RTG conditioning: successful trajectories show color gradient while failed trajectories are uniformly gray (no signal), yielding poor coverage. (c) NFs-based Q β conditioning: all trajectories colored by Q-value, achieving significantly better coverage. (d) High-… view at source ↗
Figure 3
Figure 3. Figure 3: Content-adaptive ∆t on cube-single. Left: ∆t distribution. Center: mean ∆t across sequence positions. Right: learned attention/Mamba gate weights. Statistics over 50 batches of batch size 256. Input Attention Block Mamba Block Norm Norm Channel MLP Linear Conv SSM ∙ [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Hybrid Attention-Mamba Block. This enables complementary specialization across both play and noisy datasets [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation Study on Various Q-value Estimators and the Impact of Not Estimating Q-value on Qhyer. Q: How does the Q-value estimator affect performance? A [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ablation study on SSM variants for temporal modeling.’- u’,’-m’ and ’-d’ denote umaze, medium, and diverse, respectively. Q: Does the architecture alone improve performance? A [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Ablation study on the expectile parameter τ for Q-value prediction. Q: How should the expectile parameter τ be selected? A [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: , and the complete algorithm is summarized in Algorithm 1. 0 1 2 3 0 1 2 3 NFs 0, 0, 0, 0 1, 1, 1, 1 2, 2, 2, 2 3, 3, 3, 3 Offline Dataset Estimating Goal-reaching Probability /Q-function 0 1 2 3 Add to Sample Estimating Maximum In-distribution Q-value and Obtain Optimal Action [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: GCRL example non-Markovian datasets from Ogbench. Each Trajectory is limited to travel at most 4 blocks for dataset type stitch, while at inference, the distance between the start and goal can be up to 30 in the Giant maze. F.1. Offline GCRL non-Markovian Datasets We adopt the manipulation suite from OGBench (Park et al., 2025a), which consists of three robotic manipulation environments based on a 6-DoF UR… view at source ↗
Figure 10
Figure 10. Figure 10: GCRL example non-Markovian datasets from D4RL (Fu et al., 2020): The AntMaze-v2 datasets involve controlling an 8-DoF quadruped to navigate towards a specified goal state. This benchmark requires value propagation to effectively stitch together sub-optimal trajectories from the collected data. F.2. Implementation Details We ran all our experiments on NVIDIA RTX 3090 GPUs with 24GB of memory within an inte… view at source ↗
Figure 11
Figure 11. Figure 11: State-goal tokenization strategy. (A) Offline data represented as a graph, where nodes denote states and edges represent transitions. Different trajectories (colored in orange/yellow) may share common states but target different goals. (B) State-goal concatenation: each state s is concatenated with goal g to form a unified token [s; g], enabling the model to directly attend to goal-relevant state features… view at source ↗
Figure 12
Figure 12. Figure 12: Ablation study on state-goal tokenization strategies. Panel B contrasts DT’s standard tokenization with QHyer’s approach. In vanilla DT, states and goals may be processed separately or with weak coupling. QHyer instead concate￾nates [s; g] at each timestep, ensuring that goal information is directly available when computing attention over state features. This design maintains the sequence length at 3T (Q-… view at source ↗
Figure 13
Figure 13. Figure 13: Ablation study on regression functions for maximum Q-value learning. These results validate our design choice and explain why QHyer achieves effective trajectory stitching: the concatenated state-goal representation provides the necessary goal-aware context for identifying and combining high-value segments from different trajectories. G.2. Effect of Regression Functions on Learning Stability We compare MS… view at source ↗
Figure 14
Figure 14. Figure 14: D4RL Antmaze-medium envi￾ronment with trajectories from different behavioral policies. To further illustrate the trajectory stitching capabilities of different methods, we provide a qualitative comparison on the D4RL Antmaze-Medium task. As shown in [PITH_FULL_IMAGE:figures/full_fig_p029_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Qualitative comparison of trajectory stitching capabilities on D4RL Antmaze-Medium task. Different colors represent trajectory segments from different data collection policies in the offline dataset. Black segments indicate OOD states where the agent passes through walls. (a) DT fails to reach the goal due to ineffective RTG conditioning. (b) LSDT moves correctly but stops early. (c) IQL successfully reac… view at source ↗
Figure 16
Figure 16. Figure 16: 5 × 5 Gridworld environment. Specifically, we compute the true discounted future state distribu￾tion in a modified GridWorld environment example and evaluate the estimation error by comparing it against the true distribution. We also compare the predictions of CVAE(Sohn et al., 2015), C-learning (Eysenbach et al., 2020) and CRL(Eysenbach et al., 2022) with the true future state density. First, we introduc… view at source ↗
Figure 17
Figure 17. Figure 17: Experiments on the effectiveness of density estimation using NFs. Left: We evaluate CVAE, C-learning, CRL and NFs for predicting the future state distribution in the on-policy setting. As anticipated, NFs demonstrated the lowest estimation error among all methods evaluated. Conversely, CVAE exhibited the poorest estimation accuracy. In our empirical implementation, we observed that CVAE incurs significant… view at source ↗
Figure 18
Figure 18. Figure 18: Visualization comparing predicted maximum Q-values from expectile regression with different τ values against ground-truth maximum Q-values (Q ⋆ ) in the GridWorld environment. As τ increases from 0.5 to 0.99, the predictions converge toward the diagonal line (perfect prediction), validating Theorem 3.1. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Curves of R 2 and MAE metrics as a function of the expectile parameter τ . R 2 increases monotonically from 0.781 (τ = 0.5) to 0.995 (τ = 0.99), while MAE decreases correspondingly. This empirical trend confirms that expectile regression with high τ effectively approximates the in-distribution optimal Q-value Q ⋆ , consistent with the theoretical bound in Theorem 3.1. 33 [PITH_FULL_IMAGE:figures/full_fig… view at source ↗
read the original abstract

Offline goal-conditioned RL (GCRL) learns goal-reaching policies from static datasets, but real-world datasets are often partially observable and history-dependent, exhibiting a mix of Markovian and non-Markovian that violate standard RL assumptions. History-aware sequence models such as Decision Transformer (DT) are a natural fit for long-term dependency modeling, yet pure attention is inefficient and brittle when handling local Markovian structure and long-range context simultaneously. Although recent hybrid architectures (e.g., LSDT) introduce local extractors to improve local dependencies modeling, the fixed-window extraction cannot adapt its effective memory to varying dependency lengths in temporally heterogeneous settings, often truncating long-range context rather than compressing its content adaptively. Moreover, sequential offline GCRL faces a key bottleneck: under sparse rewards, return-to-go (RTG) becomes non-discriminative across sub-trajectories, providing little guidance signal for stitching goal-reaching behaviors from diverse demonstrations. To address these, we propose \textbf{QHyer}, which replaces RTG with a flow-parameterized, state-conditioned goal-reaching Q-estimator to support stitching across demonstrations, and introduces a gated Hybrid Attention-Mamba backbone that performs content-adaptive history compression while preserving local dynamics. Extensive experiments demonstrate that \textbf{QHyer} achieves state-of-the-art performance on both non-Markovian and Markovian datasets, validating its effectiveness for diverse scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces QHyer for offline goal-conditioned RL on datasets that mix Markovian and non-Markovian structure. It replaces return-to-go conditioning with a flow-parameterized, state-conditioned goal-reaching Q-estimator intended to enable stitching of goal-reaching behaviors from diverse demonstrations under sparse rewards, and proposes a gated hybrid attention-mamba backbone that performs content-adaptive history compression while preserving local dynamics. The central empirical claim is state-of-the-art performance on both non-Markovian and Markovian datasets.

Significance. If the results and ablations hold, the work would provide a concrete advance in offline GCRL by addressing the non-discriminative nature of RTG under sparsity and the fixed-window limitations of prior hybrid sequence models. The combination of flow-based Q-estimation with adaptive compression could improve sample efficiency in temporally heterogeneous settings.

major comments (1)
  1. [§3] §3: The Q-estimator is specified as state-conditioned only. In explicitly history-dependent non-Markovian regimes this creates a potential mismatch, because the stitching signal must discriminate among sub-trajectories that share the same current state but differ in relevant history; it is not shown that the estimator receives the same compressed history representation produced by the gated hybrid backbone. If this assumption does not hold, the claimed improvement on non-Markovian datasets cannot be attributed to the proposed RTG replacement.
minor comments (2)
  1. [Abstract] The abstract asserts SOTA performance without naming the datasets, baselines, or metrics; the experimental section should make these explicit and include ablations isolating the Q-estimator versus the backbone.
  2. [§3] Notation for the flow-parameterized Q-estimator and the gating mechanism would benefit from a single consolidated equation block early in §3 to aid reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The major comment raises an important point about the integration between the Q-estimator and the hybrid backbone in non-Markovian regimes. We address this below and outline the changes we will make.

read point-by-point responses
  1. Referee: [§3] §3: The Q-estimator is specified as state-conditioned only. In explicitly history-dependent non-Markovian regimes this creates a potential mismatch, because the stitching signal must discriminate among sub-trajectories that share the same current state but differ in relevant history; it is not shown that the estimator receives the same compressed history representation produced by the gated hybrid backbone. If this assumption does not hold, the claimed improvement on non-Markovian datasets cannot be attributed to the proposed RTG replacement.

    Authors: We appreciate this observation. In the QHyer architecture the gated hybrid attention-mamba backbone first processes the full observation sequence and produces a content-adaptive compressed representation. The flow-parameterized Q-estimator is then conditioned on this backbone output (rather than on the raw current state alone), so that the resulting Q-values incorporate the relevant history compression needed for stitching in non-Markovian settings. While the manuscript describes the estimator as “state-conditioned” for conciseness, the conditioning is performed on the backbone-encoded state. We acknowledge that this data-flow was not stated with sufficient explicitness in §3. In the revised manuscript we will add a clarifying paragraph, update the architecture diagram, and include a short ablation confirming that removing the backbone input to the Q-estimator degrades performance on the non-Markovian datasets. revision: yes

Circularity Check

0 steps flagged

No circularity: method proposal relies on external experiments without self-referential derivations

full rationale

The provided manuscript text contains no equations, derivations, or mathematical reductions that equate any claimed result or prediction to its own inputs by construction. The core proposal replaces RTG with a flow-parameterized Q-estimator and introduces a gated hybrid backbone, but these are presented as architectural choices validated by experiments on Markovian and non-Markovian datasets rather than derived from self-citations or fitted parameters renamed as predictions. No load-bearing self-citation chains, uniqueness theorems from the same authors, or ansatzes smuggled via prior work are invoked to force the central claims. The SOTA performance assertions rest on empirical benchmarks external to any internal fitting loop, rendering the derivation chain self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Based solely on the abstract, the central claim rests on the effectiveness of the new Q-estimator for stitching and the adaptive compression property of the hybrid backbone; no explicit free parameters, axioms, or invented entities are quantified.

free parameters (1)
  • flow parameters
    Used to parameterize the state-conditioned goal-reaching Q-estimator as described in the abstract.
axioms (1)
  • domain assumption Real-world offline datasets exhibit a mix of Markovian and non-Markovian structure that violates standard RL assumptions
    Stated as motivation for the work in the abstract.
invented entities (1)
  • gated Hybrid Attention-Mamba backbone no independent evidence
    purpose: Performs content-adaptive history compression while preserving local dynamics
    New architectural component introduced to address fixed-window limitations of prior hybrids.

pith-pipeline@v0.9.0 · 5555 in / 1350 out tokens · 48038 ms · 2026-05-11T01:13:51.828022+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages · 18 internal anchors

  1. [1]

    arXiv preprint arXiv:1807.10299 , year=

    Variational option discovery algorithms , author=. arXiv preprint arXiv:1807.10299 , year=

  2. [2]

    Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning , author=

  3. [3]

    Advances in neural information processing systems , volume=

    Deep reinforcement learning at the edge of the statistical precipice , author=. Advances in neural information processing systems , volume=

  4. [4]

    Computer Science , year=

    Adam: A Method for Stochastic Optimization , author=. Computer Science , year=

  5. [5]

    International economic review , pages=

    On the estimation of production frontiers: maximum likelihood estimation of the parameters of a discontinuous density function , author=. International economic review , pages=. 1976 , publisher=

  6. [6]

    Morgan Kaufmann Publishers Inc

    Approximately Optimal Approximate Reinforcement Learning , author=. Morgan Kaufmann Publishers Inc. , year=

  7. [7]

    Actionable Models: Unsupervised Offline Reinforcement Learning of Robotic Skills , author=

  8. [8]

    Goal-Conditioned Reinforcement Learning with Imagined Subgoals , author=

  9. [9]

    Scaling Learning Algorithms Towards

    Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

  10. [10]

    org/abs/2101.07123

    Learning successor states and goal-dependent values: A mathematical viewpoint , author=. arXiv preprint arXiv:2101.07123 , year=

  11. [11]

    arXiv preprint arXiv:2202.11566 , year=

    Pessimistic bootstrapping for uncertainty-driven offline reinforcement learning , author=. arXiv preprint arXiv:2202.11566 , year=

  12. [12]

    Advances in neural information processing systems , volume=

    Offline rl without off-policy evaluation , author=. Advances in neural information processing systems , volume=

  13. [13]

    The Thirteenth International Conference on Learning Representations , year=

    Accelerating Goal-Conditioned Reinforcement Learning Algorithms and Research , author=. The Thirteenth International Conference on Learning Representations , year=

  14. [14]

    Proceedings of the 3rd International Conference on Development and Learning , volume=

    Intrinsically motivated learning of hierarchical collections of skills , author=. Proceedings of the 3rd International Conference on Development and Learning , volume=. 2004 , organization=

  15. [15]

    Advances in neural information processing systems , volume=

    Successor features for transfer in reinforcement learning , author=. Advances in neural information processing systems , volume=

  16. [16]

    Advances in neural information processing systems , volume=

    Hindsight experience replay , author=. Advances in neural information processing systems , volume=

  17. [17]

    Grounding language to autonomously-acquired skills via goal generation,

    Grounding language to autonomously-acquired skills via goal generation , author=. arXiv preprint arXiv:2006.07185 , year=

  18. [18]

    Advances in neural information processing systems , volume=

    Rewriting history with inverse rl: Hindsight inference for policy improvement , author=. Advances in neural information processing systems , volume=

  19. [19]

    Exploration by Random Network Distillation

    Exploration by random network distillation , author=. arXiv preprint arXiv:1810.12894 , year=

  20. [20]

    Advances in Neural Information Processing Systems , volume=

    When does return-conditioned supervised learning work for offline reinforcement learning? , author=. Advances in Neural Information Processing Systems , volume=

  21. [21]

    Importance weighted autoencoders

    Importance weighted autoencoders , author=. arXiv preprint arXiv:1509.00519 , year=

  22. [22]

    Dota 2 with Large Scale Deep Reinforcement Learning

    Dota 2 with large scale deep reinforcement learning , author=. arXiv preprint arXiv:1912.06680 , year=

  23. [23]

    ESAIM: Control, Optimisation and Calculus of Variations , volume=

    Second order optimality conditions in the smooth case and applications in optimal control , author=. ESAIM: Control, Optimisation and Calculus of Variations , volume=. 2007 , publisher=

  24. [24]

    International Conference on Machine Learning , pages=

    Agent57: Outperforming the atari human benchmark , author=. International Conference on Machine Learning , pages=. 2020 , organization=

  25. [25]

    Advances in neural information processing systems , volume=

    Decision transformer: Reinforcement learning via sequence modeling , author=. Advances in neural information processing systems , volume=

  26. [26]

    2022 , eprint=

    BATS: Best Action Trajectory Stitching , author=. 2022 , eprint=

  27. [27]

    Advances in Neural Information Processing Systems , volume=

    Plangan: Model-based planning with sparse rewards and multiple goals , author=. Advances in Neural Information Processing Systems , volume=

  28. [28]

    arXiv preprint arXiv:1902.04546 , year=

    ACTRCE: Augmenting Experience via Teacher's Advice For Multi-Goal Reinforcement Learning , author=. arXiv preprint arXiv:1902.04546 , year=

  29. [29]

    Actionable models: Unsupervised offline reinforcement learning of robotic skills

    Actionable models: Unsupervised offline reinforcement learning of robotic skills , author=. arXiv preprint arXiv:2104.07749 , year=

  30. [30]

    International conference on machine learning , pages=

    Goal-conditioned reinforcement learning with imagined subgoals , author=. International conference on machine learning , pages=. 2021 , organization=

  31. [31]

    Advances in Neural Information Processing Systems , volume=

    Language as a cognitive tool to imagine goals in curiosity driven exploration , author=. Advances in Neural Information Processing Systems , volume=

  32. [32]

    International Conference on Machine Learning , pages=

    On the statistical benefits of temporal difference learning , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  33. [33]

    International Conference on Machine Learning , pages=

    Explore, discover and learn: Unsupervised discovery of state-covering skills , author=. International Conference on Machine Learning , pages=. 2020 , organization=

  34. [34]

    CONTINUOUS CONTROL WITH DEEP REINFORCEMENT LEARNING , author=

  35. [35]

    Advances in neural information processing systems , volume=

    Goal-conditioned imitation learning , author=. Advances in neural information processing systems , volume=

  36. [36]

    International Journal of Serious Games , volume=

    Improved Reinforcement Learning in Asymmetric Real-time Strategy Games via Strategy Diversity: A Case Study for Hunting-of-the-Plark Game , author=. International Journal of Serious Games , volume=

  37. [37]

    Journal of Intelligent & Robotic Systems , volume=

    Deep reinforcement learning for a humanoid robot soccer player , author=. Journal of Intelligent & Robotic Systems , volume=. 2021 , publisher=

  38. [38]

    arXiv preprint arXiv:2105.13806 , year=

    DRL: Deep Reinforcement Learning for Intelligent Robot Control--Concept, Literature, and Future , author=. arXiv preprint arXiv:2105.13806 , year=

  39. [39]

    Divide-and-Conquer Monte Carlo Tree Search For Goal-Directed Planning , author=

  40. [40]

    Advances in Neural Information Processing Systems , volume=

    Adversarial intrinsic motivation for reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

  41. [41]

    2020 , eprint=

    Sample-Efficient Reinforcement Learning via Counterfactual-Based Data Augmentation , author=. 2020 , eprint=

  42. [42]

    Advances in neural information processing systems , volume=

    Generalized hindsight for reinforcement learning , author=. Advances in neural information processing systems , volume=

  43. [43]

    arXiv preprint arXiv:2011.08909 , year=

    C-learning: Learning to achieve goals via recursive classification , author=. arXiv preprint arXiv:2011.08909 , year=

  44. [44]

    Advances in Neural Information Processing Systems , volume=

    Imitating past successes can be very suboptimal , author=. Advances in Neural Information Processing Systems , volume=

  45. [45]

    Advances in Neural Information Processing Systems , volume=

    Contrastive learning as goal-conditioned reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

  46. [46]

    Eysenbach, A

    Diversity is all you need: Learning skills without a reward function , author=. arXiv preprint arXiv:1802.06070 , year=

  47. [47]

    arXiv preprint arXiv:2112.10751 , year=

    Rvs: What is essential for offline rl via supervised learning? , author=. arXiv preprint arXiv:2112.10751 , year=

  48. [48]

    Nature , volume=

    First return, then explore , author=. Nature , volume=. 2021 , publisher=

  49. [49]

    arXiv preprint arXiv:2111.10364 , year=

    Generalized decision transformer for offline hindsight information matching , author=. arXiv preprint arXiv:2111.10364 , year=

  50. [50]

    D4RL: Datasets for Deep Data-Driven Reinforcement Learning

    D4rl: Datasets for deep data-driven reinforcement learning , author=. arXiv preprint arXiv:2004.07219 , year=

  51. [51]

    Proceedings of the AAAI conference on artificial intelligence , volume=

    Deep reinforcement learning that matters , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

  52. [52]

    2021 , eprint=

    Generalization in Reinforcement Learning by Soft Data Augmentation , author=. 2021 , eprint=

  53. [53]

    arXiv preprint arXiv:2002.02089 , year=

    Soft hindsight experience replay , author=. arXiv preprint arXiv:2002.02089 , year=

  54. [54]

    Advances in Neural Information Processing Systems , volume=

    Discrete Compositional Representations as an Abstraction for Goal Conditioned Reinforcement Learning , author=. Advances in Neural Information Processing Systems , volume=

  55. [55]

    Advances in neural information processing systems , volume=

    Curriculum-guided hindsight experience replay , author=. Advances in neural information processing systems , volume=

  56. [56]

    International conference on machine learning , pages=

    Off-policy deep reinforcement learning without exploration , author=. International conference on machine learning , pages=. 2019 , organization=

  57. [57]

    International conference on machine learning , pages=

    Addressing function approximation error in actor-critic methods , author=. International conference on machine learning , pages=. 2018 , organization=

  58. [58]

    2022 IEEE Conference on Games (CoG) , pages=

    Multi-goal Reinforcement Learning via Exploring Successor Matching , author=. 2022 IEEE Conference on Games (CoG) , pages=. 2022 , organization=

  59. [59]

    arXiv preprint arXiv:1704.03012 , year=

    Stochastic neural networks for hierarchical reinforcement learning , author=. arXiv preprint arXiv:1704.03012 , year=

  60. [60]

    Advances in neural information processing systems , volume=

    A minimalist approach to offline reinforcement learning , author=. Advances in neural information processing systems , volume=

  61. [61]

    International Conference on Learning Representations , year=

    Learning to Reach Goals via Iterated Supervised Learning , author=. International Conference on Learning Representations , year=

  62. [62]

    Conference on Robot Learning , pages=

    Relay Policy Learning: Solving Long-Horizon Tasks via Imitation and Reinforcement Learning , author=. Conference on Robot Learning , pages=. 2020 , organization=

  63. [63]

    Gregor, D

    Variational intrinsic control , author=. arXiv preprint arXiv:1611.07507 , year=

  64. [64]

    Closing the Gap between

    Raj Ghugare and Matthieu Geist and Glen Berseth and Benjamin Eysenbach , booktitle=. Closing the Gap between

  65. [65]

    Reports on Mathematical Physics , volume=

    Weighted entropy , author=. Reports on Mathematical Physics , volume=. 1971 , publisher=

  66. [66]

    2016 , publisher=

    Deep learning , author=. 2016 , publisher=

  67. [67]

    2017 IEEE international conference on robotics and automation (ICRA) , pages=

    Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates , author=. 2017 IEEE international conference on robotics and automation (ICRA) , pages=. 2017 , organization=

  68. [68]

    arXiv preprint arXiv:1906.05030 , year=

    Fast task inference with variational intrinsic successor features , author=. arXiv preprint arXiv:1906.05030 , year=

  69. [69]

    arXiv preprint arXiv:2304.13774 , year=

    Distance Weighted Supervised Learning for Offline Interaction Data , author=. arXiv preprint arXiv:2304.13774 , year=

  70. [70]

    arXiv preprint arXiv:1907.08225 , year=

    Dynamical distance learning for semi-supervised and unsupervised skill discovery , author=. arXiv preprint arXiv:1907.08225 , year=

  71. [71]

    Soft Actor-Critic Algorithms and Applications , author=

  72. [72]

    International conference on machine learning , pages=

    Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor , author=. International conference on machine learning , pages=. 2018 , organization=

  73. [73]

    arXiv preprint arXiv:2305.10171 , year=

    Goal-Conditioned Supervised Learning with Sub-Goal Prediction , author=. arXiv preprint arXiv:2305.10171 , year=

  74. [74]

    2021 , eprint=

    Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels , author=. 2021 , eprint=

  75. [75]

    IJCAI , volume=

    Learning to achieve goals , author=. IJCAI , volume=. 1993 , organization=

  76. [76]

    Kingma and Max Welling

    An introduction to variational autoencoders , author=. arXiv preprint arXiv:1906.02691 , year=

  77. [77]

    arXiv preprint arXiv:1912.13465 , year=

    Reward-conditioned policies , author=. arXiv preprint arXiv:1912.13465 , year=

  78. [78]

    Adam: A Method for Stochastic Optimization

    Adam: A method for stochastic optimization , author=. arXiv preprint arXiv:1412.6980 , year=

  79. [79]

    arXiv preprint arXiv:2310.12972 , year=

    CCIL: Continuity-based Data Augmentation for Corrective Imitation Learning , author=. arXiv preprint arXiv:2310.12972 , year=

  80. [80]

    Exploiting Multi-step Sample Trajectories for Approximate Value Iteration

    Wright, Robert and Loscalzo, Steven and Dexter, Philip and Yu, Lei. Exploiting Multi-step Sample Trajectories for Approximate Value Iteration. Machine Learning and Knowledge Discovery in Databases. 2013

Showing first 80 references.