arxiv: 2605.01862 · v2 · submitted 2026-05-03 · 💻 cs.LG

Recognition: unknown

QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL

Xing Lei , Jincheng Wang , Xuetao Zhang , Donglin Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:13 UTC · model grok-4.3

classification 💻 cs.LG

keywords offline goal-conditioned reinforcement learninghybrid attention-mambaQ-value estimatorhistory-dependent dataMarkovian and non-Markovian trajectoriesbehavior stitchingsparse rewardssequence models for RL

0 comments

The pith

QHyer replaces return-to-go conditioning with a flow-parameterized goal-reaching Q-estimator and adds a gated hybrid attention-mamba backbone to learn goal-reaching policies from static datasets that mix Markovian and non-Markovian history.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to solve offline goal-conditioned reinforcement learning when real-world datasets contain partially observable, history-dependent trajectories that mix short-term Markovian dynamics with longer non-Markovian dependencies. Pure attention models struggle with efficiency and local structure, while existing hybrids use fixed windows that fail to adapt to varying dependency lengths; return-to-go signals also lose discriminative power under sparse rewards. QHyer addresses both issues by conditioning on a state-aware Q-estimator that supports stitching sub-trajectories across demonstrations and by using a gated hybrid backbone that compresses history content-adaptively while keeping local Markovian information intact. A sympathetic reader would care because this combination promises to turn more realistic static datasets into working goal-reaching policies without requiring new online interaction or dense rewards.

Core claim

QHyer replaces return-to-go with a flow-parameterized, state-conditioned goal-reaching Q-estimator that supplies discriminative guidance for stitching goal-reaching behaviors from diverse demonstrations under sparse rewards, and it introduces a gated Hybrid Attention-Mamba backbone that performs content-adaptive history compression while preserving local dynamics, allowing the model to handle temporally heterogeneous data more effectively than prior pure-attention or fixed-window hybrids.

What carries the argument

The gated hybrid attention-mamba backbone paired with a flow-parameterized, state-conditioned goal-reaching Q-estimator, which together enable adaptive compression of mixed history and effective behavior stitching.

If this is right

Goal-reaching policies become learnable from static datasets that combine Markovian local dynamics with longer non-Markovian dependencies.
Behavior stitching across demonstrations works under sparse rewards because the Q-estimator remains discriminative where return-to-go does not.
History compression adapts its effective memory length to the data rather than using a fixed window that truncates context.
The same architecture delivers improved results on both non-Markovian and Markovian datasets without separate handling for each case.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The Q-estimator approach may extend to other offline RL problems where return-to-go signals become uninformative due to sparse or delayed rewards.
Content-adaptive compression could apply to sequence modeling tasks outside goal-conditioned RL whenever local and global temporal structure coexist.
If the Q-estimator proves robust, it reduces reliance on reward engineering when curating offline datasets for goal-reaching tasks.

Load-bearing premise

The flow-parameterized Q-estimator must accurately estimate goal-reaching values from static data even when rewards are sparse, and the hybrid backbone must compress history in a content-adaptive way without losing essential local Markovian structure.

What would settle it

Training QHyer on the same non-Markovian and Markovian benchmark datasets and measuring that its success rate or normalized score does not exceed that of Decision Transformer or LSDT baselines would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2605.01862 by Donglin Wang, Jincheng Wang, Xing Lei, Xuetao Zhang.

**Figure 1.** Figure 1: Overview of QHyer architecture. The framework consists of three main components: (1) a NFs Q-value estimator that replaces RTG conditioning, (2) a Hybrid Attention-Mamba Block, (3) concatenated state-goal tokenization for effective goal information propagation and (4) reinforced learning with expectile regression. 3.1. Limitation 1: RTG Fails Under Sparse Rewards In standard DT-based methods, the return-to… view at source ↗

**Figure 2.** Figure 2: RTG vs. NFs-based Q-value conditioning in D4RL non-Markovian AntMaze-medium dataset. (a) Trajectories: successful (blue) and failed (purple). (b) RTG conditioning: successful trajectories show color gradient while failed trajectories are uniformly gray (no signal), yielding poor coverage. (c) NFs-based Q β conditioning: all trajectories colored by Q-value, achieving significantly better coverage. (d) High-… view at source ↗

**Figure 3.** Figure 3: Content-adaptive ∆t on cube-single. Left: ∆t distribution. Center: mean ∆t across sequence positions. Right: learned attention/Mamba gate weights. Statistics over 50 batches of batch size 256. Input Attention Block Mamba Block Norm Norm Channel MLP Linear Conv SSM ∙ [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Hybrid Attention-Mamba Block. This enables complementary specialization across both play and noisy datasets [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation Study on Various Q-value Estimators and the Impact of Not Estimating Q-value on Qhyer. Q: How does the Q-value estimator affect performance? A [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Ablation study on SSM variants for temporal modeling.’- u’,’-m’ and ’-d’ denote umaze, medium, and diverse, respectively. Q: Does the architecture alone improve performance? A [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Ablation study on the expectile parameter τ for Q-value prediction. Q: How should the expectile parameter τ be selected? A [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: , and the complete algorithm is summarized in Algorithm 1. 0 1 2 3 0 1 2 3 NFs 0, 0, 0, 0 1, 1, 1, 1 2, 2, 2, 2 3, 3, 3, 3 Offline Dataset Estimating Goal-reaching Probability /Q-function 0 1 2 3 Add to Sample Estimating Maximum In-distribution Q-value and Obtain Optimal Action [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: GCRL example non-Markovian datasets from Ogbench. Each Trajectory is limited to travel at most 4 blocks for dataset type stitch, while at inference, the distance between the start and goal can be up to 30 in the Giant maze. F.1. Offline GCRL non-Markovian Datasets We adopt the manipulation suite from OGBench (Park et al., 2025a), which consists of three robotic manipulation environments based on a 6-DoF UR… view at source ↗

**Figure 10.** Figure 10: GCRL example non-Markovian datasets from D4RL (Fu et al., 2020): The AntMaze-v2 datasets involve controlling an 8-DoF quadruped to navigate towards a specified goal state. This benchmark requires value propagation to effectively stitch together sub-optimal trajectories from the collected data. F.2. Implementation Details We ran all our experiments on NVIDIA RTX 3090 GPUs with 24GB of memory within an inte… view at source ↗

**Figure 11.** Figure 11: State-goal tokenization strategy. (A) Offline data represented as a graph, where nodes denote states and edges represent transitions. Different trajectories (colored in orange/yellow) may share common states but target different goals. (B) State-goal concatenation: each state s is concatenated with goal g to form a unified token [s; g], enabling the model to directly attend to goal-relevant state features… view at source ↗

**Figure 12.** Figure 12: Ablation study on state-goal tokenization strategies. Panel B contrasts DT’s standard tokenization with QHyer’s approach. In vanilla DT, states and goals may be processed separately or with weak coupling. QHyer instead concatenates [s; g] at each timestep, ensuring that goal information is directly available when computing attention over state features. This design maintains the sequence length at 3T (Q-… view at source ↗

**Figure 13.** Figure 13: Ablation study on regression functions for maximum Q-value learning. These results validate our design choice and explain why QHyer achieves effective trajectory stitching: the concatenated state-goal representation provides the necessary goal-aware context for identifying and combining high-value segments from different trajectories. G.2. Effect of Regression Functions on Learning Stability We compare MS… view at source ↗

**Figure 14.** Figure 14: D4RL Antmaze-medium environment with trajectories from different behavioral policies. To further illustrate the trajectory stitching capabilities of different methods, we provide a qualitative comparison on the D4RL Antmaze-Medium task. As shown in [PITH_FULL_IMAGE:figures/full_fig_p029_14.png] view at source ↗

**Figure 15.** Figure 15: Qualitative comparison of trajectory stitching capabilities on D4RL Antmaze-Medium task. Different colors represent trajectory segments from different data collection policies in the offline dataset. Black segments indicate OOD states where the agent passes through walls. (a) DT fails to reach the goal due to ineffective RTG conditioning. (b) LSDT moves correctly but stops early. (c) IQL successfully reac… view at source ↗

**Figure 16.** Figure 16: 5 × 5 Gridworld environment. Specifically, we compute the true discounted future state distribution in a modified GridWorld environment example and evaluate the estimation error by comparing it against the true distribution. We also compare the predictions of CVAE(Sohn et al., 2015), C-learning (Eysenbach et al., 2020) and CRL(Eysenbach et al., 2022) with the true future state density. First, we introduc… view at source ↗

**Figure 17.** Figure 17: Experiments on the effectiveness of density estimation using NFs. Left: We evaluate CVAE, C-learning, CRL and NFs for predicting the future state distribution in the on-policy setting. As anticipated, NFs demonstrated the lowest estimation error among all methods evaluated. Conversely, CVAE exhibited the poorest estimation accuracy. In our empirical implementation, we observed that CVAE incurs significant… view at source ↗

**Figure 18.** Figure 18: Visualization comparing predicted maximum Q-values from expectile regression with different τ values against ground-truth maximum Q-values (Q ⋆ ) in the GridWorld environment. As τ increases from 0.5 to 0.99, the predictions converge toward the diagonal line (perfect prediction), validating Theorem 3.1. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_18.png] view at source ↗

**Figure 19.** Figure 19: Curves of R 2 and MAE metrics as a function of the expectile parameter τ . R 2 increases monotonically from 0.781 (τ = 0.5) to 0.995 (τ = 0.99), while MAE decreases correspondingly. This empirical trend confirms that expectile regression with high τ effectively approximates the in-distribution optimal Q-value Q ⋆ , consistent with the theoretical bound in Theorem 3.1. 33 [PITH_FULL_IMAGE:figures/full_fig… view at source ↗

read the original abstract

Offline goal-conditioned RL (GCRL) learns goal-reaching policies from static datasets, but real-world datasets are often partially observable and history-dependent, exhibiting a mix of Markovian and non-Markovian that violate standard RL assumptions. History-aware sequence models such as Decision Transformer (DT) are a natural fit for long-term dependency modeling, yet pure attention is inefficient and brittle when handling local Markovian structure and long-range context simultaneously. Although recent hybrid architectures (e.g., LSDT) introduce local extractors to improve local dependencies modeling, the fixed-window extraction cannot adapt its effective memory to varying dependency lengths in temporally heterogeneous settings, often truncating long-range context rather than compressing its content adaptively. Moreover, sequential offline GCRL faces a key bottleneck: under sparse rewards, return-to-go (RTG) becomes non-discriminative across sub-trajectories, providing little guidance signal for stitching goal-reaching behaviors from diverse demonstrations. To address these, we propose \textbf{QHyer}, which replaces RTG with a flow-parameterized, state-conditioned goal-reaching Q-estimator to support stitching across demonstrations, and introduces a gated Hybrid Attention-Mamba backbone that performs content-adaptive history compression while preserving local dynamics. Extensive experiments demonstrate that \textbf{QHyer} achieves state-of-the-art performance on both non-Markovian and Markovian datasets, validating its effectiveness for diverse scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

QHyer swaps RTG for a flow-based state-conditioned Q-estimator and adds a gated attention-mamba hybrid for adaptive history compression, but the Q conditioning creates a mismatch in non-Markovian settings.

read the letter

QHyer replaces return-to-go with a flow-parameterized state-conditioned Q-estimator and introduces a gated hybrid attention-mamba backbone that does content-adaptive history compression while keeping local dynamics intact. This targets offline goal-conditioned RL on datasets that mix Markovian and history-dependent trajectories, where standard DT-style models struggle with efficiency and RTG becomes non-discriminative under sparse rewards. The specific gated hybrid and the Q-estimator substitution are the concrete new pieces relative to LSDT and Decision Transformer. The motivation for adaptive rather than fixed-window extraction is clear and directly addresses a limitation in prior hybrids. The paper also correctly flags that pure attention is brittle on local Markovian structure. The main soft spot is the state-conditioned Q-estimator. In non-Markovian regimes the environment depends on history, yet the Q only sees the current state according to the abstract and section 3. The backbone supplies compressed history to the policy, but nothing indicates the Q-estimator receives the same representation. This leaves the stitching signal potentially non-discriminative exactly where the paper claims the biggest gains. The SOTA results on non-Markovian data therefore rest on an assumption that may not hold. Experiments are described as extensive but the abstract supplies no baselines, metrics, or ablation numbers, so the performance claims cannot be assessed yet. The citation pattern is standard and appropriate. This paper is for people working on sequence models for offline RL, especially those dealing with partial observability and temporally heterogeneous data. A reader who wants to experiment with hybrid backbones or Q-based stitching could extract usable ideas if the experiments check out. It deserves peer review because the architecture is concrete and the identified bottlenecks are real. I would send it to referees but would ask the authors to show how the Q-estimator actually incorporates history information in the non-Markovian case.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces QHyer for offline goal-conditioned RL on datasets that mix Markovian and non-Markovian structure. It replaces return-to-go conditioning with a flow-parameterized, state-conditioned goal-reaching Q-estimator intended to enable stitching of goal-reaching behaviors from diverse demonstrations under sparse rewards, and proposes a gated hybrid attention-mamba backbone that performs content-adaptive history compression while preserving local dynamics. The central empirical claim is state-of-the-art performance on both non-Markovian and Markovian datasets.

Significance. If the results and ablations hold, the work would provide a concrete advance in offline GCRL by addressing the non-discriminative nature of RTG under sparsity and the fixed-window limitations of prior hybrid sequence models. The combination of flow-based Q-estimation with adaptive compression could improve sample efficiency in temporally heterogeneous settings.

major comments (1)

[§3] §3: The Q-estimator is specified as state-conditioned only. In explicitly history-dependent non-Markovian regimes this creates a potential mismatch, because the stitching signal must discriminate among sub-trajectories that share the same current state but differ in relevant history; it is not shown that the estimator receives the same compressed history representation produced by the gated hybrid backbone. If this assumption does not hold, the claimed improvement on non-Markovian datasets cannot be attributed to the proposed RTG replacement.

minor comments (2)

[Abstract] The abstract asserts SOTA performance without naming the datasets, baselines, or metrics; the experimental section should make these explicit and include ablations isolating the Q-estimator versus the backbone.
[§3] Notation for the flow-parameterized Q-estimator and the gating mechanism would benefit from a single consolidated equation block early in §3 to aid reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The major comment raises an important point about the integration between the Q-estimator and the hybrid backbone in non-Markovian regimes. We address this below and outline the changes we will make.

read point-by-point responses

Referee: [§3] §3: The Q-estimator is specified as state-conditioned only. In explicitly history-dependent non-Markovian regimes this creates a potential mismatch, because the stitching signal must discriminate among sub-trajectories that share the same current state but differ in relevant history; it is not shown that the estimator receives the same compressed history representation produced by the gated hybrid backbone. If this assumption does not hold, the claimed improvement on non-Markovian datasets cannot be attributed to the proposed RTG replacement.

Authors: We appreciate this observation. In the QHyer architecture the gated hybrid attention-mamba backbone first processes the full observation sequence and produces a content-adaptive compressed representation. The flow-parameterized Q-estimator is then conditioned on this backbone output (rather than on the raw current state alone), so that the resulting Q-values incorporate the relevant history compression needed for stitching in non-Markovian settings. While the manuscript describes the estimator as “state-conditioned” for conciseness, the conditioning is performed on the backbone-encoded state. We acknowledge that this data-flow was not stated with sufficient explicitness in §3. In the revised manuscript we will add a clarifying paragraph, update the architecture diagram, and include a short ablation confirming that removing the backbone input to the Q-estimator degrades performance on the non-Markovian datasets. revision: yes

Circularity Check

0 steps flagged

No circularity: method proposal relies on external experiments without self-referential derivations

full rationale

The provided manuscript text contains no equations, derivations, or mathematical reductions that equate any claimed result or prediction to its own inputs by construction. The core proposal replaces RTG with a flow-parameterized Q-estimator and introduces a gated hybrid backbone, but these are presented as architectural choices validated by experiments on Markovian and non-Markovian datasets rather than derived from self-citations or fitted parameters renamed as predictions. No load-bearing self-citation chains, uniqueness theorems from the same authors, or ansatzes smuggled via prior work are invoked to force the central claims. The SOTA performance assertions rest on empirical benchmarks external to any internal fitting loop, rendering the derivation chain self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Based solely on the abstract, the central claim rests on the effectiveness of the new Q-estimator for stitching and the adaptive compression property of the hybrid backbone; no explicit free parameters, axioms, or invented entities are quantified.

free parameters (1)

flow parameters
Used to parameterize the state-conditioned goal-reaching Q-estimator as described in the abstract.

axioms (1)

domain assumption Real-world offline datasets exhibit a mix of Markovian and non-Markovian structure that violates standard RL assumptions
Stated as motivation for the work in the abstract.

invented entities (1)

gated Hybrid Attention-Mamba backbone no independent evidence
purpose: Performs content-adaptive history compression while preserving local dynamics
New architectural component introduced to address fixed-window limitations of prior hybrids.

pith-pipeline@v0.9.0 · 5555 in / 1350 out tokens · 48038 ms · 2026-05-11T01:13:51.828022+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages · 18 internal anchors

[1]

arXiv preprint arXiv:1807.10299 , year=

Variational option discovery algorithms , author=. arXiv preprint arXiv:1807.10299 , year=

work page arXiv
[2]

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning , author=

work page
[3]

Advances in neural information processing systems , volume=

Deep reinforcement learning at the edge of the statistical precipice , author=. Advances in neural information processing systems , volume=

work page
[4]

Computer Science , year=

Adam: A Method for Stochastic Optimization , author=. Computer Science , year=

work page
[5]

International economic review , pages=

On the estimation of production frontiers: maximum likelihood estimation of the parameters of a discontinuous density function , author=. International economic review , pages=. 1976 , publisher=

work page 1976
[6]

Morgan Kaufmann Publishers Inc

Approximately Optimal Approximate Reinforcement Learning , author=. Morgan Kaufmann Publishers Inc. , year=

work page
[7]

Actionable Models: Unsupervised Offline Reinforcement Learning of Robotic Skills , author=

work page
[8]

Goal-Conditioned Reinforcement Learning with Imagined Subgoals , author=

work page
[9]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page
[10]

org/abs/2101.07123

Learning successor states and goal-dependent values: A mathematical viewpoint , author=. arXiv preprint arXiv:2101.07123 , year=

work page arXiv
[11]

arXiv preprint arXiv:2202.11566 , year=

Pessimistic bootstrapping for uncertainty-driven offline reinforcement learning , author=. arXiv preprint arXiv:2202.11566 , year=

work page arXiv
[12]

Advances in neural information processing systems , volume=

Offline rl without off-policy evaluation , author=. Advances in neural information processing systems , volume=

work page
[13]

The Thirteenth International Conference on Learning Representations , year=

Accelerating Goal-Conditioned Reinforcement Learning Algorithms and Research , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[14]

Proceedings of the 3rd International Conference on Development and Learning , volume=

Intrinsically motivated learning of hierarchical collections of skills , author=. Proceedings of the 3rd International Conference on Development and Learning , volume=. 2004 , organization=

work page 2004
[15]

Advances in neural information processing systems , volume=

Successor features for transfer in reinforcement learning , author=. Advances in neural information processing systems , volume=

work page
[16]

Advances in neural information processing systems , volume=

Hindsight experience replay , author=. Advances in neural information processing systems , volume=

work page
[17]

Grounding language to autonomously-acquired skills via goal generation,

Grounding language to autonomously-acquired skills via goal generation , author=. arXiv preprint arXiv:2006.07185 , year=

work page arXiv 2006
[18]

Advances in neural information processing systems , volume=

Rewriting history with inverse rl: Hindsight inference for policy improvement , author=. Advances in neural information processing systems , volume=

work page
[19]

Exploration by Random Network Distillation

Exploration by random network distillation , author=. arXiv preprint arXiv:1810.12894 , year=

work page Pith review arXiv
[20]

Advances in Neural Information Processing Systems , volume=

When does return-conditioned supervised learning work for offline reinforcement learning? , author=. Advances in Neural Information Processing Systems , volume=

work page
[21]

Importance weighted autoencoders

Importance weighted autoencoders , author=. arXiv preprint arXiv:1509.00519 , year=

work page arXiv
[22]

Dota 2 with Large Scale Deep Reinforcement Learning

Dota 2 with large scale deep reinforcement learning , author=. arXiv preprint arXiv:1912.06680 , year=

work page internal anchor Pith review arXiv 1912
[23]

ESAIM: Control, Optimisation and Calculus of Variations , volume=

Second order optimality conditions in the smooth case and applications in optimal control , author=. ESAIM: Control, Optimisation and Calculus of Variations , volume=. 2007 , publisher=

work page 2007
[24]

International Conference on Machine Learning , pages=

Agent57: Outperforming the atari human benchmark , author=. International Conference on Machine Learning , pages=. 2020 , organization=

work page 2020
[25]

Advances in neural information processing systems , volume=

Decision transformer: Reinforcement learning via sequence modeling , author=. Advances in neural information processing systems , volume=

work page
[26]

2022 , eprint=

BATS: Best Action Trajectory Stitching , author=. 2022 , eprint=

work page 2022
[27]

Advances in Neural Information Processing Systems , volume=

Plangan: Model-based planning with sparse rewards and multiple goals , author=. Advances in Neural Information Processing Systems , volume=

work page
[28]

arXiv preprint arXiv:1902.04546 , year=

ACTRCE: Augmenting Experience via Teacher's Advice For Multi-Goal Reinforcement Learning , author=. arXiv preprint arXiv:1902.04546 , year=

work page arXiv 1902
[29]

Actionable models: Unsupervised ofﬂine reinforcement learning of robotic skills

Actionable models: Unsupervised offline reinforcement learning of robotic skills , author=. arXiv preprint arXiv:2104.07749 , year=

work page arXiv
[30]

International conference on machine learning , pages=

Goal-conditioned reinforcement learning with imagined subgoals , author=. International conference on machine learning , pages=. 2021 , organization=

work page 2021
[31]

Advances in Neural Information Processing Systems , volume=

Language as a cognitive tool to imagine goals in curiosity driven exploration , author=. Advances in Neural Information Processing Systems , volume=

work page
[32]

International Conference on Machine Learning , pages=

On the statistical benefits of temporal difference learning , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[33]

International Conference on Machine Learning , pages=

Explore, discover and learn: Unsupervised discovery of state-covering skills , author=. International Conference on Machine Learning , pages=. 2020 , organization=

work page 2020
[34]

CONTINUOUS CONTROL WITH DEEP REINFORCEMENT LEARNING , author=

work page
[35]

Advances in neural information processing systems , volume=

Goal-conditioned imitation learning , author=. Advances in neural information processing systems , volume=

work page
[36]

International Journal of Serious Games , volume=

Improved Reinforcement Learning in Asymmetric Real-time Strategy Games via Strategy Diversity: A Case Study for Hunting-of-the-Plark Game , author=. International Journal of Serious Games , volume=

work page
[37]

Journal of Intelligent & Robotic Systems , volume=

Deep reinforcement learning for a humanoid robot soccer player , author=. Journal of Intelligent & Robotic Systems , volume=. 2021 , publisher=

work page 2021
[38]

arXiv preprint arXiv:2105.13806 , year=

DRL: Deep Reinforcement Learning for Intelligent Robot Control--Concept, Literature, and Future , author=. arXiv preprint arXiv:2105.13806 , year=

work page arXiv
[39]

Divide-and-Conquer Monte Carlo Tree Search For Goal-Directed Planning , author=

work page
[40]

Advances in Neural Information Processing Systems , volume=

Adversarial intrinsic motivation for reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[41]

2020 , eprint=

Sample-Efficient Reinforcement Learning via Counterfactual-Based Data Augmentation , author=. 2020 , eprint=

work page 2020
[42]

Advances in neural information processing systems , volume=

Generalized hindsight for reinforcement learning , author=. Advances in neural information processing systems , volume=

work page
[43]

arXiv preprint arXiv:2011.08909 , year=

C-learning: Learning to achieve goals via recursive classification , author=. arXiv preprint arXiv:2011.08909 , year=

work page arXiv 2011
[44]

Advances in Neural Information Processing Systems , volume=

Imitating past successes can be very suboptimal , author=. Advances in Neural Information Processing Systems , volume=

work page
[45]

Advances in Neural Information Processing Systems , volume=

Contrastive learning as goal-conditioned reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[46]

Eysenbach, A

Diversity is all you need: Learning skills without a reward function , author=. arXiv preprint arXiv:1802.06070 , year=

work page arXiv
[47]

arXiv preprint arXiv:2112.10751 , year=

Rvs: What is essential for offline rl via supervised learning? , author=. arXiv preprint arXiv:2112.10751 , year=

work page arXiv
[48]

Nature , volume=

First return, then explore , author=. Nature , volume=. 2021 , publisher=

work page 2021
[49]

arXiv preprint arXiv:2111.10364 , year=

Generalized decision transformer for offline hindsight information matching , author=. arXiv preprint arXiv:2111.10364 , year=

work page arXiv
[50]

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

D4rl: Datasets for deep data-driven reinforcement learning , author=. arXiv preprint arXiv:2004.07219 , year=

work page internal anchor Pith review arXiv 2004
[51]

Proceedings of the AAAI conference on artificial intelligence , volume=

Deep reinforcement learning that matters , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

work page
[52]

2021 , eprint=

Generalization in Reinforcement Learning by Soft Data Augmentation , author=. 2021 , eprint=

work page 2021
[53]

arXiv preprint arXiv:2002.02089 , year=

Soft hindsight experience replay , author=. arXiv preprint arXiv:2002.02089 , year=

work page arXiv 2002
[54]

Advances in Neural Information Processing Systems , volume=

Discrete Compositional Representations as an Abstraction for Goal Conditioned Reinforcement Learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[55]

Advances in neural information processing systems , volume=

Curriculum-guided hindsight experience replay , author=. Advances in neural information processing systems , volume=

work page
[56]

International conference on machine learning , pages=

Off-policy deep reinforcement learning without exploration , author=. International conference on machine learning , pages=. 2019 , organization=

work page 2019
[57]

International conference on machine learning , pages=

Addressing function approximation error in actor-critic methods , author=. International conference on machine learning , pages=. 2018 , organization=

work page 2018
[58]

2022 IEEE Conference on Games (CoG) , pages=

Multi-goal Reinforcement Learning via Exploring Successor Matching , author=. 2022 IEEE Conference on Games (CoG) , pages=. 2022 , organization=

work page 2022
[59]

arXiv preprint arXiv:1704.03012 , year=

Stochastic neural networks for hierarchical reinforcement learning , author=. arXiv preprint arXiv:1704.03012 , year=

work page arXiv
[60]

Advances in neural information processing systems , volume=

A minimalist approach to offline reinforcement learning , author=. Advances in neural information processing systems , volume=

work page
[61]

International Conference on Learning Representations , year=

Learning to Reach Goals via Iterated Supervised Learning , author=. International Conference on Learning Representations , year=

work page
[62]

Conference on Robot Learning , pages=

Relay Policy Learning: Solving Long-Horizon Tasks via Imitation and Reinforcement Learning , author=. Conference on Robot Learning , pages=. 2020 , organization=

work page 2020
[63]

Gregor, D

Variational intrinsic control , author=. arXiv preprint arXiv:1611.07507 , year=

work page arXiv
[64]

Closing the Gap between

Raj Ghugare and Matthieu Geist and Glen Berseth and Benjamin Eysenbach , booktitle=. Closing the Gap between

work page
[65]

Reports on Mathematical Physics , volume=

Weighted entropy , author=. Reports on Mathematical Physics , volume=. 1971 , publisher=

work page 1971
[66]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016
[67]

2017 IEEE international conference on robotics and automation (ICRA) , pages=

Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates , author=. 2017 IEEE international conference on robotics and automation (ICRA) , pages=. 2017 , organization=

work page 2017
[68]

arXiv preprint arXiv:1906.05030 , year=

Fast task inference with variational intrinsic successor features , author=. arXiv preprint arXiv:1906.05030 , year=

work page arXiv 1906
[69]

arXiv preprint arXiv:2304.13774 , year=

Distance Weighted Supervised Learning for Offline Interaction Data , author=. arXiv preprint arXiv:2304.13774 , year=

work page arXiv
[70]

arXiv preprint arXiv:1907.08225 , year=

Dynamical distance learning for semi-supervised and unsupervised skill discovery , author=. arXiv preprint arXiv:1907.08225 , year=

work page arXiv 1907
[71]

Soft Actor-Critic Algorithms and Applications , author=

work page
[72]

International conference on machine learning , pages=

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor , author=. International conference on machine learning , pages=. 2018 , organization=

work page 2018
[73]

arXiv preprint arXiv:2305.10171 , year=

Goal-Conditioned Supervised Learning with Sub-Goal Prediction , author=. arXiv preprint arXiv:2305.10171 , year=

work page arXiv
[74]

2021 , eprint=

Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels , author=. 2021 , eprint=

work page 2021
[75]

IJCAI , volume=

Learning to achieve goals , author=. IJCAI , volume=. 1993 , organization=

work page 1993
[76]

Kingma and Max Welling

An introduction to variational autoencoders , author=. arXiv preprint arXiv:1906.02691 , year=

work page arXiv 1906
[77]

arXiv preprint arXiv:1912.13465 , year=

Reward-conditioned policies , author=. arXiv preprint arXiv:1912.13465 , year=

work page arXiv 1912
[78]

Adam: A Method for Stochastic Optimization

Adam: A method for stochastic optimization , author=. arXiv preprint arXiv:1412.6980 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[79]

arXiv preprint arXiv:2310.12972 , year=

CCIL: Continuity-based Data Augmentation for Corrective Imitation Learning , author=. arXiv preprint arXiv:2310.12972 , year=

work page arXiv
[80]

Exploiting Multi-step Sample Trajectories for Approximate Value Iteration

Wright, Robert and Loscalzo, Steven and Dexter, Philip and Yu, Lei. Exploiting Multi-step Sample Trajectories for Approximate Value Iteration. Machine Learning and Knowledge Discovery in Databases. 2013

work page 2013

Showing first 80 references.