hub Canonical reference

Decision Transformer: Reinforcement Learning via Sequence Modeling

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin · 2021 · cs.LG · arXiv 2106.01345

Canonical reference. 71% of citing Pith papers cite this work as background.

25 Pith papers citing it

Background 71% of classified citations

open full Pith review browse 25 citing papers arXiv PDF

abstract

We introduce a framework that abstracts Reinforcement Learning (RL) as a sequence modeling problem. This allows us to draw upon the simplicity and scalability of the Transformer architecture, and associated advances in language modeling such as GPT-x and BERT. In particular, we present Decision Transformer, an architecture that casts the problem of RL as conditional sequence modeling. Unlike prior approaches to RL that fit value functions or compute policy gradients, Decision Transformer simply outputs the optimal actions by leveraging a causally masked Transformer. By conditioning an autoregressive model on the desired return (reward), past states, and actions, our Decision Transformer model can generate future actions that achieve the desired return. Despite its simplicity, Decision Transformer matches or exceeds the performance of state-of-the-art model-free offline RL baselines on Atari, OpenAI Gym, and Key-to-Door tasks.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7

citation-polarity summary

background 5 unclear 2

representative citing papers

Offline Reinforcement Learning with Implicit Q-Learning

cs.LG · 2021-10-12 · unverdicted · novelty 8.0

IQL achieves policy improvement in offline RL by implicitly estimating optimal action values through state-conditional upper expectiles of value functions, without querying Q-functions on out-of-distribution actions.

Graph Transformers and Stabilized Reinforcement Learning for Large-Scale Dynamic Routing Modulation and Spectrum Allocation in Elastic Optical Networks

cs.NI · 2026-05-03 · unverdicted · novelty 7.0 · 2 refs

A graph transformer with RL stabilizations is the first to exceed benchmarks for dynamic RMSA, supporting up to 13% more traffic load on networks up to 143 nodes.

Latent State Design for World Models under Sufficiency Constraints

cs.AI · 2026-05-03 · unverdicted · novelty 7.0

World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.

Gradient Boosting within a Single Attention Layer

cs.LG · 2026-04-03 · conditional · novelty 7.0

Gradient-boosted attention applies a corrective second attention pass within a single layer, mapping to Friedman's gradient boosting and improving perplexity by 5.6-6.0% on WikiText-103 and OpenWebText subsets over standard attention.

Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models

cs.RO · 2023-10-16 · conditional · novelty 7.0

SuSIE uses a finetuned InstructPix2Pix diffusion model to propose subgoal images that guide a low-level goal-conditioned policy, achieving SOTA zero-shot performance on CALVIN and real-world manipulation.

Unified Motion-Action Modeling for Heterogeneous Robot Learning

cs.RO · 2026-06-15 · unverdicted · novelty 6.0

UMA treats object motion and robot actions as co-evolving variables under a masked generative objective with hindsight relabeling and contrastive disentanglement to support multi-task pretraining and deployment across heterogeneous robot data.

A Meta Reinforcement Learning Approach to Goals-Based Wealth Management

cs.LG · 2026-05-04 · unverdicted · novelty 6.0

MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.

When Do We Need LLMs? A Diagnostic for Language-Driven Bandits

cs.AI · 2026-04-07 · unverdicted · novelty 6.0

Lightweight numerical bandits on text embeddings match or exceed LLM accuracy in contextual bandits at a fraction of the cost, with an embedding-based diagnostic to choose between them.

Anticipatory Reinforcement Learning: From Generative Path-Laws to Distributional Value Functions

cs.LG · 2026-04-06 · unverdicted · novelty 6.0

ARL lifts states into signature-augmented manifolds and employs self-consistent proxies of future path-laws to enable deterministic expected-return evaluation while preserving contraction mappings in jump-diffusion environments.

DAWM: Diffusion Action World Models for Offline Reinforcement Learning via Action-Inferred Transitions

cs.LG · 2025-09-23 · unverdicted · novelty 6.0

DAWM introduces a modular diffusion world model with an inverse dynamics model to produce complete synthetic transitions that improve conservative offline RL algorithms like TD3BC and IQL on D4RL tasks.

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

cs.AI · 2025-07-01 · conditional · novelty 6.0

Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.

VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning

cs.RO · 2025-05-24 · conditional · novelty 6.0

VLA-RL applies online RL to pretrained VLAs, yielding a 4.5% gain over strong baselines on 40 LIBERO manipulation tasks and matching commercial models like π₀-FAST.

A Roadmap to Pluralistic Alignment

cs.AI · 2024-02-07 · unverdicted · novelty 6.0

The paper formalizes three types of pluralistic AI models and three benchmark classes, arguing that current alignment techniques may reduce rather than increase distributional pluralism.

A General Language Assistant as a Laboratory for Alignment

cs.CL · 2021-12-01 · conditional · novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.

What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

cs.RO · 2021-08-06 · accept · novelty 6.0

A comprehensive benchmark study of offline imitation learning methods on multi-stage robot manipulation tasks identifies key sensitivities to algorithm design, data quality, and stopping criteria while releasing all datasets and code.

Revealing Safety-Critical Scenarios for UTM via Transformer

cs.AI · 2026-06-30 · unverdicted · novelty 5.0

Transformer RL with a Policy Model and Action Sampler finds UTM safety vulnerabilities 8x more efficiently than expert testing in 700-hour simulations.

Pessimism's Paradox: Conservative Offline Training Amplifies Reward Hacking During Online Adaptation in Reasoning Models

cs.LG · 2026-06-29 · unverdicted · novelty 5.0

Higher conservatism in offline DPO training of Qwen3-14B monotonically increases reward-hacking damage (Goodhart gap AUGC) during online adaptation on GSM8K.

From Bootstrapping to Sequence Modeling: A Unified Generative Framework for Personalized Landing-Page Modeling

cs.IR · 2026-06-26 · unverdicted · novelty 5.0

GLAN replaces CQL bootstrapping with Decision Transformer sequence modeling for PLPM, using global inter-day (L-RTG) and local session (HRM) modules to achieve +0.158% DAU and +0.108% LT gains in Kuaishou online tests.

Reinforcement Learning Foundation Models Should Already Be A Thing

cs.LG · 2026-06-17 · unverdicted · novelty 5.0

A Graph Attention Network pretrained solely on synthetic MDPs solves held-out tabular RL benchmarks in context, outperforming UCB-VI and Q-learning online while matching VI-LCB offline.

Belief-Aware Scheduling for Predictive Wildfire Hazard Mapping under Sparse-Window Telemetry

cs.ET · 2026-06-05 · unverdicted · novelty 5.0

The paper shows that deriving a structured belief from the prediction operator's needs and using it in non-myopic scheduling yields up to 28% better predictive loss than activity-paced baselines on a physics-calibrated synthetic wildfire environment.

ASH: Agents that Self-Hone via Embodied Learning

cs.AI · 2026-05-14 · unverdicted · novelty 5.0 · 2 refs

ASH learns long-horizon embodied policies from unlabeled internet video via a self-improvement loop that trains an IDM on its own trajectories and extracts supervision plus key-moment memory from video.

RankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking

cs.AI · 2026-05-11 · unverdicted · novelty 5.0 · 2 refs

RankQ augments temporal-difference Q-learning with a multi-term self-supervised ranking loss to enforce structured action ordering, yielding competitive or better results than prior methods on D4RL and large gains in vision-based robot fine-tuning.

Galactica: A Large Language Model for Science

cs.CL · 2022-11-16 · unverdicted · novelty 5.0

Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.

Decision-Driven Geosteering Under Uncertainty: A Unified Framework for Sequential Decision Optimization

cs.LG · 2026-06-15 · unverdicted · novelty 4.0

A unified framework integrates particle filtering for explicit geological uncertainty representation with value-based reinforcement learning policies for sequential geosteering decisions under uncertainty.

citing papers explorer

Showing 25 of 25 citing papers.

Offline Reinforcement Learning with Implicit Q-Learning cs.LG · 2021-10-12 · unverdicted · none · ref 2 · internal anchor
IQL achieves policy improvement in offline RL by implicitly estimating optimal action values through state-conditional upper expectiles of value functions, without querying Q-functions on out-of-distribution actions.
Graph Transformers and Stabilized Reinforcement Learning for Large-Scale Dynamic Routing Modulation and Spectrum Allocation in Elastic Optical Networks cs.NI · 2026-05-03 · unverdicted · none · ref 27 · 2 links · internal anchor
A graph transformer with RL stabilizations is the first to exceed benchmarks for dynamic RMSA, supporting up to 13% more traffic load on networks up to 143 nodes.
Latent State Design for World Models under Sufficiency Constraints cs.AI · 2026-05-03 · unverdicted · none · ref 10 · internal anchor
World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.
Gradient Boosting within a Single Attention Layer cs.LG · 2026-04-03 · conditional · none · ref 1 · internal anchor
Gradient-boosted attention applies a corrective second attention pass within a single layer, mapping to Friedman's gradient boosting and improving perplexity by 5.6-6.0% on WikiText-103 and OpenWebText subsets over standard attention.
Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models cs.RO · 2023-10-16 · conditional · none · ref 10 · internal anchor
SuSIE uses a finetuned InstructPix2Pix diffusion model to propose subgoal images that guide a low-level goal-conditioned policy, achieving SOTA zero-shot performance on CALVIN and real-world manipulation.
Unified Motion-Action Modeling for Heterogeneous Robot Learning cs.RO · 2026-06-15 · unverdicted · none · ref 26 · internal anchor
UMA treats object motion and robot actions as co-evolving variables under a masked generative objective with hindsight relabeling and contrastive disentanglement to support multi-task pretraining and deployment across heterogeneous robot data.
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management cs.LG · 2026-05-04 · unverdicted · none · ref 43 · internal anchor
MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
When Do We Need LLMs? A Diagnostic for Language-Driven Bandits cs.AI · 2026-04-07 · unverdicted · none · ref 13 · internal anchor
Lightweight numerical bandits on text embeddings match or exceed LLM accuracy in contextual bandits at a fraction of the cost, with an embedding-based diagnostic to choose between them.
Anticipatory Reinforcement Learning: From Generative Path-Laws to Distributional Value Functions cs.LG · 2026-04-06 · unverdicted · none · ref 6 · internal anchor
ARL lifts states into signature-augmented manifolds and employs self-consistent proxies of future path-laws to enable deterministic expected-return evaluation while preserving contraction mappings in jump-diffusion environments.
DAWM: Diffusion Action World Models for Offline Reinforcement Learning via Action-Inferred Transitions cs.LG · 2025-09-23 · unverdicted · none · ref 3 · internal anchor
DAWM introduces a modular diffusion world model with an inverse dynamics model to produce complete synthetic transitions that improve conservative offline RL algorithms like TD3BC and IQL on D4RL tasks.
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning cs.AI · 2025-07-01 · conditional · none · ref 99 · internal anchor
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning cs.RO · 2025-05-24 · conditional · none · ref 11 · internal anchor
VLA-RL applies online RL to pretrained VLAs, yielding a 4.5% gain over strong baselines on 40 LIBERO manipulation tasks and matching commercial models like π₀-FAST.
A Roadmap to Pluralistic Alignment cs.AI · 2024-02-07 · unverdicted · none · ref 296 · internal anchor
The paper formalizes three types of pluralistic AI models and three benchmark classes, arguing that current alignment techniques may reduce rather than increase distributional pluralism.
A General Language Assistant as a Laboratory for Alignment cs.CL · 2021-12-01 · conditional · none · ref 216 · internal anchor
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
What Matters in Learning from Offline Human Demonstrations for Robot Manipulation cs.RO · 2021-08-06 · accept · none · ref 50 · internal anchor
A comprehensive benchmark study of offline imitation learning methods on multi-stage robot manipulation tasks identifies key sensitivities to algorithm design, data quality, and stopping criteria while releasing all datasets and code.
Revealing Safety-Critical Scenarios for UTM via Transformer cs.AI · 2026-06-30 · unverdicted · none · ref 5 · internal anchor
Transformer RL with a Policy Model and Action Sampler finds UTM safety vulnerabilities 8x more efficiently than expert testing in 700-hour simulations.
Pessimism's Paradox: Conservative Offline Training Amplifies Reward Hacking During Online Adaptation in Reasoning Models cs.LG · 2026-06-29 · unverdicted · none · ref 1 · internal anchor
Higher conservatism in offline DPO training of Qwen3-14B monotonically increases reward-hacking damage (Goodhart gap AUGC) during online adaptation on GSM8K.
From Bootstrapping to Sequence Modeling: A Unified Generative Framework for Personalized Landing-Page Modeling cs.IR · 2026-06-26 · unverdicted · none · ref 5 · internal anchor
GLAN replaces CQL bootstrapping with Decision Transformer sequence modeling for PLPM, using global inter-day (L-RTG) and local session (HRM) modules to achieve +0.158% DAU and +0.108% LT gains in Kuaishou online tests.
Reinforcement Learning Foundation Models Should Already Be A Thing cs.LG · 2026-06-17 · unverdicted · none · ref 2 · internal anchor
A Graph Attention Network pretrained solely on synthetic MDPs solves held-out tabular RL benchmarks in context, outperforming UCB-VI and Q-learning online while matching VI-LCB offline.
Belief-Aware Scheduling for Predictive Wildfire Hazard Mapping under Sparse-Window Telemetry cs.ET · 2026-06-05 · unverdicted · none · ref 10 · internal anchor
The paper shows that deriving a structured belief from the prediction operator's needs and using it in non-myopic scheduling yields up to 28% better predictive loss than activity-paced baselines on a physics-calibrated synthetic wildfire environment.
ASH: Agents that Self-Hone via Embodied Learning cs.AI · 2026-05-14 · unverdicted · none · ref 30 · 2 links · internal anchor
ASH learns long-horizon embodied policies from unlabeled internet video via a self-improvement loop that trains an IDM on its own trajectories and extracts supervision plus key-moment memory from video.
RankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking cs.AI · 2026-05-11 · unverdicted · none · ref 24 · 2 links · internal anchor
RankQ augments temporal-difference Q-learning with a multi-term self-supervised ranking loss to enforce structured action ordering, yielding competitive or better results than prior methods on D4RL and large gains in vision-based robot fine-tuning.
Galactica: A Large Language Model for Science cs.CL · 2022-11-16 · unverdicted · none · ref 11 · internal anchor
Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.
Decision-Driven Geosteering Under Uncertainty: A Unified Framework for Sequential Decision Optimization cs.LG · 2026-06-15 · unverdicted · none · ref 20 · internal anchor
A unified framework integrates particle filtering for explicit geological uncertainty representation with value-based reinforcement learning policies for sequential geosteering decisions under uncertainty.
Built Environment Reasoning from Remote Sensing Imagery Using Large Vision--Language Models cs.CL · 2026-05-08 · unverdicted · none · ref 71 · internal anchor
Large vision-language models applied to multi-scale remote sensing imagery can generate recommendations on built environment design, constructability, land use, and risks for smart city decision-making.

Decision Transformer: Reinforcement Learning via Sequence Modeling

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer