A liveness-based Bellman operator enables conservative offline policy evaluation for manipulation tasks by encoding task progression and reducing truncation bias from finite horizons.
Sutton , title =
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4representative citing papers
MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
An external controller for frozen LLMs raises strict validation success on three RL coding tasks from 0/9 to 8/9 by selecting memory records and skills, running fail-fast checks, and propagating credit via eligibility traces.
C51 matches StreamQ in streaming RL on 55 Atari games while a new Adaptive Q(λ) algorithm based on bounded derivatives and variance-adjusted updates reaches nearly double the human baseline.
citing papers explorer
-
Offline Policy Evaluation for Manipulation Policies via Discounted Liveness Formulation
A liveness-based Bellman operator enables conservative offline policy evaluation for manipulation tasks by encoding task progression and reducing truncation bias from finite horizons.
-
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management
MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
-
PYTHALAB-MERA: Validation-Grounded Memory, Retrieval, and Acceptance Control for Frozen-LLM Coding Agents
An external controller for frozen LLMs raises strict validation success on three RL coding tasks from 0/9 to 8/9 by selecting memory records and skills, running fail-fast checks, and propagating credit via eligibility traces.
-
Revisiting Adam for Streaming Reinforcement Learning
C51 matches StreamQ in streaming RL on 55 Atari games while a new Adaptive Q(λ) algorithm based on bounded derivatives and variance-adjusted updates reaches nearly double the human baseline.