Recognition: 2 theorem links
· Lean TheoremPlaying Atari with Deep Reinforcement Learning
Pith reviewed 2026-05-11 07:54 UTC · model grok-4.3
The pith
A convolutional neural network learns control policies for Atari games directly from raw pixel inputs using reinforcement learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. We apply our method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm. We find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.
What carries the argument
A convolutional neural network trained with Q-learning that maps raw pixel inputs to action-value estimates.
If this is right
- Single fixed architecture succeeds across games with varying dynamics and rewards.
- Outperforms previous methods on six of seven tested Atari games.
- Surpasses human expert performance on three games.
- Learns directly from high-dimensional sensory input without domain knowledge.
Where Pith is reading between the lines
- Such models could potentially be adapted to other visual control tasks like robotics.
- Scaling this approach might enable agents that handle more complex environments.
- This suggests deep RL can reduce the need for manual feature engineering in game AI.
Load-bearing premise
The assumption that one unchanging convolutional network and Q-learning setup can produce effective policies for games with substantially different reward structures and visual dynamics.
What would settle it
Training the described network on the seven Atari games and measuring if it achieves lower performance than reported on the six games where it was claimed to outperform priors.
read the original abstract
We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. We apply our method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm. We find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network trained with a variant of Q-learning whose input is raw pixels and output is a value function; experience replay and target networks are used to stabilize training. The same fixed architecture and algorithm (no per-game adjustments) is applied to seven Atari 2600 games from the Arcade Learning Environment, outperforming all previous approaches on six games and surpassing human expert performance on three.
Significance. If the empirical results hold, the work is significant because it shows that deep neural networks can be combined with reinforcement learning to solve control tasks from raw high-dimensional inputs without domain-specific features or tuning. The stabilization techniques (experience replay and periodic target network updates) directly address known divergence problems in deep Q-learning, and the consistent results across diverse games with a single method provide evidence of generality. The detailed description of the architecture, update rule, and use of standard benchmarks (Arcade Learning Environment) supports reproducibility of the central empirical claims.
minor comments (3)
- [Section 4] Section 4 (Deep Q-Learning): the loss function and target computation are described in prose; adding an explicit equation for the target value y_j (incorporating the target network) would improve clarity and make the stabilization mechanism easier to follow.
- [Table 1] Table 1 and Section 5 (Experiments): average scores are reported, but the number of evaluation episodes per game and any measure of variability (e.g., standard deviation across runs) are not stated; including these would strengthen assessment of the outperformance claims.
- [Section 5] Figure 2 (or equivalent training curves): if full learning curves are present only in supplementary material, a brief reference in the main text would help readers understand the stability achieved by the proposed variant.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of the manuscript, accurate summary of the contributions, and recommendation to accept. We are pleased that the significance of combining deep networks with reinforcement learning for high-dimensional control tasks was recognized.
Circularity Check
No significant circularity
full rationale
The paper's core contribution is an empirical demonstration: a fixed CNN architecture plus stabilized Q-learning (experience replay + target network) is trained end-to-end on raw pixels from the external Arcade Learning Environment and evaluated on held-out game episodes. Performance numbers are measured outcomes on public benchmarks, not quantities defined or fitted to themselves. The update rules follow the standard Bellman equation with two well-motivated stabilizations; neither the architecture nor the algorithm is derived from the reported scores. No self-citation chain, self-definitional loop, or fitted-input-renamed-as-prediction appears in the derivation or results section. The method is externally falsifiable on the same benchmarks.
Axiom & Free-Parameter Ledger
free parameters (4)
- learning rate
- discount factor gamma
- replay buffer size and sampling
- target network update frequency
axioms (2)
- domain assumption The environment satisfies the Markov property with respect to the observed pixel frames.
- domain assumption Gradient descent on the Q-network loss converges to a useful policy under the chosen hyperparameters.
Forward citations
Cited by 60 Pith papers
-
OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models
OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.
-
LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning
LIBERO is a new benchmark for lifelong robot learning that evaluates transfer of declarative, procedural, and mixed knowledge across 130 manipulation tasks with provided demonstration data.
-
Consistency Models
Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.
-
ASH: Agents that Self-Hone via Embodied Learning
ASH reaches 11.2/12 milestones in Pokemon Emerald and 9.9/12 in Zelda by self-improving via an IDM trained on its own trajectories to label internet video, while baselines plateau at roughly 6/12.
-
Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation
CPPO is an on-policy contrastive RL method that derives advantages from contrastive Q-values for PPO optimization, outperforming prior CRL baselines in 14/18 tasks and matching or exceeding reward-based PPO in 12/18 tasks.
-
TuniQ: Autotuning Compilation Passes for Quantum Workloads at Scale for Effectiveness and Efficiency
TuniQ uses RL with a dual-encoder, shaped rewards, and action masking to autotune quantum compilation passes, improving fidelity and speed over Qiskit while generalizing across backends and scaling to large circuits.
-
On-line Learning in Tree MDPs by Treating Policies as Bandit Arms
Bandit algorithms can be adapted to Tree MDPs by treating policies as arms with shared-data confidence bounds, achieving polynomial memory and instance-dependent bounds on sample complexity and regret that depend on t...
-
Replay-buffer engineering for noise-robust quantum circuit optimization
Treating the replay buffer as a central lever in RL for quantum circuit optimization yields 4-32x sample efficiency gains, up to 67.5% faster episodes, and 85-90% fewer steps to accuracy on noisy molecular and compila...
-
Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks
COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.
-
Bounded Ratio Reinforcement Learning
BRRL derives an analytic optimal policy for regularized constrained RL that guarantees monotonic improvement and yields the BPO algorithm that matches or exceeds PPO.
-
Reinforcement Learning via Value Gradient Flow
VGF solves behavior-regularized RL by transporting particles from a reference distribution to the value-induced optimal policy via discrete value-guided gradient flow.
-
Autonomous Diffractometry Enabled by Visual Reinforcement Learning
A model-free reinforcement learning agent learns to align crystals from diffraction images without human supervision or theoretical knowledge.
-
SafeAdapt: Provably Safe Policy Updates in Deep Reinforcement Learning
SafeAdapt certifies a Rashomon set of safe policies from demonstration data and projects updates from arbitrary RL algorithms onto it to guarantee preservation of safety on source tasks.
-
Voyager: An Open-Ended Embodied Agent with Large Language Models
Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more uniq...
-
Dota 2 with Large Scale Deep Reinforcement Learning
OpenAI Five achieved superhuman performance in Dota 2 by defeating the world champions using scaled self-play reinforcement learning.
-
Soft Actor-Critic Algorithms and Applications
SAC extends maximum-entropy RL into a stable off-policy actor-critic method with constrained temperature tuning, outperforming prior algorithms in sample efficiency and consistency on locomotion and manipulation tasks.
-
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
Soft Actor-Critic is an off-policy maximum-entropy actor-critic algorithm that achieves state-of-the-art performance and high stability on continuous control benchmarks.
-
Continuous control with deep reinforcement learning
DDPG is a model-free actor-critic algorithm that learns continuous control policies end-to-end from states or pixels using deterministic policy gradients and deep networks, solving more than 20 physics tasks competiti...
-
CA2: Code-Aware Agent for Automated Game Testing
CA2 integrates call stack information into RL agents for game testing and shows consistent gains over baselines that ignore code signals.
-
Q-Flow: Stable and Expressive Reinforcement Learning with Flow-Based Policy
Q-Flow enables stable optimization of expressive flow-based policies in RL by propagating terminal values along deterministic flow dynamics to intermediate states for gradient updates without solver unrolling.
-
Discrete Flow Matching for Offline-to-Online Reinforcement Learning
DRIFT enables stable offline-to-online fine-tuning of CTMC policies in discrete RL via advantage-weighted discrete flow matching, path-space regularization, and candidate-set approximation.
-
DelAC: A Multi-agent Reinforcement Learning of Team-Symmetric Stochastic Games
Team-symmetric games always have team-symmetric Nash equilibria solvable via linear complementarity problems, and the DelAC actor-critic MARL algorithm outperforms existing methods in simulations.
-
Plan2Cleanse: Test-Time Backdoor Defense via Monte-Carlo Planning in Deep Reinforcement Learning
Plan2Cleanse frames RL backdoor detection as a Monte Carlo planning problem to achieve over 61 percentage point gains in trigger detection and improved win rates in competitive environments.
-
Learning the Preferences of a Learning Agent
Formalizes preference learning from a no-regret or Boltzmann-converging learner with theoretical guarantees or impossibility results for IRL algorithms.
-
Counter-Dyna: Data-Efficient RL-Based HVAC Control using Counterfactual Building Models
Counter-Dyna reduces RL training data for HVAC control to five weeks by using counterfactual surrogate models that ignore uncontrollable variables like weather and prices.
-
Quantile Geometry Regularization for Distributional Reinforcement Learning
RQIQN introduces a Wasserstein DRO-based correction to Bellman quantile targets that enlarges distributional spread without altering risk-neutral averages.
-
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management
MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
-
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL
QHyer achieves state-of-the-art results in offline goal-conditioned RL by replacing return-to-go with a state-conditioned Q-estimator and introducing a gated hybrid attention-mamba backbone for content-adaptive histor...
-
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL
QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markov...
-
Towards Real-time Control of a CartPole System on a Quantum Computer
A single-qubit quantum reinforcement learning agent solves CartPole faster than classical networks and quantifies shot-count versus control-frequency requirements for real-time closed-loop control on NISQ hardware, in...
-
AutoREC: A software platform for developing reinforcement learning agents for equivalent circuit model generation from electrochemical impedance spectroscopy data
AutoREC uses a Double Deep Q-Network agent to generate equivalent circuit models from EIS data, reporting over 99.6% success on synthetic sets and generalization to experimental battery, corrosion, and catalysis data.
-
Improving Zero-Shot Offline RL via Behavioral Task Sampling
Extracting task vectors from the offline dataset for policy training improves zero-shot offline RL performance by an average of 20% over random sampling baselines.
-
Temporal Difference Calibration in Sequential Tasks: Application to Vision-Language-Action Models
Temporal difference calibration aligns uncertainty estimates in vision-language-action models with their value functions for better sequential performance.
-
From Particles to Perils: SVGD-Based Hazardous Scenario Generation for Autonomous Driving Systems Testing
PtoP uses SVGD to create diverse, failure-inducing seeds for ADS testing, boosting violation rates by up to 27.68% and diversity by 9.6% over baselines.
-
Scalable Neighborhood-Based Multi-Agent Actor-Critic
MADDPG-K scales centralized critics in multi-agent RL by limiting each critic to k-nearest neighbors under Euclidean distance, yielding constant input size and competitive performance.
-
GRAIL: Autonomous Concept Grounding for Neuro-Symbolic Reinforcement Learning
GRAIL autonomously grounds relational concepts in NeSy-RL by using LLM weak supervision followed by interaction-based refinement, matching or exceeding manually defined concepts on Atari games.
-
Soft-Quantum Algorithms
Directly training soft-unitary matrices with a unitarity regularization term and converting them to circuits via alignment enables faster training and lower loss than gate-based optimization on small quantum classific...
-
Can We Trust a Black-box LLM? LLM Untrustworthy Boundary Detection via Bias-Diffusion and Multi-Agent Reinforcement Learning
GMRL-BD detects untrustworthy topic boundaries for black-box LLMs by combining bias-diffusion on a Wikipedia KG with multi-agent RL, supported by a released dataset labeling biases in models like Llama2 and Qwen2.
-
Anticipatory Reinforcement Learning: From Generative Path-Laws to Distributional Value Functions
ARL lifts states into signature-augmented manifolds and employs self-consistent proxies of future path-laws to enable deterministic expected-return evaluation while preserving contraction mappings in jump-diffusion en...
-
Behavior Regularized Offline Reinforcement Learning
Behavior-regularized actor-critic methods achieve strong offline RL results with simple regularization, rendering many recent technical additions unnecessary.
-
Towards A Rigorous Science of Interpretable Machine Learning
The authors define interpretability for machine learning, specify when it is required, and propose a taxonomy for its rigorous evaluation while identifying open research questions.
-
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
Large-batch methods converge to sharp minima causing a generalization gap, while small-batch methods reach flat minima due to inherent gradient noise.
-
Active Sensing with Meta-Reinforcement Learning for Emitter Localization from RF Observations
A meta-reinforcement learning agent achieves 80.1% success in localizing RF emitters by sequentially sensing the environment with a 2x2 patch antenna in Sionna ray-tracing simulations.
-
Higher Resolution, Better Generalization: Unlocking Visual Scaling in Deep Reinforcement Learning
Higher-resolution observations with global-average-pooling encoders improve RL performance and generalization by enabling more localized visual attention, yielding up to 28% gains over standard Impala encoders.
-
PG-LRF: Physiology-Guided Latent Rectified Flow for Electro-Hemodynamic PPG-to-ECG Generation
PG-LRF generates signal-faithful and physiologically plausible ECGs from PPG inputs by structuring a latent space with an electro-hemodynamic simulator and enforcing consistency in a rectified flow model.
-
Soft Deterministic Policy Gradient with Gaussian Smoothing
Soft-DPG uses Gaussian smoothing on the Bellman equation to derive a well-defined policy gradient without relying on critic action derivatives, yielding competitive performance on dense-reward tasks and gains on discr...
-
E$^2$DT: Efficient and Effective Decision Transformer with Experience-Aware Sampling for Robotic Manipulation
E²DT couples a Decision Transformer with a k-Determinantal Point Process that scores trajectories on return-to-go quantiles, predictive uncertainty, and stage coverage to improve sample efficiency and policy quality i...
-
GUI Agents with Reinforcement Learning: Toward Digital Inhabitants
The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a fut...
-
A Survey of Multi-Agent Deep Reinforcement Learning with Graph Neural Network-Based Communication
A survey of MARL with GNN-based communication that proposes a generalized process to organize and clarify existing methods.
-
Efficient Reinforcement Learning using Linear Koopman Dynamics for Nonlinear Robotic Systems
Koopman-learned linear dynamics enable an online actor-critic RL method that improves sample efficiency and closed-loop performance on nonlinear robotic systems compared with model-free and other model-based baselines.
-
Aerial Multi-Functional RIS in Fluid Antennas-Aided Full-Duplex Networks: A Self-Optimized Hybrid Deep Reinforcement Learning Approach
A hybrid multi-agent DRL framework with attention and meta-optimization jointly tunes beamforming, power, RIS configuration, and positions to achieve higher energy efficiency in aerial MF-RIS and fluid-antenna full-du...
-
Safe reinforcement learning with online filtering for fatigue-predictive human-robot task planning and allocation in production
PF-CD3Q uses online particle filtering to estimate fatigue parameters and constrains a deep Q-learning agent to solve fatigue-aware human-robot task planning as a CMDP.
-
Labeled TrustSet Guided: Batch Active Learning with Reinforcement Learning
BRAL-T uses TrustSet-guided reinforcement learning for batch active learning and reports state-of-the-art results on 10 image classification benchmarks plus 2 fine-tuning tasks.
-
Semantic-Aware UAV Command and Control for Efficient IoT Data Collection
A DDQN policy for UAVs using semantic latent representations from DeepJSCC outperforms greedy and traveling salesman baselines in simulated device coverage and image reconstruction quality.
-
Joint Knowledge Base Completion and Question Answering by Combining Large Language Models and Small Language Models
JCQL uses an SLM-trained KBC model as an action in an LLM agent for KBQA to reduce hallucinations, then fine-tunes the KBC model with KBQA reasoning paths, outperforming baselines on two benchmarks.
-
Hierarchical Reasoning Model
HRM is a recurrent architecture with high-level planning and low-level execution modules that reaches near-perfect accuracy on complex Sudoku, maze navigation, and ARC benchmarks using 27M parameters and 1000 samples ...
-
Gymnasium: A Standard Interface for Reinforcement Learning Environments
Gymnasium establishes a standardized API for RL environments to improve interoperability, reproducibility, and ease of development in reinforcement learning.
-
Enhancing Visual Question Answering with Multimodal LLMs via Chain-of-Question Guided Retrieval-Augmented Generation
A new CoVQD-guided retrieval-augmented generation framework improves multimodal LLMs on visual question answering by using structured reasoning to retrieve better external knowledge.
-
Semantic-Aware UAV Command and Control for Efficient IoT Data Collection
A semantic-aware UAV framework using DeepJSCC and DDQN outperforms greedy and TSP baselines in device coverage and image reconstruction quality for IoT data collection.
-
Fuzzy Encoding-Decoding to Improve Spiking Q-Learning Performance in Autonomous Driving
A fuzzy encoder-decoder architecture reduces information loss in spiking Q-learning and narrows the performance gap with conventional multi-modal networks on HighwayEnv driving tasks.
Reference graph
Works this paper leans on
-
[1]
Residual algorithms: Reinforcement learning with function approximation
Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Machine Learning (ICML 1995) , pages 30–37. Morgan Kaufmann, 1995
work page 1995
-
[2]
Sketch-based linear value function ap- proximation
Marc Bellemare, Joel Veness, and Michael Bowling. Sketch-based linear value function ap- proximation. In Advances in Neural Information Processing Systems 25 , pages 2222–2230, 2012
work page 2012
-
[3]
The arcade learning environment: An evaluation platform for general agents
Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013
work page 2013
-
[4]
Investigating contingency awareness using atari 2600 games
Marc G Bellemare, Joel Veness, and Michael Bowling. Investigating contingency awareness using atari 2600 games. In AAAI, 2012
work page 2012
-
[5]
Bellemare, Joel Veness, and Michael Bowling
Marc G. Bellemare, Joel Veness, and Michael Bowling. Bayesian learning of recursively fac- tored environments. In Proceedings of the Thirtieth International Conference on Machine Learning (ICML 2013), pages 1211–1219, 2013. 8
work page 2013
-
[6]
Dahl, Dong Yu, Li Deng, and Alex Acero
George E. Dahl, Dong Yu, Li Deng, and Alex Acero. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. Audio, Speech, and Language Pro- cessing, IEEE Transactions on, 20(1):30 –42, January 2012
work page 2012
-
[7]
Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. In Proc. ICASSP, 2013
work page 2013
-
[8]
A neuro-evolution approach to general atari game playing
Matthew Hausknecht, Risto Miikkulainen, and Peter Stone. A neuro-evolution approach to general atari game playing. 2013
work page 2013
-
[9]
Actor-critic reinforcement learning with energy-based policies
Nicolas Heess, David Silver, and Yee Whye Teh. Actor-critic reinforcement learning with energy-based policies. In European Workshop on Reinforcement Learning, page 43, 2012
work page 2012
-
[10]
What is the best multi-stage architecture for object recognition? In Proc
Kevin Jarrett, Koray Kavukcuoglu, MarcAurelio Ranzato, and Yann LeCun. What is the best multi-stage architecture for object recognition? In Proc. International Conference on Com- puter Vision and Pattern Recognition (CVPR 2009), pages 2146–2153. IEEE, 2009
work page 2009
-
[11]
Imagenet classification with deep con- volutional neural networks
Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Imagenet classification with deep con- volutional neural networks. In Advances in Neural Information Processing Systems 25, pages 1106–1114, 2012
work page 2012
-
[12]
Deep auto-encoder neural networks in reinforcement learning
Sascha Lange and Martin Riedmiller. Deep auto-encoder neural networks in reinforcement learning. In Neural Networks (IJCNN), The 2010 International Joint Conference on , pages 1–8. IEEE, 2010
work page 2010
-
[13]
Reinforcement learning for robots using neural networks
Long-Ji Lin. Reinforcement learning for robots using neural networks. Technical report, DTIC Document, 1993
work page 1993
-
[14]
Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approxi- mation
Hamid Maei, Csaba Szepesvari, Shalabh Bhatnagar, Doina Precup, David Silver, and Rich Sutton. Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approxi- mation. In Advances in Neural Information Processing Systems 22, pages 1204–1212, 2009
work page 2009
-
[15]
Hamid Maei, Csaba Szepesv ´ari, Shalabh Bhatnagar, and Richard S. Sutton. Toward off-policy learning control with function approximation. In Proceedings of the 27th International Con- ference on Machine Learning (ICML 2010), pages 719–726, 2010
work page 2010
-
[16]
Machine Learning for Aerial Image Labeling
V olodymyr Mnih. Machine Learning for Aerial Image Labeling . PhD thesis, University of Toronto, 2013
work page 2013
-
[17]
Prioritized sweeping: Reinforcement learning with less data and less real time
Andrew Moore and Chris Atkeson. Prioritized sweeping: Reinforcement learning with less data and less real time. Machine Learning, 13:103–130, 1993
work page 1993
-
[18]
Rectified linear units improve restricted boltzmann ma- chines
Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann ma- chines. In Proceedings of the 27th International Conference on Machine Learning (ICML 2010), pages 807–814, 2010
work page 2010
-
[19]
Jordan B. Pollack and Alan D. Blair. Why did td-gammon work. In Advances in Neural Information Processing Systems 9, pages 10–16, 1996
work page 1996
-
[20]
Martin Riedmiller. Neural fitted q iteration–first experiences with a data efficient neural re- inforcement learning method. In Machine Learning: ECML 2005 , pages 317–328. Springer, 2005
work page 2005
-
[21]
Brian Sallans and Geoffrey E. Hinton. Reinforcement learning with factored states and actions. Journal of Machine Learning Research, 5:1063–1088, 2004
work page 2004
-
[22]
Pedestrian de- tection with unsupervised multi-stage feature learning
Pierre Sermanet, Koray Kavukcuoglu, Soumith Chintala, and Yann LeCun. Pedestrian de- tection with unsupervised multi-stage feature learning. In Proc. International Conference on Computer Vision and Pattern Recognition (CVPR 2013). IEEE, 2013
work page 2013
-
[23]
Reinforcement Learning: An Introduction
Richard Sutton and Andrew Barto. Reinforcement Learning: An Introduction . MIT Press, 1998
work page 1998
-
[24]
Temporal difference learning and td-gammon
Gerald Tesauro. Temporal difference learning and td-gammon. Communications of the ACM, 38(3):58–68, 1995
work page 1995
-
[25]
An analysis of temporal-difference learning with function approximation
John N Tsitsiklis and Benjamin Van Roy. An analysis of temporal-difference learning with function approximation. Automatic Control, IEEE Transactions on, 42(5):674–690, 1997
work page 1997
-
[26]
Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992. 9
work page 1992
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.