arxiv: 1312.5602 · v1 · submitted 2013-12-19 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Playing Atari with Deep Reinforcement Learning

Volodymyr Mnih , Koray Kavukcuoglu , David Silver , Alex Graves , Ioannis Antonoglou , Daan Wierstra , Martin Riedmiller

Authors on Pith no claims yet

Pith reviewed 2026-05-11 07:54 UTC · model grok-4.3

classification 💻 cs.LG

keywords deep reinforcement learningAtari 2600convolutional neural networksQ-learningvalue functioncontrol policiesraw pixelsArcade Learning Environment

0 comments

The pith

A convolutional neural network learns control policies for Atari games directly from raw pixel inputs using reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a deep learning approach that trains a convolutional neural network with a variant of Q-learning to play Atari games. The network takes raw screen pixels as input and outputs estimates of future rewards for different actions. This method is applied uniformly to seven different games without any changes to the architecture or algorithm for each game. It beats all prior methods on six games and exceeds human performance on three. This matters because it demonstrates that reinforcement learning can scale to complex visual environments without relying on hand-engineered features.

Core claim

We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. We apply our method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm. We find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.

What carries the argument

A convolutional neural network trained with Q-learning that maps raw pixel inputs to action-value estimates.

If this is right

Single fixed architecture succeeds across games with varying dynamics and rewards.
Outperforms previous methods on six of seven tested Atari games.
Surpasses human expert performance on three games.
Learns directly from high-dimensional sensory input without domain knowledge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such models could potentially be adapted to other visual control tasks like robotics.
Scaling this approach might enable agents that handle more complex environments.
This suggests deep RL can reduce the need for manual feature engineering in game AI.

Load-bearing premise

The assumption that one unchanging convolutional network and Q-learning setup can produce effective policies for games with substantially different reward structures and visual dynamics.

What would settle it

Training the described network on the seven Atari games and measuring if it achieves lower performance than reported on the six games where it was claimed to outperform priors.

read the original abstract

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is the paper that first showed a single deep CNN plus stabilized Q-learning could learn Atari games from raw pixels and beat prior methods on most of them.

read the letter

This paper demonstrates that a convolutional network trained end-to-end with Q-learning can control Atari games directly from pixel input. The same architecture and algorithm, with no per-game tweaks, works across seven titles and beats earlier approaches on six while exceeding human performance on three. That combination of raw input, deep function approximation, and practical stabilizations is what stands out as new here. Earlier RL work on these games relied on hand-crafted features or linear approximators; this one skips that step entirely. Experience replay and the target network are presented as fixes for the known divergence problems in naive deep Q-learning, and the results back that up with consistent gains on the standard Arcade Learning Environment benchmarks. The citation pattern is clean and points to the relevant prior RL and deep learning literature without padding. The math is straightforward: the update rule follows from the Bellman equation with the usual replay buffer and periodic target copy to reduce correlation. No load-bearing circularity appears, since performance is measured on fixed external environments rather than quantities fitted to the reported scores. Soft spots are limited. The main text gives less detail on training curves and variance than one might want for full reproducibility, and the single-architecture claim rests on empirical success rather than a proof that it must generalize. Those are normal for an early empirical paper and do not undermine the central result. This work is for researchers in reinforcement learning or anyone trying to apply deep networks to high-dimensional control tasks. A reader who wants to see how the basic deep RL toolkit was assembled will find it useful. It deserves a serious referee because the empirical demonstration is sharp enough to evaluate and the stabilizations have held up in later work.

Referee Report

0 major / 3 minor

Summary. The paper claims to present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network trained with a variant of Q-learning whose input is raw pixels and output is a value function; experience replay and target networks are used to stabilize training. The same fixed architecture and algorithm (no per-game adjustments) is applied to seven Atari 2600 games from the Arcade Learning Environment, outperforming all previous approaches on six games and surpassing human expert performance on three.

Significance. If the empirical results hold, the work is significant because it shows that deep neural networks can be combined with reinforcement learning to solve control tasks from raw high-dimensional inputs without domain-specific features or tuning. The stabilization techniques (experience replay and periodic target network updates) directly address known divergence problems in deep Q-learning, and the consistent results across diverse games with a single method provide evidence of generality. The detailed description of the architecture, update rule, and use of standard benchmarks (Arcade Learning Environment) supports reproducibility of the central empirical claims.

minor comments (3)

[Section 4] Section 4 (Deep Q-Learning): the loss function and target computation are described in prose; adding an explicit equation for the target value y_j (incorporating the target network) would improve clarity and make the stabilization mechanism easier to follow.
[Table 1] Table 1 and Section 5 (Experiments): average scores are reported, but the number of evaluation episodes per game and any measure of variability (e.g., standard deviation across runs) are not stated; including these would strengthen assessment of the outperformance claims.
[Section 5] Figure 2 (or equivalent training curves): if full learning curves are present only in supplementary material, a brief reference in the main text would help readers understand the stability achieved by the proposed variant.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript, accurate summary of the contributions, and recommendation to accept. We are pleased that the significance of combining deep networks with reinforcement learning for high-dimensional control tasks was recognized.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's core contribution is an empirical demonstration: a fixed CNN architecture plus stabilized Q-learning (experience replay + target network) is trained end-to-end on raw pixels from the external Arcade Learning Environment and evaluated on held-out game episodes. Performance numbers are measured outcomes on public benchmarks, not quantities defined or fitted to themselves. The update rules follow the standard Bellman equation with two well-motivated stabilizations; neither the architecture nor the algorithm is derived from the reported scores. No self-citation chain, self-definitional loop, or fitted-input-renamed-as-prediction appears in the derivation or results section. The method is externally falsifiable on the same benchmarks.

Axiom & Free-Parameter Ledger

4 free parameters · 2 axioms · 0 invented entities

The central claim rests on empirical training success rather than a closed-form derivation. Standard RL assumptions (Markov property, discounted rewards) and neural network optimization assumptions are used; many hyperparameters are selected by hand or grid search.

free parameters (4)

learning rate
Chosen to ensure stable convergence of the Q-network updates.
discount factor gamma
Set to 0.99; standard value but still a free parameter affecting long-term reward weighting.
replay buffer size and sampling
Hyperparameters controlling experience replay that affect training stability.
target network update frequency
Period chosen to balance stability and learning speed.

axioms (2)

domain assumption The environment satisfies the Markov property with respect to the observed pixel frames.
Invoked when treating raw pixels as sufficient state for Q-learning.
domain assumption Gradient descent on the Q-network loss converges to a useful policy under the chosen hyperparameters.
Relied upon for the training procedure to succeed across games.

pith-pipeline@v0.9.0 · 5393 in / 1417 out tokens · 57999 ms · 2026-05-11T07:54:39.048358+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models
cs.CV 2026-04 unverdicted novelty 8.0

OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.
LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning
cs.AI 2023-06 conditional novelty 8.0

LIBERO is a new benchmark for lifelong robot learning that evaluates transfer of declarative, procedural, and mixed knowledge across 130 manipulation tasks with provided demonstration data.
Consistency Models
cs.LG 2023-03 conditional novelty 8.0

Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.
ASH: Agents that Self-Hone via Embodied Learning
cs.AI 2026-05 unverdicted novelty 7.0

ASH reaches 11.2/12 milestones in Pokemon Emerald and 9.9/12 in Zelda by self-improving via an IDM trained on its own trajectories to label internet video, while baselines plateau at roughly 6/12.
Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation
cs.LG 2026-05 unverdicted novelty 7.0

CPPO is an on-policy contrastive RL method that derives advantages from contrastive Q-values for PPO optimization, outperforming prior CRL baselines in 14/18 tasks and matching or exceeding reward-based PPO in 12/18 tasks.
TuniQ: Autotuning Compilation Passes for Quantum Workloads at Scale for Effectiveness and Efficiency
quant-ph 2026-05 unverdicted novelty 7.0

TuniQ uses RL with a dual-encoder, shaped rewards, and action masking to autotune quantum compilation passes, improving fidelity and speed over Qiskit while generalizing across backends and scaling to large circuits.
On-line Learning in Tree MDPs by Treating Policies as Bandit Arms
cs.AI 2026-05 unverdicted novelty 7.0

Bandit algorithms can be adapted to Tree MDPs by treating policies as arms with shared-data confidence bounds, achieving polynomial memory and instance-dependent bounds on sample complexity and regret that depend on t...
Replay-buffer engineering for noise-robust quantum circuit optimization
quant-ph 2026-04 unverdicted novelty 7.0

Treating the replay buffer as a central lever in RL for quantum circuit optimization yields 4-32x sample efficiency gains, up to 67.5% faster episodes, and 85-90% fewer steps to accuracy on noisy molecular and compila...
Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks
cs.AI 2026-04 unverdicted novelty 7.0

COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.
Bounded Ratio Reinforcement Learning
cs.LG 2026-04 conditional novelty 7.0

BRRL derives an analytic optimal policy for regularized constrained RL that guarantees monotonic improvement and yields the BPO algorithm that matches or exceeds PPO.
Reinforcement Learning via Value Gradient Flow
cs.LG 2026-04 unverdicted novelty 7.0

VGF solves behavior-regularized RL by transporting particles from a reference distribution to the value-induced optimal policy via discrete value-guided gradient flow.
Autonomous Diffractometry Enabled by Visual Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

A model-free reinforcement learning agent learns to align crystals from diffraction images without human supervision or theoretical knowledge.
SafeAdapt: Provably Safe Policy Updates in Deep Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

SafeAdapt certifies a Rashomon set of safe policies from demonstration data and projects updates from arbitrary RL algorithms onto it to guarantee preservation of safety on source tasks.
Voyager: An Open-Ended Embodied Agent with Large Language Models
cs.AI 2023-05 unverdicted novelty 7.0

Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more uniq...
Dota 2 with Large Scale Deep Reinforcement Learning
cs.LG 2019-12 accept novelty 7.0

OpenAI Five achieved superhuman performance in Dota 2 by defeating the world champions using scaled self-play reinforcement learning.
Soft Actor-Critic Algorithms and Applications
cs.LG 2018-12 unverdicted novelty 7.0

SAC extends maximum-entropy RL into a stable off-policy actor-critic method with constrained temperature tuning, outperforming prior algorithms in sample efficiency and consistency on locomotion and manipulation tasks.
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
cs.LG 2018-01 accept novelty 7.0

Soft Actor-Critic is an off-policy maximum-entropy actor-critic algorithm that achieves state-of-the-art performance and high stability on continuous control benchmarks.
Continuous control with deep reinforcement learning
cs.LG 2015-09 accept novelty 7.0

DDPG is a model-free actor-critic algorithm that learns continuous control policies end-to-end from states or pixels using deterministic policy gradients and deep networks, solving more than 20 physics tasks competiti...
CA2: Code-Aware Agent for Automated Game Testing
cs.SE 2026-05 unverdicted novelty 6.0

CA2 integrates call stack information into RL agents for game testing and shows consistent gains over baselines that ignore code signals.
Q-Flow: Stable and Expressive Reinforcement Learning with Flow-Based Policy
cs.LG 2026-05 unverdicted novelty 6.0

Q-Flow enables stable optimization of expressive flow-based policies in RL by propagating terminal values along deterministic flow dynamics to intermediate states for gradient updates without solver unrolling.
Discrete Flow Matching for Offline-to-Online Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

DRIFT enables stable offline-to-online fine-tuning of CTMC policies in discrete RL via advantage-weighted discrete flow matching, path-space regularization, and candidate-set approximation.
DelAC: A Multi-agent Reinforcement Learning of Team-Symmetric Stochastic Games
cs.MA 2026-05 unverdicted novelty 6.0

Team-symmetric games always have team-symmetric Nash equilibria solvable via linear complementarity problems, and the DelAC actor-critic MARL algorithm outperforms existing methods in simulations.
Plan2Cleanse: Test-Time Backdoor Defense via Monte-Carlo Planning in Deep Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

Plan2Cleanse frames RL backdoor detection as a Monte Carlo planning problem to achieve over 61 percentage point gains in trigger detection and improved win rates in competitive environments.
Learning the Preferences of a Learning Agent
cs.AI 2026-05 unverdicted novelty 6.0

Formalizes preference learning from a no-regret or Boltzmann-converging learner with theoretical guarantees or impossibility results for IRL algorithms.
Counter-Dyna: Data-Efficient RL-Based HVAC Control using Counterfactual Building Models
cs.LG 2026-05 unverdicted novelty 6.0

Counter-Dyna reduces RL training data for HVAC control to five weeks by using counterfactual surrogate models that ignore uncontrollable variables like weather and prices.
Quantile Geometry Regularization for Distributional Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

RQIQN introduces a Wasserstein DRO-based correction to Bellman quantile targets that enlarges distributional spread without altering risk-neutral averages.
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management
cs.LG 2026-05 unverdicted novelty 6.0

MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL
cs.LG 2026-05 unverdicted novelty 6.0

QHyer achieves state-of-the-art results in offline goal-conditioned RL by replacing return-to-go with a state-conditioned Q-estimator and introducing a gated hybrid attention-mamba backbone for content-adaptive histor...
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL
cs.LG 2026-05 unverdicted novelty 6.0

QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markov...
Towards Real-time Control of a CartPole System on a Quantum Computer
quant-ph 2026-05 unverdicted novelty 6.0

A single-qubit quantum reinforcement learning agent solves CartPole faster than classical networks and quantifies shot-count versus control-frequency requirements for real-time closed-loop control on NISQ hardware, in...
AutoREC: A software platform for developing reinforcement learning agents for equivalent circuit model generation from electrochemical impedance spectroscopy data
cs.LG 2026-04 unverdicted novelty 6.0

AutoREC uses a Double Deep Q-Network agent to generate equivalent circuit models from EIS data, reporting over 99.6% success on synthetic sets and generalization to experimental battery, corrosion, and catalysis data.
Improving Zero-Shot Offline RL via Behavioral Task Sampling
cs.AI 2026-04 unverdicted novelty 6.0

Extracting task vectors from the offline dataset for policy training improves zero-shot offline RL performance by an average of 20% over random sampling baselines.
Temporal Difference Calibration in Sequential Tasks: Application to Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 6.0

Temporal difference calibration aligns uncertainty estimates in vision-language-action models with their value functions for better sequential performance.
From Particles to Perils: SVGD-Based Hazardous Scenario Generation for Autonomous Driving Systems Testing
cs.SE 2026-04 unverdicted novelty 6.0

PtoP uses SVGD to create diverse, failure-inducing seeds for ADS testing, boosting violation rates by up to 27.68% and diversity by 9.6% over baselines.
Scalable Neighborhood-Based Multi-Agent Actor-Critic
cs.LG 2026-04 unverdicted novelty 6.0

MADDPG-K scales centralized critics in multi-agent RL by limiting each critic to k-nearest neighbors under Euclidean distance, yielding constant input size and competitive performance.
GRAIL: Autonomous Concept Grounding for Neuro-Symbolic Reinforcement Learning
cs.AI 2026-04 unverdicted novelty 6.0

GRAIL autonomously grounds relational concepts in NeSy-RL by using LLM weak supervision followed by interaction-based refinement, matching or exceeding manually defined concepts on Atari games.
Soft-Quantum Algorithms
quant-ph 2026-04 unverdicted novelty 6.0

Directly training soft-unitary matrices with a unitarity regularization term and converting them to circuits via alignment enables faster training and lower loss than gate-based optimization on small quantum classific...
Can We Trust a Black-box LLM? LLM Untrustworthy Boundary Detection via Bias-Diffusion and Multi-Agent Reinforcement Learning
cs.AI 2026-04 unverdicted novelty 6.0

GMRL-BD detects untrustworthy topic boundaries for black-box LLMs by combining bias-diffusion on a Wikipedia KG with multi-agent RL, supported by a released dataset labeling biases in models like Llama2 and Qwen2.
Anticipatory Reinforcement Learning: From Generative Path-Laws to Distributional Value Functions
cs.LG 2026-04 unverdicted novelty 6.0

ARL lifts states into signature-augmented manifolds and employs self-consistent proxies of future path-laws to enable deterministic expected-return evaluation while preserving contraction mappings in jump-diffusion en...
Behavior Regularized Offline Reinforcement Learning
cs.LG 2019-11 unverdicted novelty 6.0

Behavior-regularized actor-critic methods achieve strong offline RL results with simple regularization, rendering many recent technical additions unnecessary.
Towards A Rigorous Science of Interpretable Machine Learning
stat.ML 2017-02 unverdicted novelty 6.0

The authors define interpretability for machine learning, specify when it is required, and propose a taxonomy for its rigorous evaluation while identifying open research questions.
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
cs.LG 2016-09 unverdicted novelty 6.0

Large-batch methods converge to sharp minima causing a generalization gap, while small-batch methods reach flat minima due to inherent gradient noise.
Active Sensing with Meta-Reinforcement Learning for Emitter Localization from RF Observations
eess.SP 2026-05 unverdicted novelty 5.0

A meta-reinforcement learning agent achieves 80.1% success in localizing RF emitters by sequentially sensing the environment with a 2x2 patch antenna in Sionna ray-tracing simulations.
Higher Resolution, Better Generalization: Unlocking Visual Scaling in Deep Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 5.0

Higher-resolution observations with global-average-pooling encoders improve RL performance and generalization by enabling more localized visual attention, yielding up to 28% gains over standard Impala encoders.
PG-LRF: Physiology-Guided Latent Rectified Flow for Electro-Hemodynamic PPG-to-ECG Generation
eess.SP 2026-05 unverdicted novelty 5.0

PG-LRF generates signal-faithful and physiologically plausible ECGs from PPG inputs by structuring a latent space with an electro-hemodynamic simulator and enforcing consistency in a rectified flow model.
Soft Deterministic Policy Gradient with Gaussian Smoothing
cs.LG 2026-05 unverdicted novelty 5.0

Soft-DPG uses Gaussian smoothing on the Bellman equation to derive a well-defined policy gradient without relying on critic action derivatives, yielding competitive performance on dense-reward tasks and gains on discr...
E$^2$DT: Efficient and Effective Decision Transformer with Experience-Aware Sampling for Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 5.0

E²DT couples a Decision Transformer with a k-Determinantal Point Process that scores trajectories on return-to-go quantiles, predictive uncertainty, and stage coverage to improve sample efficiency and policy quality i...
GUI Agents with Reinforcement Learning: Toward Digital Inhabitants
cs.AI 2026-04 unverdicted novelty 5.0

The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a fut...
A Survey of Multi-Agent Deep Reinforcement Learning with Graph Neural Network-Based Communication
cs.LG 2026-04 unverdicted novelty 5.0

A survey of MARL with GNN-based communication that proposes a generalized process to organize and clarify existing methods.
Efficient Reinforcement Learning using Linear Koopman Dynamics for Nonlinear Robotic Systems
cs.RO 2026-04 unverdicted novelty 5.0

Koopman-learned linear dynamics enable an online actor-critic RL method that improves sample efficiency and closed-loop performance on nonlinear robotic systems compared with model-free and other model-based baselines.
Aerial Multi-Functional RIS in Fluid Antennas-Aided Full-Duplex Networks: A Self-Optimized Hybrid Deep Reinforcement Learning Approach
cs.IT 2026-04 unverdicted novelty 5.0

A hybrid multi-agent DRL framework with attention and meta-optimization jointly tunes beamforming, power, RIS configuration, and positions to achieve higher energy efficiency in aerial MF-RIS and fluid-antenna full-du...
Safe reinforcement learning with online filtering for fatigue-predictive human-robot task planning and allocation in production
cs.AI 2026-04 unverdicted novelty 5.0

PF-CD3Q uses online particle filtering to estimate fatigue parameters and constrains a deep Q-learning agent to solve fatigue-aware human-robot task planning as a CMDP.
Labeled TrustSet Guided: Batch Active Learning with Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 5.0

BRAL-T uses TrustSet-guided reinforcement learning for batch active learning and reports state-of-the-art results on 10 image classification benchmarks plus 2 fine-tuning tasks.
Semantic-Aware UAV Command and Control for Efficient IoT Data Collection
cs.RO 2026-04 unverdicted novelty 5.0

A DDQN policy for UAVs using semantic latent representations from DeepJSCC outperforms greedy and traveling salesman baselines in simulated device coverage and image reconstruction quality.
Joint Knowledge Base Completion and Question Answering by Combining Large Language Models and Small Language Models
cs.AI 2026-04 unverdicted novelty 5.0

JCQL uses an SLM-trained KBC model as an action in an LLM agent for KBQA to reduce hallucinations, then fine-tunes the KBC model with KBQA reasoning paths, outperforming baselines on two benchmarks.
Hierarchical Reasoning Model
cs.AI 2025-06 unverdicted novelty 5.0

HRM is a recurrent architecture with high-level planning and low-level execution modules that reaches near-perfect accuracy on complex Sudoku, maze navigation, and ARC benchmarks using 27M parameters and 1000 samples ...
Gymnasium: A Standard Interface for Reinforcement Learning Environments
cs.LG 2024-07 accept novelty 5.0

Gymnasium establishes a standardized API for RL environments to improve interoperability, reproducibility, and ease of development in reinforcement learning.
Enhancing Visual Question Answering with Multimodal LLMs via Chain-of-Question Guided Retrieval-Augmented Generation
cs.CV 2026-05 unverdicted novelty 4.0

A new CoVQD-guided retrieval-augmented generation framework improves multimodal LLMs on visual question answering by using structured reasoning to retrieve better external knowledge.
Semantic-Aware UAV Command and Control for Efficient IoT Data Collection
cs.RO 2026-04 unverdicted novelty 4.0

A semantic-aware UAV framework using DeepJSCC and DDQN outperforms greedy and TSP baselines in device coverage and image reconstruction quality for IoT data collection.
Fuzzy Encoding-Decoding to Improve Spiking Q-Learning Performance in Autonomous Driving
cs.NE 2026-04 unverdicted novelty 4.0

A fuzzy encoder-decoder architecture reduces information loss in spiking Q-learning and narrows the performance gap with conventional multi-modal networks on HighwayEnv driving tasks.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 62 Pith papers

[1]

Residual algorithms: Reinforcement learning with function approximation

Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Machine Learning (ICML 1995) , pages 30–37. Morgan Kaufmann, 1995

work page 1995
[2]

Sketch-based linear value function ap- proximation

Marc Bellemare, Joel Veness, and Michael Bowling. Sketch-based linear value function ap- proximation. In Advances in Neural Information Processing Systems 25 , pages 2222–2230, 2012

work page 2012
[3]

The arcade learning environment: An evaluation platform for general agents

Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artiﬁcial Intelligence Research, 47:253–279, 2013

work page 2013
[4]

Investigating contingency awareness using atari 2600 games

Marc G Bellemare, Joel Veness, and Michael Bowling. Investigating contingency awareness using atari 2600 games. In AAAI, 2012

work page 2012
[5]

Bellemare, Joel Veness, and Michael Bowling

Marc G. Bellemare, Joel Veness, and Michael Bowling. Bayesian learning of recursively fac- tored environments. In Proceedings of the Thirtieth International Conference on Machine Learning (ICML 2013), pages 1211–1219, 2013. 8

work page 2013
[6]

Dahl, Dong Yu, Li Deng, and Alex Acero

George E. Dahl, Dong Yu, Li Deng, and Alex Acero. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. Audio, Speech, and Language Pro- cessing, IEEE Transactions on, 20(1):30 –42, January 2012

work page 2012
[7]

Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. In Proc. ICASSP, 2013

work page 2013
[8]

A neuro-evolution approach to general atari game playing

Matthew Hausknecht, Risto Miikkulainen, and Peter Stone. A neuro-evolution approach to general atari game playing. 2013

work page 2013
[9]

Actor-critic reinforcement learning with energy-based policies

Nicolas Heess, David Silver, and Yee Whye Teh. Actor-critic reinforcement learning with energy-based policies. In European Workshop on Reinforcement Learning, page 43, 2012

work page 2012
[10]

What is the best multi-stage architecture for object recognition? In Proc

Kevin Jarrett, Koray Kavukcuoglu, MarcAurelio Ranzato, and Yann LeCun. What is the best multi-stage architecture for object recognition? In Proc. International Conference on Com- puter Vision and Pattern Recognition (CVPR 2009), pages 2146–2153. IEEE, 2009

work page 2009
[11]

Imagenet classiﬁcation with deep con- volutional neural networks

Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Imagenet classiﬁcation with deep con- volutional neural networks. In Advances in Neural Information Processing Systems 25, pages 1106–1114, 2012

work page 2012
[12]

Deep auto-encoder neural networks in reinforcement learning

Sascha Lange and Martin Riedmiller. Deep auto-encoder neural networks in reinforcement learning. In Neural Networks (IJCNN), The 2010 International Joint Conference on , pages 1–8. IEEE, 2010

work page 2010
[13]

Reinforcement learning for robots using neural networks

Long-Ji Lin. Reinforcement learning for robots using neural networks. Technical report, DTIC Document, 1993

work page 1993
[14]

Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approxi- mation

Hamid Maei, Csaba Szepesvari, Shalabh Bhatnagar, Doina Precup, David Silver, and Rich Sutton. Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approxi- mation. In Advances in Neural Information Processing Systems 22, pages 1204–1212, 2009

work page 2009
[15]

Hamid Maei, Csaba Szepesv ´ari, Shalabh Bhatnagar, and Richard S. Sutton. Toward off-policy learning control with function approximation. In Proceedings of the 27th International Con- ference on Machine Learning (ICML 2010), pages 719–726, 2010

work page 2010
[16]

Machine Learning for Aerial Image Labeling

V olodymyr Mnih. Machine Learning for Aerial Image Labeling . PhD thesis, University of Toronto, 2013

work page 2013
[17]

Prioritized sweeping: Reinforcement learning with less data and less real time

Andrew Moore and Chris Atkeson. Prioritized sweeping: Reinforcement learning with less data and less real time. Machine Learning, 13:103–130, 1993

work page 1993
[18]

Rectiﬁed linear units improve restricted boltzmann ma- chines

Vinod Nair and Geoffrey E Hinton. Rectiﬁed linear units improve restricted boltzmann ma- chines. In Proceedings of the 27th International Conference on Machine Learning (ICML 2010), pages 807–814, 2010

work page 2010
[19]

Pollack and Alan D

Jordan B. Pollack and Alan D. Blair. Why did td-gammon work. In Advances in Neural Information Processing Systems 9, pages 10–16, 1996

work page 1996
[20]

Neural ﬁtted q iteration–ﬁrst experiences with a data efﬁcient neural re- inforcement learning method

Martin Riedmiller. Neural ﬁtted q iteration–ﬁrst experiences with a data efﬁcient neural re- inforcement learning method. In Machine Learning: ECML 2005 , pages 317–328. Springer, 2005

work page 2005
[21]

Brian Sallans and Geoffrey E. Hinton. Reinforcement learning with factored states and actions. Journal of Machine Learning Research, 5:1063–1088, 2004

work page 2004
[22]

Pedestrian de- tection with unsupervised multi-stage feature learning

Pierre Sermanet, Koray Kavukcuoglu, Soumith Chintala, and Yann LeCun. Pedestrian de- tection with unsupervised multi-stage feature learning. In Proc. International Conference on Computer Vision and Pattern Recognition (CVPR 2013). IEEE, 2013

work page 2013
[23]

Reinforcement Learning: An Introduction

Richard Sutton and Andrew Barto. Reinforcement Learning: An Introduction . MIT Press, 1998

work page 1998
[24]

Temporal difference learning and td-gammon

Gerald Tesauro. Temporal difference learning and td-gammon. Communications of the ACM, 38(3):58–68, 1995

work page 1995
[25]

An analysis of temporal-difference learning with function approximation

John N Tsitsiklis and Benjamin Van Roy. An analysis of temporal-difference learning with function approximation. Automatic Control, IEEE Transactions on, 42(5):674–690, 1997

work page 1997
[26]

Q-learning

Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992. 9

work page 1992