pith. sign in

super hub Mixed citations

Title resolution pending

Mixed citation behavior. Most common role is background (45%).

123 Pith papers citing it
Background 45% of classified citations
abstract

OpenAI Gym is a toolkit for reinforcement learning research. It includes a growing collection of benchmark problems that expose a common interface, and a website where people can share their results and compare the performance of algorithms. This whitepaper discusses the components of OpenAI Gym and the design decisions that went into the software.

hub tools

citation-role summary

background 10 dataset 6 method 2 baseline 1 other 1

citation-polarity summary

claims ledger

  • abstract OpenAI Gym is a toolkit for reinforcement learning research. It includes a growing collection of benchmark problems that expose a common interface, and a website where people can share their results and compare the performance of algorithms. This whitepaper discusses the components of OpenAI Gym and the design decisions that went into the software.

authors

co-cited works

representative citing papers

What Type of Inference is Active Inference?

cs.AI · 2026-06-03 · unverdicted · novelty 7.0

EFE-based active inference planning is characterized as VFE on an augmented model plus entropy and planning corrections, with a derived message-passing implementation and grid-world validation.

Proximal State Nudging: Reducing Skill Atrophy from AI Assistance

cs.RO · 2026-05-19 · unverdicted · novelty 7.0

Proximal State Nudging (PSN) jointly optimizes skill development and task performance in shared autonomy, outperforming baselines in LunarLander simulation and yielding up to 7x larger unassisted skill gains with 50% fewer collisions in human CARLA driving studies.

gym-invmgmt: An Open Benchmarking Framework for Inventory Management Methods

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

gym-invmgmt is a new benchmarking framework that evaluates inventory policies across optimization and learning methods, finding stochastic programming strongest among non-oracle approaches and PPO-Transformer best among learned ones in tested scenarios.

Revisiting Mixture Policies in Entropy-Regularized Actor-Critic

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

A new marginalized reparameterization estimator allows low-variance training of mixture policies in entropy-regularized actor-critic algorithms, matching or exceeding Gaussian policy performance in several continuous control benchmarks.

EO-Gym: A Multimodal, Interactive Environment for Earth Observation Agents

cs.AI · 2026-05-02 · unverdicted · novelty 7.0

EO-Gym supplies an executable multimodal environment and 9k-trajectory benchmark that turns Earth Observation into a tool-using, multi-step reasoning task, revealing that current VLMs struggle on temporal and cross-sensor workflows while fine-tuning lifts Pass@3 from 0.49 to 0.74.

Adaptive Ensemble Aggregation for Actor-Critics

cs.LG · 2025-07-31 · unverdicted · novelty 7.0

AEA dynamically aggregates ensembles in off-policy actor-critics from training dynamics, with proofs of convergence to an error-minimizing equilibrium, bias shrinkage with ensemble size, and monotonic policy improvement.

Group-in-Group Policy Optimization for LLM Agent Training

cs.LG · 2025-05-16 · unverdicted · novelty 7.0

GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while keeping the same rollout and memory footprint.

A Generalist Agent

cs.AI · 2022-05-12 · accept · novelty 7.0

Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.

citing papers explorer

Showing 50 of 123 citing papers.

  • The Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering cs.SE · 2025-07-20 · conditional · none · ref 9 · internal anchor

    AIDev is a new open dataset of 456k AI-agent pull requests showing agents submit code faster than humans but with lower acceptance rates and simpler changes.

  • BEHAVIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation cs.RO · 2024-03-14 · accept · none · ref 30 · internal anchor

    BEHAVIOR-1K introduces a benchmark of 1,000 human everyday activities in realistic simulated scenes together with the OMNIGIBSON physics simulator to evaluate embodied AI.

  • Decision Transformer: Reinforcement Learning via Sequence Modeling cs.LG · 2021-06-02 · accept · none · ref 11 · internal anchor

    Decision Transformer casts RL as autoregressive sequence modeling conditioned on desired returns, past states and actions, matching or exceeding offline RL baselines on Atari, Gym and Key-to-Door tasks.

  • Low-power analogue neural networks with trainable nonlinear connections for continuous control cs.LG · 2026-06-21 · unverdicted · none · ref 48 · internal anchor

    Placing trainable nonlinear functions on connections in analogue networks enables efficient representation of smooth continuous targets with hardware transfer at projected 30 microwatt power.

  • Expected Free Energy-based Planning as Variational Inference cs.AI · 2026-06-09 · unverdicted · none · ref 42 · internal anchor

    EFE-based planning is formulated as variational free energy minimization with epistemic priors, decomposing into expected plan costs plus a complexity term.

  • What Type of Inference is Active Inference? cs.AI · 2026-06-03 · unverdicted · none · ref 46 · internal anchor

    EFE-based active inference planning is characterized as VFE on an augmented model plus entropy and planning corrections, with a derived message-passing implementation and grid-world validation.

  • Proximal State Nudging: Reducing Skill Atrophy from AI Assistance cs.RO · 2026-05-19 · unverdicted · none · ref 28 · internal anchor

    Proximal State Nudging (PSN) jointly optimizes skill development and task performance in shared autonomy, outperforming baselines in LunarLander simulation and yielding up to 7x larger unassisted skill gains with 50% fewer collisions in human CARLA driving studies.

  • Randomized Advantage Transformation (RAT): Computing Natural Policy Gradients via Direct Backpropagation cs.LG · 2026-05-18 · unverdicted · none · ref 130 · internal anchor

    RAT reformulates regularized natural policy gradients as vanilla gradients with a transformed advantage, computed efficiently via randomized block Kaczmarz iterations on on-policy data.

  • NeuroTrain: Surveying Local Learning Rules for Spiking Neural Networks with an Open Benchmarking Framework cs.NE · 2026-05-14 · unverdicted · none · ref 151 · internal anchor

    A taxonomy of SNN training algorithms is presented with the release of NeuroTrain, an open benchmarking framework for reproducible comparisons across datasets and architectures.

  • Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling cs.LG · 2026-05-14 · unverdicted · none · ref 89 · internal anchor

    DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.

  • Matrix-Space Reinforcement Learning for Reusing Local Transition Geometry cs.LG · 2026-05-14 · unverdicted · none · ref 14 · internal anchor

    MSRL represents trajectory segments as PSD matrices to prove additive composition properties and bootstrap value functions for better transfer, reaching 0.73 AUC versus 0.57-0.65 baselines.

  • IGT-OMD: Implicit Gradient Transport for Decision-Focused Learning under Delayed Feedback cs.LG · 2026-05-12 · unverdicted · none · ref 37 · internal anchor

    IGT-OMD reduces gradient transport error from quadratic to linear in delay length for delayed bilevel optimization and achieves sublinear regret with adaptive steps.

  • gym-invmgmt: An Open Benchmarking Framework for Inventory Management Methods cs.LG · 2026-05-12 · unverdicted · none · ref 2 · internal anchor

    gym-invmgmt is a new benchmarking framework that evaluates inventory policies across optimization and learning methods, finding stochastic programming strongest among non-oracle approaches and PPO-Transformer best among learned ones in tested scenarios.

  • Revisiting Mixture Policies in Entropy-Regularized Actor-Critic cs.LG · 2026-05-09 · unverdicted · none · ref 7 · internal anchor

    A new marginalized reparameterization estimator allows low-variance training of mixture policies in entropy-regularized actor-critic algorithms, matching or exceeding Gaussian policy performance in several continuous control benchmarks.

  • Operator-Guided Invariance Learning for Continuous Reinforcement Learning cs.LG · 2026-05-07 · unverdicted · none · ref 10 · internal anchor

    VPSD-RL discovers exact and approximate value-preserving Lie-group operators in continuous RL to stabilize learning via transition augmentation and consistency regularization.

  • FLUID: Continuous-Time Hyperconnected Sparse Transformer for Sink-Free Learning cs.LG · 2026-05-06 · unverdicted · none · ref 44 · internal anchor

    FLUID is a continuous-time transformer using Liquid Attention Networks to model attention as stable ODE solutions that interpolate between discrete SDPA and CT-RNNs, with an explicit sink gate and liquid hyper-connections for better information flow.

  • EO-Gym: A Multimodal, Interactive Environment for Earth Observation Agents cs.AI · 2026-05-02 · unverdicted · none · ref 20 · internal anchor

    EO-Gym supplies an executable multimodal environment and 9k-trajectory benchmark that turns Earth Observation into a tool-using, multi-step reasoning task, revealing that current VLMs struggle on temporal and cross-sensor workflows while fine-tuning lifts Pass@3 from 0.49 to 0.74.

  • Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks cs.AI · 2026-04-22 · unverdicted · none · ref 2 · internal anchor

    COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.

  • Hierarchical Active Inference using Successor Representations cs.LG · 2026-04-17 · unverdicted · none · ref 1 · internal anchor

    A hierarchical active inference framework using successor representations learns abstract states and actions to enable efficient planning on navigation and reinforcement learning tasks.

  • Flow Gym: A framework for the development, benchmarking, training, and deployment of flow-field quantification methods physics.flu-dyn · 2025-12-12 · accept · none · ref 1 · internal anchor

    Flow Gym supplies a JAX-based framework with standardized interfaces, modular components, and utilities to develop, benchmark, train, and deploy flow-field quantification methods such as PIV on both synthetic and experimental data.

  • Adaptive Ensemble Aggregation for Actor-Critics cs.LG · 2025-07-31 · unverdicted · none · ref 7 · internal anchor

    AEA dynamically aggregates ensembles in off-policy actor-critics from training dynamics, with proofs of convergence to an error-minimizing equilibrium, bias shrinkage with ensemble size, and monotonic policy improvement.

  • Steering Your Diffusion Policy with Latent Space Reinforcement Learning cs.RO · 2025-06-18 · unverdicted · none · ref 85 · internal anchor

    DSRL steers pretrained diffusion policies for robotics by applying RL to their latent noise inputs, achieving sample-efficient real-world adaptation with only black-box access.

  • Group-in-Group Policy Optimization for LLM Agent Training cs.LG · 2025-05-16 · unverdicted · none · ref 45 · internal anchor

    GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while keeping the same rollout and memory footprint.

  • A Generalist Agent cs.AI · 2022-05-12 · accept · none · ref 12 · internal anchor

    Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.

  • Mitigating Value Hallucination in Dyna Planning via Multistep Predecessor Models cs.LG · 2020-06-08 · unverdicted · none · ref 3 · internal anchor

    Introduces multistep predecessor models for Dyna planning to mitigate value hallucination by avoiding real-state updates from simulated values.

  • Dota 2 with Large Scale Deep Reinforcement Learning cs.LG · 2019-12-13 · accept · none · ref 61 · internal anchor

    OpenAI Five achieved superhuman performance in Dota 2 by defeating the world champions using scaled self-play reinforcement learning.

  • Benchmarking Model-Based Reinforcement Learning cs.LG · 2019-07-03 · accept · none · ref 5 · internal anchor

    Introduces a benchmark suite of over 18 MBRL environments, evaluates multiple algorithms under consistent settings, and identifies three core challenges: dynamics bottleneck, planning horizon dilemma, and early-termination dilemma.

  • Learning the Arrow of Time cs.LG · 2019-07-02 · unverdicted · none · ref 21 · internal anchor

    Introduces a learned arrow of time in MDPs that aligns with the Jordan-Kinderlehrer-Otto notion for stochastic processes and enables practical RL utilities like reachability and side-effect detection.

  • Exploring Model-based Planning with Policy Networks cs.LG · 2019-06-20 · unverdicted · none · ref 2 · internal anchor

    POPLIN combines policy networks with model-predictive planning by optimizing either action sequences or policy parameters, yielding 3x better sample efficiency than PETS, TD3 and SAC on MuJoCo locomotion tasks.

  • Soft Actor-Critic Algorithms and Applications cs.LG · 2018-12-13 · unverdicted · none · ref 2 · internal anchor

    SAC extends maximum-entropy RL into a stable off-policy actor-critic method with constrained temperature tuning, outperforming prior algorithms in sample efficiency and consistency on locomotion and manipulation tasks.

  • Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor cs.LG · 2018-01-04 · accept · none · ref 3 · internal anchor

    Soft Actor-Critic is an off-policy maximum-entropy actor-critic algorithm that achieves state-of-the-art performance and high stability on continuous control benchmarks.

  • Proximal Policy Optimization Algorithms cs.LG · 2017-07-20 · accept · none · ref 1 · internal anchor

    A clipped surrogate objective L^CLIP = E[min(r_t A_t, clip(r_t, 1-ε, 1+ε) A_t)] enables multi-epoch minibatch policy updates with TRPO-like stability but first-order optimization.

  • Deep reinforcement learning from human preferences stat.ML · 2017-06-12 · accept · none · ref 4 · internal anchor

    Reinforcement learning agents solve complex tasks without access to the reward function by training a reward predictor from human comparisons of trajectory segments, requiring feedback on less than 1% of interactions.

  • QnRL: Quantum-Native Reinforcement Learning quant-ph · 2026-06-06 · unverdicted · none · ref 43 · internal anchor

    QnRL is a distributional quantum RL framework that distills conditional action policies from moments of quantum generative models in Hilbert space via the QuAK algorithm, reporting higher scores and fewer parameters than baselines.

  • ARC-RL: A Reinforcement Learning Playground Inspired by ARC Raiders cs.RO · 2026-05-19 · accept · none · ref 4 · 2 links · internal anchor

    ARC-RL is a new suite of four MuJoCo continuous-control environments featuring game-inspired hexapod and quadruped morphologies, a single closed-form multi-component reward function, CPG demonstrators, and empirical comparisons of online and offline-to-online RL algorithms.

  • DiPRL: Learning Discrete Programmatic Policies via Architecture Entropy Regularization cs.LG · 2026-05-18 · unverdicted · none · ref 33 · internal anchor

    DiPRL trains nearly discrete programmatic policies in RL by adding architecture entropy regularization to gradient-based optimization, avoiding performance collapse from post-hoc discretization.

  • Ada-Diffuser: Latent-Aware Adaptive Diffusion for Decision-Making cs.LG · 2026-05-15 · unverdicted · none · ref 271 · internal anchor

    Ada-Diffuser is a causal diffusion model that jointly learns observed interaction structure and underlying latent dynamics from minimal observations for adaptive planning and policy learning.

  • R2R2: Robust Representation for Intensive Experience Reuse via Redundancy Reduction in Self-Predictive Learning cs.LG · 2026-05-13 · unverdicted · none · ref 26 · internal anchor

    R2R2 introduces a non-centered regularization objective for SPL that addresses conflicts with spectral properties, leading to better performance on continuous control tasks at high UTD ratios.

  • CA2: Code-Aware Agent for Automated Game Testing cs.SE · 2026-05-13 · unverdicted · none · ref 8 · internal anchor

    CA2 integrates call stack information into RL agents for game testing and shows consistent gains over baselines that ignore code signals.

  • Stochastic Minimum-Cost Reach-Avoid Reinforcement Learning cs.LG · 2026-05-12 · unverdicted · none · ref 2 · 2 links · internal anchor

    Introduces RAPCs and a contraction Bellman operator for cost-optimal policies that satisfy probabilistic reach-avoid specifications in stochastic MDPs, with almost-sure convergence to local optima.

  • Debiased Model-based Representations for Sample-efficient Continuous Control cs.LG · 2026-05-12 · unverdicted · none · ref 2 · internal anchor

    DR.Q debiases model-based representations for Q-learning by maximizing mutual information between state-action and next-state representations and applying faded prioritized experience replay, achieving competitive or superior performance on continuous control benchmarks.

  • Policy Gradient Methods for Non-Markovian Reinforcement Learning cs.LG · 2026-05-11 · unverdicted · none · ref 60 · internal anchor

    Introduces the Agent State-Markov Policy Gradient (ASMPG) algorithm and a policy gradient theorem for non-Markovian decision processes by jointly optimizing agent state dynamics and control policy.

  • Actor-Critic Algorithm for Dynamic Expectile and CVaR cs.LG · 2026-05-08 · unverdicted · none · ref 38 · internal anchor

    A model-free off-policy actor-critic algorithm is constructed for dynamic expectile and CVaR using a surrogate policy gradient without transition perturbation and elicitability-based value learning, with empirical outperformance in risk-averse domains.

  • BehaviorGuard: Online Backdoor Defense for Deep Reinforcement Learning cs.AI · 2026-05-07 · unverdicted · none · ref 6 · internal anchor

    BehaviorGuard detects backdoor behaviors in DRL policies via behavioral drift in action distributions and suppresses suspicious actions at runtime, claimed as the first online defense for both single- and multi-agent settings.

  • QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL cs.LG · 2026-05-03 · unverdicted · none · ref 262 · internal anchor

    QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markovian datasets.

  • Towards Real-time Control of a CartPole System on a Quantum Computer quant-ph · 2026-05-03 · unverdicted · none · ref 39 · internal anchor

    A single-qubit quantum reinforcement learning agent solves CartPole faster than classical networks and quantifies shot-count versus control-frequency requirements for real-time closed-loop control on NISQ hardware, including direct electronics programming to reduce latency.

  • Distributional Reinforcement Learning via the Cram\'er Distance cs.LG · 2026-04-26 · unverdicted · none · ref 4 · internal anchor

    C-DSAC applies the Cramér distance to distributional value learning inside SAC and outperforms standard SAC on robotic benchmarks, with larger gains in complex environments due to confidence-driven conservative updates.

  • Scalable Neighborhood-Based Multi-Agent Actor-Critic cs.LG · 2026-04-20 · unverdicted · none · ref 1 · internal anchor

    MADDPG-K scales centralized critics in multi-agent RL by limiting each critic to k-nearest neighbors under Euclidean distance, yielding constant input size and competitive performance.

  • Distributional Off-Policy Evaluation with Deep Quantile Process Regression stat.ML · 2026-04-20 · unverdicted · none · ref 31 · internal anchor

    DQPOPE estimates the entire return distribution in off-policy evaluation via deep quantile process regression, providing statistical advantages over standard single-value methods with equivalent sample sizes.

  • Policy-Invisible Violations in LLM-Based Agents cs.AI · 2026-04-14 · unverdicted · none · ref 4 · internal anchor

    LLM agents commit policy-invisible violations when policy facts are hidden from their context; a graph-simulation enforcer reaches 93% accuracy vs 68.8% for content-only baselines on a new 600-trace benchmark.