pith. machine review for the scientific record. sign in

hub

Title resolution pending

42 Pith papers cite this work. Polarity classification is still indexing.

42 Pith papers citing it
abstract

OpenAI Gym is a toolkit for reinforcement learning research. It includes a growing collection of benchmark problems that expose a common interface, and a website where people can share their results and compare the performance of algorithms. This whitepaper discusses the components of OpenAI Gym and the design decisions that went into the software.

hub tools

claims ledger

  • abstract OpenAI Gym is a toolkit for reinforcement learning research. It includes a growing collection of benchmark problems that expose a common interface, and a website where people can share their results and compare the performance of algorithms. This whitepaper discusses the components of OpenAI Gym and the design decisions that went into the software.

co-cited works

representative citing papers

gym-invmgmt: An Open Benchmarking Framework for Inventory Management Methods

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

gym-invmgmt is a new benchmarking framework that evaluates inventory policies across optimization and learning methods, finding stochastic programming strongest among non-oracle approaches and PPO-Transformer best among learned ones in tested scenarios.

Revisiting Mixture Policies in Entropy-Regularized Actor-Critic

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

A new marginalized reparameterization estimator allows low-variance training of mixture policies in entropy-regularized actor-critic algorithms, matching or exceeding Gaussian policy performance in several continuous control benchmarks.

EO-Gym: A Multimodal, Interactive Environment for Earth Observation Agents

cs.AI · 2026-05-02 · unverdicted · novelty 7.0

EO-Gym supplies an executable multimodal environment and 9k-trajectory benchmark that turns Earth Observation into a tool-using, multi-step reasoning task, revealing that current VLMs struggle on temporal and cross-sensor workflows while fine-tuning lifts Pass@3 from 0.49 to 0.74.

Group-in-Group Policy Optimization for LLM Agent Training

cs.LG · 2025-05-16 · unverdicted · novelty 7.0

GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while keeping the same rollout and memory footprint.

A Generalist Agent

cs.AI · 2022-05-12 · accept · novelty 7.0

Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.

Soft Actor-Critic Algorithms and Applications

cs.LG · 2018-12-13 · unverdicted · novelty 7.0

SAC extends maximum-entropy RL into a stable off-policy actor-critic method with constrained temperature tuning, outperforming prior algorithms in sample efficiency and consistency on locomotion and manipulation tasks.

Proximal Policy Optimization Algorithms

cs.LG · 2017-07-20 · accept · novelty 7.0

A clipped surrogate objective L^CLIP = E[min(r_t A_t, clip(r_t, 1-ε, 1+ε) A_t)] enables multi-epoch minibatch policy updates with TRPO-like stability but first-order optimization.

Stochastic Minimum-Cost Reach-Avoid Reinforcement Learning

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Introduces RAPCs and a contraction Bellman operator that jointly enforce probabilistic reach-avoid constraints while minimizing expected costs in stochastic RL, with almost-sure convergence to local optima.

Debiased Model-based Representations for Sample-efficient Continuous Control

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

DR.Q debiases model-based representations for Q-learning by maximizing mutual information between state-action and next-state representations and applying faded prioritized experience replay, achieving competitive or superior performance on continuous control benchmarks.

Actor-Critic Algorithm for Dynamic Expectile and CVaR

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

A model-free off-policy actor-critic algorithm is constructed for dynamic expectile and CVaR using a surrogate policy gradient without transition perturbation and elicitability-based value learning, with empirical outperformance in risk-averse domains.

Learning to Theorize the World from Observation

cs.LG · 2026-05-05 · unverdicted · novelty 6.0

NEO induces compositional latent programs as world theories from observations and executes them to enable explanation-driven generalization.

Towards Real-time Control of a CartPole System on a Quantum Computer

quant-ph · 2026-05-03 · unverdicted · novelty 6.0

A single-qubit quantum reinforcement learning agent solves CartPole faster than classical networks and quantifies shot-count versus control-frequency requirements for real-time closed-loop control on NISQ hardware, including direct electronics programming to reduce latency.

Distributional Reinforcement Learning via the Cram\'er Distance

cs.LG · 2026-04-26 · unverdicted · novelty 6.0

C-DSAC applies the Cramér distance to distributional value learning inside SAC and outperforms standard SAC on robotic benchmarks, with larger gains in complex environments due to confidence-driven conservative updates.

Scalable Neighborhood-Based Multi-Agent Actor-Critic

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

MADDPG-K scales centralized critics in multi-agent RL by limiting each critic to k-nearest neighbors under Euclidean distance, yielding constant input size and competitive performance.

citing papers explorer

Showing 42 of 42 citing papers.

  • IGT-OMD: Implicit Gradient Transport for Decision-Focused Learning under Delayed Feedback cs.LG · 2026-05-12 · unverdicted · none · ref 37 · internal anchor

    IGT-OMD reduces gradient transport error from quadratic to linear in delay length for delayed bilevel optimization and achieves sublinear regret with adaptive steps.

  • gym-invmgmt: An Open Benchmarking Framework for Inventory Management Methods cs.LG · 2026-05-12 · unverdicted · none · ref 2 · internal anchor

    gym-invmgmt is a new benchmarking framework that evaluates inventory policies across optimization and learning methods, finding stochastic programming strongest among non-oracle approaches and PPO-Transformer best among learned ones in tested scenarios.

  • Revisiting Mixture Policies in Entropy-Regularized Actor-Critic cs.LG · 2026-05-09 · unverdicted · none · ref 7 · internal anchor

    A new marginalized reparameterization estimator allows low-variance training of mixture policies in entropy-regularized actor-critic algorithms, matching or exceeding Gaussian policy performance in several continuous control benchmarks.

  • Operator-Guided Invariance Learning for Continuous Reinforcement Learning cs.LG · 2026-05-07 · unverdicted · none · ref 10 · internal anchor

    VPSD-RL discovers exact and approximate value-preserving Lie-group operators in continuous RL to stabilize learning via transition augmentation and consistency regularization.

  • FLUID: Continuous-Time Hyperconnected Sparse Transformer for Sink-Free Learning cs.LG · 2026-05-06 · unverdicted · none · ref 44 · internal anchor

    FLUID is a continuous-time transformer using Liquid Attention Networks to model attention as stable ODE solutions that interpolate between discrete SDPA and CT-RNNs, with an explicit sink gate and liquid hyper-connections for better information flow.

  • EO-Gym: A Multimodal, Interactive Environment for Earth Observation Agents cs.AI · 2026-05-02 · unverdicted · none · ref 20 · internal anchor

    EO-Gym supplies an executable multimodal environment and 9k-trajectory benchmark that turns Earth Observation into a tool-using, multi-step reasoning task, revealing that current VLMs struggle on temporal and cross-sensor workflows while fine-tuning lifts Pass@3 from 0.49 to 0.74.

  • Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks cs.AI · 2026-04-22 · unverdicted · none · ref 2 · internal anchor

    COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.

  • Hierarchical Active Inference using Successor Representations cs.LG · 2026-04-17 · unverdicted · none · ref 1 · internal anchor

    A hierarchical active inference framework using successor representations learns abstract states and actions to enable efficient planning on navigation and reinforcement learning tasks.

  • Group-in-Group Policy Optimization for LLM Agent Training cs.LG · 2025-05-16 · unverdicted · none · ref 45 · internal anchor

    GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while keeping the same rollout and memory footprint.

  • A Generalist Agent cs.AI · 2022-05-12 · accept · none · ref 12 · internal anchor

    Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.

  • Dota 2 with Large Scale Deep Reinforcement Learning cs.LG · 2019-12-13 · accept · none · ref 61 · internal anchor

    OpenAI Five achieved superhuman performance in Dota 2 by defeating the world champions using scaled self-play reinforcement learning.

  • Soft Actor-Critic Algorithms and Applications cs.LG · 2018-12-13 · unverdicted · none · ref 2 · internal anchor

    SAC extends maximum-entropy RL into a stable off-policy actor-critic method with constrained temperature tuning, outperforming prior algorithms in sample efficiency and consistency on locomotion and manipulation tasks.

  • Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor cs.LG · 2018-01-04 · accept · none · ref 3 · internal anchor

    Soft Actor-Critic is an off-policy maximum-entropy actor-critic algorithm that achieves state-of-the-art performance and high stability on continuous control benchmarks.

  • Proximal Policy Optimization Algorithms cs.LG · 2017-07-20 · accept · none · ref 1 · internal anchor

    A clipped surrogate objective L^CLIP = E[min(r_t A_t, clip(r_t, 1-ε, 1+ε) A_t)] enables multi-epoch minibatch policy updates with TRPO-like stability but first-order optimization.

  • Stochastic Minimum-Cost Reach-Avoid Reinforcement Learning cs.LG · 2026-05-12 · unverdicted · none · ref 2 · internal anchor

    Introduces RAPCs and a contraction Bellman operator that jointly enforce probabilistic reach-avoid constraints while minimizing expected costs in stochastic RL, with almost-sure convergence to local optima.

  • Debiased Model-based Representations for Sample-efficient Continuous Control cs.LG · 2026-05-12 · unverdicted · none · ref 2 · internal anchor

    DR.Q debiases model-based representations for Q-learning by maximizing mutual information between state-action and next-state representations and applying faded prioritized experience replay, achieving competitive or superior performance on continuous control benchmarks.

  • Policy Gradient Methods for Non-Markovian Reinforcement Learning cs.LG · 2026-05-11 · unverdicted · none · ref 60 · internal anchor

    Introduces the Agent State-Markov Policy Gradient (ASMPG) algorithm and a policy gradient theorem for non-Markovian decision processes by jointly optimizing agent state dynamics and control policy.

  • Actor-Critic Algorithm for Dynamic Expectile and CVaR cs.LG · 2026-05-08 · unverdicted · none · ref 38 · internal anchor

    A model-free off-policy actor-critic algorithm is constructed for dynamic expectile and CVaR using a surrogate policy gradient without transition perturbation and elicitability-based value learning, with empirical outperformance in risk-averse domains.

  • BehaviorGuard: Online Backdoor Defense for Deep Reinforcement Learning cs.AI · 2026-05-07 · unverdicted · none · ref 6 · internal anchor

    BehaviorGuard detects backdoor behaviors in DRL policies via behavioral drift in action distributions and suppresses suspicious actions at runtime, claimed as the first online defense for both single- and multi-agent settings.

  • Learning to Theorize the World from Observation cs.LG · 2026-05-05 · unverdicted · none · ref 258 · internal anchor

    NEO induces compositional latent programs as world theories from observations and executes them to enable explanation-driven generalization.

  • QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL cs.LG · 2026-05-03 · unverdicted · none · ref 262 · internal anchor

    QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markovian datasets.

  • Towards Real-time Control of a CartPole System on a Quantum Computer quant-ph · 2026-05-03 · unverdicted · none · ref 39 · internal anchor

    A single-qubit quantum reinforcement learning agent solves CartPole faster than classical networks and quantifies shot-count versus control-frequency requirements for real-time closed-loop control on NISQ hardware, including direct electronics programming to reduce latency.

  • Distributional Reinforcement Learning via the Cram\'er Distance cs.LG · 2026-04-26 · unverdicted · none · ref 4 · internal anchor

    C-DSAC applies the Cramér distance to distributional value learning inside SAC and outperforms standard SAC on robotic benchmarks, with larger gains in complex environments due to confidence-driven conservative updates.

  • Scalable Neighborhood-Based Multi-Agent Actor-Critic cs.LG · 2026-04-20 · unverdicted · none · ref 1 · internal anchor

    MADDPG-K scales centralized critics in multi-agent RL by limiting each critic to k-nearest neighbors under Euclidean distance, yielding constant input size and competitive performance.

  • Distributional Off-Policy Evaluation with Deep Quantile Process Regression stat.ML · 2026-04-20 · unverdicted · none · ref 31 · internal anchor

    DQPOPE estimates the entire return distribution in off-policy evaluation via deep quantile process regression, providing statistical advantages over standard single-value methods with equivalent sample sizes.

  • Policy-Invisible Violations in LLM-Based Agents cs.AI · 2026-04-14 · unverdicted · none · ref 4 · internal anchor

    LLM agents commit policy-invisible violations when policy facts are hidden from their context; a graph-simulation enforcer reaches 93% accuracy vs 68.8% for content-only baselines on a new 600-trace benchmark.

  • Infernux: A Python-Native Game Engine with JIT-Accelerated Scripting cs.GR · 2026-04-11 · unverdicted · none · ref 6 · internal anchor

    Infernux is a game engine that uses batch data bridging and Numba JIT to make Python scripting performant within a Vulkan C++ core.

  • Bridging Natural Language and Microgrid Dynamics: A Context-Aware Simulator and Dataset eess.SY · 2026-04-07 · unverdicted · none · ref 19 · internal anchor

    OpenCEM is the first open-source digital twin that integrates unstructured contextual information with quantitative microgrid dynamics to enable context-aware energy management.

  • Language Models (Mostly) Know What They Know cs.CL · 2022-07-11 · unverdicted · none · ref 261 · internal anchor

    Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

  • A General Language Assistant as a Laboratory for Alignment cs.CL · 2021-12-01 · conditional · none · ref 184 · internal anchor

    Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.

  • Behavior Regularized Offline Reinforcement Learning cs.LG · 2019-11-26 · unverdicted · none · ref 4 · internal anchor

    Behavior-regularized actor-critic methods achieve strong offline RL results with simple regularization, rendering many recent technical additions unnecessary.

  • Towards A Rigorous Science of Interpretable Machine Learning stat.ML · 2017-02-28 · unverdicted · none · ref 6 · internal anchor

    The authors define interpretability for machine learning, specify when it is required, and propose a taxonomy for its rigorous evaluation while identifying open research questions.

  • Insider Attacks in Multi-Agent LLM Consensus Systems cs.MA · 2026-05-08 · unverdicted · none · ref 74 · internal anchor

    A malicious agent in multi-agent LLM consensus systems can be trained via a surrogate world model and RL to reduce consensus rates and prolong disagreement more effectively than direct prompt attacks.

  • Soft Deterministic Policy Gradient with Gaussian Smoothing cs.LG · 2026-05-07 · unverdicted · none · ref 5 · internal anchor

    Soft-DPG uses Gaussian smoothing on the Bellman equation to derive a well-defined policy gradient without relying on critic action derivatives, yielding competitive performance on dense-reward tasks and gains on discretized-reward variants.

  • ARMATA: Auto-Regressive Multi-Agent Task Assignment cs.MA · 2026-05-05 · unverdicted · none · ref 45 · internal anchor

    ARMATA is a new end-to-end autoregressive model with multi-stage decoding that unifies allocation and routing for multi-agent systems and reports up to 20% better solutions than OR-Tools, CPLEX, and LKH-3 in seconds instead of hours.

  • Learning to Route Electric Trucks Under Operational Uncertainty eess.SY · 2026-04-29 · unverdicted · none · ref 37 · internal anchor

    A reinforcement learning framework formulated as an event-driven semi-Markov decision process with graph states and action masking outperforms heuristic and optimization baselines for stochastic electric truck routing under charging constraints.

  • Efficient Reinforcement Learning using Linear Koopman Dynamics for Nonlinear Robotic Systems cs.RO · 2026-04-21 · unverdicted · none · ref 37 · internal anchor

    Koopman-learned linear dynamics enable an online actor-critic RL method that improves sample efficiency and closed-loop performance on nonlinear robotic systems compared with model-free and other model-based baselines.

  • Kimi K2.5: Visual Agentic Intelligence cs.CL · 2026-02-02 · unverdicted · none · ref 10 · internal anchor

    Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.

  • Kimi K2: Open Agentic Intelligence cs.LG · 2025-07-28 · unverdicted · none · ref 5 · internal anchor

    Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.

  • Gymnasium: A Standard Interface for Reinforcement Learning Environments cs.LG · 2024-07-24 · accept · none · ref 7 · internal anchor

    Gymnasium establishes a standardized API for RL environments to improve interoperability, reproducibility, and ease of development in reinforcement learning.

  • robosuite: A Modular Simulation Framework and Benchmark for Robot Learning cs.RO · 2020-09-25 · unverdicted · none · ref 1 · internal anchor

    The paper presents robosuite v1.5, a MuJoCo-based modular simulation framework with benchmark environments for reproducible robot learning research.

  • Middle-mile logistics through the lens of goal-conditioned reinforcement learning stat.ML · 2026-05-04 · unverdicted · none · ref 7 · internal anchor

    Middle-mile logistics is cast as a multi-object goal-conditioned MDP and solved by combining graph neural networks with model-free RL via extraction of small feature graphs.