Establishes maximal concentration bounds for stochastic approximation under heavy-tailed Markovian noise, with tails ranging from sub-Gaussian to heavier than Weibull depending on step sizes and contractivity properties, plus a truncation argument for unbounded noise.
hub Canonical reference
nature , volume=
Canonical reference. 80% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Derives ODE limits of Adam-DA showing that first- and second-order momentum parameters reverse their convergence roles in zero-sum games compared to minimization, validated on GAN experiments.
DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.
HPO enables unbiased policy optimization in hybrid action spaces by mixing differentiable simulation gradients with score-function estimates, outperforming PPO as continuous dimensions increase.
CaTR applies value-decomposed RL with hierarchical conflict-aware observations to achieve better safety-efficiency trade-offs than planning, optimization, and standard RL baselines in a realistic airport taxiway simulation.
LC-MAPF uses multi-round local communication between neighboring agents in a pre-trained model to outperform prior learning-based MAPF solvers on diverse unseen scenarios while preserving scalability.
Geometric Pareto Control embeds Pareto solutions in a Lie group submanifold and navigates via Riemannian gradient flow to achieve 100% feasibility and low suboptimality in control tasks without retraining.
Formalizes preference learning from a no-regret or Boltzmann-converging learner with theoretical guarantees or impossibility results for IRL algorithms.
Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.
DQPOPE estimates the entire return distribution in off-policy evaluation via deep quantile process regression, providing statistical advantages over standard single-value methods with equivalent sample sizes.
Math-Shepherd is an automatically trained process reward model that scores solution steps to verify and reinforce LLMs, lifting Mistral-7B from 77.9% to 89.1% on GSM8K and 28.6% to 43.5% on MATH.
Compressive Transformer sets new records on WikiText-103 (17.1 ppl) and Enwik8 (0.97 bpc) via memory compression and introduces the PG-19 long-range language benchmark.
RL agents in fighting games learn to jointly predict actions and their durations, matching fixed frame-skip performance while favoring repeatable exploitative patterns against scripted bots.
DAGS initializes policy-gradient self-play from human-derived intermediate states to reduce exploitability in challenging imperfect-information games, with a multi-task flag fix for resulting bias and new benchmark environments.
citing papers explorer
-
Value-Decomposed Reinforcement Learning Framework for Taxiway Routing with Hierarchical Conflict-Aware Observations
CaTR applies value-decomposed RL with hierarchical conflict-aware observations to achieve better safety-efficiency trade-offs than planning, optimization, and standard RL baselines in a realistic airport taxiway simulation.