{"total":15,"items":[{"citing_arxiv_id":"2605.20999","ref_index":119,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Concentration of General Stochastic Approximation Under Heavy-Tailed Markovian Noise","primary_cat":"math.PR","submitted_at":"2026-05-20T10:38:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Establishes maximal concentration bounds for stochastic approximation under heavy-tailed Markovian noise, with tails ranging from sub-Gaussian to heavier than Weibull depending on step sizes and contractivity properties, plus a truncation argument for unbounded noise.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20911","ref_index":5,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"For How Long Should We Be Punching? Learning Action Duration in Fighting Games","primary_cat":"cs.AI","submitted_at":"2026-05-20T08:56:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"RL agents in fighting games learn to jointly predict actions and their durations, matching fixed frame-skip performance while favoring repeatable exploitative patterns against scripted bots.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":", low responsiveness). This strategy makes it easier to learn exploitative strategies where the same action is repeated over and over, which the scripted bots appear to be susceptible to. Keywords:Action duration·Reinforcement learning·Fighting games. 1 Introduction Reinforcement learning (RL) has achieved impressive results in many environ- ments, such as Atari games [5], and complex board games like Go where ap- proaches like AlphaGo [8] and AlphaZero [9] achieved breakthrough results. However, fast-paced games remain a difficult challenge. These games require rapid decision-making, continuous adaptation, and careful timing of actions. arXiv:2605.20911v1 [cs.AI] 20 May 2026 2 H. Nguyen et al. Timing is a crucial aspect in fighting games, where subtle differences in ac-"},{"citing_arxiv_id":"2605.19392","ref_index":60,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Understanding Dynamics of Adam in Zero-Sum Games: An ODE Approach","primary_cat":"cs.LG","submitted_at":"2026-05-19T05:38:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Derives ODE limits of Adam-DA showing that first- and second-order momentum parameters reverse their convergence roles in zero-sum games compared to minimization, validated on GAN experiments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14379","ref_index":32,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Data-Augmented Game Starts for Accelerating Self-Play Exploration in Imperfect Information Games","primary_cat":"cs.LG","submitted_at":"2026-05-14T05:00:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DAGS initializes policy-gradient self-play from human-derived intermediate states to reduce exploitability in challenging imperfect-information games, with a multi-task flag fix for resulting bias and new benchmark environments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14350","ref_index":191,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling","primary_cat":"cs.LG","submitted_at":"2026-05-14T04:22:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14297","ref_index":29,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Policy Optimization in Hybrid Discrete-Continuous Action Spaces via Mixed Gradients","primary_cat":"cs.LG","submitted_at":"2026-05-14T02:59:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"HPO enables unbiased policy optimization in hybrid action spaces by mixing differentiable simulation gradients with score-function estimates, outperforming PPO as continuous dimensions increase.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09824","ref_index":5,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Geometric Pareto Control: Riemannian Gradient Flow of Energy Function via Lie Group Homotopy","primary_cat":"eess.SY","submitted_at":"2026-05-11T00:01:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Geometric Pareto Control embeds Pareto solutions in a Lie group submanifold and navigates via Riemannian gradient flow to achieve 100% feasibility and low suboptimality in control tasks without retraining.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"avoiding the clustering artifacts of Monte Carlo sampling that would leave regions of the operational envelope unrepresented onM∗. 3.1.2 Pareto-Optimal Solution Generation For each scenariopk, the system is simulated or evaluated to obtain the corresponding true system state sk∈Sand partial observationxk =h(s k)∈X. We then solve the multi-objective optimization problem via weighted scalarization: uk(w) = arg min u∈Ufeassk M∑ i=1 wiJi(sk,u),w∈∆ M−1 (5) sweepingwuniformly across∆ M−1to obtain diverse Pareto-optimal solutions. Remark 4(Supported Pareto Points).Weighted scalarization recoverssupportedPareto-optimal solutions, i.e. points on the convex hull of the Pareto front in objective space. For problems with a nonconvex Pareto front, interior (non-supported) points are not recoverable by any positive weight vectorw∈∆M−1."},{"citing_arxiv_id":"2605.09217","ref_index":6,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Learning the Preferences of a Learning Agent","primary_cat":"cs.AI","submitted_at":"2026-05-09T23:28:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Formalizes preference learning from a no-regret or Boltzmann-converging learner with theoretical guarantees or impossibility results for IRL algorithms.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"2 BOLTZMANNRATIONALLEARNER We will also consider another approach to modeling the learner: assume that, over time, they con- verge to a Boltzmann rational policy (Luce et al., 1959; Ziebart et al., 2010). Stateless.In the stateless case, we model the learner's action selection process asa t ∼ˆpt, where ˆpt(a)∝exp(β ˆRt(a))for a rationality parameterβ∈[0,∞), and where ˆRt models the learner's estimate of the reward function at time stept. To capture the fact that the agent is learning, we assume that ˆRt converges to the true reward functionR ∗ over time. Formally, TX t=1 ∥ ˆRt −R ∗∥∞ ≤f(T)(1) for somef(T) =O(T α)whereα∈(0,1)captures the rate of the agent's learning. Stateful.In the stateful case, the learner receivess t at time steptand selects their action asa t ∼ ˆpt(·|st), withˆpt(a|s)∝exp(β ˆQt(s, a))where ˆQt models the learner's estimate of the action-value function at time stept. We model learning by assuming that this estimate ˆQt satisfies: TX τ=1:s τ=s ∥ ˆQτ(s,·)−Q ∗(s,·)∥ ∞ ≤f(N T (s))(2) 3 for everys∈ Sand for somef(N T (s)) =O(N T (s)α)whereα∈(0,1). An important detail here is that Boltzmann rationality is not itself a model of learning or exploration. Instead, it is a model capturing the inability to optimize the ground-truth reward function. Boltzmann rationality could model exploration ifβ→ ∞over time. This is the reason why the above is paired with the assumption that the learner's estimate of the reward/action-value converges over time. 3.2 EVALUATIONMEASURES There are various reasonable performance measures that we could use to evaluate the quality of the predictor's estimatesR1:T = (R1, . . . , RT ), (orQ 1:T = (Q1, . . . , QT )) for the true reward function R∗ (orQ ∗). We will introduce the best-response distance (Section 3.2.1), the KL divergence be- tween Boltzmann rational policies (Section 3.2.2), and norm-based measures (Section 3.2.3). These measures range from weak (best-response, which only requires matching the optimal action)"},{"citing_arxiv_id":"2605.08754","ref_index":11,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Value-Decomposed Reinforcement Learning Framework for Taxiway Routing with Hierarchical Conflict-Aware Observations","primary_cat":"cs.AI","submitted_at":"2026-05-09T07:32:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CaTR applies value-decomposed RL with hierarchical conflict-aware observations to achieve better safety-efficiency trade-offs than planning, optimization, and standard RL baselines in a realistic airport taxiway simulation.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"rival, proximity, and head-on conflict objectives, respectively. To balance these objectives, the critic estimates a value vector for different reward components. For thei-th value component, the gradient magnitude of the value loss is com- puted as gi =∥∇ ϕLvalue,i∥1 .(10) The normalized component weight is then obtained by wi = exp(gi)PK j=1 exp(gj) .(11) The total value loss is defined as Lvalue =K KX i=1 wiLvalue,i,(12) 3 Dual-branch Feature Extraction and Fusion plan to plan to plan to plan to plan to plan to HFTR to HFTR to HFTR to HFTR to HFTR to HFTR to p a Policy p a Policy p a Policy p a Policy Concatenate ValueValueValueValue (0) 1h (0) 2h (1) 2h (2) 2h Nhier hier (3 -1) Nh MLP Hierarchical"},{"citing_arxiv_id":"2605.07637","ref_index":34,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding","primary_cat":"cs.AI","submitted_at":"2026-05-08T12:05:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LC-MAPF uses multi-round local communication between neighboring agents in a pre-trained model to outperform prior learning-based MAPF solvers on diverse unseen scenarios while preserving scalability.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00347","ref_index":61,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-01T02:05:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18143","ref_index":88,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Distributional Off-Policy Evaluation with Deep Quantile Process Regression","primary_cat":"stat.ML","submitted_at":"2026-04-20T12:07:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DQPOPE estimates the entire return distribution in off-policy evaluation via deep quantile process regression, providing statistical advantages over standard single-value methods with equivalent sample sizes.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Schmidt-Hieber, J. (2020). Nonparametric regression using deep neural networks with relu activation function.The Annals of Statistics, 4(48):1875--1897. Shen, G., Jiao, Y., Lin, Y., Horowitz, J. L., and Huang, J. (2024). Nonparametric estima- tion of non-crossing quantile regression process with deep requ neural networks.Journal of Machine Learning Research, 25(88):1-75. Stone, C. J. (1982). Optimal global rates of convergence for nonparametric regression.The Annals of Statistics, pages 1040-1053. Subramanian, J., Sinha, A., Seraj, R., and Mahajan, A. (2022). Approximate information state for approximate planning and reinforcement learning in partially observed systems. Journal of Machine Learning Research, 23(12):1-83."},{"citing_arxiv_id":"2604.17433","ref_index":63,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning","primary_cat":"cs.CL","submitted_at":"2026-04-19T13:26:04+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2312.08935","ref_index":32,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations","primary_cat":"cs.AI","submitted_at":"2023-12-14T13:41:54+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Math-Shepherd is an automatically trained process reward model that scores solution steps to verify and reinforce LLMs, lifting Mistral-7B from 77.9% to 89.1% on GSM8K and 28.6% to 43.5% on MATH.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1911.05507","ref_index":50,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Compressive Transformers for Long-Range Sequence Modelling","primary_cat":"cs.LG","submitted_at":"2019-11-13T14:36:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Compressive Transformer sets new records on WikiText-103 (17.1 ppl) and Enwik8 (0.97 bpc) via memory compression and introduces the PG-19 long-range language benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}