TRIRL enables explicit dual-ascent IRL via trust-region local policy updates that guarantee monotonic improvement without full RL solves per iteration, outperforming prior imitation methods by 2.4x aggregate IQM and recovering generalizable rewards.
International conference on machine learning , pages=
9 Pith papers cite this work. Polarity classification is still indexing.
years
2026 9verdicts
UNVERDICTED 9representative citing papers
The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general function approximation settings.
DTSemNet gives an exact, invertible neural-network encoding of hard oblique decision trees that supports direct gradient training for both classification and regression without probabilistic softening or quantized estimators.
DRIFT enables stable offline-to-online fine-tuning of CTMC policies in discrete RL via advantage-weighted discrete flow matching, path-space regularization, and candidate-set approximation.
HölderPO unifies token aggregation in GRPO via the Hölder mean with dynamic p annealing, reporting 54.9% average math-benchmark accuracy and 93.8% ALFWorld success.
OPEFO prevents entropy collapse in RLVR by rescaling token updates according to their entropy change contributions, yielding more stable optimization and better results on math benchmarks.
Geometric Pareto Control embeds Pareto solutions in a Lie group submanifold and navigates via Riemannian gradient flow to achieve 100% feasibility and low suboptimality in control tasks without retraining.
QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markovian datasets.
Recurrent RL policies can have their hidden states aligned with PMP co-states through a derived loss, yielding robust performance on partially observable control tasks.
citing papers explorer
-
Trust Region Inverse Reinforcement Learning: Explicit Dual Ascent using Local Policy Updates
TRIRL enables explicit dual-ascent IRL via trust-region local policy updates that guarantee monotonic improvement without full RL solves per iteration, outperforming prior imitation methods by 2.4x aggregate IQM and recovering generalizable rewards.
-
Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability
The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general function approximation settings.
-
Approximation-Free Differentiable Oblique Decision Trees
DTSemNet gives an exact, invertible neural-network encoding of hard oblique decision trees that supports direct gradient training for both classification and regression without probabilistic softening or quantized estimators.
-
Discrete Flow Matching for Offline-to-Online Reinforcement Learning
DRIFT enables stable offline-to-online fine-tuning of CTMC policies in discrete RL via advantage-weighted discrete flow matching, path-space regularization, and candidate-set approximation.
-
H\"older Policy Optimisation
HölderPO unifies token aggregation in GRPO via the Hölder mean with dynamic p annealing, reporting 54.9% average math-benchmark accuracy and 93.8% ALFWorld success.
-
Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization
OPEFO prevents entropy collapse in RLVR by rescaling token updates according to their entropy change contributions, yielding more stable optimization and better results on math benchmarks.
-
Geometric Pareto Control: Riemannian Gradient Flow of Energy Function via Lie Group Homotopy
Geometric Pareto Control embeds Pareto solutions in a Lie group submanifold and navigates via Riemannian gradient flow to achieve 100% feasibility and low suboptimality in control tasks without retraining.
-
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL
QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markovian datasets.
-
Neural Co-state Policies: Structuring Hidden States in Recurrent Reinforcement Learning
Recurrent RL policies can have their hidden states aligned with PMP co-states through a derived loss, yielding robust performance on partially observable control tasks.