arXiv preprint arXiv:2503.07453 , year=

[FMR25] Dylan J · 2025 · arXiv 2503.07453

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

representative citing papers

When Does Online Imitation Learning Help in LLM Post-Training? The Role of (Non-)Realizability Beyond Horizon

cs.LG · 2026-06-29 · unverdicted · novelty 7.0

Online IL overcomes an information-theoretic bottleneck that offline IL faces in non-realizable settings even at horizon 1, under a new structural characterization of reward-relative misspecification.

The Power of Test-Time Training for Approximate Sampling

cs.DS · 2026-06-09 · unverdicted · novelty 7.0

Establishes a quadratic lower bound on query complexity for sampling from large classes of distributions given approximate density oracles, answers an open question on optimality of random walks, and shows circumvention for bounded classes as an abstraction of TTT.

Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general function approximation settings.

RL Excursions during Pre-Training: Re-examining Policy Optimization for LLM training

cs.LG · 2026-06-02 · unverdicted · novelty 6.0

Experiments indicate RL applied early in pre-training often matches full SFT-then-RL performance, targeted data composition outweighs scale for RL success, and averaging RL and SFT objectives outperforms sequential or single methods.

citing papers explorer

Showing 3 of 3 citing papers after filters.

When Does Online Imitation Learning Help in LLM Post-Training? The Role of (Non-)Realizability Beyond Horizon cs.LG · 2026-06-29 · unverdicted · none · ref 16
Online IL overcomes an information-theoretic bottleneck that offline IL faces in non-realizable settings even at horizon 1, under a new structural characterization of reward-relative misspecification.
Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability cs.LG · 2026-05-09 · unverdicted · none · ref 92
The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general function approximation settings.
RL Excursions during Pre-Training: Re-examining Policy Optimization for LLM training cs.LG · 2026-06-02 · unverdicted · none · ref 12
Experiments indicate RL applied early in pre-training often matches full SFT-then-RL performance, targeted data composition outweighs scale for RL success, and averaging RL and SFT objectives outperforms sequential or single methods.

arXiv preprint arXiv:2503.07453 , year=

fields

years

verdicts

representative citing papers

citing papers explorer