A formal theory proves model exploitation is essentially unavoidable on large policy sets in RL, generalizes reward hacking results, and derives a safe horizon for a relaxed version of exploitation.
ACM Sigart Bulletin , volume=
7 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markovian datasets.
LeJEPA derives an optimal isotropic Gaussian target for embeddings and enforces it via sketched regularization to deliver scalable, heuristics-free self-supervised pretraining with 79% ImageNet linear accuracy on ViT-H/14.
DINO-WM builds world models on pre-trained DINOv2 features to enable zero-shot planning from offline data without rewards or demonstrations.
Offline RL promises to extract high-utility policies from static datasets but faces fundamental challenges that current methods only partially address.
citing papers explorer
-
Imperfect World Models are Exploitable
A formal theory proves model exploitation is essentially unavoidable on large policy sets in RL, generalizes reward hacking results, and derives a safe horizon for a relaxed version of exploitation.
-
Learning Interactive Real-World Simulators
UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.