pith. machine review for the scientific record. sign in

arxiv: 2604.24661 · v3 · submitted 2026-04-27 · 💻 cs.RO

Recognition: no theorem link

Agent-Centric Observation Adaptation for Robust Visual Control under Dynamic Perturbations

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:43 UTC · model grok-4.3

classification 💻 cs.RO
keywords visual controlrobust perceptionmixture of expertsobservation adaptationreinforcement learningimage degradationsim-to-realdynamic perturbations
0
0 comments X

The pith

A plug-and-play adapter with mixture-of-experts restoration and foreground masking recovers 95.3 percent of clean visual control performance under dynamic perturbations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Visual control agents in the real world encounter time-varying corruptions from weather, sensor issues, and background changes that standard restoration methods fail to handle without harming task performance. The paper identifies that pixel-faithful reconstruction embeds corruption details into the latent state, which contaminates policy learning. It creates the Visual Degraded Control Suite to test Markov-switching degradations and develops ACO-MoE, an offline-pretrained adapter that uses expert routing and simulation masks to focus on clean foreground elements. This yields substantial gains on multiple benchmarks while generalizing to new corruption types.

Core claim

From an information-bottleneck view, the work establishes that restoration-based representations force encoding of nuisance corruption information, and that instead anchoring to the clean foreground via masks avoids this while preserving task-critical content. The proposed ACO-MoE adapter implements this by combining a routed bank of restoration experts with a foreground-mask branch, pretrained solely on synthetic rendered data with automatic degradation pairs and masks, then deployed at inference on corrupted RGB alone without any labels or references.

What carries the argument

ACO-MoE, an agent-centric observation adapter that routes inputs through a mixture of restoration experts conditioned on a foreground mask branch to produce task-preserving cleaned observations.

Load-bearing premise

That the foreground masks derived from simulation accurately capture the task-relevant information and that the synthetic degradation model sufficiently represents real-world non-stationary corruptions for the adapter to transfer effectively.

What would settle it

Demonstrating that control performance with ACO-MoE falls to or below baseline levels when evaluated on real-world robot footage with actual dynamic perturbations not replicable in the synthetic benchmark.

read the original abstract

Real-world visual systems face time-varying perturbations, including weather, sensor noise, compression artifacts, and background distractions. Existing image restoration methods are typically designed for fixed corruption types and optimized for pixel-level fidelity, leaving open two questions: how restoration behaves under non-stationary corruption switching, and whether pixel-level fidelity preserves the task-relevant information needed by downstream models. To study this setting, we introduce the Visual Degraded Control Suite (VDCS), a benchmark that injects Markov-switching physical degradations into rendered scenes. We further identify a fundamental failure mode of reconstruction-based representations: faithfully reconstructing corrupted observations forces the latent state to encode corruption-specific nuisance information, thereby contaminating downstream models. From an information-bottleneck perspective, anchoring the representation to the clean foreground eliminates this contamination. Motivated by this analysis, we propose \emph{Agent-Centric Observations with Mixture-of-Experts} (ACO-MoE), a frozen, plug-and-play observation adapter that combines a routed bank of restoration experts with a foreground-mask branch. ACO-MoE is pretrained entirely offline on synthetic rendered data with automatically generated degradation pairs and simulation-derived foreground masks, requiring no manual annotation. At inference time, it takes only corrupted RGB as input without corruption labels, clean reference frames, or foreground masks. Across VDCS, DMC-GB, and RoboSuite, ACO-MoE consistently improves downstream control with both model-free and model-based backbones, recovering 95.3\% of clean-input performance under challenging Markov-switching corruptions. It also generalizes zero-shot to unseen visual perturbations excluded from adapter pretraining.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces the Visual Degraded Control Suite (VDCS) benchmark for non-stationary visual degradations and proposes ACO-MoE, a frozen plug-and-play observation adapter that combines a routed mixture-of-experts restoration bank with a simulation-derived foreground-mask branch. Pretrained offline on synthetic rendered pairs, ACO-MoE is claimed to eliminate nuisance contamination in latent representations per an information-bottleneck analysis, recovering 95.3% of clean-input performance across VDCS, DMC-GB, and RoboSuite for both model-free and model-based controllers while generalizing zero-shot to unseen perturbations.

Significance. If the results and the foreground-anchoring justification hold, the work offers a practical, label-free adapter for robust visual control under dynamic real-world corruptions, potentially reducing the need for policy retraining or online adaptation in robotics applications.

major comments (3)
  1. [§3 (Information-Bottleneck Analysis)] §3 (Information-Bottleneck Analysis): The central motivation that anchoring latents to clean foreground masks eliminates nuisance contamination without discarding task-critical information is load-bearing for the performance claims, yet the analysis does not include a quantitative bound or ablation demonstrating that policy-relevant cues (e.g., peripheral dynamics or shadows in RoboSuite manipulation) are retained; if masks remove such context, downstream controllers would lose performance even with restored nuisances.
  2. [§5 (Experiments)] §5 (Experiments): The reported 95.3% recovery and zero-shot generalization are the primary empirical support, but the results section provides insufficient detail on run counts, error bars, statistical tests, and component ablations (e.g., MoE routing vs. single expert, mask branch vs. full-image input); without these, it is impossible to verify that improvements are not due to the synthetic pretraining distribution or baseline weaknesses.
  3. [§4.2 (ACO-MoE Architecture)] §4.2 (ACO-MoE Architecture): The transfer assumption that offline synthetic Markov-switching degradations plus simulation masks will handle real non-stationary corruptions at inference is central, but no analysis or cross-domain experiment quantifies the domain gap between rendered degradations and actual sensor/weather effects, risking overstatement of robustness.
minor comments (2)
  1. [Notation] The notation for the information-bottleneck objective and expert routing could be made more explicit with a single equation block defining all mutual-information terms and gating weights.
  2. [Figures] Figure captions for mask visualizations should include quantitative metrics (e.g., IoU with clean foreground) to allow readers to assess information preservation directly.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the manuscript without misrepresenting our contributions.

read point-by-point responses
  1. Referee: [§3 (Information-Bottleneck Analysis)] §3 (Information-Bottleneck Analysis): The central motivation that anchoring latents to clean foreground masks eliminates nuisance contamination without discarding task-critical information is load-bearing for the performance claims, yet the analysis does not include a quantitative bound or ablation demonstrating that policy-relevant cues (e.g., peripheral dynamics or shadows in RoboSuite manipulation) are retained; if masks remove such context, downstream controllers would lose performance even with restored nuisances.

    Authors: We agree that a more explicit demonstration of retained task-relevant information would strengthen the information-bottleneck argument in Section 3. The current analysis shows that foreground anchoring reduces mutual information with nuisance factors while the empirical recovery of 95.3% clean performance across environments (including RoboSuite) indicates that critical cues such as peripheral dynamics are preserved in practice. To directly address the concern, we will add a targeted ablation in the revised manuscript that isolates the mask branch's effect on control performance in tasks with prominent peripheral elements, quantifying any information loss. revision: yes

  2. Referee: [§5 (Experiments)] §5 (Experiments): The reported 95.3% recovery and zero-shot generalization are the primary empirical support, but the results section provides insufficient detail on run counts, error bars, statistical tests, and component ablations (e.g., MoE routing vs. single expert, mask branch vs. full-image input); without these, it is impossible to verify that improvements are not due to the synthetic pretraining distribution or baseline weaknesses.

    Authors: We acknowledge that the experimental reporting in Section 5 lacks sufficient statistical rigor and component-level ablations. In the revised manuscript we will report the exact number of independent runs (5 seeds per setting), include error bars on all performance plots, add statistical significance tests comparing ACO-MoE against baselines, and expand the ablation study to explicitly compare full ACO-MoE against a single-expert restoration variant and a mask-free full-image input variant. These additions will allow readers to verify that gains arise from the routed experts and foreground anchoring rather than pretraining artifacts. revision: yes

  3. Referee: [§4.2 (ACO-MoE Architecture)] §4.2 (ACO-MoE Architecture): The transfer assumption that offline synthetic Markov-switching degradations plus simulation masks will handle real non-stationary corruptions at inference is central, but no analysis or cross-domain experiment quantifies the domain gap between rendered degradations and actual sensor/weather effects, risking overstatement of robustness.

    Authors: The referee correctly notes that our evaluation remains within synthetic domains and does not quantify the synthetic-to-real domain gap. While zero-shot generalization to unseen synthetic perturbations provides evidence of robustness inside the simulated distribution, we do not claim direct equivalence to real sensor or weather effects. In the revised manuscript we will add an explicit limitations paragraph in Section 4.2 and the conclusion discussing this gap and suggesting future real-robot validation protocols. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation chain; performance claims are empirical

full rationale

The paper's chain begins with an information-bottleneck analysis identifying a failure mode in reconstruction-based latents, then motivates the ACO-MoE architecture (frozen adapter with routed experts and foreground-mask branch) as a plug-and-play solution pretrained offline on synthetic degradation pairs and simulation-derived masks. No equations, derivations, or fitted parameters are presented that reduce the reported 95.3% recovery or zero-shot generalization to inputs by construction. The central claims rest on downstream empirical evaluations across VDCS, DMC-GB, and RoboSuite with model-free and model-based controllers, without self-citations serving as load-bearing uniqueness theorems or ansatzes. The method is self-contained as an empirical architecture whose validity is tested externally on benchmarks rather than tautologically derived from its own definitions or prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on the domain assumption that the information-bottleneck view correctly diagnoses the contamination problem and that foreground anchoring solves it; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Anchoring the representation to the clean foreground eliminates contamination from corruption-specific nuisance information
    Invoked in abstract as the information-bottleneck motivation for the foreground-mask branch.

pith-pipeline@v0.9.0 · 5607 in / 1227 out tokens · 36176 ms · 2026-05-11T00:43:02.634294+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

83 extracted references · 27 canonical work pages · 6 internal anchors

  1. [1]

    Look where you look! Saliency-guided q-networks for generalization in visual reinforcement learning.Advances in neural information processing systems, 35:30693–30706, 2022

    David Bertoin, Adil Zouitine, Mehdi Zouitine, and Emmanuel Rachelson. Look where you look! Saliency-guided q-networks for generalization in visual reinforcement learning.Advances in neural information processing systems, 35:30693–30706, 2022

  2. [2]

    Parameter-free online test-time adaptation

    Malik Boudiaf, Romain Mueller, Ismail Ben Ayed, and Luca Bertinetto. Parameter-free online test-time adaptation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8344–8353, 2022

  3. [3]

    Simple baselines for image restoration

    Liangyu Chen, Xiaojie Chu, Xiangyu Zhang, and Jian Sun. Simple baselines for image restoration. In European Conference on Computer Vision (ECCV), volume 13667, pages 17–33, 2022

  4. [4]

    InstructIR: High-quality image restoration following human instructions

    Marcos V Conde, Gregor Geigle, and Radu Timofte. InstructIR: High-quality image restoration following human instructions. InEuropean Conference on Computer Vision, pages 1–21. Springer, 2024

  5. [5]

    Robustbench: a standardized adversarial robustness benchmark

    Francesco Croce, Maksym Andriushchenko, Vikash Sehwag, Edoardo Debenedetti, Nicolas Flammarion, Mung Chiang, Prateek Mittal, and Matthias Hein. RobustBench: A standardized adversarial robustness benchmark.arXiv preprint arXiv:2010.09670, 2020

  6. [6]

    Image denoising by sparse 3-D transform-domain collaborative filtering.IEEE Transactions on Image Processing, 16(8):2080–2095, 2007

    Kostadin Dabov, Alessandro Foi, Vladimir Katkovnik, and Karen Egiazarian. Image denoising by sparse 3-D transform-domain collaborative filtering.IEEE Transactions on Image Processing, 16(8):2080–2095, 2007

  7. [7]

    Sliding puzzles gym: A scalable benchmark for state representation in visual reinforcement learning.arXiv preprint arXiv:2410.14038, 2024

    Bryan LM de Oliveira, Luana GB Martins, Bruno Brandão, Murilo L da Luz, Telma W de L Soares, and Luckeciano C Melo. Sliding puzzles gym: A scalable benchmark for state representation in visual reinforcement learning.arXiv preprint arXiv:2410.14038, 2024

  8. [8]

    MambaIR: A simple baseline for image restoration with state-space model

    Hang Guo, Jinmin Li, Tao Dai, Zhihao Ouyang, Xudong Ren, and Shu-Tao Xia. MambaIR: A simple baseline for image restoration with state-space model. InEuropean Conference on Computer Vision (ECCV), pages 222–241. Springer, 2024

  9. [9]

    Onerestore: A universal restoration framework for composite degradation

    Yu Guo, Yuan Gao, Yuxu Lu, Huilin Zhu, Ryan Wen Liu, and Shengfeng He. Onerestore: A universal restoration framework for composite degradation. InEuropean Conference on Computer Vision, pages 255–272. Springer, 2024

  10. [10]

    Neptune-x: Active x-to-maritime generation for universal maritime object detection.arXiv preprint arXiv:2509.20745, 2025

    Yu Guo, Shengfeng He, Yuxu Lu, Haonan An, Yihang Tao, Huilin Zhu, Jingxian Liu, and Yuguang Fang. Neptune-x: Active x-to-maritime generation for universal maritime object detection.arXiv preprint arXiv:2509.20745, 2025. [11]David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2018

  11. [11]

    Learning latent dynamics for planning from pixels

    Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. InInternational Conference on Machine Learning, volume 97, pages 2555–2565. PMLR, 2019

  12. [12]

    Mastering Diverse Domains through World Models

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

  13. [13]

    Dsp-reg: Domain-sensitive parameter regularization for robust domain generalization, 2026

    Xudong Han, Senkang Hu, Yihang Tao, Yu Guo, Philip Birch, Sam Tak Wu Kwong, and Yuguang Fang. Dsp-reg: Domain-sensitive parameter regularization for robust domain generalization, 2026

  14. [14]

    Generalizationinreinforcementlearningbysoftdataaugmentation

    NicklasHansenandXiaolongWang. Generalizationinreinforcementlearningbysoftdataaugmentation. In2021 IEEE International Conference on Robotics and Automation (ICRA), pages 13611–13617, 2021. 13/37 Agent-Centric Visual Reinforcement Learning under Dynamic Perturbations

  15. [15]

    Self-supervised policy adaptation during deployment

    Nicklas Hansen, Rishabh Jangir, Yu Sun, Guillem Alenyà, Pieter Abbeel, Alexei A Efros, Lerrel Pinto, and Xiaolong Wang. Self-supervised policy adaptation during deployment. InInternational Conference on Learning Representations, 2021

  16. [16]

    Stabilizing deep q-learning with convnets and vision transformersunderdataaugmentation

    Nicklas Hansen, Hao Su, and Xiaolong Wang. Stabilizing deep q-learning with convnets and vision transformersunderdataaugmentation. InAdvancesinNeuralInformationProcessingSystems,volume34, pages 3680–3693, 2021

  17. [17]

    TD-MPC2: Scalable, Robust World Models for Continuous Control

    Nicklas Hansen, Hao Su, and Xiaolong Wang. TD-MPC2: Scalable, robust world models for continuous control.arXiv preprint arXiv:2310.16828, 2023

  18. [18]

    Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

    Dan Hendrycks and Thomas G. Dietterich. Benchmarking neural network robustness to common corruptions and perturbations.arXiv preprint arXiv:1903.12261, 2019

  19. [19]

    Agentscodriver: Large language model empowered collaborative driving with lifelong learning, 2024

    Senkang Hu, Zhengru Fang, Zihan Fang, Yiqin Deng, Xianhao Chen, and Yuguang Fang. Agentscodriver: Large language model empowered collaborative driving with lifelong learning, 2024

  20. [20]

    Senkang Hu, Zhengru Fang, Yiqin Deng, Xianhao Chen, Yuguang Fang, and Sam Kwong. Toward full- scene domain generalization in multi-agent collaborative bird’s eye view segmentation for connected and autonomous driving.IEEE Transactions on Intelligent Transportation Systems, 26(2):1783–1796, 2025

  21. [21]

    Agentscomerge: Large language model empowered collaborative decision making for ramp merging.IEEE Transactions on Mobile Computing, 24(10):9791–9805, 2025

    Senkang Hu, Zhengru Fang, Zihan Fang, Yiqin Deng, Xianhao Chen, Yuguang Fang, and Sam Tak Wu Kwong. Agentscomerge: Large language model empowered collaborative decision making for ramp merging.IEEE Transactions on Mobile Computing, 24(10):9791–9805, 2025

  22. [22]

    Optimizing agentic reasoning with retrieval via synthetic semantic information gain reward.arXiv preprint arXiv:2602.00845, 2026

    Senkang Hu, Yong Dai, Yuzhi Zhao, Yihang Tao, Yu Guo, Zhengru Fang, Sam Tak Wu Kwong, and Yuguang Fang. Optimizing agentic reasoning with retrieval via synthetic semantic information gain reward.arXiv preprint arXiv:2602.00845, 2026

  23. [23]

    Planning- oriented autonomous driving

    Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, Lewei Lu, Xiaosong Jia, Qiang Liu, Jifeng Dai, Yu Qiao, and Hongyang Li. Planning- oriented autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17853–17862, 2023

  24. [24]

    Spectrum random masking for generalization in image-based reinforcement learning

    Yangru Huang, Peixi Peng, Yifan Zhao, Guangyao Chen, and Yonghong Tian. Spectrum random masking for generalization in image-based reinforcement learning. InAdvances in Neural Information Processing Systems, volume 35, pages 20393–20406, 2022

  25. [25]

    Adaptive mixtures of local experts.Neural Computation, 3(1):79–87, 1991

    Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts.Neural Computation, 3(1):79–87, 1991

  26. [26]

    Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lelio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven...

  27. [27]

    Hierarchical mixtures of experts and the EM algorithm.Neural Computation, 6(2):181–214, 1994

    Michael I Jordan and Robert A Jacobs. Hierarchical mixtures of experts and the EM algorithm.Neural Computation, 6(2):181–214, 1994. 14/37 Agent-Centric Visual Reinforcement Learning under Dynamic Perturbations

  28. [28]

    3D common corruptions and data augmentation

    Oğuzhan Fatih Kar, Teresa Yeo, Andrei Atanov, and Amir Zamir. 3D common corruptions and data augmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18963–18974, 2022

  29. [29]

    Make the pertinent salient: Task-relevant reconstruction for visual control with distractions.arXiv preprint arXiv:2410.09972, 2024

    Kyungmin Kim, JB Lanier, Pierre Baldi, Charless Fowlkes, and Roy Fox. Make the pertinent salient: Task-relevant reconstruction for visual control with distractions.arXiv preprint arXiv:2410.09972, 2024

  30. [30]

    Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning From Pixels

    Ilya Kostrikov, Denis Yarats, and Rob Fergus. Image augmentation is all you need: Regularizing deep reinforcement learning from pixels.arXiv preprint arXiv:2004.13649, 2020

  31. [31]

    DeblurGAN: Blind motion deblurring using conditional adversarial networks

    Orest Kupyn, Volodymyr Budzan, Mykola Mykhailych, Dmytro Mishkin, and Jiří Matas. DeblurGAN: Blind motion deblurring using conditional adversarial networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 8183–8192, 2018

  32. [32]

    Reinforce- ment learning with augmented data

    Michael Laskin, Kimin Lee, Adam Stooke, Lerrel Pinto, Pieter Abbeel, and Aravind Srinivas. Reinforce- ment learning with augmented data. InAdvances in Neural Information Processing Systems, volume 33, pages 19884–19895, 2020

  33. [33]

    CURL: Contrastive unsupervised representations for reinforcement learning

    Michael Laskin, Aravind Srinivas, and Pieter Abbeel. CURL: Contrastive unsupervised representations for reinforcement learning. InInternational Conference on Machine Learning, volume 119, pages 5639–5650. PMLR, 2020

  34. [34]

    All-in-one image restoration for unknown corruption

    Boyun Li, Xiao Liu, Peng Hu, Zhongqin Wu, Jiancheng Lv, and Xi Peng. All-in-one image restoration for unknown corruption. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17452–17462, 2022

  35. [35]

    Instruct2see: Learningtoremoveanyobstructions across distributions

    JunhangLi,YuGuo,ChuhuaXian,andShengfengHe. Instruct2see: Learningtoremoveanyobstructions across distributions. InInternational Conference on Machine Learning, pages 34453–34470. PMLR, 2025

  36. [36]

    Policy-independent behavioral metric-based representa- tion for deep reinforcement learning

    Weijian Liao, Zongzhang Zhang, and Yang Yu. Policy-independent behavioral metric-based representa- tion for deep reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 8746–8754, 2023

  37. [37]

    MoE-LLaVA: Mixture of experts for large vision-language models.IEEE Transactions on Multimedia, 2026

    Bin Lin, Zhenyu Tang, Yang Ye, Jinfa Huang, Junwu Zhang, Yatian Pang, Peng Jin, Munan Ning, Jiebo Luo, and Li Yuan. MoE-LLaVA: Mixture of experts for large vision-language models.IEEE Transactions on Multimedia, 2026

  38. [38]

    TTT++: When does self-supervised test-time training fail or thrive? InAdvances in Neural Information Processing Systems (NeurIPS), volume 34, pages 21808–21820, 2021

    Yuejiang Liu, Parth Kothari, Bastien van Delft, Baptiste Bellot-Gurlet, Taylor Mordan, and Alexandre Alahi. TTT++: When does self-supervised test-time training fail or thrive? InAdvances in Neural Information Processing Systems (NeurIPS), volume 34, pages 21808–21820, 2021

  39. [39]

    Gustafsson, Zheng Zhao, Jens Sjölund, and Thomas B

    Ziwei Luo, Fredrik K. Gustafsson, Zheng Zhao, Jens Sjölund, and Thomas B. Schön. Controlling vision-language models for multi-task image restoration. InInternational Conference on Learning Representations (ICLR), 2024

  40. [40]

    Transformers are sample efficient world models.arXiv preprint arXiv:2209.00588, 2022

    Vincent Micheli, Eloi Alonso, and Francois Fleuret. Transformers are sample-efficient world models. arXiv preprint arXiv:2209.00588, 2022

  41. [41]

    Human-level control through deep reinforcement learning.Nature, 518(7540):529–533, 2015

    Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement lear...

  42. [42]

    Steering your generalists: Improving robotic foundation models via value guidance.arXiv preprint arXiv:2410.13816, 2024

    Mitsuhiko Nakamoto, Oier Mees, Aviral Kumar, and SergeyLevine. Steering your generalists: Improving robotic foundation models via value guidance.arXiv preprint arXiv:2410.13816, 2024

  43. [43]

    Dmc-vb: A benchmark for representation learning for control with visual distractors.Advances in Neural Information Processing Systems, 37:6574–6602, 2024

    Joseph Ortiz, Antoine Dedieu, Wolfgang Lehrach, J Swaroop Guntupalli, Carter Wendelken, Ahmad Humayun, Sivaramakrishnan Swaminathan, Guangyao Zhou, Miguel Lázaro-Gredilla, and Kevin P Murphy. Dmc-vb: A benchmark for representation learning for control with visual distractors.Advances in Neural Information Processing Systems, 37:6574–6602, 2024

  44. [44]

    Model-based rein- forcement learning with isolated imaginations.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(5):2788–2803, 2024

    Minting Pan, Xiangming Zhu, Yitao Zheng, Yunbo Wang, and Xiaokang Yang. Model-based rein- forcement learning with isolated imaginations.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(5):2788–2803, 2024

  45. [45]

    PromptIR: Prompting for all-in-one blind image restoration

    Vaishnav Potlapalli, Syed Waqas Zamir, Salman Khan, and Fahad Shahbaz Khan. PromptIR: Prompting for all-in-one blind image restoration. InAdvances in Neural Information Processing Systems (NeurIPS), volume 36, 2023

  46. [46]

    From sparse to soft mixtures of experts.arXiv preprint arXiv:2308.00951,

    Joan Puigcerver, Carlos Riquelme, Basil Mustafa, and Neil Houlsby. From sparse to soft mixtures of experts.arXiv preprint arXiv:2308.00951, 2023

  47. [47]

    arXiv preprint arXiv:2402.08191

    Wilbert Pumacay, Ishika Singh, Jiafei Duan, Ranjay Krishna, Jesse Thomason, and Dieter Fox. The colosseum: A benchmark for evaluating generalization for robotic manipulation.arXiv preprint arXiv:2402.08191, 2024

  48. [48]

    MoE-DiffIR: Task-customized diffusion priors for universal compressed image restoration

    Yulin Ren, Xin Li, Bingchen Li, Xingrui Wang, Mengxi Guo, Shijie Zhao, Li Zhang, and Zhibo Chen. MoE-DiffIR: Task-customized diffusion priors for universal compressed image restoration. InEuropean Conference on Computer Vision (ECCV), volume 15067, pages 116–134, 2024

  49. [49]

    Scaling vision with sparse mixture of experts

    Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, André Susano Pinto, Daniel Keysers, and Neil Houlsby. Scaling vision with sparse mixture of experts. InAdvances in Neural Information Processing Systems, volume 34, pages 8583–8595, 2021

  50. [50]

    Transformer-based world models are happy with 100k interactions.arXiv preprint arXiv:2303.07109, 2023

    Jan Robine, Marc Hoftmann, Tobias Uelwer, and Stefan Harmeling. Transformer-based world models are happy with 100k interactions.arXiv preprint arXiv:2303.07109, 2023

  51. [51]

    U-Net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, volume 9351, pages 234–241. Springer, 2015

  52. [52]

    Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations, 2017

  53. [53]

    DriveX: Omni scene modeling for learning generalizable world knowledge in autonomous driving

    Chen Shi, Shaoshuai Shi, Kehua Sheng, Bo Zhang, and Li Jiang. DriveX: Omni scene modeling for learning generalizable world knowledge in autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 28599–28609, 2025

  54. [54]

    A simple framework for generalization in visual RL under dynamic scene perturbations

    Wonil Song, Hyesong Choi, Kwanghoon Sohn, and Dongbo Min. A simple framework for generalization in visual RL under dynamic scene perturbations. volume 37, pages 121790–121826, 2024

  55. [55]

    The distracting control suite: a challenging benchmark for reinforcement learning from pixels.arXiv preprint arXiv:2101.02722, 2021

    Austin Stone, Oscar Ramirez, Kurt Konolige, and Rico Jonschkowski. The distracting control suite: a challenging benchmark for reinforcement learning from pixels.arXiv preprint arXiv:2101.02722, 2021

  56. [56]

    Learning latent dynamic robust representations for world models.arXiv preprint arXiv:2405.06263, 2024

    Ruixiang Sun, Hongyu Zang, Xin Li, and Riashat Islam. Learning latent dynamic robust representations for world models.arXiv preprint arXiv:2405.06263, 2024. 16/37 Agent-Centric Visual Reinforcement Learning under Dynamic Perturbations

  57. [57]

    Proagentbench: Evaluating llm agents for proactive assistance with real-world data

    Yuanbo Tang, Huaze Tang, Tingyu Cao, Lam Nguyen, Anping Zhang, Xinwen Cao, Chunkang Liu, Wenbo Ding, and Yang Li. ProAgentBench: Evaluating llm agents for proactive assistance with real-world data.arXiv preprint arXiv:2602.04482, 2026

  58. [58]

    DeepMind Control Suite

    Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite.arXiv preprint arXiv:1801.00690, 2018

  59. [59]

    Focus-Then-Reuse: Fast adaptation in visual perturbation environments

    Jiahui Wang, Chao Chen, Jiacheng Xu, Zongzhang Zhang, and Yang Yu. Focus-Then-Reuse: Fast adaptation in visual perturbation environments. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  60. [60]

    GridFormer: Residual dense transformer with grid structure for image restoration in adverse weather conditions.International journal of computer vision, 132(10):4541–4563, 2024

    Tao Wang, Kaihao Zhang, Ziqian Shao, Wenhan Luo, Bjorn Stenger, Tong Lu, Tae-Kyun Kim, Wei Liu, and Hongdong Li. GridFormer: Residual dense transformer with grid structure for image restoration in adverse weather conditions.International journal of computer vision, 132(10):4541–4563, 2024

  61. [61]

    DriveDreamer: Towards real-world-driven world models for autonomous driving

    Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, and Jiwen Lu. DriveDreamer: Towards real-world-driven world models for autonomous driving. InEuropean Conference on Computer Vision (ECCV), pages 55–72. Springer, 2024

  62. [62]

    Generalizable visual reinforcement learning with segment anything model.arXiv preprint arXiv:2312.17116, 2023

    Ziyu Wang, Yanjie Ze, Yifei Sun, Zhecheng Yuan, and Huazhe Xu. Generalizable visual reinforcement learning with segment anything model.arXiv preprint arXiv:2312.17116, 2023

  63. [63]

    DiffIR: Efficient diffusion model for image restoration

    Bin Xia, Yulun Zhang, Shiyin Wang, Yitong Wang, Xinglong Wu, Yapeng Tian, Wenming Yang, and Luc Van Gool. DiffIR: Efficient diffusion model for image restoration. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 13095–13105, 2023

  64. [64]

    Image de-raining transformer.IEEE transactions on pattern analysis and machine intelligence, 45(11):12978–12995, 2022

    Jie Xiao, Xueyang Fu, Aiping Liu, Feng Wu, and Zheng-Jun Zha. Image de-raining transformer.IEEE transactions on pattern analysis and machine intelligence, 45(11):12978–12995, 2022

  65. [65]

    DrM:Mastering visual reinforcement learning through dormant ratio minimization.arXiv preprint arXiv:2310.19668, 2023

    GuoweiXu, RuijieZheng, YongyuanLiang, XiyaoWang, ZhechengYuan, TianyingJi, YuLuo, XiaoyuLiu, JiaxinYuan,PuHua,ShuzhenLi,YanjieZe,HalDaume,FurongHuang,andHuazheXu. DrM:Mastering visual reinforcement learning through dormant ratio minimization.arXiv preprint arXiv:2310.19668, 2023

  66. [66]

    arXiv preprint arXiv:2310.061141(2), 6 (2023)

    Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators.arXiv preprint arXiv:2310.06114, 2023

  67. [67]

    arXiv preprint arXiv:2107.09645 , year=

    Denis Yarats, Rob Fergus, Alessandro Lazaric, and Lerrel Pinto. Mastering visual continuous control: Improved data-augmented reinforcement learning.arXiv preprint arXiv:2107.09645, 2021

  68. [68]

    The latent space: Foundation, evolution, mechanism, ability, and outlook.arXiv preprint arXiv:2604.02029, 2026

    Xinlei Yu, Zhangquan Chen, Yongbo He, Tianyu Fu, Cheng Yang, Chengming Xu, Yue Ma, Xiaobin Hu, Zhe Cao, Jie Xu, et al. The latent space: Foundation, evolution, mechanism, ability, and outlook.arXiv preprint arXiv:2604.02029, 2026

  69. [69]

    Pre- trained image encoder for generalizable visual reinforcement learning.Advances in Neural Information Processing Systems, 35:13022–13037, 2022

    Zhecheng Yuan, Zhengrong Xue, Bo Yuan, Xueqian Wang, Yi Wu, Yang Gao, and Huazhe Xu. Pre- trained image encoder for generalizable visual reinforcement learning.Advances in Neural Information Processing Systems, 35:13022–13037, 2022

  70. [70]

    Rl-vigen: Areinforcement learning benchmark for visual generalization.Advances in Neural Information Processing Systems, 36: 6720–6747, 2023

    ZhechengYuan, SizheYang, PuHua, CanChang, KaizheHu, andHuazheXu. Rl-vigen: Areinforcement learning benchmark for visual generalization.Advances in Neural Information Processing Systems, 36: 6720–6747, 2023. 17/37 Agent-Centric Visual Reinforcement Learning under Dynamic Perturbations

  71. [71]

    Restormer: Efficient transformer for high-resolution image restoration

    Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, and Ming- Hsuan Yang. Restormer: Efficient transformer for high-resolution image restoration. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5728–5739, 2022

  72. [72]

    Learning invariant representations for reinforcement learning without reconstruction

    Amy Zhang, Rowan McAllister, Roberto Calandra, Yarin Gal, and Sergey Levine. Learning invariant representations for reinforcement learning without reconstruction. InInternational Conference on Learning Representations, 2021

  73. [73]

    Focus On What Matters: Separated models for visual-based rl generalization.Advances in Neural Information Processing Systems, 37:116960–116986, 2024

    Di Zhang, Bowen Lv, Hai Zhang, Feifan Yang, Junqiao Zhao, Hang Yu, Chang Huang, Hongtu Zhou, Chen Ye, et al. Focus On What Matters: Separated models for visual-based rl generalization.Advances in Neural Information Processing Systems, 37:116960–116986, 2024

  74. [74]

    Image de-raining using a conditional generative adversarial network.IEEE transactions on circuits and systems for video technology, 30(11):3943–3956, 2019

    He Zhang, Vishwanath Sindagi, and Vishal M Patel. Image de-raining using a conditional generative adversarial network.IEEE transactions on circuits and systems for video technology, 30(11):3943–3956, 2019

  75. [75]

    STORM: Efficient stochastic transformer based world models for reinforcement learning.Advances in Neural Information Processing Systems, 36:27147–27166, 2023

    Weipu Zhang, Gang Wang, Jian Sun, Yetian Yuan, and Gao Huang. STORM: Efficient stochastic transformer based world models for reinforcement learning.Advances in Neural Information Processing Systems, 36:27147–27166, 2023

  76. [76]

    Perceive-IR: Learning to perceive degradation better for all-in-one image restoration.IEEE Transactions on Image Processing, 2025

    Xu Zhang, Jiaqi Ma, Guoli Wang, Qian Zhang, Huan Zhang, and Lefei Zhang. Perceive-IR: Learning to perceive degradation better for all-in-one image restoration.IEEE Transactions on Image Processing, 2025

  77. [77]

    SAC Flow: Sample-efficient reinforcement learning of flow-based policies via velocity-reparameterized sequential modeling

    Yixian Zhang, Shu’ang Yu, Tonghe Zhang, Mo Guang, Haojia Hui, Kaiwen Long, Yu Wang, Chao Yu, and Wenbo Ding. SAC Flow: Sample-efficient reinforcement learning of flow-based policies via velocity-reparameterized sequential modeling. InInternational Conference on Learning Representations (ICLR), 2026

  78. [78]

    TACO: Temporal latent action-driven contrastive loss for visual reinforcement learning

    Ruijie Zheng, Xiyao Wang, Yanchao Sun, Shuang Ma, Jieyu Zhao, Huazhe Xu, Hal Daumé III, and Furong Huang. TACO: Temporal latent action-driven contrastive loss for visual reinforcement learning. Advances in Neural Information Processing Systems, 36:48203–48225, 2023

  79. [79]

    OccWorld: Learning a 3D occupancy world model for autonomous driving

    Wenzhao Zheng, Weiliang Chen, Yuanhui Huang, Borui Zhang, Yueqi Duan, and Jiwen Lu. OccWorld: Learning a 3D occupancy world model for autonomous driving. InEuropean Conference on Computer Vision (ECCV), pages 55–72. Springer, 2024

  80. [80]

    2411.04983 , archiveprefix =

    Gaoyue Zhou, Haizhou Pan, Yann LeCun, and Lerrel Pinto. DINO-WM: World models on pre-trained visual features enable zero-shot planning.arXiv preprint arXiv:2411.04983, 2024

Showing first 80 references.