Maximizing Information Gain in Partially Observable Environments via Prediction Reward

Frans Oliehoek; Martha White; Shimon Whiteson; Sungsu Lim; Yash Satsangi

arxiv: 2005.04912 · v1 · pith:DRSICIUBnew · submitted 2020-05-11 · 💻 cs.AI · cs.LG

Maximizing Information Gain in Partially Observable Environments via Prediction Reward

Yash Satsangi , Sungsu Lim , Shimon Whiteson , Frans Oliehoek , Martha White This is my paper

classification 💻 cs.AI cs.LG

keywords agentrewardpredictionbeliefdeepfunctioninsightmaximizing

0 comments

read the original abstract

Information gathering in a partially observable environment can be formulated as a reinforcement learning (RL), problem where the reward depends on the agent's uncertainty. For example, the reward can be the negative entropy of the agent's belief over an unknown (or hidden) variable. Typically, the rewards of an RL agent are defined as a function of the state-action pairs and not as a function of the belief of the agent; this hinders the direct application of deep RL methods for such tasks. This paper tackles the challenge of using belief-based rewards for a deep RL agent, by offering a simple insight that maximizing any convex function of the belief of the agent can be approximated by instead maximizing a prediction reward: a reward based on prediction accuracy. In particular, we derive the exact error between negative entropy and the expected prediction reward. This insight provides theoretical motivation for several fields using prediction rewards---namely visual attention, question answering systems, and intrinsic motivation---and highlights their connection to the usually distinct fields of active perception, active sensing, and sensor placement. Based on this insight we present deep anticipatory networks (DANs), which enables an agent to take actions to reduce its uncertainty without performing explicit belief inference. We present two applications of DANs: building a sensor selection system for tracking people in a shopping mall and learning discrete models of attention on fashion MNIST and MNIST digit classification.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Video Active Perception: Effective Inference-Time Long-Form Video Understanding with Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

VAP is a training-free active-perception method that improves zero-shot long-form video QA performance and frame efficiency up to 5.6x in VLMs by selecting keyframes that differ from priors generated by a text-conditi...