ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models

Elisa Ricci; Massimiliano Mancini; St\'ephane Lathuili\`ere; Subhankar Roy; Thomas De Min

arxiv: 2603.19466 · v2 · pith:X4XNA6AMnew · submitted 2026-03-19 · 💻 cs.CV

ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models

Thomas De Min , Subhankar Roy , St\'ephane Lathuili\`ere , Elisa Ricci , Massimiliano Mancini This is my paper

classification 💻 cs.CV

keywords proactivenessproactivebenchintroducelearningmllmsmodelsmultimodaloccluded

0 comments

read the original abstract

Effective collaboration begins with knowing when to ask for help. For example, when trying to identify an occluded object, a human would ask someone to remove the obstruction. Can MLLMs exhibit a similar "proactive" behavior by requesting simple user interventions? To investigate this, we introduce ProactiveBench, a benchmark built from seven repurposed datasets that tests proactiveness across different tasks such as recognizing occluded objects, enhancing image quality, and interpreting coarse sketches. We evaluate 22 MLLMs on ProactiveBench, showing that (i) they generally lack proactiveness; (ii) proactiveness does not correlate with model capacity; (iii) "hinting" at proactiveness yields only marginal gains. Surprisingly, we found that conversation histories and in-context learning introduce negative biases, hindering performance. Finally, we explore a simple fine-tuning strategy based on reinforcement learning: its results suggest that proactiveness can be learned, even generalizing to unseen scenarios. We publicly release ProactiveBench as a first step toward building proactive multimodal models.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Anticipate and Learn: Unleashing Idle-Time Compute in Proactive Agents
cs.CL 2026-05 unverdicted novelty 6.0

ProAct uses idle compute to anticipate user needs via dialogue history and memory, achieving 14.8% fewer turns, 11.7% less user effort, and 28.1% fewer hallucinations than reactive baselines on the new ProActEval benchmark.