MERRIN benchmark shows AI agents average only 22.3% accuracy on multimodal evidence retrieval and multi-hop reasoning over noisy conflicting web sources, with the best reaching 40.1%.
Keep all tabs open throughout your search so that you can accurately record all resources and queries at the end
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
MERRIN: A Benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments
MERRIN benchmark shows AI agents average only 22.3% accuracy on multimodal evidence retrieval and multi-hop reasoning over noisy conflicting web sources, with the best reaching 40.1%.