MERRIN benchmark shows AI agents average only 22.3% accuracy on multimodal evidence retrieval and multi-hop reasoning over noisy conflicting web sources, with the best reaching 40.1%.
After finding the answer (or giving up), go through your browser history/tabs and count the total number of search queries you made
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
MERRIN: A Benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments
MERRIN benchmark shows AI agents average only 22.3% accuracy on multimodal evidence retrieval and multi-hop reasoning over noisy conflicting web sources, with the best reaching 40.1%.