pith. machine review for the scientific record. sign in

arxiv: 2506.06211 · v2 · submitted 2025-06-06 · 💻 cs.CL · cs.AI· cs.CV

Recognition: unknown

PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts

Adithya Balachandran, Alexander Naehu, Brendon Jiang, Chanakya Ekbote, Hengzhi Li, Justin Zhang, Megan Tjandrasuwita, Paul Pu Liang, Rebecca Chang, Regan Song, Steven-Shine Chen, Wei Dai

Authors on Pith no claims yet
classification 💻 cs.CL cs.AIcs.CV
keywords reasoningpuzzleworldopen-endedaccuracyanalysismodelsmultimodalpuzzle
0
0 comments X
read the original abstract

Puzzlehunts are a genre of complex, multi-step puzzles lacking well-defined problem definitions. In contrast to conventional reasoning benchmarks consisting of tasks with clear instructions and constrained environments, puzzlehunts requires discovering the underlying problem structure from multimodal evidence and iterative reasoning, mirroring real-world domains such as scientific discovery, exploratory data analysis, or investigative problem-solving. Despite progress in foundation models, their performance on open-ended settings remains largely untested. We introduce PuzzleWorld, a comprehensive benchmark of 667 puzzlehunt-style problems designed to assess step-by-step, open-ended, and creative multimodal reasoning. Each puzzle is annotated with the final solution, detailed reasoning traces, and cognitive skill labels, enabling holistic benchmarking and fine-grained diagnostic analysis. Most state-of-the-art models achieve only 1-4% final answer accuracy. On PuzzleWorld, the best model solves only 18% of puzzles and reaches 40% stepwise accuracy, matching human puzzle novices but falling significantly behind puzzle enthusiasts. To demonstrate the value of our reasoning annotations, we show that fine-tuning a small model on reasoning traces boosts stepwise accuracy from 4% to 11%, which translates to improvements in downstream visual reasoning tasks. Our detailed error analysis reveals that current models exhibit myopic reasoning, are bottlenecked by the limitations of language-based inference, and lack sketching capabilities crucial for visual and spatial reasoning. We release PuzzleWorld at https://github.com/MIT-MI/PuzzleWorld to support future work on building more general, open-ended, and creative reasoning systems.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    AgentEscapeBench shows LLM agents' success rates drop from 90% to 60% as tool-dependency depth increases from 5 to 25 steps, while humans drop only from 98% to 80%.

  2. CTM-AI: A Blueprint for General AI Inspired by a Model of Consciousness

    q-bio.NC 2026-04 unverdicted novelty 6.0

    CTM-AI combines a formal consciousness model with foundation models to report state-of-the-art results on sarcasm detection, humor, and agentic tool-use benchmarks.