Seemingly Simple Planning Problems are Computationally Challenging: The Countdown Game

Michael Katz , Harsha Kokel , Sarath Sreedharan

Authors on Pith no claims yet

classification 💻 cs.AI

keywords planningexistinginstancebenchmarkbenchmarkschallengingcountdowngame

read the original abstract

There is a broad consensus that the inability to form long-term plans is one of the key limitations of current foundational models and agents. However, the existing planning benchmarks remain woefully inadequate to truly measure their planning capabilities. Most existing benchmarks either focus on loosely defined tasks like travel planning or end up leveraging existing domains and problems from international planning competitions. While the former tasks are hard to formalize and verify, the latter were specifically designed to test and challenge the weaknesses of existing automated planners. To address these shortcomings, we propose a procedure for creating a planning benchmark centered around the game called Countdown, where a player is expected to form a target number from a list of input numbers through arithmetic operations. From a world-model perspective, each instance induces a fully specified transition model (dynamics) over states and actions, enabling evaluation of planning with verifiable outcomes. We discuss how this problem meets many of the desiderata associated with an ideal benchmark for planning capabilities evaluation. Specifically, the domain allows for an intuitive, natural language description for each problem instance, it is computationally challenging (NP-complete), and the instance space is rich enough that we do not have to worry about memorization. We perform an extensive theoretical analysis, establishing the computational complexity result and demonstrate the advantage of our instance generation procedure over public benchmarks. We evaluate a variety of existing LLM-assisted planning methods on instances generated using our procedure. Our results show that, unlike other domains like 24 Game (a special case of Countdown), our proposed dynamic benchmark remains extremely challenging for existing LLM-based approaches.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Mind the Gap Between Spatial Reasoning and Acting! Step-by-Step Evaluation of Agents With Spatial-Gym
cs.AI 2026-04 unverdicted novelty 7.0

Spatial-Gym benchmark shows the best tested model solves only 16% of pathfinding tasks versus 98% for humans, with step-by-step and backtracking formats producing mixed effects across model strengths.