Recognition: unknown
ChessArena: A Chess Testbed for Evaluating Strategic Reasoning Capabilities of Large Language Models
read the original abstract
Recent large language models (LLMs) have shown strong reasoning capabilities. However, a critical question remains: do these models possess genuine strategic reasoning, or do they primarily excel at pattern recognition? To address this, we present ChessArena, a chess-based testbed for evaluating LLMs. Chess demands strategic reasoning, precise rule adherence, and the ability to track complex game states. ChessArena is a competitive framework where LLMs play against each other under four play modes. We evaluate 13 LLMs across over 800 games, testing basic understanding, move selection, and puzzle solving. Results reveal significant shortcomings: no model beats Maia-1100 (human amateur level), and some lose to random play. We also present a strong baseline: our fine-tuned Qwen3-8B substantially improves performance, approaching much larger state-of-the-art reasoning models.
This paper has not been read by Pith yet.
Forward citations
Cited by 4 Pith papers
-
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
LLMs exhibit myopic planning in four-in-a-row: move choices are best explained by shallow nodes in reasoning traces, not the deep lookahead they generate, unlike humans where depth drives performance.
-
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
LLM move selection in four-in-a-row is best explained by myopic models that ignore deep nodes in their own reasoning traces, while performance correlates with search breadth rather than depth.
-
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
LLMs display myopic planning in games: move selection is driven by shallow nodes in reasoning traces despite generating deep lookahead, with performance tied to search breadth rather than depth.
-
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
LLMs exhibit myopic planning in games, with move choices driven by shallow nodes despite deep reasoning traces, in contrast to human deep-search reliance.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.