Systematic testing of prompt engineering for LLM equational reasoning finds a performance ceiling of 60-79% accuracy that extensive engineering cannot exceed, driven by undecidability and model capacity limits.
Stage 1 judge for the mathematics distillation challenge: Equational theo- ries.https://github.com/SAIRcompetition/equational-theories-stage1-judge
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Less Is More: Cognitive Load and the Single-Prompt Ceiling in LLM Mathematical Reasoning
Systematic testing of prompt engineering for LLM equational reasoning finds a performance ceiling of 60-79% accuracy that extensive engineering cannot exceed, driven by undecidability and model capacity limits.