Recognition: 2 theorem links
· Lean TheoremNSMQ Riddles: A Benchmark of Scientific and Mathematical Riddles for Quizzing Large Language Models
Pith reviewed 2026-05-11 01:08 UTC · model grok-4.3
The pith
A benchmark of Ghanaian science riddles reveals large language models underperform the best high school students.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
NSMQ Riddles is a benchmark of riddle questions from the fifth round of the NSMQ, each containing at least three clues that start vague and increase in specificity, with answers that are numbers, words, or short phrases. State-of-the-art LLMs, tested in both high and low reasoning settings, performed worse than the best student contestants on this dataset.
What carries the argument
The NSMQ Riddles benchmark, a dataset of 1.8K progressive-clue riddles from Ghana's annual high school science quiz that supports automatic evaluation through short answers.
If this is right
- Current LLMs have clear limitations when required to integrate successive clues in science and mathematics problems.
- Benchmarks drawn from Global South education systems can expose gaps not visible in Western multiple-choice tests.
- Short-answer riddle formats allow direct, automatic comparison of model outputs to human contestant performance.
- Future model development can target progressive-reasoning skills using this type of clue structure.
Where Pith is reading between the lines
- Similar riddle-style datasets from other regions could help test whether performance gaps are specific to this format or more general.
- Models might improve by training on examples that reward answering from partial information rather than full context.
- The benchmark format could be adapted to measure how well LLMs handle ambiguity in real classroom or competition settings.
Load-bearing premise
That the riddles from the NSMQ competition provide a valid and unbiased measure of scientific and mathematical reasoning capabilities that can be directly compared between LLMs and human students.
What would settle it
A new large language model achieving higher accuracy on the full set of NSMQ Riddles than the top student teams from the original competitions would falsify the performance gap claim.
read the original abstract
Large Language Models (LLMs) have shown good performance on various science educational benchmarks, demonstrating their potential for use in science and mathematics education. Yet, LLMs tend to be evaluated on science and mathematical educational datasets from the Western world, with an underrepresentation of datasets from the Global South. Furthermore, they tend to have multiple-choice answer options that are trivial to evaluate. In this work, we present NSMQ Riddles, a novel benchmark of Scientific and Mathematical Riddles from Ghana's National Science and Maths Quiz (NSMQ) competition to evaluate LLMs. The NSMQ is an annual live TV competition for senior secondary school students in Ghana that brings together the smartest high school students in Ghana who compete in teams of 2 by answering questions in biology, chemistry, physics, and math over five rounds and five stages until a winning team is crowned for that year. NSMQ Riddles consists of 11 years of riddle questions (n=1.8K) from the 5th round, with each riddle containing a minimum of 3 clues. Students compete to be the first to guess the answer on any of the clues, with earlier clues being vague and also fetching more points. The answers are usually a number, word, or short phrase, allowing for automatic evaluation. We evaluated state-of-the-art models: closed (GPT-5.4, Gemini 3.1 Pro, Claude Opus 4.6) and open models (Kimi-K2.5, DeepSeek-V3.1, GPT-OSS-120B) with high and low reasoning settings. Our evaluation shows that the dataset is challenging even for state-of-the-art LLMs, which performed worse than the best student contestants. This work contributes a novel and challenging benchmark for scientific and mathematical reasoning from the Global South towards enabling a true global benchmarking of LLMs' capabilities for science and mathematics education.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces NSMQ Riddles, a benchmark of 1.8K scientific and mathematical riddles drawn from round 5 of Ghana's National Science and Maths Quiz (NSMQ) over 11 years. Each riddle supplies at least three sequential clues whose answers are short phrases or numbers. The authors evaluate closed models (GPT-5.4, Gemini 3.1 Pro, Claude Opus 4.6) and open models (Kimi-K2.5, DeepSeek-V3.1, GPT-OSS-120B) under high- and low-reasoning prompting regimes and conclude that the dataset remains challenging for current LLMs, which underperform the best student contestants.
Significance. If the performance comparison can be placed on a sound footing, the work supplies a publicly useful, non-Western, open-ended riddle benchmark that stresses incremental clue integration and early-answer incentives. The dataset itself is a clear contribution toward broader geographic coverage of science-reasoning evaluations.
major comments (2)
- [Abstract] Abstract and Evaluation section: the headline claim that 'state-of-the-art LLMs... performed worse than the best student contestants' is unsupported by any reported accuracy, points-per-riddle, or statistical comparison. No table or figure supplies LLM scores on the 1.8K riddles or the corresponding student baseline on the identical set.
- [Evaluation] Evaluation protocol (high/low reasoning settings): the manuscript does not demonstrate that static prompt templates replicate NSMQ round-5 conditions, in which students receive clues sequentially, may buzz after any clue for higher points, and compete under live time pressure in teams of two. Without this equivalence, the numerical comparison to student performance cannot be treated as load-bearing evidence.
minor comments (2)
- [Abstract] Model nomenclature (GPT-5.4, Claude Opus 4.6) appears inconsistent with currently released versions; clarify exact checkpoints or release dates used.
- [Dataset construction] The abstract states that answers are 'usually a number, word, or short phrase, allowing for automatic evaluation,' yet no details are given on the exact matching procedure or handling of partial credit.
Simulated Author's Rebuttal
We are grateful to the referee for their constructive feedback, which highlights important issues regarding evidence and protocol equivalence in our manuscript. We address each major comment below and commit to revisions that strengthen the presentation without overstating our results.
read point-by-point responses
-
Referee: [Abstract] Abstract and Evaluation section: the headline claim that 'state-of-the-art LLMs... performed worse than the best student contestants' is unsupported by any reported accuracy, points-per-riddle, or statistical comparison. No table or figure supplies LLM scores on the 1.8K riddles or the corresponding student baseline on the identical set.
Authors: We acknowledge that the manuscript does not currently include explicit numerical LLM accuracies, points-per-riddle metrics, or a direct comparison table against student baselines on the exact 1.8K riddles. The abstract statement was based on our internal model evaluations showing lower success rates than the high overall performance typically achieved by top NSMQ teams. We will add a dedicated table in the Evaluation section reporting per-model accuracy and average points under both prompting regimes. For the student side, we will incorporate available aggregate NSMQ statistics and explicitly qualify the comparison to reflect any gaps in identical-set granularity. If precise per-riddle student data cannot be sourced, we will revise the abstract to remove or soften the direct claim. revision: yes
-
Referee: [Evaluation] Evaluation protocol (high/low reasoning settings): the manuscript does not demonstrate that static prompt templates replicate NSMQ round-5 conditions, in which students receive clues sequentially, may buzz after any clue for higher points, and compete under live time pressure in teams of two. Without this equivalence, the numerical comparison to student performance cannot be treated as load-bearing evidence.
Authors: We agree that our static high- and low-reasoning prompts do not replicate the sequential clue delivery, early-buzzing incentive structure, or real-time team competition of NSMQ round 5. The high-reasoning setting supplies all clues at once to evaluate information integration, which is a related but distinct task from the live format. We will revise the Evaluation section to describe the protocol limitations clearly, discuss why the current setup still demonstrates the benchmark's difficulty, and caution against treating the student comparison as a direct equivalence. The abstract will be updated to reflect this nuance. revision: yes
- Granular per-riddle student performance data (accuracy or points) from the original NSMQ competitions on the exact 1.8K riddles may not be publicly available or recorded, preventing a fully identical-set baseline.
Circularity Check
No circularity: external benchmark with independent empirical evaluation
full rationale
The paper collects an external dataset of 1.8K riddles from Ghana's NSMQ competition (an independent live student contest) and applies standard LLM prompting to obtain performance numbers. The claim that SOTA models underperform top students rests on direct comparison to reported human results from that external source, with no equations, fitted parameters, self-referential predictions, or load-bearing self-citations that reduce the result to the paper's own inputs. The evaluation protocol is a straightforward application of models to the new benchmark rather than any derivation that loops back on itself.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclearWe present NSMQ Riddles, a novel benchmark of Scientific and Mathematical Riddles from Ghana’s National Science and Maths Quiz (NSMQ) competition to evaluate LLMs... evaluated state-of-the-art models... performed worse than the best student contestants.
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery unclearThe NSMQ is an annual live TV competition... round 5 — Riddles... clues start vague and get more specific... answering on the 1st clue fetches 5 points...
Reference graph
Works this paper leans on
-
[1]
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmark (2025)
Balunović, M., Dekoninck, J., Petrov, I., Jovanović, N., Vechev, M.: Matharena: Evaluating llms on uncontaminated math competitions. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmark (2025)
2025
-
[3]
In Workshop on Practical Machine Learning for Developing Countries (PML4DC) at ICLR 2023 (2023)
Boateng, G., Kumbol, V., Kaufmann, E.E.: Can an ai win ghana’s national science and maths quiz? an ai grand challenge for education. In Workshop on Practical Machine Learning for Developing Countries (PML4DC) at ICLR 2023 (2023)
2023
-
[4]
In: Deep Learning Indaba 2023 (2023)
Boateng, G., Mensah, J.A., Yeboah, K.T., Edor, W., Mensah-Onumah, A.K., Ibrahim, N.D., Yeboah, N.S.: Towards an ai to win ghana’s national science and maths quiz. In: Deep Learning Indaba 2023 (2023)
2023
-
[5]
In: International Conference on Artificial Intelligence in Education
Boateng, G., Mensah, J.A., Yeboah, K.T., Edor, W., Mensah-Onumah, A.K., Ibrahim, N.D., Yeboah, N.S.: Brilla ai: Ai contestant for the national science and maths quiz. In: International Conference on Artificial Intelligence in Education. pp. 214–227. Springer (2024) NSMQ Riddles 15
2024
-
[6]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., Tafjord, O.: Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[7]
Training Verifiers to Solve Math Word Problems
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al.: Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[8]
In: International Conference on Learning Representations
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., Steinhardt, J.: Measuring massive multitask language understanding. In: International Conference on Learning Representations
-
[9]
In: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)
Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., Steinhardt, J.: Measuring mathematical problem solving with the math dataset. In: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)
-
[10]
In: Proceedings of the IEEE Conference on Computer Vision and Pattern recognition
Kembhavi, A., Seo, M., Schwenk, D., Choi, J., Farhadi, A., Hajishirzi, H.: Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. In: Proceedings of the IEEE Conference on Computer Vision and Pattern recognition. pp. 4999–5007 (2017)
2017
-
[11]
Langchain.https://www.langchain.com/
-
[12]
In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021
Lin, B.Y., Wu, Z., Yang, Y., Lee, D.H., Ren, X.: Riddlesense: Reasoning about riddle questions featuring linguistic creativity and commonsense knowledge. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. pp. 1504–1515 (2021)
2021
-
[13]
Advances in Neural Information Processing Systems 35, 2507–2521 (2022)
Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.W., Zhu, S.C., Tafjord, O., Clark, P., Kalyan, A.: Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems 35, 2507–2521 (2022)
2022
-
[14]
Mathpix.https://mathpix.com/
-
[15]
In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
Mihaylov, T., Clark, P., Khot, T., Sabharwal, A.: Can a suit of armor conduct electricity? a new dataset for open book question answering. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. pp. 2381–2391 (2018)
2018
-
[16]
National science and maths quiz.https://nsmq.com.gh/
-
[17]
In: First Conference on Language Modeling (2024)
Rein, D., Hou, B.L., Stickland, A.C., Petty, J., Pang, R.Y., Dirani, J., Michael, J., Bowman, S.R.: Gpqa: A graduate-level google-proof q&a benchmark. In: First Conference on Language Modeling (2024)
2024
-
[18]
In: International Conference on Machine Learning
Wang, X., Hu, Z., Lu, P., Zhu, Y., Zhang, J., Subramaniam, S., Loomba, A.R., Zhang, S., Sun, Y., Wang, W.: Scibench: Evaluating college-level scientific problem- solving abilities of large language models. In: International Conference on Machine Learning. pp. 50622–50649. PMLR (2024)
2024
-
[19]
Advances in Neural Information Processing Systems36(2024)
Zhang, W., Aljunied, M., Gao, C., Chia, Y.K., Bing, L.: M3exam: A multilingual, multimodal, multilevel benchmark for examining large language models. Advances in Neural Information Processing Systems36(2024)
2024
-
[20]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Zhang, Y., Wan, X.: Birdqa: A bilingual dataset for question answering on tricky riddles. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 36, pp. 11748–11756 (2022)
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.