pith. machine review for the scientific record. sign in

arxiv: 2605.07051 · v1 · submitted 2026-05-08 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

NSMQ Riddles: A Benchmark of Scientific and Mathematical Riddles for Quizzing Large Language Models

Andrew Mensa-Onumah, George Boateng, Jonathan Mensah, Kevin Yeboah, Naafi Ibrahim, Nana Yeboah, Patrick Agyeman-Budu, Philemon Badu, Samuel John, Victor Wumbor-Apin Kumbol, William Edor

Pith reviewed 2026-05-11 01:08 UTC · model grok-4.3

classification 💻 cs.CL
keywords benchmarklarge language modelsscientific riddlesmathematical reasoningNSMQGlobal Southscience educationprogressive clues
0
0 comments X

The pith

A benchmark of Ghanaian science riddles reveals large language models underperform the best high school students.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents NSMQ Riddles, a collection of 1.8K scientific and mathematical riddle questions drawn from 11 years of Ghana's National Science and Maths Quiz competition. It establishes that these riddles, which feature progressive clues and short-form answers, form a challenging test for LLMs on topics in biology, chemistry, physics, and mathematics. Evaluations of multiple closed and open state-of-the-art models demonstrate lower performance than the strongest student teams from the original contests. This matters because most existing science benchmarks come from Western sources and use multiple-choice formats that are easier to score automatically.

Core claim

NSMQ Riddles is a benchmark of riddle questions from the fifth round of the NSMQ, each containing at least three clues that start vague and increase in specificity, with answers that are numbers, words, or short phrases. State-of-the-art LLMs, tested in both high and low reasoning settings, performed worse than the best student contestants on this dataset.

What carries the argument

The NSMQ Riddles benchmark, a dataset of 1.8K progressive-clue riddles from Ghana's annual high school science quiz that supports automatic evaluation through short answers.

If this is right

  • Current LLMs have clear limitations when required to integrate successive clues in science and mathematics problems.
  • Benchmarks drawn from Global South education systems can expose gaps not visible in Western multiple-choice tests.
  • Short-answer riddle formats allow direct, automatic comparison of model outputs to human contestant performance.
  • Future model development can target progressive-reasoning skills using this type of clue structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar riddle-style datasets from other regions could help test whether performance gaps are specific to this format or more general.
  • Models might improve by training on examples that reward answering from partial information rather than full context.
  • The benchmark format could be adapted to measure how well LLMs handle ambiguity in real classroom or competition settings.

Load-bearing premise

That the riddles from the NSMQ competition provide a valid and unbiased measure of scientific and mathematical reasoning capabilities that can be directly compared between LLMs and human students.

What would settle it

A new large language model achieving higher accuracy on the full set of NSMQ Riddles than the top student teams from the original competitions would falsify the performance gap claim.

read the original abstract

Large Language Models (LLMs) have shown good performance on various science educational benchmarks, demonstrating their potential for use in science and mathematics education. Yet, LLMs tend to be evaluated on science and mathematical educational datasets from the Western world, with an underrepresentation of datasets from the Global South. Furthermore, they tend to have multiple-choice answer options that are trivial to evaluate. In this work, we present NSMQ Riddles, a novel benchmark of Scientific and Mathematical Riddles from Ghana's National Science and Maths Quiz (NSMQ) competition to evaluate LLMs. The NSMQ is an annual live TV competition for senior secondary school students in Ghana that brings together the smartest high school students in Ghana who compete in teams of 2 by answering questions in biology, chemistry, physics, and math over five rounds and five stages until a winning team is crowned for that year. NSMQ Riddles consists of 11 years of riddle questions (n=1.8K) from the 5th round, with each riddle containing a minimum of 3 clues. Students compete to be the first to guess the answer on any of the clues, with earlier clues being vague and also fetching more points. The answers are usually a number, word, or short phrase, allowing for automatic evaluation. We evaluated state-of-the-art models: closed (GPT-5.4, Gemini 3.1 Pro, Claude Opus 4.6) and open models (Kimi-K2.5, DeepSeek-V3.1, GPT-OSS-120B) with high and low reasoning settings. Our evaluation shows that the dataset is challenging even for state-of-the-art LLMs, which performed worse than the best student contestants. This work contributes a novel and challenging benchmark for scientific and mathematical reasoning from the Global South towards enabling a true global benchmarking of LLMs' capabilities for science and mathematics education.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces NSMQ Riddles, a benchmark of 1.8K scientific and mathematical riddles drawn from round 5 of Ghana's National Science and Maths Quiz (NSMQ) over 11 years. Each riddle supplies at least three sequential clues whose answers are short phrases or numbers. The authors evaluate closed models (GPT-5.4, Gemini 3.1 Pro, Claude Opus 4.6) and open models (Kimi-K2.5, DeepSeek-V3.1, GPT-OSS-120B) under high- and low-reasoning prompting regimes and conclude that the dataset remains challenging for current LLMs, which underperform the best student contestants.

Significance. If the performance comparison can be placed on a sound footing, the work supplies a publicly useful, non-Western, open-ended riddle benchmark that stresses incremental clue integration and early-answer incentives. The dataset itself is a clear contribution toward broader geographic coverage of science-reasoning evaluations.

major comments (2)
  1. [Abstract] Abstract and Evaluation section: the headline claim that 'state-of-the-art LLMs... performed worse than the best student contestants' is unsupported by any reported accuracy, points-per-riddle, or statistical comparison. No table or figure supplies LLM scores on the 1.8K riddles or the corresponding student baseline on the identical set.
  2. [Evaluation] Evaluation protocol (high/low reasoning settings): the manuscript does not demonstrate that static prompt templates replicate NSMQ round-5 conditions, in which students receive clues sequentially, may buzz after any clue for higher points, and compete under live time pressure in teams of two. Without this equivalence, the numerical comparison to student performance cannot be treated as load-bearing evidence.
minor comments (2)
  1. [Abstract] Model nomenclature (GPT-5.4, Claude Opus 4.6) appears inconsistent with currently released versions; clarify exact checkpoints or release dates used.
  2. [Dataset construction] The abstract states that answers are 'usually a number, word, or short phrase, allowing for automatic evaluation,' yet no details are given on the exact matching procedure or handling of partial credit.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We are grateful to the referee for their constructive feedback, which highlights important issues regarding evidence and protocol equivalence in our manuscript. We address each major comment below and commit to revisions that strengthen the presentation without overstating our results.

read point-by-point responses
  1. Referee: [Abstract] Abstract and Evaluation section: the headline claim that 'state-of-the-art LLMs... performed worse than the best student contestants' is unsupported by any reported accuracy, points-per-riddle, or statistical comparison. No table or figure supplies LLM scores on the 1.8K riddles or the corresponding student baseline on the identical set.

    Authors: We acknowledge that the manuscript does not currently include explicit numerical LLM accuracies, points-per-riddle metrics, or a direct comparison table against student baselines on the exact 1.8K riddles. The abstract statement was based on our internal model evaluations showing lower success rates than the high overall performance typically achieved by top NSMQ teams. We will add a dedicated table in the Evaluation section reporting per-model accuracy and average points under both prompting regimes. For the student side, we will incorporate available aggregate NSMQ statistics and explicitly qualify the comparison to reflect any gaps in identical-set granularity. If precise per-riddle student data cannot be sourced, we will revise the abstract to remove or soften the direct claim. revision: yes

  2. Referee: [Evaluation] Evaluation protocol (high/low reasoning settings): the manuscript does not demonstrate that static prompt templates replicate NSMQ round-5 conditions, in which students receive clues sequentially, may buzz after any clue for higher points, and compete under live time pressure in teams of two. Without this equivalence, the numerical comparison to student performance cannot be treated as load-bearing evidence.

    Authors: We agree that our static high- and low-reasoning prompts do not replicate the sequential clue delivery, early-buzzing incentive structure, or real-time team competition of NSMQ round 5. The high-reasoning setting supplies all clues at once to evaluate information integration, which is a related but distinct task from the live format. We will revise the Evaluation section to describe the protocol limitations clearly, discuss why the current setup still demonstrates the benchmark's difficulty, and caution against treating the student comparison as a direct equivalence. The abstract will be updated to reflect this nuance. revision: yes

standing simulated objections not resolved
  • Granular per-riddle student performance data (accuracy or points) from the original NSMQ competitions on the exact 1.8K riddles may not be publicly available or recorded, preventing a fully identical-set baseline.

Circularity Check

0 steps flagged

No circularity: external benchmark with independent empirical evaluation

full rationale

The paper collects an external dataset of 1.8K riddles from Ghana's NSMQ competition (an independent live student contest) and applies standard LLM prompting to obtain performance numbers. The claim that SOTA models underperform top students rests on direct comparison to reported human results from that external source, with no equations, fitted parameters, self-referential predictions, or load-bearing self-citations that reduce the result to the paper's own inputs. The evaluation protocol is a straightforward application of models to the new benchmark rather than any derivation that loops back on itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark creation paper. It introduces no free parameters, mathematical axioms, or invented entities; the contribution rests on the collection and use of existing competition riddles for LLM testing.

pith-pipeline@v0.9.0 · 5704 in / 1132 out tokens · 62751 ms · 2026-05-11T01:08:07.313470+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

20 extracted references · 3 canonical work pages · 3 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  2. [2]

    Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmark (2025)

    Balunović, M., Dekoninck, J., Petrov, I., Jovanović, N., Vechev, M.: Matharena: Evaluating llms on uncontaminated math competitions. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmark (2025)

  3. [3]

    In Workshop on Practical Machine Learning for Developing Countries (PML4DC) at ICLR 2023 (2023)

    Boateng, G., Kumbol, V., Kaufmann, E.E.: Can an ai win ghana’s national science and maths quiz? an ai grand challenge for education. In Workshop on Practical Machine Learning for Developing Countries (PML4DC) at ICLR 2023 (2023)

  4. [4]

    In: Deep Learning Indaba 2023 (2023)

    Boateng, G., Mensah, J.A., Yeboah, K.T., Edor, W., Mensah-Onumah, A.K., Ibrahim, N.D., Yeboah, N.S.: Towards an ai to win ghana’s national science and maths quiz. In: Deep Learning Indaba 2023 (2023)

  5. [5]

    In: International Conference on Artificial Intelligence in Education

    Boateng, G., Mensah, J.A., Yeboah, K.T., Edor, W., Mensah-Onumah, A.K., Ibrahim, N.D., Yeboah, N.S.: Brilla ai: Ai contestant for the national science and maths quiz. In: International Conference on Artificial Intelligence in Education. pp. 214–227. Springer (2024) NSMQ Riddles 15

  6. [6]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., Tafjord, O.: Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457 (2018)

  7. [7]

    Training Verifiers to Solve Math Word Problems

    Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al.: Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021)

  8. [8]

    In: International Conference on Learning Representations

    Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., Steinhardt, J.: Measuring massive multitask language understanding. In: International Conference on Learning Representations

  9. [9]

    In: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)

    Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., Steinhardt, J.: Measuring mathematical problem solving with the math dataset. In: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)

  10. [10]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern recognition

    Kembhavi, A., Seo, M., Schwenk, D., Choi, J., Farhadi, A., Hajishirzi, H.: Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. In: Proceedings of the IEEE Conference on Computer Vision and Pattern recognition. pp. 4999–5007 (2017)

  11. [11]

    Langchain.https://www.langchain.com/

  12. [12]

    In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

    Lin, B.Y., Wu, Z., Yang, Y., Lee, D.H., Ren, X.: Riddlesense: Reasoning about riddle questions featuring linguistic creativity and commonsense knowledge. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. pp. 1504–1515 (2021)

  13. [13]

    Advances in Neural Information Processing Systems 35, 2507–2521 (2022)

    Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.W., Zhu, S.C., Tafjord, O., Clark, P., Kalyan, A.: Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems 35, 2507–2521 (2022)

  14. [14]

    Mathpix.https://mathpix.com/

  15. [15]

    In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    Mihaylov, T., Clark, P., Khot, T., Sabharwal, A.: Can a suit of armor conduct electricity? a new dataset for open book question answering. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. pp. 2381–2391 (2018)

  16. [16]

    National science and maths quiz.https://nsmq.com.gh/

  17. [17]

    In: First Conference on Language Modeling (2024)

    Rein, D., Hou, B.L., Stickland, A.C., Petty, J., Pang, R.Y., Dirani, J., Michael, J., Bowman, S.R.: Gpqa: A graduate-level google-proof q&a benchmark. In: First Conference on Language Modeling (2024)

  18. [18]

    In: International Conference on Machine Learning

    Wang, X., Hu, Z., Lu, P., Zhu, Y., Zhang, J., Subramaniam, S., Loomba, A.R., Zhang, S., Sun, Y., Wang, W.: Scibench: Evaluating college-level scientific problem- solving abilities of large language models. In: International Conference on Machine Learning. pp. 50622–50649. PMLR (2024)

  19. [19]

    Advances in Neural Information Processing Systems36(2024)

    Zhang, W., Aljunied, M., Gao, C., Chia, Y.K., Bing, L.: M3exam: A multilingual, multimodal, multilevel benchmark for examining large language models. Advances in Neural Information Processing Systems36(2024)

  20. [20]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Zhang, Y., Wan, X.: Birdqa: A bilingual dataset for question answering on tricky riddles. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 36, pp. 11748–11756 (2022)