pith. sign in

arxiv: 2606.06526 · v1 · pith:KJNVQKAXnew · submitted 2026-06-02 · 💻 cs.AI · cs.LG

CrowdMath: A Dataset of Crowdsourced Mathematical Research Discussions

Pith reviewed 2026-06-28 09:43 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords CrowdMathcollaborative mathematical reasoningLLM evaluationdatasetrole classificationnext-post predictionopen-problem solving
0
0 comments X

The pith

CrowdMath dataset shows models follow math discussion sequences at 83-88% accuracy but identify post roles at only 0.42 macro-F1.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CrowdMath, a dataset of 164 expert-annotated chains drawn from real collaborative research discussions that produced published proofs. Each chain records how participants advance an open problem through partial arguments, error spotting, repairs, and incremental synthesis. Two tasks are defined: next-post prediction and classification of each post's functional role such as partial progress or error identification. Six frontier models reach 83-88% accuracy on next-post prediction yet only 0.42 macro-F1 on role classification. The work therefore isolates a specific shortfall in current models' grasp of how collaborative mathematical progress actually unfolds.

Core claim

CrowdMath supplies 164 progress chains from the MIT PRIMES-AoPS program, each tracing a multi-participant forum thread from open-problem statement to completed proof, with every post labeled by its functional role in the solution process; benchmarking reveals that models can track the local sequence of discussion but cannot yet assign meaningful significance to individual contributions.

What carries the argument

The annotated progress chain, a sequence of forum posts labeled by functional roles including partial progress, proof completion, erroneous reasoning, and error identification.

If this is right

  • Models can follow the local flow of mathematical discussion but not the functional significance of contributions.
  • Existing benchmarks on well-specified problems do not measure ability to track open collaborative proofs.
  • The dataset supplies a concrete testbed for evaluating incremental collaborative reasoning.
  • Performance gaps on role classification point to missing capabilities needed for assisting live research threads.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training on role-labeled chains could improve models' ability to participate in or summarize ongoing proofs.
  • The same annotation scheme might be applied to other collaborative platforms to test generality.
  • High next-post accuracy combined with low role accuracy suggests models capture surface patterns more readily than underlying proof dynamics.

Load-bearing premise

Expert annotations of post functional roles accurately capture meaningful contributions to the evolving proof and the chosen tasks serve as valid proxies for collaborative reasoning ability.

What would settle it

A model reaching macro-F1 of 0.7 or higher on post-role classification while retaining 83%+ next-post accuracy would indicate the reported gap has closed.

Figures

Figures reproduced from arXiv: 2606.06526 by Anna Rumshisky, Jesse Geneson, Mikhail Gronas, Pavel Etingof, Sherin Muckatira, Slava Gerovitch.

Figure 1
Figure 1. Figure 1: Example progress chain from CROWDMATH. The chain shows how an open problem discussion develops, including the start, partial progress, proof completion, and a follow-up question–answer exchange. The example is drawn from the metric-dimension and hypercube subgraph thread for Open Problem 2020-6. CROWDMATH differs from prior resources in three key ways. First, it captures real-world re￾search trajectories r… view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of the 164 CROWDMATH progress chains across mathematical topics. The largest areas are pursuit–evasion games on graphs, extremal functions of forbidden sequences, metric dimension of graphs, and zero forcing on graphs. Topics contributing fewer than 2% of chains are grouped as “Other.” (A) Chains per year Year Chains 2016 26 2017 59 2018 6 2019 11 2020 21 2021 19 2022 8 2023 2 2024 5 2025 7 To… view at source ↗
Figure 3
Figure 3. Figure 3: Post count versus text volume for CROWD￾MATH progress chains. Each point represents one com￾pleted result. The x-axis shows the number of posts in the chain, and the y-axis shows the total number of novel words after removing quoted prior-post content. Point color indicates the year of the completed result. posts. The full chain-construction algorithm is given in Appendix A.1, and the post-level label sche… view at source ↗
Figure 4
Figure 4. Figure 4: Row-normalized confusion matrices for Task 1 post-role classification across six frontier models. Each [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Large language models have made substantial progress on mathematical reasoning, but existing benchmarks typically evaluate well-specified problems with final answers, step-by-step solutions, or complete proofs. They do not capture collaborative open-problem solving: a setting in which participants propose partial arguments, identify gaps or errors in prior steps, repair flawed reasoning, and gradually synthesize incremental contributions into a proof. We introduce CrowdMath, a dataset of 164 expert-annotated progress chains from the MIT PRIMES--Art of Problem Solving (AoPS) CrowdMath program (2016-2025), a collaborative research initiative whose discussions have led to peer-reviewed publications. Each chain traces a multi-participant forum discussion from an open-problem statement to a completed proof. Posts are labeled by their functional roles in the evolving solution process, including partial progress, proof completion, erroneous reasoning, and error identification. We define evaluation tasks and benchmark six frontier models. Models achieve 83-88% accuracy on next-post prediction, suggesting that they can follow the local flow of mathematical discussion. However, they struggle to identify the functional significance of individual contributions with the best model achieving only 0.42 macro-F1 on post-role classification. CrowdMath exposes a gap between solving well-specified mathematical problems and understanding collaborative mathematical progress as it unfolds.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces CrowdMath, a dataset comprising 164 expert-annotated progress chains drawn from the MIT PRIMES-AoPS CrowdMath program (2016-2025). Each chain follows a multi-participant forum discussion from an open problem to a completed proof, with individual posts labeled according to functional roles (e.g., partial progress, erroneous reasoning, error identification, proof completion). The authors define two tasks—next-post prediction and post-role classification—and benchmark six frontier LLMs, reporting 83-88% accuracy on the former and a best macro-F1 of 0.42 on the latter. They conclude that these results expose a gap between LLMs' performance on well-specified mathematical problems and their ability to model collaborative, open-ended mathematical progress.

Significance. If the dataset construction and task validity hold, CrowdMath would provide a valuable resource for studying LLM limitations in collaborative reasoning settings that more closely resemble actual research than standard math benchmarks. The grounding in discussions that have produced peer-reviewed publications is a concrete strength, as is the focus on incremental, multi-agent contributions rather than final answers.

major comments (3)
  1. [§3] §3 (Dataset and Annotation): The manuscript reports expert annotations for functional roles but supplies no information on the annotation protocol, number of annotators, or inter-annotator agreement. Because the role labels are the sole basis for the role-classification task and the central gap claim, the absence of these reliability metrics leaves the evaluation results difficult to interpret.
  2. [§4.2] §4.2 (Role Classification Task): The claim that 0.42 macro-F1 demonstrates inability to track collaborative reasoning rests on the assumption that the role labels cannot be recovered from surface features (post position, length, or lexical patterns). The paper provides no ablation or feature analysis showing that the task requires semantic understanding of the mathematical argument beyond what next-post prediction already captures at 83-88% accuracy.
  3. [§5] §5 (Experiments and Results): The gap conclusion is load-bearing on the interpretation of the role-classification result as a proxy for collaborative understanding. Without evidence that models fail specifically on the mathematical content of the roles (rather than on any multi-class labeling task), the performance difference does not yet establish the claimed distinction from well-specified problem solving.
minor comments (2)
  1. [Abstract] The abstract states that discussions have led to peer-reviewed publications but does not quantify how many of the 164 chains correspond to published results; adding this statistic would strengthen the dataset's claimed connection to real research outcomes.
  2. [§5] Table or figure presenting per-role F1 scores (rather than only macro-F1) would clarify which functional roles drive the low aggregate score.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on dataset documentation and task interpretation. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Dataset and Annotation): The manuscript reports expert annotations for functional roles but supplies no information on the annotation protocol, number of annotators, or inter-annotator agreement. Because the role labels are the sole basis for the role-classification task and the central gap claim, the absence of these reliability metrics leaves the evaluation results difficult to interpret.

    Authors: We agree that these details are necessary for assessing label reliability. The revised manuscript will expand §3 with a dedicated subsection on the annotation protocol (including guidelines provided to annotators), the number of expert annotators, and inter-annotator agreement metrics (e.g., Fleiss' kappa). revision: yes

  2. Referee: [§4.2] §4.2 (Role Classification Task): The claim that 0.42 macro-F1 demonstrates inability to track collaborative reasoning rests on the assumption that the role labels cannot be recovered from surface features (post position, length, or lexical patterns). The paper provides no ablation or feature analysis showing that the task requires semantic understanding of the mathematical argument beyond what next-post prediction already captures at 83-88% accuracy.

    Authors: This observation is fair. We will add an ablation analysis to §4.2 comparing LLM performance against simple baselines using only surface features (post position, length, and lexical patterns). This will demonstrate that role classification requires semantic understanding of the mathematical contributions beyond what next-post prediction captures. revision: yes

  3. Referee: [§5] §5 (Experiments and Results): The gap conclusion is load-bearing on the interpretation of the role-classification result as a proxy for collaborative understanding. Without evidence that models fail specifically on the mathematical content of the roles (rather than on any multi-class labeling task), the performance difference does not yet establish the claimed distinction from well-specified problem solving.

    Authors: We acknowledge the need for clearer linkage. In the revision to §5, we will elaborate on how each role label is defined in terms of specific mathematical actions (e.g., error identification requires detecting flaws in prior reasoning steps) and include per-role performance breakdowns to highlight where models struggle with content-dependent aspects. This will better support the distinction from well-specified problem solving. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical dataset paper with no derivation chain

full rationale

The paper introduces CrowdMath as a new annotated dataset of forum discussions, defines two evaluation tasks (next-post prediction and role classification), and reports model performance numbers on those tasks. No equations, fitted parameters, predictions derived from inputs, or self-citation chains are present. The central claim rests on the empirical observation that models perform differently on the two tasks; this is a direct reporting of benchmark results rather than any reduction of a result to its own inputs by construction. The skeptic concern about proxy validity is a question of task design, not circularity in a derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical dataset introduction and benchmarking paper; the central claim does not rest on any free parameters, mathematical axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5777 in / 1045 out tokens · 26653 ms · 2026-06-28T09:43:14.947951+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 7 canonical work pages

  1. [1]

    arXiv preprint arXiv:2110.14168 , year =

    Training Verifiers to Solve Math Word Problems , author =. arXiv preprint arXiv:2110.14168 , year =

  2. [2]

    2021 , cdate=

    Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt , title=. 2021 , cdate=

  3. [3]

    Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

    LeanDojo: Theorem Proving with Retrieval-Augmented Language Models , author=. Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

  4. [4]

    Wang and Kaylie Hausknecht and Jonah Brenner and Danxian Liu and Nianli Peng and Corey Wang and Michael Brenner , booktitle=

    Jingxuan Fan and Sarah Martinson and Erik Y. Wang and Kaylie Hausknecht and Jonah Brenner and Danxian Liu and Nianli Peng and Corey Wang and Michael Brenner , booktitle=. 2025 , url=

  5. [5]

    arXiv preprint arXiv:2505.08744 , year=

    DeepMath-Creative: A Benchmark for Evaluating Mathematical Creativity of Large Language Models , author=. arXiv preprint arXiv:2505.08744 , year=

  6. [6]

    2025 , cdate=

    Junyi Ye and Jingyi Gu and Xinyun Zhao and Wenpeng Yin and Grace Guiling Wang , title=. 2025 , cdate=

  7. [7]

    OMEGA: Can LLMs Reason Outside the Box in Math? Evaluating Exploratory, Compositional, and Transformative Generalization , author=

  8. [8]

    2026 , url=

    Yiyou Sun and Shawn Hu and Georgia Zhou and Ken Jiankun Zheng and Hannaneh Hajishirzi and Nouha Dziri and Dawn Song , booktitle=. 2026 , url=

  9. [9]

    Aryan Gulati and Brando Miranda and Eric Chen and Emily Xia and Kai Fronsdal and Bruno de Moraes Dumont and Sanmi Koyejo , booktitle=. Putnam-. 2025 , url=

  10. [10]

    2024 , cdate=

    Shubham Toshniwal and Ivan Moshkov and Sean Narenthiran and Daria Gitman and Fei Jia and Igor Gitman , title=. 2024 , cdate=

  11. [11]

    CoRR , volume=

    Meng Fang and Xiangpeng Wan and Fei Lu and Fei Xing and Kai Zou , title=. CoRR , volume=. 2024 , cdate=

  12. [12]

    SAND-Math: Using LLMs to Generate Novel, Difficult and Useful Mathematics Questions and Answers , author=

  13. [13]

    A Diverse Corpus for Evaluating and Developing E nglish Math Word Problem Solvers

    Miao, Shen-yun and Liang, Chao-Chun and Su, Keh-Yih. A Diverse Corpus for Evaluating and Developing E nglish Math Word Problem Solvers. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.92

  14. [14]

    findings-emnlp.427/

    Ling, Wang and Yogatama, Dani and Dyer, Chris and Blunsom, Phil. Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017. doi:10.18653/v1/P17-1015

  15. [15]

    Sikai Bai, Haoxi Li, Jie Zhang, Zicong Hong, and Song Guo

    Amini, Aida and Gabriel, Saadia and Lin, Shanchuan and Koncel-Kedziorski, Rik and Choi, Yejin and Hajishirzi, Hannaneh. M ath QA : Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, ...

  16. [16]

    Bofei Gao and Feifan Song and Zhe Yang and Zefan Cai and Yibo Miao and Qingxiu Dong and Lei Li and Chenghao Ma and Liang Chen and Runxin Xu and Zhengyang Tang and Benyou Wang and Daoguang Zan and Shanghaoran Quan and Ge Zhang and Lei Sha and Yichang Zhang and Xuancheng Ren and Tianyu Liu and Baobao Chang , booktitle=. Omni-. 2025 , url=

  17. [17]

    doi: 10.18653/v1/2024.acl-long.211

    He, Chaoqun and Luo, Renjie and Bai, Yuzhuo and Hu, Shengding and Thai, Zhen and Shen, Junhao and Hu, Jinyi and Han, Xu and Huang, Yujie and Zhang, Yuxiang and Liu, Jie and Qi, Lei and Liu, Zhiyuan and Sun, Maosong. O lympiad B ench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems. Proceedings of the ...

  18. [18]

    arXiv preprint arXiv:2412.08819 , year=

    Harp: A challenging human-annotated math reasoning benchmark , author=. arXiv preprint arXiv:2412.08819 , year=

  19. [19]

    CHAMP : A Competition-level Dataset for Fine-Grained Analyses of LLM s' Mathematical Reasoning Capabilities

    Mao, Yujun and Kim, Yoon and Zhou, Yilun. CHAMP : A Competition-level Dataset for Fine-Grained Analyses of LLM s' Mathematical Reasoning Capabilities. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.785

  20. [20]

    arXiv preprint arXiv:2411.04872 , year=

    Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai , author=. arXiv preprint arXiv:2411.04872 , year=

  21. [21]

    The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

    RealMath: A Continuous Benchmark for Evaluating Language Models on Research-Level Mathematics , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

  22. [22]

    Forty-second International Conference on Machine Learning , year=

    Machine Learning meets Algebraic Combinatorics: A Suite of Datasets Capturing Research-level Conjecturing Ability in Pure Mathematics , author=. Forty-second International Conference on Machine Learning , year=

  23. [23]

    arXiv preprint arXiv:2603.15617 , year=

    HorizonMath: Measuring AI Progress Toward Mathematical Discovery with Automatic Verification , author=. arXiv preprint arXiv:2603.15617 , year=

  24. [24]

    arXiv preprint arXiv:2309.04295 , year=

    Fimo: A challenge formal dataset for automated theorem proving , author=. arXiv preprint arXiv:2309.04295 , year=

  25. [25]

    2021 , cdate=

    Sean Welleck and Jiacheng Liu and Ronan Le Bras and Hanna Hajishirzi and Yejin Choi and Kyunghyun Cho , title=. 2021 , cdate=

  26. [26]

    Ayers and Dragomir Radev and Jeremy Avigad , title=

    Zhangir Azerbayev and Bartosz Piotrowski and Hailey Schoelkopf and Edward W. Ayers and Dragomir Radev and Jeremy Avigad , title=. CoRR , volume=. 2023 , cdate=

  27. [27]

    International Conference on Learning Representations , year=

    miniF2F: a cross-system benchmark for formal Olympiad-level mathematics , author=. International Conference on Learning Representations , year=

  28. [28]

    International Conference on Learning Representations , year=

    IsarStep: a Benchmark for High-level Mathematical Reasoning , author=. International Conference on Learning Representations , year=

  29. [29]

    The Twelfth International Conference on Learning Representations , year=

    Let's Verify Step by Step , author=. The Twelfth International Conference on Learning Representations , year=

  30. [30]

    Math-Shepherd: Verify and Reinforce

    Wang, Peiyi and Li, Lei and Shao, Zhihong and Xu, Runxin and Dai, Damai and Li, Yifei and Chen, Deli and Wu, Yu and Sui, Zhifang. Math-Shepherd: Verify and Reinforce LLM s Step-by-step without Human Annotations. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.510

  31. [31]

    M ath D ial: A Dialogue Tutoring Dataset with Rich Pedagogical Properties Grounded in Math Reasoning Problems

    Macina, Jakub and Daheim, Nico and Chowdhury, Sankalan and Sinha, Tanmay and Kapur, Manu and Gurevych, Iryna and Sachan, Mrinmaya. M ath D ial: A Dialogue Tutoring Dataset with Rich Pedagogical Properties Grounded in Math Reasoning Problems. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.372

  32. [32]

    and Sumner, Tamara

    Suresh, Abhijit and Jacobs, Jennifer and Harty, Charis and Perkoff, Margaret and Martin, James H. and Sumner, Tamara. The T alk M oves Dataset: K-12 Mathematics Lesson Transcripts Annotated for Teacher and Student Discursive Moves. Proceedings of the Thirteenth Language Resources and Evaluation Conference. 2022

  33. [33]

    MathArena: Evaluating

    Mislav Balunovic and Jasper Dekoninck and Ivo Petrov and Nikola Jovanovi. MathArena: Evaluating. The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=