RMA: an Agentic System for Research-Level Mathematical Problems
Pith reviewed 2026-05-25 06:03 UTC · model grok-4.3
The pith
RMA solves eight out of ten research-level mathematical problems by coordinating agents for analysis, literature search, and iterative proof refinement.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RMA decomposes research-level proof solving into specialized modules for problem analysis, literature search and understanding, fair comparison, knowledge-bank construction, and proof verification, all coordinated by initializer, proposer, and verifier agents through a shared structured memory. Within this unified framework, these agents operate in a multi-role, multi-round workflow, collaboratively generating, refining, and verifying candidate proofs through iterative feedback. This enables solving eight out of ten research problems on the First Proof benchmark with more logically sound and readable proofs than strong baselines.
What carries the argument
The multi-role, multi-round workflow with shared structured memory coordinating specialized modules for analysis, literature search, comparison, knowledge construction, and verification.
If this is right
- Performance gains arise from the interaction of structured reasoning modules, iterative refinement, and verifier-based feedback rather than any single component.
- The framework can address long-horizon reasoning and literature grounding required in research-level mathematics beyond competition problems or formal theorem proving.
- RMA produces proofs rated higher in logical soundness and readability by experts compared to baselines including GPT-5.2R and Aletheia.
Where Pith is reading between the lines
- The shared structured memory may support scaling the approach to problems that require synthesizing information across many papers.
- Applying the same module decomposition to domains like theoretical computer science could test whether the workflow generalizes beyond the tested benchmark.
- Adding formal verification tools to the verifier module might reduce reliance on expert human judgment for soundness.
Load-bearing premise
Expert evaluation of the generated proofs is reliable, unbiased, and consistent across the ten contributed problems, and the First Proof benchmark accurately captures the requirements of genuine research-level mathematics.
What would settle it
Independent expert re-evaluation of the proofs finding that fewer than six of the eight claimed solutions are logically sound and readable, or failure to solve a majority of additional research problems contributed by mathematicians in the same domains.
Figures
read the original abstract
We present $\textbf{Research Math Agents (RMA)}$, an agentic framework for automated reasoning on research-level mathematical problems. Unlike prior studies centered on competition mathematics or formal theorem proving, RMA targets research-level mathematical problems that require long-horizon reasoning, literature grounding, and iterative proof refinement. RMA decomposes research-level proof solving into specialized modules for problem analysis, literature search and understanding, fair comparison, knowledge-bank construction, and proof verification, all coordinated by initializer, proposer, and verifier agents through a shared structured memory. Within this unified framework, these agents operate in a multi-role, multi-round workflow, collaboratively generating, refining, and verifying candidate proofs through iterative feedback. We evaluate RMA on the First Proof benchmark, which consists of ten research-level problems contributed by expert mathematicians across diverse domains. Through comprehensive expert evaluation, RMA outperforms strong baselines on the First Proof benchmark, including GPT-5.2R and Aletheia, solving eight out of ten research problems and producing more logically sound and readable proofs. Our comprehensive ablation studies further show that performance gains arise from the interaction of structured reasoning modules, iterative refinement, and verifier-based feedback, rather than any single component. Our solutions and implementations will be made publicly available upon acceptance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents Research Math Agents (RMA), a multi-agent framework that decomposes research-level proof construction into modules for problem analysis, literature search, fair comparison, knowledge-bank construction, and verification. These are coordinated via initializer, proposer, and verifier agents operating in a multi-round workflow with shared structured memory. The system is evaluated on the newly introduced First Proof benchmark consisting of ten research-level problems contributed by expert mathematicians; the central claim is that RMA solves eight of the ten problems, outperforms baselines including GPT-5.2R and Aletheia, and produces more logically sound and readable proofs according to expert judgment, with ablations attributing gains to module interaction and iterative verifier feedback.
Significance. If the empirical claims were supported by transparent, reproducible evaluation, the work would be significant for shifting automated reasoning research from contest-style or formally verified problems toward open research mathematics that requires literature grounding and long-horizon iteration. The modular agent design and emphasis on iterative refinement represent a concrete architectural proposal that could be tested in other domains.
major comments (3)
- [Abstract and §4] Abstract and §4 (Evaluation): The headline result that RMA solves 8/10 problems and outperforms GPT-5.2R and Aletheia rests entirely on expert judgments of logical soundness and readability, yet the manuscript supplies no description of the evaluation protocol, blinding procedure, rubric, inter-rater agreement statistics, or how the ten contributed problems were confirmed to be genuine open research questions rather than contest-style items. Without these details the central empirical claim cannot be assessed.
- [Abstract and §3] Abstract and §3 (Benchmark): The First Proof benchmark is defined and contributed within the paper itself, creating an unaddressed risk of circularity; no external validation, pre-registration, or independent confirmation that the problems meet the stated criteria of research-level mathematics is reported. This directly affects the reliability of the 8/10 success rate.
- [Abstract] Abstract: Ablation studies are invoked to attribute performance gains to module interaction and verifier feedback, but the manuscript provides neither the ablation table, the exact configurations tested, nor quantitative deltas, rendering the causal claim about component contributions unverifiable from the given text.
minor comments (1)
- [Abstract] The abstract refers to 'comprehensive expert evaluation' and 'comprehensive ablation studies' without cross-references to specific tables or appendices that would allow readers to locate the supporting data.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments correctly identify areas where greater transparency is needed in the evaluation protocol, benchmark documentation, and ablation results. We address each point below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Evaluation): The headline result that RMA solves 8/10 problems and outperforms GPT-5.2R and Aletheia rests entirely on expert judgments of logical soundness and readability, yet the manuscript supplies no description of the evaluation protocol, blinding procedure, rubric, inter-rater agreement statistics, or how the ten contributed problems were confirmed to be genuine open research questions rather than contest-style items. Without these details the central empirical claim cannot be assessed.
Authors: We agree that the evaluation protocol requires explicit documentation. In the revised manuscript we will add a dedicated subsection in §4 that fully describes the expert evaluation procedure. This will include the scoring rubric for logical soundness and readability, the blinding protocol (experts received anonymized proofs without system identifiers), inter-rater agreement statistics, and the vetting process used by contributing mathematicians to confirm the problems represent open research questions rather than contest-style items. The evaluation guidelines will also be provided as supplementary material. revision: yes
-
Referee: [Abstract and §3] Abstract and §3 (Benchmark): The First Proof benchmark is defined and contributed within the paper itself, creating an unaddressed risk of circularity; no external validation, pre-registration, or independent confirmation that the problems meet the stated criteria of research-level mathematics is reported. This directly affects the reliability of the 8/10 success rate.
Authors: The First Proof benchmark is a new contribution introduced in this paper. Problems were curated in direct consultation with expert mathematicians who confirmed they require literature grounding and long-horizon iteration. Pre-registration was not feasible because the benchmark was developed concurrently with the system. In revision we will expand §3 with additional documentation of the curation criteria and anonymized expert input on problem selection. We will also release the full benchmark publicly upon acceptance to enable independent verification. revision: partial
-
Referee: [Abstract] Abstract: Ablation studies are invoked to attribute performance gains to module interaction and verifier feedback, but the manuscript provides neither the ablation table, the exact configurations tested, nor quantitative deltas, rendering the causal claim about component contributions unverifiable from the given text.
Authors: We acknowledge that the ablation results were summarized rather than presented in full tabular form. The revised manuscript will include a complete ablation table in §4 listing all tested configurations (e.g., without structured memory, without iterative verifier feedback), the exact parameter settings, and quantitative deltas for both success rate and expert scores. This will allow readers to verify the contribution of each component. revision: yes
Circularity Check
No significant circularity detected.
full rationale
The paper presents an empirical agentic framework evaluated on a newly introduced benchmark via expert judgments. No equations, fitted parameters, predictions, or derivations appear in the provided text. The central performance claims (8/10 solved, outperforming baselines) rest on external-style empirical reporting rather than any self-definitional construct, fitted-input-as-prediction, or self-citation load-bearing step that reduces the result to its inputs by construction. No enumerated circularity pattern is exhibited with a quotable reduction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
The logic theory machine-a complex information processing system
A Newell and H Simon. The logic theory machine-a complex information processing system. IRE Transactions on Information Theory, 2(3):61–79, 1956
work page 1956
-
[2]
Measuring mathematical problem solving with the math dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021
work page 2021
-
[3]
Solving olympiad geometry without human demonstrations.Nature, 625(7995):476–482, 2024
Trieu H Trinh, Yuhuai Wu, Quoc V Le, He He, and Thang Luong. Solving olympiad geometry without human demonstrations.Nature, 625(7995):476–482, 2024
work page 2024
-
[4]
Llemma: An Open Language Model For Mathematics
Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q Jiang, Jia Deng, Stella Biderman, and Sean Welleck. Llemma: An open language model for mathematics.arXiv preprint arXiv:2310.10631, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
The lean 4 theorem prover and programming language
Leonardo de Moura and Sebastian Ullrich. The lean 4 theorem prover and programming language. InAutomated Deduction – CADE 28: 28th International Conference on Automated Deduction, Virtual Event, July 12–15, 2021, Proceedings, page 625–635, Berlin, Heidelberg,
work page 2021
-
[6]
End-to-end differentiable proving.Advances in neural information processing systems, 30, 2017
Tim Rocktäschel and Sebastian Riedel. End-to-end differentiable proving.Advances in neural information processing systems, 30, 2017
work page 2017
-
[7]
Generative Language Modeling for Automated Theorem Proving
Stanislas Polu and Ilya Sutskever. Generative language modeling for automated theorem proving. arXiv preprint arXiv:2009.03393, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[8]
Yuhuai Wu, Albert Qiaochu Jiang, Wenda Li, Markus Rabe, Charles Staats, Mateja Jamnik, and Christian Szegedy. Autoformalization with large language models.Advances in neural information processing systems, 35:32353–32368, 2022
work page 2022
-
[9]
Kaiyu Yang, Aidan Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan J Prenger, and Animashree Anandkumar. Leandojo: Theorem proving with retrieval-augmented language models.Advances in Neural Information Processing Systems, 36:21573–21612, 2023
work page 2023
-
[10]
Team AlphaProof and Team AlphaGeometry. Ai achieves silver-medal standard solving interna- tional 178 mathematical olympiad problems.DeepMind blog, 179:45, 2024
work page 2024
-
[11]
Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A coding agent for scientific and algor...
work page 2025
-
[12]
Huajian Xin, Daya Guo, Zhihong Shao, Zhizhou Ren, Qihao Zhu, Bo Liu, Chong Ruan, Wenda Li, and Xiaodan Liang. Deepseek-prover: Advancing theorem proving in llms through large-scale synthetic data.arXiv preprint arXiv:2405.14333, 2024
-
[13]
Tony Feng, Trieu H. Trinh, Garrett Bingham, Dawsen Hwang, Yuri Chervonyi, Junehyuk Jung, Joonkyung Lee, Carlo Pagano, Sang hyun Kim, Federico Pasqualotto, Sergei Gukov, Jonathan N. Lee, Junsu Kim, Kaiying Hou, Golnaz Ghiasi, Yi Tay, YaGuang Li, Chenkai Kuang, Yuan Liu, Hanzhao Lin, Evan Zheran Liu, Nigamaa Nayakanti, Xiaomeng Yang, Heng-Tze Cheng, Demis H...
work page 2026
-
[14]
autoresearch.https://github.com/karpathy/autoresearch
Andrej Karpathy. autoresearch.https://github.com/karpathy/autoresearch
-
[15]
Max Zimmer, Nico Pelleriti, Christophe Roux, and Sebastian Pokutta. The agentic researcher: A practical guide to ai-assisted research in mathematics and machine learning, 2026
work page 2026
-
[16]
First proof.arXiv preprint arXiv:2602.05192, 2026
Mohammed Abouzaid, Andrew J Blumberg, Martin Hairer, Joe Kileel, Tamara G Kolda, Paul D Nelson, Daniel Spielman, Nikhil Srivastava, Rachel Ward, Shmuel Weinberger, et al. First proof.arXiv preprint arXiv:2602.05192, 2026. 10
-
[17]
Tony Feng, Junehyuk Jung, Sang hyun Kim, Carlo Pagano, Sergei Gukov, Chiang-Chiang Tsai, David Woodruff, Adel Javanmard, Aryan Mokhtari, Dawsen Hwang, Yuri Chervonyi, Jonathan N. Lee, Garrett Bingham, Trieu H. Trinh, Vahab Mirrokni, Quoc V . Le, and Thang Luong. Aletheia tackles firstproof autonomously, 2026
work page 2026
-
[18]
OpenAI. First proof submissions. https://openai.com/index/ first-proof-submissions/, 2025. Accessed: 2026-04-18
work page 2025
-
[19]
Agentic reasoning: A streamlined framework for enhancing llm reasoning with agentic tools
Junde Wu, Jiayuan Zhu, Yuyuan Liu, Min Xu, and Yueming Jin. Agentic reasoning: A streamlined framework for enhancing llm reasoning with agentic tools. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 28489–28503, 2025
work page 2025
-
[20]
Agentic strategy design for math proofs
Dietmar Wolz. Agentic strategy design for math proofs. https://althofer.de/agentic_ strategy_design_for_math_proofs.pdf, 2026. Accessed February 12, 2026
work page 2026
-
[21]
Aflow: Automating agentic workflow generation
Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xiong-Hui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, et al. Aflow: Automating agentic workflow generation. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[22]
CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery
Ao Qu, Han Zheng, Zijian Zhou, Yihao Yan, Yihong Tang, Shao Yong Ong, Fenglu Hong, Kaichen Zhou, Chonghe Jiang, Minwei Kong, et al. Coral: Towards autonomous multi-agent evolution for open-ended discovery.arXiv preprint arXiv:2604.01658, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[23]
Bochen Han and Songmao Zhang. Exploring advanced llm multi-agent systems based on blackboard architecture.arXiv preprint arXiv:2507.01701, 2025
-
[24]
Qihuang Zhong, Kang Wang, Ziyang Xu, Liang Ding, Juhua Liu, and Bo Du. Achieving> 97% on gsm8k: Deeply understanding the problems makes llms better solvers for math word problems.Frontiers of Computer Science, 20(1):1–3, 2026
work page 2026
-
[25]
John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024
work page 2024
-
[26]
OpenAI. Introducing deep research. https://openai.com/index/ introducing-deep-research/, 2025. Research leads: Isa Fulford, Zhiqing Sun
work page 2025
-
[27]
Google DeepMind. Gemini deep research. https://gemini.google/overview/ deep-research/, 2025. AI research assistant system
work page 2025
-
[28]
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023
work page 2023
-
[29]
Xiaoyuan Li, Wenjie Wang, Moxin Li, Junrong Guo, Yang Zhang, and Fuli Feng. Evaluating mathematical reasoning of large language models: A focus on error identification and correction. InFindings of the Association for Computational Linguistics: ACL 2024, pages 11316–11360, 2024
work page 2024
-
[30]
Mathematical capabilities of chatgpt
Simon Frieder, Luca Pinchetti, Chevalier Chevalier, Ryan-Rhys Griffiths, Tommaso Salvatori, Thomas Lukasiewicz, Philipp Petersen, and Julius Berner. Mathematical capabilities of chatgpt. Advances in neural information processing systems, 36:27699–27744, 2023
work page 2023
-
[31]
Benjamin Breen, Marco Del Tredici, Jacob McCarran, Javier Aspuru Mijares, Weichen Winston Yin, Kfir Sulimany, Jacob M. Taylor, Frank H. L. Koppens, and Dirk Englund. Ax-prover: A deep reasoning agentic framework for theorem proving in mathematics and quantum physics, 2025
work page 2025
-
[32]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. 11
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[33]
Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are nlp models really able to solve simple math word problems? InProceedings of the 2021 conference of the North American chapter of the association for computational linguistics: human language technologies, pages 2080–2094, 2021
work page 2021
-
[34]
A diverse corpus for evaluating and developing english math word problem solvers
Shen-Yun Miao, Chao-Chun Liang, and Keh-Yih Su. A diverse corpus for evaluating and developing english math word problem solvers. InProceedings of the 58th annual meeting of the Association for Computational Linguistics, pages 975–984, 2020
work page 2020
-
[35]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Paper...
work page 2024
-
[37]
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models
Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, et al. Omni-math: A universal olympiad level mathematic benchmark for large language models.arXiv preprint arXiv:2410.07985, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI
Elliot Glazer, Ege Erdil, Tamay Besiroglu, Diego Chicharro, Evan Chen, Alex Gunning, Caroline Falkman Olsson, Jean-Stanislas Denain, Anson Ho, Emily de Oliveira Santos, et al. Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai.arXiv preprint arXiv:2411.04872, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
Learning Explanatory Rules from Noisy Data
Richard Evans and Edward Grefenstette. Learning explanatory rules from noisy data.CoRR, abs/1711.04574, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[40]
Lemmahead: Rag assisted proof generation using large language models.ArXiv, abs/2501.15797, 2025
Tianbo Yang, Mingqi Yang, Hongyi Zhao, and Tianshuo Yang. Lemmahead: Rag assisted proof generation using large language models.ArXiv, abs/2501.15797, 2025
-
[41]
Ziju Shen, Naohao Huang, Fanyi Yang, Yutong Wang, Guoxiong Gao, Tianyi Xu, Jiedong Jiang, Wanyi He, Pu Yang, Mengzhou Sun, et al. Real-prover: Retrieval augmented lean prover for mathematical reasoning.arXiv preprint arXiv:2505.20613, 2025
-
[42]
Proof artifact co-training for theorem proving with language models
Jesse Michael Han, Jason Rute, Yuhuai Wu, Edward W Ayers, and Stanislas Polu. Proof artifact co-training for theorem proving with language models. InInternational Conference on Learning Representations, 2022
work page 2022
-
[43]
Guillaume Lample, Timothee Lacroix, Marie-Anne Lachaux, Aurelien Rodriguez, Amaury Hayat, Thibaut Lavril, Gabriel Ebner, and Xavier Martinet. Hypertree proof search for neural theorem proving.Advances in neural information processing systems, 35:26337–26349, 2022
work page 2022
-
[44]
Draft, sketch, and prove: Guiding formal theorem provers with informal proofs
Albert Qiaochu Jiang, Sean Welleck, Jin Peng Zhou, Timothee Lacroix, Jiacheng Liu, Wenda Li, Mateja Jamnik, Guillaume Lample, and Yuhuai Wu. Draft, sketch, and prove: Guiding formal theorem provers with informal proofs. InThe Eleventh International Conference on Learning Representations, 2023
work page 2023
-
[45]
Mathematical discoveries from program search with large language models
Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M Pawan Kumar, Emilien Dupont, Francisco JR Ruiz, Jordan S Ellenberg, Pengming Wang, Omar Fawzi, et al. Mathematical discoveries from program search with large language models. Nature, 625(7995):468–475, 2024
work page 2024
-
[46]
Mario: Math reasoning with code interpreter output-a reproducible pipeline
Minpeng Liao, Chengxi Li, Wei Luo, Wu Jing, and Kai Fan. Mario: Math reasoning with code interpreter output-a reproducible pipeline. InFindings of the Association for Computational Linguistics: ACL 2024, pages 905–924, 2024
work page 2024
-
[47]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022. 12
work page 2022
-
[48]
Self-consistency improves chain of thought reasoning in language models
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, 2023
work page 2023
-
[49]
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023
work page 2023
-
[50]
Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. Program of thoughts prompt- ing: Disentangling computation from reasoning for numerical reasoning tasks.Transactions on Machine Learning Research, 2023
work page 2023
-
[51]
Pal: Program-aided language models
Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. InInternational conference on machine learning, pages 10764–10799. PMLR, 2023
work page 2023
-
[52]
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539– 68551, 2023
work page 2023
-
[53]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022
work page 2022
-
[54]
Autogen: Enabling next-gen llm applications via multi-agent conversations
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024
work page 2024
-
[55]
Openhands: An open platform for ai software developers as generalist agents
Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[56]
Kuo Zhou and Lu Zhang. Step-wise formal verification for llm-based mathematical problem solving.arXiv preprint arXiv:2505.20869, 2025
-
[57]
Codex | ai coding partner from openai
OpenAI. Codex | ai coding partner from openai. https://openai.com/codex/. Accessed: 2026-04-26
work page 2026
-
[58]
Claude code.https://code.claude.com/docs/en/verview
Anthropic. Claude code.https://code.claude.com/docs/en/verview
-
[59]
Build, debug & deploy with ai.https://geminicli.com/
Google. Build, debug & deploy with ai.https://geminicli.com/. Accessed: 2026-04-26
work page 2026
-
[60]
arxiv e-print archive.https://arxiv.org
Cornell University. arxiv e-print archive.https://arxiv.org. Accessed: 2026
work page 2026
-
[61]
American Mathematical Society. Mathscinet. https://mathscinet.ams.org. Accessed: 2026
work page 2026
-
[62]
An overview of zbmath open digital library
Madhurima Deb, Isabel Beckenbach, Matteo Petrera, Dariush Ehsani, Marcel Fuhrmann, Yun Hao, Olaf Teschke, and Moritz Schubotz. An overview of zbmath open digital library. In Proceedings of the 24th ACM/IEEE Joint Conference on Digital Libraries, JCDL ’24, page 1–5. ACM, December 2024
work page 2024
-
[63]
ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory
Siru Ouyang, Jun Yan, I Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long T Le, Samira Daruki, Xiangru Tang, et al. Reasoningbank: Scaling agent self-evolving with reasoning memory.arXiv preprint arXiv:2509.25140, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [64]
-
[65]
Cambridge university press, 2004
Stephen Boyd and Lieven Vandenberghe.Convex optimization. Cambridge university press, 2004
work page 2004
-
[66]
Cambridge university press, 2017
Michael Mitzenmacher and Eli Upfal.Probability and computing: Randomization and proba- bilistic techniques in algorithms and data analysis. Cambridge university press, 2017. 13
work page 2017
-
[67]
Towards mitigating llm hallucination via self reflection
Ziwei Ji, Tiezheng Yu, Yan Xu, Nayeon Lee, Etsuko Ishii, and Pascale Fung. Towards mitigating llm hallucination via self reflection. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 1827–1843, 2023
work page 2023
-
[68]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022. 14 Table 4: The 10 problems of the First Proof benchmark. # Area T...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.