Recognition: unknown
Agentic Frameworks for Reasoning Tasks: An Empirical Study
Pith reviewed 2026-05-10 08:20 UTC · model grok-4.3
The pith
An empirical evaluation of 22 agentic frameworks on BBH, GSM8K, and ARC benchmarks shows stable performance in 12 frameworks but highlights orchestration failures and weaker mathematical reasoning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
this study provides the first large-scale empirical comparison of agentic frameworks for reasoning-intensive software engineering tasks and shows that framework selection should prioritize orchestration quality, especially memory control, failure handling, and cost management.
Load-bearing premise
That the 22 frameworks selected from 1,200 GitHub repositories are representative of the field and that the unified evaluation setting provides a fair, unbiased comparison without hidden differences in implementation or prompting.
read the original abstract
Recent advances in agentic frameworks have enabled AI agents to perform complex reasoning and decision-making. However, evidence comparing their reasoning performance, efficiency, and practical suitability remains limited. To address this gap, we empirically evaluate 22 widely used agentic frameworks across three reasoning benchmarks: BBH, GSM8K, and ARC. The frameworks were selected from 1,200 GitHub repositories collected between January 2023 and July 2025 and organized into a taxonomy based on architectural design. We evaluated them under a unified setting, measuring reasoning accuracy, execution time, computational cost, and cross-benchmark consistency. Our results show that 19 of the 22 frameworks completed all three benchmarks. Among these, 12 showed stable performance, with mean accuracy of 74.6-75.9%, execution time of 4-6 seconds per task, and cost of 0.14-0.18 cents per task. Poorer results were mainly caused by orchestration problems rather than reasoning limits. For example, Camel failed to complete BBH after 11 days because of uncontrolled context growth, while Upsonic consumed USD 1,434 in one day because repeated extraction failures triggered costly retries. AutoGen and Mastra also exhausted API quotas through iterative interactions that increased prompt length without improving results. We also found a sharp drop in mathematical reasoning. Mean accuracy on GSM8K was 44.35%, compared with 89.80% on BBH and 89.56% on ARC. Overall, this study provides the first large-scale empirical comparison of agentic frameworks for reasoning-intensive software engineering tasks and shows that framework selection should prioritize orchestration quality, especially memory control, failure handling, and cost management.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents an empirical study evaluating 22 agentic frameworks, selected from 1,200 GitHub repositories, on three reasoning benchmarks: BBH, GSM8K, and ARC. Under a unified setting, it measures accuracy, execution time, cost, and consistency. Results indicate that 19 frameworks completed the benchmarks, with 12 exhibiting stable performance (mean accuracy 74.6-75.9%, time 4-6s, cost 0.14-0.18 cents per task). Failures are attributed to orchestration issues such as uncontrolled context growth (Camel) and excessive retry costs (Upsonic). A performance drop is noted on GSM8K (44.35% vs. ~89% on others). The paper concludes that framework selection for reasoning-intensive software engineering tasks should prioritize orchestration quality, memory control, failure handling, and cost management.
Significance. If the empirical results hold, the work provides a valuable large-scale comparison with specific, actionable insights into practical limitations of agentic frameworks, including quantifiable failure modes and cost overruns. The unified evaluation setting and concrete examples (e.g., Camel context growth, Upsonic retry costs) strengthen the contribution to the field of AI agents. The study also ships concrete metrics from a controlled setting, which is a strength for empirical agent research.
major comments (2)
- [Abstract] Abstract: The assertion that the study is 'the first large-scale empirical comparison of agentic frameworks for reasoning-intensive software engineering tasks' and the recommendation that 'framework selection should prioritize orchestration quality... for reasoning-intensive software engineering tasks' are not supported by the evidence. The evaluation is limited to the general reasoning benchmarks BBH, GSM8K, and ARC, with no tasks involving code, debugging, repositories, or engineering workflows. This scope mismatch makes the SE-specific conclusions unsupported.
- [Results] Results section: The performance metrics for the 12 stable frameworks (mean accuracy 74.6-75.9%, execution time 4-6 seconds, cost 0.14-0.18 cents) are reported as ranges without error bars, standard deviations, number of trials per task, or statistical significance tests. This omission undermines the reliability of cross-framework comparisons and the claim of 'stable performance'.
minor comments (2)
- [Abstract] The collection period 'January 2023 and July 2025' includes a future date relative to the manuscript's likely submission; this should be clarified or corrected.
- The taxonomy organizing the 22 frameworks by architectural design is mentioned but not illustrated or tabulated; adding a summary table or figure would improve readability.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our manuscript. We address each of the major comments point by point below. We have made revisions to the abstract and results section to address the concerns raised.
read point-by-point responses
-
Referee: [Abstract] Abstract: The assertion that the study is 'the first large-scale empirical comparison of agentic frameworks for reasoning-intensive software engineering tasks' and the recommendation that 'framework selection should prioritize orchestration quality... for reasoning-intensive software engineering tasks' are not supported by the evidence. The evaluation is limited to the general reasoning benchmarks BBH, GSM8K, and ARC, with no tasks involving code, debugging, repositories, or engineering workflows. This scope mismatch makes the SE-specific conclusions unsupported.
Authors: We agree that the abstract's language linking the study directly to software engineering tasks is not fully supported by the benchmarks used, which are general reasoning tasks. Our intent was to highlight potential relevance to SE, as reasoning is key in such tasks, but we acknowledge the scope mismatch. In the revised manuscript, we have updated the abstract to state that this is the first large-scale empirical comparison of agentic frameworks on reasoning benchmarks, and we have removed the SE-specific recommendation, instead suggesting that the findings on orchestration and cost management may be relevant for applications in software engineering. This aligns the claims with the evidence presented. revision: yes
-
Referee: [Results] Results section: The performance metrics for the 12 stable frameworks (mean accuracy 74.6-75.9%, execution time 4-6 seconds, cost 0.14-0.18 cents) are reported as ranges without error bars, standard deviations, number of trials per task, or statistical significance tests. This omission undermines the reliability of cross-framework comparisons and the claim of 'stable performance'.
Authors: The ranges provided represent the variation observed across the 12 stable frameworks rather than statistical summaries within individual frameworks. Each framework was evaluated once per benchmark under the unified setting to ensure comparability, which limited our ability to compute within-framework standard deviations or perform extensive statistical tests. We have revised the results section to explicitly state that these are observed ranges across frameworks, clarified the single-run nature of the evaluation, and added a discussion of this as a limitation. Where possible, we have included consistency metrics across the three benchmarks as a proxy for stability. We believe this addresses the concern while reflecting the practical constraints of evaluating 22 frameworks. revision: yes
Circularity Check
No circularity: pure empirical evaluation on external benchmarks
full rationale
The paper conducts an empirical study selecting 22 frameworks from 1,200 GitHub repositories and evaluating them on public benchmarks BBH, GSM8K, and ARC under a unified setting. No derivations, equations, fitted parameters, or self-citations are used to generate predictions that reduce to the inputs by construction. Results on accuracy, time, cost, and failures (e.g., context growth in Camel) are direct measurements. The abstract's phrasing linking findings to 'reasoning-intensive software engineering tasks' while using general reasoning benchmarks represents a potential overgeneralization, but this is not a circular reduction per the enumerated patterns. The study is self-contained against external benchmarks with no load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption BBH, GSM8K, and ARC benchmarks validly measure reasoning performance in agentic frameworks
Reference graph
Works this paper leans on
-
[1]
ACM Transactions on Software Engineering and Methodology (2024)
Liu, J., Wang, K., Chen, Y., Peng, X., Chen, Z., Zhang, L., Lou, Y.: Large lan- guage model-based agents for software engineering: A survey. ACM Transactions on Software Engineering and Methodology (2024)
2024
-
[2]
ACM Transactions on Software Engineering and Methodology34(5), 1–30 (2025)
He, J., Treude, C., Lo, D.: Llm-based multi-agent systems for software engineer- ing: Literature review, vision, and the road ahead. ACM Transactions on Software Engineering and Methodology34(5), 1–30 (2025)
2025
-
[3]
IEEE Software43(1), 133–138 (2025)
Vaidhyanathan, K., Taibi, D.: Agentic ai frameworks under the microscope: What works, what doesn’t. IEEE Software43(1), 133–138 (2025)
2025
-
[4]
arXiv preprint arXiv:2512.01939 (2025) 37
Wang, Y., Xu, X., Chen, J., Bi, T., Gu, W., Zheng, Z.: An empirical study of agent developer practices in AI agent frameworks. arXiv preprint arXiv:2512.01939 (2025) 37
-
[5]
LLM-based agentic reasoning frameworks: A survey from methods to scenarios,
Zhao, B., Foo, L.G., Hu, P., Theobalt, C., Rahmani, H., Liu, J.: LLM-based agentic reasoning frameworks: A survey from methods to scenarios. arXiv preprint arXiv:2508.17692 (2025)
-
[6]
arXiv preprint arXiv:2601.12538 (2026)
Wei, T., Li, T.-W., Liu, Z., Ning, X., Yang, Z., Zou, J., Zeng, Z., Qiu, R., Lin, X., Fu, D., et al.: Agentic reasoning for large language models. arXiv preprint arXiv:2601.12538 (2026)
-
[7]
OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning
Lu, P., Chen, B., Liu, S., Thapa, R., Boen, J., Zou, J.: Octotools: An agen- tic framework with extensible tools for complex reasoning. arXiv preprint arXiv:2502.11271 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Journal of Computer Science and Technology Studies7(5), 182–193 (2025)
Garg, V.: Designing the mind: How agentic frameworks are shaping the future of AI behavior. Journal of Computer Science and Technology Studies7(5), 182–193 (2025)
2025
-
[9]
Kazemi, M., Fatemi, B., Bansal, H., Palowitch, J., Anastasiou, C., Mehta, S.V., Jain, L.K., Aglietti, V., Jindal, D., Chen, P., et al.: Big-bench extra hard. arXiv preprint arXiv:2502.19187 (2025)
-
[10]
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
Suzgun, M., Scales, N., Sch¨ arli, N., Gehrmann, S., Tay, Y., Chung, H.W., Chowd- hery, A., Le, Q.V., Chi, E.H., Zhou, D., et al.: Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261 (2022)
work page internal anchor Pith review arXiv 2022
-
[11]
Yu, Z., Zhao, Y., Cohan, A., Zhang, X.-P.: Humaneval pro and mbpp pro: Eval- uating large language models on self-invoking code generation. arXiv preprint arXiv:2412.21199 (2024)
-
[12]
arXiv e-prints, 2602 (2026)
Frendi Gunawan, G., Amien, M.: Comprehensive evaluation of large language models on software engineering tasks: A multi-task benchmark. arXiv e-prints, 2602 (2026)
2026
-
[13]
Journal of Systems and Software, 112815 (2026)
Hasan, M.M., Waseem, M., Kemell, K.-K., Rasku, J., Ala-Rantala, J., Abrahams- son, P.: Assessing small language models for code generation: An empirical study with benchmarks. Journal of Systems and Software, 112815 (2026)
2026
-
[14]
Yin, Z., Gao, C., Fan, C., Yang, W., Xue, Y., Zhang, L.: A comprehensive empir- ical evaluation of agent frameworks on code-centric software engineering tasks. arXiv preprint arXiv:2511.00872 (2025)
-
[15]
In: International Conference on Practical Applications of Agents and Multi-Agent Systems, pp
Barbarroxa, R., Gomes, L., Vale, Z.: Benchmarking large language models for multi-agent systems: A comparative analysis of autogen, crewai, and taskweaver. In: International Conference on Practical Applications of Agents and Multi-Agent Systems, pp. 39–48 (2024). Springer
2024
-
[16]
Unpublished manuscript (2025) 38
Oliveira, M.C.: A Comparative Analysis of LLM-Based Multi-Agent Frameworks. Unpublished manuscript (2025) 38
2025
-
[17]
Training Verifiers to Solve Math Word Problems
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al.: Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[18]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., Tafjord, O.: Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[19]
https://github.com/GPT-Laborat ory/22-Agentic-Framework-Comparison-for-Reasoning-Tasks-across-BBH-G SM8K-and-ARC-Benchmarks
GPT-Laboratory: 22-Agentic-Framework-Comparison-for-Reasoning-Tasks- across-BBH-GSM8K-and-ARC-Benchmarks. https://github.com/GPT-Laborat ory/22-Agentic-Framework-Comparison-for-Reasoning-Tasks-across-BBH-G SM8K-and-ARC-Benchmarks. GitHub repository, accessed April 15, 2026 (2025)
2026
-
[20]
https: //doi.org/10.5281/zenodo.19593242
Rasheed, Z.: Agentic Frameworks for Reasoning Tasks: An Empirical Study. https: //doi.org/10.5281/zenodo.19593242 . https://doi.org/10.5281/zenodo.19593242
-
[21]
Aratchige, R., Ilmini, W.: Llms working in harmony: A survey on the technolog- ical aspects of building effective LLM-based multi agent systems. arXiv preprint arXiv:2504.01963 (2025)
-
[22]
Available at SSRN 6328882 (2026)
Rasheed, Z., Muhammad, W., Kemell, K.-K., Saari, M., Abrahamsson, P.: Llm- based multi-agent systems for code generation: A multi-vocal literature review. Available at SSRN 6328882 (2026)
2026
-
[23]
IEEE Robotics and Automation Letters (2025)
Sun, C., Huang, S., Pompili, D.: Llm-based multi-agent decision-making: Chal- lenges and future directions. IEEE Robotics and Automation Letters (2025)
2025
-
[24]
Artificial Intelligence Review 59(1), 11 (2025)
Abou Ali, M., Dornaika, F., Charafeddine, J.: Agentic AI: a comprehensive survey of architectures, applications, and future directions. Artificial Intelligence Review 59(1), 11 (2025)
2025
-
[25]
Future Internet17(9), 404 (2025)
Bandi, A., Kongari, B., Naguru, R., Pasnoor, S., Vilipala, S.V.: The rise of agentic AI: A review of definitions, frameworks, architectures, applications, evaluation metrics, and challenges. Future Internet17(9), 404 (2025)
2025
-
[26]
ACM Transactions on Information Systems43(6), 1–47 (2025)
Zhang, Z., Dai, Q., Bo, X., Ma, C., Li, R., Chen, X., Zhu, J., Dong, Z., Wen, J.- R.: A survey on the memory mechanism of large language model-based agents. ACM Transactions on Information Systems43(6), 1–47 (2025)
2025
-
[27]
Wang, J., Duan, Z.: Agent AI with langgraph: A modular framework for enhancing machine translation using large language models. arXiv preprint arXiv:2412.03801 (2024)
-
[28]
In: First Conference on Language Modeling (2024) 39
Wu, Q., Bansal, G., Zhang, J., Wu, Y., Li, B., Zhu, E., Jiang, L., Zhang, X., Zhang, S., Liu, J.,et al.: Autogen: Enabling next-gen LLM applications via multi-agent conversations. In: First Conference on Language Modeling (2024) 39
2024
-
[29]
Advances in Neural Information Processing Systems36, 51991–52008 (2023)
Li, G., Hammoud, H., Itani, H., Khizbullin, D., Ghanem, B.: Camel: Communica- tive agents for mind exploration of LLM society. Advances in Neural Information Processing Systems36, 51991–52008 (2023)
2023
-
[30]
GitHub repository
CrewAI Inc.: CrewAI: AI Agent Development Framework. GitHub repository. Available at: https://github.com/crewAIInc/crewAI (accessed February 2026) (2025)
2026
-
[31]
GitHub repository
TransformerOptimus: SuperAGI: An Open-Source Autonomous AI Agent Framework. GitHub repository. Available at: https://github.com/TransformerOptimus/SuperAGI (accessed February 2026) (2025)
2026
-
[32]
TaskWeaver : A code-first agent framework, 2024
Qiao, B., Li, L., Zhang, X., He, S., Kang, Y., Zhang, C., Yang, F., Dong, H., Zhang, J., Wang, L., et al.: Taskweaver: A code-first agent framework. arXiv preprint arXiv:2311.17541 (2023)
-
[33]
In: The Twelfth International Conference on Learning Representations (2023)
Hong, S., Zhuge, M., Chen, J., Zheng, X., Cheng, Y., Wang, J., Zhang, C., Wang, Z., Yau, S.K.S., Lin, Z.,et al.: Metagpt: Meta programming for a multi-agent collaborative framework. In: The Twelfth International Conference on Learning Representations (2023)
2023
-
[34]
In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (volume 1: Long Papers), pp
Qian, C., Liu, W., Liu, H., Chen, N., Dang, Y., Li, J., Yang, C., Chen, W., Su, Y., Cong, X.,et al.: Chatdev: Communicative agents for software development. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (volume 1: Long Papers), pp. 15174–15186 (2024)
2024
-
[35]
arXiv preprint arXiv:2310.03659 , year=
H¨ andler, T.: Balancing autonomy and alignment: a multi-dimensional tax- onomy for autonomous llm-powered multi-agent architectures. arXiv preprint arXiv:2310.03659 (2023)
-
[36]
Available at SSRN 5484727 (2025)
Patel, V.: Systematic comparison of agentic AI frameworks for scholarly literature processing. Available at SSRN 5484727 (2025)
2025
-
[37]
Yang, H., Yue, S., He, Y.: Auto-GPT for online decision making: Benchmarks and additional opinions. arXiv preprint arXiv:2306.02224 (2023)
-
[38]
GitHub repository
LangChain AI: LangChain. GitHub repository. Available at: https://github.com/langchain-ai/langchain (accessed February 2026) (2023)
2026
-
[39]
GitHub repository
LlamaIndex Contributors: LlamaIndex. GitHub repository. Available at: https: //github.com/run-llama/llama index (accessed February 2026) (2023)
2026
-
[40]
In: International Conference on Applied Engineering and Natural Sciences, vol
Topsakal, O., Akinci, T.C.: Creating large language model applications utilizing LangChain: A primer on developing LLM apps fast. In: International Conference on Applied Engineering and Natural Sciences, vol. 1, pp. 1050–1056 (2023) 40
2023
-
[41]
An Empirical Study of Testing Practices in Open Source AI Agent Frameworks and Agentic Applications
Hasan, M.M., Li, H., Fallahzadeh, E., Rajbahadur, G.K., Adams, B., Hassan, A.E.: An empirical study of testing practices in open source AI agent frameworks and agentic applications. arXiv preprint arXiv:2509.19185 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
GitHub repository
Google: ADK-Python: An Open-Source Python Toolkit for AI Agent Develop- ment. GitHub repository. Available at: https://github.com/google/adk-python (accessed February 2026) (2026)
2026
-
[43]
Large Language Model based Multi-Agents: A Survey of Progress and Challenges
Guo, T., Chen, X., Wang, Y., Chang, R., Pei, S., Chawla, N.V., Wiest, O., Zhang, X.: Large language model based multi-agents: A survey of progress and challenges. arXiv preprint arXiv:2402.01680 (2024)
work page internal anchor Pith review arXiv 2024
-
[44]
A., Tihanyi, N., and Debbah, M
Ferrag, M.A., Tihanyi, N., Debbah, M.: From LLM reasoning to autonomous ai agents: A comprehensive review. arXiv preprint arXiv:2504.19678 (2025)
-
[45]
World Scientific Annual Review of Fintech, 2450001 (2025)
Shi, J., Lee, D.K.C., Xu, W., Wang, Y.: Comparative analysis of open-source frameworks for agentic AI systems: Capabilities, design philosophies, and devel- opment experiences. World Scientific Annual Review of Fintech, 2450001 (2025)
2025
-
[46]
BMC medical informatics and decision making7(1), 16 (2007)
Schardt, C., Adams, M.B., Owens, T., Keitz, S., Fontelo, P.: Utilization of the PICO framework to improve searching pubmed for clinical questions. BMC medical informatics and decision making7(1), 16 (2007)
2007
-
[47]
arXiv preprint arXiv:2508.10146 , year =
Derouiche, H., Brahmi, Z., Mazeni, H.: Agentic AI frameworks: Architectures, protocols, and design challenges. arXiv preprint arXiv:2508.10146 (2025)
-
[48]
In: Findings of the Association for Computational Linguistics: ACL 2023, pp
Huang, J., Chang, K.C.-C.: Towards reasoning in large language models: A survey. In: Findings of the Association for Computational Linguistics: ACL 2023, pp. 1049–1065 (2023)
2023
-
[49]
In: The Eleventh International Conference on Learning Representations (2022)
Zhang, Z., Zhang, A., Li, M., Smola, A.: Automatic chain of thought prompting in large language models. In: The Eleventh International Conference on Learning Representations (2022)
2022
-
[50]
https://openai.com/index/introducing-gpt-5-2/
OpenAI: Introducing GPT-5.2. https://openai.com/index/introducing-gpt-5-2/. Accessed: 2025-12-xx (2025)
2025
-
[51]
arXiv preprint arXiv:2512.06104 (2025)
Liao, I., Gu, A.: Arc-agi without pretraining. arXiv preprint arXiv:2512.06104 (2025)
-
[52]
Arc- agi-2: A new challenge for frontier ai reasoning systems.arXiv preprint arXiv:2505.11831, 2025
Chollet, F., Knoop, M., Kamradt, G., Landers, B., Pinkard, H.: ARC-AGI-2: A new challenge for frontier AI reasoning systems. arXiv preprint arXiv:2505.11831 (2025)
-
[53]
https://arcprize.org/leaderboard
ARC Prize, Inc.: ARC-AGI-3 Leaderboard. https://arcprize.org/leaderboard. Accessed: 2026-04-12 (2026)
2026
-
[54]
Advances in Neural Information Processing Systems37, 113569–113697 (2024)
Wang, Z., Xia, M., He, L., Chen, H., Liu, Y., Zhu, R., Liang, K., Wu, X., Liu, 41 H., Malladi, S.,et al.: Charxiv: Charting gaps in realistic chart understanding in multimodal LLMs. Advances in Neural Information Processing Systems37, 113569–113697 (2024)
2024
-
[55]
Glazer, E., Erdil, E., Besiroglu, T., Chicharro, D., Chen, E., Gunning, A., Olsson, C.F., Denain, J.-S., Ho, A., Santos, E.d.O., et al.: Frontiermath: A bench- mark for evaluating advanced mathematical reasoning in AI. arXiv preprint arXiv:2411.04872 (2024)
-
[56]
Advances in Neural Information Processing Systems37, 126032–126058 (2024)
Zhou, P., Pujara, J., Ren, X., Chen, X., Cheng, H.-T., Le, Q.V., Chi, E., Zhou, D., Mishra, S., Zheng, H.S.: Self-discover: Large language models self-compose reasoning structures. Advances in Neural Information Processing Systems37, 126032–126058 (2024)
2024
-
[57]
Web engineering, 409–430 (2006)
Wohlin, C., H¨ ost, M., Henningsson, K.: Empirical research methods in web and software engineering. Web engineering, 409–430 (2006)
2006
-
[58]
Journal of Methods and Measurement in the Social Sciences6(1), 14–29 (2015)
Blair, E.: A reflexive exploration of two qualitative data coding techniques. Journal of Methods and Measurement in the Social Sciences6(1), 14–29 (2015)
2015
-
[59]
IEEE Access (2025)
Pati, A.K.: Agentic AI: a comprehensive survey of technologies, applications, and societal implications. IEEE Access (2025)
2025
-
[60]
Mirzadeh, I., Alizadeh, K., Shahrokhi, H., Tuzel, O., Bengio, S., Farajtabar, M.: Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models. arXiv preprint arXiv:2410.05229 (2024)
-
[61]
arXiv preprint arXiv:2507.10624 (2025)
Zhang, Z.: Comprehension without competence: Architectural limits of llms in symbolic computation and reasoning. arXiv preprint arXiv:2507.10624 (2025)
-
[62]
arXiv preprint arXiv:2410.01748 (2024)
Hosseini, A., Sordoni, A., Toyama, D., Courville, A., Agarwal, R.: Not all LLM reasoners are created equal. arXiv preprint arXiv:2410.01748 (2024)
-
[63]
Yuan, Z., Yuan, H., Li, C., Dong, G., Lu, K., Tan, C., Zhou, C., Zhou, J.: Scal- ing relationship on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825 (2023)
-
[64]
Tan, Z., Geng, H., Yu, X., Zhang, M., Wan, G., Zhou, Y., He, Q., Xue, X., Zhou, H., Fan, Y., et al.: Scaling behaviors of LLM reinforcement learning post-training: An empirical study in mathematical reasoning. arXiv preprint arXiv:2509.25300 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[65]
Advances in Neural Information Processing Systems37, 46819–46836 (2024) 42
Zhang, H., Da, J., Lee, D., Robinson, V., Wu, C., Song, W., Zhao, T., Raja, P., Zhuang, C., Slack, D.,et al.: A careful examination of large language model per- formance on grade school arithmetic. Advances in Neural Information Processing Systems37, 46819–46836 (2024) 42
2024
-
[66]
Springer, Berlin, Heidelberg (2012) 43
Wohlin, C., Runeson, P., H¨ ost, M., Ohlsson, M.C., Regnell, B., Wessl´ en, A.: Experimentation in Software Engineering. Springer, Berlin, Heidelberg (2012) 43
2012
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.