Agentic Frameworks for Reasoning Tasks: An Empirical Study

Abdul Malik Sami, Kai-Kristian Kemell, Mika Saari, Muhammad Waseem, Pekka Abrahamsson, Zeeshan Rasheed

Pith reviewed 2026-05-10 08:20 UTC · model grok-4.3

classification 💻 cs.AI cs.SE

keywords reasoningframeworksagenticaccuracycostresultsbecausebenchmarks

0 comments

The pith

An empirical evaluation of 22 agentic frameworks on BBH, GSM8K, and ARC benchmarks shows stable performance in 12 frameworks but highlights orchestration failures and weaker mathematical reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper conducts an empirical evaluation of 22 agentic frameworks for AI agents. These frameworks help AI systems perform complex reasoning and decision making. The authors collected repositories from GitHub and created a taxonomy based on how the frameworks are designed architecturally. They tested the frameworks on three standard benchmarks: BBH for big-bench hard tasks, GSM8K for grade school math, and ARC for abstract reasoning. The evaluation measured accuracy, how long tasks take, computational cost, and how consistent the results are across benchmarks. Results indicate that 19 frameworks finished all tests. Of those, 12 had stable performance with accuracy around 75 percent, taking 4 to 6 seconds per task and costing about 0.15 cents. Problems in other frameworks came from issues in how they manage the process, like context growing too large or repeated failures leading to high costs. Notably, performance on math problems was much lower, with only 44 percent accuracy on GSM8K compared to nearly 90 percent on the other two. The study concludes that when choosing a framework, focus on how well it handles orchestration, memory, failures, and costs rather than just the core reasoning ability. This helps developers and researchers pick better tools for building AI agents that need to reason.

Core claim

this study provides the first large-scale empirical comparison of agentic frameworks for reasoning-intensive software engineering tasks and shows that framework selection should prioritize orchestration quality, especially memory control, failure handling, and cost management.

Load-bearing premise

That the 22 frameworks selected from 1,200 GitHub repositories are representative of the field and that the unified evaluation setting provides a fair, unbiased comparison without hidden differences in implementation or prompting.

read the original abstract

Recent advances in agentic frameworks have enabled AI agents to perform complex reasoning and decision-making. However, evidence comparing their reasoning performance, efficiency, and practical suitability remains limited. To address this gap, we empirically evaluate 22 widely used agentic frameworks across three reasoning benchmarks: BBH, GSM8K, and ARC. The frameworks were selected from 1,200 GitHub repositories collected between January 2023 and July 2025 and organized into a taxonomy based on architectural design. We evaluated them under a unified setting, measuring reasoning accuracy, execution time, computational cost, and cross-benchmark consistency. Our results show that 19 of the 22 frameworks completed all three benchmarks. Among these, 12 showed stable performance, with mean accuracy of 74.6-75.9%, execution time of 4-6 seconds per task, and cost of 0.14-0.18 cents per task. Poorer results were mainly caused by orchestration problems rather than reasoning limits. For example, Camel failed to complete BBH after 11 days because of uncontrolled context growth, while Upsonic consumed USD 1,434 in one day because repeated extraction failures triggered costly retries. AutoGen and Mastra also exhausted API quotas through iterative interactions that increased prompt length without improving results. We also found a sharp drop in mathematical reasoning. Mean accuracy on GSM8K was 44.35%, compared with 89.80% on BBH and 89.56% on ARC. Overall, this study provides the first large-scale empirical comparison of agentic frameworks for reasoning-intensive software engineering tasks and shows that framework selection should prioritize orchestration quality, especially memory control, failure handling, and cost management.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents an empirical study evaluating 22 agentic frameworks, selected from 1,200 GitHub repositories, on three reasoning benchmarks: BBH, GSM8K, and ARC. Under a unified setting, it measures accuracy, execution time, cost, and consistency. Results indicate that 19 frameworks completed the benchmarks, with 12 exhibiting stable performance (mean accuracy 74.6-75.9%, time 4-6s, cost 0.14-0.18 cents per task). Failures are attributed to orchestration issues such as uncontrolled context growth (Camel) and excessive retry costs (Upsonic). A performance drop is noted on GSM8K (44.35% vs. ~89% on others). The paper concludes that framework selection for reasoning-intensive software engineering tasks should prioritize orchestration quality, memory control, failure handling, and cost management.

Significance. If the empirical results hold, the work provides a valuable large-scale comparison with specific, actionable insights into practical limitations of agentic frameworks, including quantifiable failure modes and cost overruns. The unified evaluation setting and concrete examples (e.g., Camel context growth, Upsonic retry costs) strengthen the contribution to the field of AI agents. The study also ships concrete metrics from a controlled setting, which is a strength for empirical agent research.

major comments (2)

[Abstract] Abstract: The assertion that the study is 'the first large-scale empirical comparison of agentic frameworks for reasoning-intensive software engineering tasks' and the recommendation that 'framework selection should prioritize orchestration quality... for reasoning-intensive software engineering tasks' are not supported by the evidence. The evaluation is limited to the general reasoning benchmarks BBH, GSM8K, and ARC, with no tasks involving code, debugging, repositories, or engineering workflows. This scope mismatch makes the SE-specific conclusions unsupported.
[Results] Results section: The performance metrics for the 12 stable frameworks (mean accuracy 74.6-75.9%, execution time 4-6 seconds, cost 0.14-0.18 cents) are reported as ranges without error bars, standard deviations, number of trials per task, or statistical significance tests. This omission undermines the reliability of cross-framework comparisons and the claim of 'stable performance'.

minor comments (2)

[Abstract] The collection period 'January 2023 and July 2025' includes a future date relative to the manuscript's likely submission; this should be clarified or corrected.
The taxonomy organizing the 22 frameworks by architectural design is mentioned but not illustrated or tabulated; adding a summary table or figure would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each of the major comments point by point below. We have made revisions to the abstract and results section to address the concerns raised.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that the study is 'the first large-scale empirical comparison of agentic frameworks for reasoning-intensive software engineering tasks' and the recommendation that 'framework selection should prioritize orchestration quality... for reasoning-intensive software engineering tasks' are not supported by the evidence. The evaluation is limited to the general reasoning benchmarks BBH, GSM8K, and ARC, with no tasks involving code, debugging, repositories, or engineering workflows. This scope mismatch makes the SE-specific conclusions unsupported.

Authors: We agree that the abstract's language linking the study directly to software engineering tasks is not fully supported by the benchmarks used, which are general reasoning tasks. Our intent was to highlight potential relevance to SE, as reasoning is key in such tasks, but we acknowledge the scope mismatch. In the revised manuscript, we have updated the abstract to state that this is the first large-scale empirical comparison of agentic frameworks on reasoning benchmarks, and we have removed the SE-specific recommendation, instead suggesting that the findings on orchestration and cost management may be relevant for applications in software engineering. This aligns the claims with the evidence presented. revision: yes
Referee: [Results] Results section: The performance metrics for the 12 stable frameworks (mean accuracy 74.6-75.9%, execution time 4-6 seconds, cost 0.14-0.18 cents) are reported as ranges without error bars, standard deviations, number of trials per task, or statistical significance tests. This omission undermines the reliability of cross-framework comparisons and the claim of 'stable performance'.

Authors: The ranges provided represent the variation observed across the 12 stable frameworks rather than statistical summaries within individual frameworks. Each framework was evaluated once per benchmark under the unified setting to ensure comparability, which limited our ability to compute within-framework standard deviations or perform extensive statistical tests. We have revised the results section to explicitly state that these are observed ranges across frameworks, clarified the single-run nature of the evaluation, and added a discussion of this as a limitation. Where possible, we have included consistency metrics across the three benchmarks as a proxy for stability. We believe this addresses the concern while reflecting the practical constraints of evaluating 22 frameworks. revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical evaluation on external benchmarks

full rationale

The paper conducts an empirical study selecting 22 frameworks from 1,200 GitHub repositories and evaluating them on public benchmarks BBH, GSM8K, and ARC under a unified setting. No derivations, equations, fitted parameters, or self-citations are used to generate predictions that reduce to the inputs by construction. Results on accuracy, time, cost, and failures (e.g., context growth in Camel) are direct measurements. The abstract's phrasing linking findings to 'reasoning-intensive software engineering tasks' while using general reasoning benchmarks represents a potential overgeneralization, but this is not a circular reduction per the enumerated patterns. The study is self-contained against external benchmarks with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical benchmarking study with no free parameters fitted to results, no invented entities, and only standard domain assumptions about benchmark validity.

axioms (1)

domain assumption BBH, GSM8K, and ARC benchmarks validly measure reasoning performance in agentic frameworks
Invoked when using accuracy on these benchmarks as the primary outcome measure.

pith-pipeline@v0.9.0 · 5631 in / 1283 out tokens · 46693 ms · 2026-05-10T08:20:43.905881+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 28 canonical work pages · 7 internal anchors

[1]

ACM Transactions on Software Engineering and Methodology (2024)

Liu, J., Wang, K., Chen, Y., Peng, X., Chen, Z., Zhang, L., Lou, Y.: Large lan- guage model-based agents for software engineering: A survey. ACM Transactions on Software Engineering and Methodology (2024)

2024
[2]

ACM Transactions on Software Engineering and Methodology34(5), 1–30 (2025)

He, J., Treude, C., Lo, D.: Llm-based multi-agent systems for software engineer- ing: Literature review, vision, and the road ahead. ACM Transactions on Software Engineering and Methodology34(5), 1–30 (2025)

2025
[3]

IEEE Software43(1), 133–138 (2025)

Vaidhyanathan, K., Taibi, D.: Agentic ai frameworks under the microscope: What works, what doesn’t. IEEE Software43(1), 133–138 (2025)

2025
[4]

arXiv preprint arXiv:2512.01939 (2025) 37

Wang, Y., Xu, X., Chen, J., Bi, T., Gu, W., Zheng, Z.: An empirical study of agent developer practices in AI agent frameworks. arXiv preprint arXiv:2512.01939 (2025) 37

work page arXiv 2025
[5]

LLM-based agentic reasoning frameworks: A survey from methods to scenarios,

Zhao, B., Foo, L.G., Hu, P., Theobalt, C., Rahmani, H., Liu, J.: LLM-based agentic reasoning frameworks: A survey from methods to scenarios. arXiv preprint arXiv:2508.17692 (2025)

work page arXiv 2025
[6]

arXiv preprint arXiv:2601.12538 (2026)

Wei, T., Li, T.-W., Liu, Z., Ning, X., Yang, Z., Zou, J., Zeng, Z., Qiu, R., Lin, X., Fu, D., et al.: Agentic reasoning for large language models. arXiv preprint arXiv:2601.12538 (2026)

work page arXiv 2026
[7]

OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning

Lu, P., Chen, B., Liu, S., Thapa, R., Boen, J., Zou, J.: Octotools: An agen- tic framework with extensible tools for complex reasoning. arXiv preprint arXiv:2502.11271 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Journal of Computer Science and Technology Studies7(5), 182–193 (2025)

Garg, V.: Designing the mind: How agentic frameworks are shaping the future of AI behavior. Journal of Computer Science and Technology Studies7(5), 182–193 (2025)

2025
[9]

ArXiv:2502.19187

Kazemi, M., Fatemi, B., Bansal, H., Palowitch, J., Anastasiou, C., Mehta, S.V., Jain, L.K., Aglietti, V., Jindal, D., Chen, P., et al.: Big-bench extra hard. arXiv preprint arXiv:2502.19187 (2025)

work page arXiv 2025
[10]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Suzgun, M., Scales, N., Sch¨ arli, N., Gehrmann, S., Tay, Y., Chung, H.W., Chowd- hery, A., Le, Q.V., Chi, E.H., Zhou, D., et al.: Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261 (2022)

work page internal anchor Pith review arXiv 2022
[11]

Humaneval pro and mbpp pro: Evaluating large language models on self-invoking code generation.arXiv preprint arXiv:2412.21199(2024)

Yu, Z., Zhao, Y., Cohan, A., Zhang, X.-P.: Humaneval pro and mbpp pro: Eval- uating large language models on self-invoking code generation. arXiv preprint arXiv:2412.21199 (2024)

work page arXiv 2024
[12]

arXiv e-prints, 2602 (2026)

Frendi Gunawan, G., Amien, M.: Comprehensive evaluation of large language models on software engineering tasks: A multi-task benchmark. arXiv e-prints, 2602 (2026)

2026
[13]

Journal of Systems and Software, 112815 (2026)

Hasan, M.M., Waseem, M., Kemell, K.-K., Rasku, J., Ala-Rantala, J., Abrahams- son, P.: Assessing small language models for code generation: An empirical study with benchmarks. Journal of Systems and Software, 112815 (2026)

2026
[14]

A com- prehensive empirical evaluation of agent frameworks on code-centric software engineering tasks

Yin, Z., Gao, C., Fan, C., Yang, W., Xue, Y., Zhang, L.: A comprehensive empir- ical evaluation of agent frameworks on code-centric software engineering tasks. arXiv preprint arXiv:2511.00872 (2025)

work page arXiv 2025
[15]

In: International Conference on Practical Applications of Agents and Multi-Agent Systems, pp

Barbarroxa, R., Gomes, L., Vale, Z.: Benchmarking large language models for multi-agent systems: A comparative analysis of autogen, crewai, and taskweaver. In: International Conference on Practical Applications of Agents and Multi-Agent Systems, pp. 39–48 (2024). Springer

2024
[16]

Unpublished manuscript (2025) 38

Oliveira, M.C.: A Comparative Analysis of LLM-Based Multi-Agent Frameworks. Unpublished manuscript (2025) 38

2025
[17]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al.: Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[18]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., Tafjord, O.: Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[19]

https://github.com/GPT-Laborat ory/22-Agentic-Framework-Comparison-for-Reasoning-Tasks-across-BBH-G SM8K-and-ARC-Benchmarks

GPT-Laboratory: 22-Agentic-Framework-Comparison-for-Reasoning-Tasks- across-BBH-GSM8K-and-ARC-Benchmarks. https://github.com/GPT-Laborat ory/22-Agentic-Framework-Comparison-for-Reasoning-Tasks-across-BBH-G SM8K-and-ARC-Benchmarks. GitHub repository, accessed April 15, 2026 (2025)

2026
[20]

https: //doi.org/10.5281/zenodo.19593242

Rasheed, Z.: Agentic Frameworks for Reasoning Tasks: An Empirical Study. https: //doi.org/10.5281/zenodo.19593242 . https://doi.org/10.5281/zenodo.19593242

work page doi:10.5281/zenodo.19593242
[21]

Wmks Ilmini

Aratchige, R., Ilmini, W.: Llms working in harmony: A survey on the technolog- ical aspects of building effective LLM-based multi agent systems. arXiv preprint arXiv:2504.01963 (2025)

work page arXiv 2025
[22]

Available at SSRN 6328882 (2026)

Rasheed, Z., Muhammad, W., Kemell, K.-K., Saari, M., Abrahamsson, P.: Llm- based multi-agent systems for code generation: A multi-vocal literature review. Available at SSRN 6328882 (2026)

2026
[23]

IEEE Robotics and Automation Letters (2025)

Sun, C., Huang, S., Pompili, D.: Llm-based multi-agent decision-making: Chal- lenges and future directions. IEEE Robotics and Automation Letters (2025)

2025
[24]

Artificial Intelligence Review 59(1), 11 (2025)

Abou Ali, M., Dornaika, F., Charafeddine, J.: Agentic AI: a comprehensive survey of architectures, applications, and future directions. Artificial Intelligence Review 59(1), 11 (2025)

2025
[25]

Future Internet17(9), 404 (2025)

Bandi, A., Kongari, B., Naguru, R., Pasnoor, S., Vilipala, S.V.: The rise of agentic AI: A review of definitions, frameworks, architectures, applications, evaluation metrics, and challenges. Future Internet17(9), 404 (2025)

2025
[26]

ACM Transactions on Information Systems43(6), 1–47 (2025)

Zhang, Z., Dai, Q., Bo, X., Ma, C., Li, R., Chen, X., Zhu, J., Dong, Z., Wen, J.- R.: A survey on the memory mechanism of large language model-based agents. ACM Transactions on Information Systems43(6), 1–47 (2025)

2025
[27]

Agent ai with langgraph: A modular framework for enhancing machine translation using large language models.arXiv preprint arXiv:2412.03801, 2024

Wang, J., Duan, Z.: Agent AI with langgraph: A modular framework for enhancing machine translation using large language models. arXiv preprint arXiv:2412.03801 (2024)

work page arXiv 2024
[28]

In: First Conference on Language Modeling (2024) 39

Wu, Q., Bansal, G., Zhang, J., Wu, Y., Li, B., Zhu, E., Jiang, L., Zhang, X., Zhang, S., Liu, J.,et al.: Autogen: Enabling next-gen LLM applications via multi-agent conversations. In: First Conference on Language Modeling (2024) 39

2024
[29]

Advances in Neural Information Processing Systems36, 51991–52008 (2023)

Li, G., Hammoud, H., Itani, H., Khizbullin, D., Ghanem, B.: Camel: Communica- tive agents for mind exploration of LLM society. Advances in Neural Information Processing Systems36, 51991–52008 (2023)

2023
[30]

GitHub repository

CrewAI Inc.: CrewAI: AI Agent Development Framework. GitHub repository. Available at: https://github.com/crewAIInc/crewAI (accessed February 2026) (2025)

2026
[31]

GitHub repository

TransformerOptimus: SuperAGI: An Open-Source Autonomous AI Agent Framework. GitHub repository. Available at: https://github.com/TransformerOptimus/SuperAGI (accessed February 2026) (2025)

2026
[32]

TaskWeaver : A code-first agent framework, 2024

Qiao, B., Li, L., Zhang, X., He, S., Kang, Y., Zhang, C., Yang, F., Dong, H., Zhang, J., Wang, L., et al.: Taskweaver: A code-first agent framework. arXiv preprint arXiv:2311.17541 (2023)

work page arXiv 2023
[33]

In: The Twelfth International Conference on Learning Representations (2023)

Hong, S., Zhuge, M., Chen, J., Zheng, X., Cheng, Y., Wang, J., Zhang, C., Wang, Z., Yau, S.K.S., Lin, Z.,et al.: Metagpt: Meta programming for a multi-agent collaborative framework. In: The Twelfth International Conference on Learning Representations (2023)

2023
[34]

In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (volume 1: Long Papers), pp

Qian, C., Liu, W., Liu, H., Chen, N., Dang, Y., Li, J., Yang, C., Chen, W., Su, Y., Cong, X.,et al.: Chatdev: Communicative agents for software development. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (volume 1: Long Papers), pp. 15174–15186 (2024)

2024
[35]

arXiv preprint arXiv:2310.03659 , year=

H¨ andler, T.: Balancing autonomy and alignment: a multi-dimensional tax- onomy for autonomous llm-powered multi-agent architectures. arXiv preprint arXiv:2310.03659 (2023)

work page arXiv 2023
[36]

Available at SSRN 5484727 (2025)

Patel, V.: Systematic comparison of agentic AI frameworks for scholarly literature processing. Available at SSRN 5484727 (2025)

2025
[37]

Auto-gpt for online decision making: Benchmarks and additional opinions.arXiv preprint arXiv:2306.02224,

Yang, H., Yue, S., He, Y.: Auto-GPT for online decision making: Benchmarks and additional opinions. arXiv preprint arXiv:2306.02224 (2023)

work page arXiv 2023
[38]

GitHub repository

LangChain AI: LangChain. GitHub repository. Available at: https://github.com/langchain-ai/langchain (accessed February 2026) (2023)

2026
[39]

GitHub repository

LlamaIndex Contributors: LlamaIndex. GitHub repository. Available at: https: //github.com/run-llama/llama index (accessed February 2026) (2023)

2026
[40]

In: International Conference on Applied Engineering and Natural Sciences, vol

Topsakal, O., Akinci, T.C.: Creating large language model applications utilizing LangChain: A primer on developing LLM apps fast. In: International Conference on Applied Engineering and Natural Sciences, vol. 1, pp. 1050–1056 (2023) 40

2023
[41]

An Empirical Study of Testing Practices in Open Source AI Agent Frameworks and Agentic Applications

Hasan, M.M., Li, H., Fallahzadeh, E., Rajbahadur, G.K., Adams, B., Hassan, A.E.: An empirical study of testing practices in open source AI agent frameworks and agentic applications. arXiv preprint arXiv:2509.19185 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

GitHub repository

Google: ADK-Python: An Open-Source Python Toolkit for AI Agent Develop- ment. GitHub repository. Available at: https://github.com/google/adk-python (accessed February 2026) (2026)

2026
[43]

Large Language Model based Multi-Agents: A Survey of Progress and Challenges

Guo, T., Chen, X., Wang, Y., Chang, R., Pei, S., Chawla, N.V., Wiest, O., Zhang, X.: Large language model based multi-agents: A survey of progress and challenges. arXiv preprint arXiv:2402.01680 (2024)

work page internal anchor Pith review arXiv 2024
[44]

A., Tihanyi, N., and Debbah, M

Ferrag, M.A., Tihanyi, N., Debbah, M.: From LLM reasoning to autonomous ai agents: A comprehensive review. arXiv preprint arXiv:2504.19678 (2025)

work page arXiv 2025
[45]

World Scientific Annual Review of Fintech, 2450001 (2025)

Shi, J., Lee, D.K.C., Xu, W., Wang, Y.: Comparative analysis of open-source frameworks for agentic AI systems: Capabilities, design philosophies, and devel- opment experiences. World Scientific Annual Review of Fintech, 2450001 (2025)

2025
[46]

BMC medical informatics and decision making7(1), 16 (2007)

Schardt, C., Adams, M.B., Owens, T., Keitz, S., Fontelo, P.: Utilization of the PICO framework to improve searching pubmed for clinical questions. BMC medical informatics and decision making7(1), 16 (2007)

2007
[47]

arXiv preprint arXiv:2508.10146 , year =

Derouiche, H., Brahmi, Z., Mazeni, H.: Agentic AI frameworks: Architectures, protocols, and design challenges. arXiv preprint arXiv:2508.10146 (2025)

work page arXiv 2025
[48]

In: Findings of the Association for Computational Linguistics: ACL 2023, pp

Huang, J., Chang, K.C.-C.: Towards reasoning in large language models: A survey. In: Findings of the Association for Computational Linguistics: ACL 2023, pp. 1049–1065 (2023)

2023
[49]

In: The Eleventh International Conference on Learning Representations (2022)

Zhang, Z., Zhang, A., Li, M., Smola, A.: Automatic chain of thought prompting in large language models. In: The Eleventh International Conference on Learning Representations (2022)

2022
[50]

https://openai.com/index/introducing-gpt-5-2/

OpenAI: Introducing GPT-5.2. https://openai.com/index/introducing-gpt-5-2/. Accessed: 2025-12-xx (2025)

2025
[51]

arXiv preprint arXiv:2512.06104 (2025)

Liao, I., Gu, A.: Arc-agi without pretraining. arXiv preprint arXiv:2512.06104 (2025)

work page arXiv 2025
[52]

Arc- agi-2: A new challenge for frontier ai reasoning systems.arXiv preprint arXiv:2505.11831, 2025

Chollet, F., Knoop, M., Kamradt, G., Landers, B., Pinkard, H.: ARC-AGI-2: A new challenge for frontier AI reasoning systems. arXiv preprint arXiv:2505.11831 (2025)

work page arXiv 2025
[53]

https://arcprize.org/leaderboard

ARC Prize, Inc.: ARC-AGI-3 Leaderboard. https://arcprize.org/leaderboard. Accessed: 2026-04-12 (2026)

2026
[54]

Advances in Neural Information Processing Systems37, 113569–113697 (2024)

Wang, Z., Xia, M., He, L., Chen, H., Liu, Y., Zhu, R., Liang, K., Wu, X., Liu, 41 H., Malladi, S.,et al.: Charxiv: Charting gaps in realistic chart understanding in multimodal LLMs. Advances in Neural Information Processing Systems37, 113569–113697 (2024)

2024
[55]

Robin Hartshorne

Glazer, E., Erdil, E., Besiroglu, T., Chicharro, D., Chen, E., Gunning, A., Olsson, C.F., Denain, J.-S., Ho, A., Santos, E.d.O., et al.: Frontiermath: A bench- mark for evaluating advanced mathematical reasoning in AI. arXiv preprint arXiv:2411.04872 (2024)

work page arXiv 2024
[56]

Advances in Neural Information Processing Systems37, 126032–126058 (2024)

Zhou, P., Pujara, J., Ren, X., Chen, X., Cheng, H.-T., Le, Q.V., Chi, E., Zhou, D., Mishra, S., Zheng, H.S.: Self-discover: Large language models self-compose reasoning structures. Advances in Neural Information Processing Systems37, 126032–126058 (2024)

2024
[57]

Web engineering, 409–430 (2006)

Wohlin, C., H¨ ost, M., Henningsson, K.: Empirical research methods in web and software engineering. Web engineering, 409–430 (2006)

2006
[58]

Journal of Methods and Measurement in the Social Sciences6(1), 14–29 (2015)

Blair, E.: A reflexive exploration of two qualitative data coding techniques. Journal of Methods and Measurement in the Social Sciences6(1), 14–29 (2015)

2015
[59]

IEEE Access (2025)

Pati, A.K.: Agentic AI: a comprehensive survey of technologies, applications, and societal implications. IEEE Access (2025)

2025
[60]

Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models.arXiv preprint arXiv:2410.05229,

Mirzadeh, I., Alizadeh, K., Shahrokhi, H., Tuzel, O., Bengio, S., Farajtabar, M.: Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models. arXiv preprint arXiv:2410.05229 (2024)

work page arXiv 2024
[61]

arXiv preprint arXiv:2507.10624 (2025)

Zhang, Z.: Comprehension without competence: Architectural limits of llms in symbolic computation and reasoning. arXiv preprint arXiv:2507.10624 (2025)

work page arXiv 2025
[62]

arXiv preprint arXiv:2410.01748 (2024)

Hosseini, A., Sordoni, A., Toyama, D., Courville, A., Agarwal, R.: Not all LLM reasoners are created equal. arXiv preprint arXiv:2410.01748 (2024)

work page arXiv 2024
[63]

Scaling relationship on learning mathematical reasoning with large language models.arXiv preprint arXiv:2308.01825, 2023

Yuan, Z., Yuan, H., Li, C., Dong, G., Lu, K., Tan, C., Zhou, C., Zhou, J.: Scal- ing relationship on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825 (2023)

work page arXiv 2023
[64]

Scaling Behaviors of LLM Reinforcement Learning Post-Training: An Empirical Study in Mathematical Reasoning

Tan, Z., Geng, H., Yu, X., Zhang, M., Wan, G., Zhou, Y., He, Q., Xue, X., Zhou, H., Fan, Y., et al.: Scaling behaviors of LLM reinforcement learning post-training: An empirical study in mathematical reasoning. arXiv preprint arXiv:2509.25300 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[65]

Advances in Neural Information Processing Systems37, 46819–46836 (2024) 42

Zhang, H., Da, J., Lee, D., Robinson, V., Wu, C., Song, W., Zhao, T., Raja, P., Zhuang, C., Slack, D.,et al.: A careful examination of large language model per- formance on grade school arithmetic. Advances in Neural Information Processing Systems37, 46819–46836 (2024) 42

2024
[66]

Springer, Berlin, Heidelberg (2012) 43

Wohlin, C., Runeson, P., H¨ ost, M., Ohlsson, M.C., Regnell, B., Wessl´ en, A.: Experimentation in Software Engineering. Springer, Berlin, Heidelberg (2012) 43

2012