pith. sign in

arxiv: 2412.09819 · v1 · submitted 2024-12-13 · 💻 cs.LG · cs.SY· eess.SY

FDM-Bench: A Comprehensive Benchmark for Evaluating Large Language Models in Additive Manufacturing Tasks

Pith reviewed 2026-05-23 06:54 UTC · model grok-4.3

classification 💻 cs.LG cs.SYeess.SY
keywords FDM-BenchLarge Language ModelsAdditive ManufacturingG-code anomaly detectionBenchmark evaluationFused Deposition ModelingLLM performance3D printing defects
0
0 comments X

The pith

FDM-Bench tests large language models on fused deposition modeling tasks and finds closed-source models stronger at G-code anomaly detection while Llama-3.1-405B leads slightly on user queries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FDM-Bench, a dataset of user queries spanning experience levels and G-code samples with varied anomalies, to measure how well LLMs can support FDM design, planning, and defect resolution. It runs four models through these tasks and has FDM experts score the outputs in detail. Closed-source models generally handle anomaly detection better, yet the largest open-source model shows a modest edge when answering user questions. A sympathetic reader would care because FDM remains hard for non-experts due to parameter complexity and print defects, and reliable LLM assistance could lower that barrier if the benchmark results hold.

Core claim

FDM-Bench supplies user queries across experience levels together with G-code samples that contain a range of anomalies; when GPT-4o, Claude 3.5 Sonnet, Llama-3.1-70B, and Llama-3.1-405B are evaluated on these items, expert panel review shows closed-source models generally outperform open-source models in anomaly detection while Llama-3.1-405B holds a slight advantage on user-query responses.

What carries the argument

FDM-Bench, the dataset of experience-stratified user queries and anomalous G-code samples that serves as the evaluation substrate for LLM performance on FDM tasks.

If this is right

  • LLMs can now be compared systematically on their ability to assist with FDM parameter setting and defect diagnosis.
  • Closed-source models appear preferable for automated G-code inspection tasks.
  • Larger open-source models may be competitive when the goal is conversational support for users of varying skill.
  • FDM-Bench supplies a reusable test set that can track future model improvements in additive manufacturing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same benchmark format could be adapted to other additive manufacturing methods such as SLA or SLS to test cross-process generalization.
  • If models improve on FDM-Bench, non-experts might begin using 3D printers with less formal training, changing who participates in small-scale manufacturing.
  • Standardized scoring rubrics beyond expert panels could make the benchmark easier to run at scale and reduce subjectivity.

Load-bearing premise

The chosen user queries and G-code anomalies represent the full range of real-world FDM challenges and the expert panel assessments give a reliable, consistent measure of model quality.

What would settle it

A fresh collection of real-world FDM queries and G-code anomalies that produces different performance rankings among the four models, or expert raters whose scores disagree substantially with the original panel.

Figures

Figures reproduced from arXiv: 2412.09819 by Adrian Jackson, Ahmadreza Eslaminia, Avi Stern, Beitong Tian, Chenhui Shao, Hallie Gordon, Klara Nahrstedt, Rajiv Malhotra.

Figure 1
Figure 1. Figure 1: Printed parts illustrating different FDM quality classes: (a) ND, (b) UE, (c) OE, and (d) SP. • OE occurs when excess filament is extruded, leading to overlapping print patterns and dimensional inaccuracies. To create over-extruded samples, we increase the extru￾sion multiplier above the standard setting, with values ranging from 1.3 to 1.6. While the flow rate is the pri￾mary parameter influencing OE, we … view at source ↗
Figure 2
Figure 2. Figure 2: Confusion matrices for G-code anomaly detection across four LLM [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Average scores (1–5 scale) for each LLM model on free-form user [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: Average probability assigned by each LLM model to the correct [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Accuracy comparison of LLM models in answering multiple-choice [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
read the original abstract

Fused Deposition Modeling (FDM) is a widely used additive manufacturing (AM) technique valued for its flexibility and cost-efficiency, with applications in a variety of industries including healthcare and aerospace. Recent developments have made affordable FDM machines accessible and encouraged adoption among diverse users. However, the design, planning, and production process in FDM require specialized interdisciplinary knowledge. Managing the complex parameters and resolving print defects in FDM remain challenging. These technical complexities form the most critical barrier preventing individuals without technical backgrounds and even professional engineers without training in other domains from participating in AM design and manufacturing. Large Language Models (LLMs), with their advanced capabilities in text and code processing, offer the potential for addressing these challenges in FDM. However, existing research on LLM applications in this field is limited, typically focusing on specific use cases without providing comprehensive evaluations across multiple models and tasks. To this end, we introduce FDM-Bench, a benchmark dataset designed to evaluate LLMs on FDM-specific tasks. FDM-Bench enables a thorough assessment by including user queries across various experience levels and G-code samples that represent a range of anomalies. We evaluate two closed-source models (GPT-4o and Claude 3.5 Sonnet) and two open-source models (Llama-3.1-70B and Llama-3.1-405B) on FDM-Bench. A panel of FDM experts assess the models' responses to user queries in detail. Results indicate that closed-source models generally outperform open-source models in G-code anomaly detection, whereas Llama-3.1-405B demonstrates a slight advantage over other models in responding to user queries. These findings underscore FDM-Bench's potential as a foundational tool for advancing research on LLM capabilities in FDM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces FDM-Bench, a benchmark dataset for evaluating LLMs on FDM-specific tasks in additive manufacturing. It includes user queries across experience levels and G-code samples representing various anomalies. Four models are evaluated (GPT-4o, Claude 3.5 Sonnet, Llama-3.1-70B, Llama-3.1-405B), with a panel of FDM experts assessing responses to user queries in detail. The central claims are that closed-source models generally outperform open-source models on G-code anomaly detection, while Llama-3.1-405B shows a slight advantage on user queries, positioning FDM-Bench as a foundational evaluation tool.

Significance. If the benchmark samples prove representative and the expert assessments reliable, the work fills a documented gap in comprehensive, multi-model, multi-task LLM evaluation for AM. It supplies an empirical resource that future studies can build upon for standardized comparisons, with explicit credit due for constructing a new task suite rather than relying on ad-hoc prompts.

major comments (2)
  1. [Evaluation / Results (panel assessment description)] The comparative claims rest on expert-panel judgments of model outputs, yet the manuscript provides no details on panel size, selection criteria, calibration procedure, inter-rater agreement (e.g., Fleiss' kappa), or disagreement-resolution protocol. This information is required to substantiate the headline rankings (closed-source superiority on anomaly detection; Llama-3.1-405B edge on queries).
  2. [Benchmark Construction / Dataset Description] The representativeness claim—that queries span experience levels and G-code anomalies cover a range of real-world defects—is asserted without quantitative justification, sampling methodology, or comparison against documented FDM defect distributions. This directly affects whether the reported model orderings generalize beyond the chosen samples.
minor comments (2)
  1. [Abstract] The abstract would benefit from stating the exact number of user queries and G-code samples evaluated, allowing readers to gauge the scale of the reported comparisons immediately.
  2. [Results] Clarify whether G-code anomaly detection was scored by the same expert panel or by an automated metric; the current separation of tasks in the abstract leaves the evaluation protocol ambiguous.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments identify important gaps in methodological transparency that affect the interpretability of our results. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Evaluation / Results (panel assessment description)] The comparative claims rest on expert-panel judgments of model outputs, yet the manuscript provides no details on panel size, selection criteria, calibration procedure, inter-rater agreement (e.g., Fleiss' kappa), or disagreement-resolution protocol. This information is required to substantiate the headline rankings (closed-source superiority on anomaly detection; Llama-3.1-405B edge on queries).

    Authors: We agree that these details are required to substantiate the reliability of the expert assessments. The submitted manuscript does not provide them. In the revision we will add a dedicated subsection describing panel size, expert selection criteria and qualifications, any calibration steps, inter-rater agreement statistics (including Fleiss' kappa), and the disagreement-resolution protocol. revision: yes

  2. Referee: [Benchmark Construction / Dataset Description] The representativeness claim—that queries span experience levels and G-code anomalies cover a range of real-world defects—is asserted without quantitative justification, sampling methodology, or comparison against documented FDM defect distributions. This directly affects whether the reported model orderings generalize beyond the chosen samples.

    Authors: We acknowledge that the manuscript asserts representativeness without quantitative justification or explicit sampling methodology. In the revision we will expand the benchmark-construction section to document the selection process for queries and G-code samples and to include any available comparisons against documented FDM defect distributions from the literature; where quantitative comparisons are not feasible we will explicitly state this limitation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation without derivations or fitted inputs

full rationale

The paper introduces FDM-Bench as a new dataset for LLM evaluation on FDM tasks and reports direct empirical results from model inferences plus expert panel judgments. No equations, derivations, parameter fitting, or self-citation chains appear in the abstract or described structure. Claims about model performance rankings rest on the constructed benchmark and external expert assessments rather than reducing to any internal inputs by construction. This is a standard empirical benchmark paper with no load-bearing self-referential logic.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that expert judgment is a valid proxy for response quality and that the chosen tasks adequately sample the FDM problem space; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Expert assessments provide a valid and consistent measure of LLM response quality in technical FDM domains.
    The paper relies on a panel of FDM experts to judge model outputs without reporting validation procedures such as inter-rater reliability.

pith-pipeline@v0.9.0 · 5896 in / 1406 out tokens · 32987 ms · 2026-05-23T06:54:10.370968+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization

    cs.AI 2026-04 unverdicted novelty 6.0

    Frontier-Eng is a new benchmark for generative optimization in engineering where agents iteratively improve designs under fixed interaction budgets using executable verifiers, with top models like GPT 5.4 showing limi...

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    and Portoac ˘a, A.I., 2023

    Zisopol, D.G., T ˘anase, M. and Portoac ˘a, A.I., 2023. Innovative strate- gies for technical-economical optimization of FDM production. Poly- mers, 15(18), p.3787

  2. [2]

    and Yeong, W.Y ., 2020

    Sing, S.L., Tey, C.F., Tan, J.H.K., Huang, S. and Yeong, W.Y ., 2020. 3D printing of metals in rapid prototyping of biomaterials: Techniques in additive manufacturing. In Rapid prototyping of biomaterials (pp. 17-40). Woodhead Publishing

  3. [3]

    and Tiwari, M.K., 2022

    Aabith, S., Caulfield, R., Akhlaghi, O., Papadopoulou, A., Homer- Vanniasinkam, S. and Tiwari, M.K., 2022. 3D direct-write printing of water soluble micromoulds for high-resolution rapid prototyping. Addi- tive Manufacturing, 58, p.103019

  4. [4]

    and Malhotra, R., 2022, June

    Cleeman, J., Bogut, A., Mangrolia, B., Ripberger, A., Maghouli, A., Kate, K. and Malhotra, R., 2022, June. Multiplexed 3D Printing of Thermoplas- tics. In International Manufacturing Science and Engineering Conference (V ol. 85802, p. V001T01A004). American Society of Mechanical Engi- neers

  5. [5]

    and Fenollosa-Art ´es, F., 2021

    Buj-Corral, I., Tejo-Otero, A. and Fenollosa-Art ´es, F., 2021. Use of FDM technology in healthcare applications: recent advances. Fused Deposition Modeling Based 3D Printing, pp.277-297

  6. [6]

    and Soulairol, I., 2021

    Cailleaux, S., Sanchez-Ballester, N.M., Gueche, Y .A., Bataille, B. and Soulairol, I., 2021. Fused Deposition Modeling (FDM), the new asset for the production of tailored medicines. Journal of controlled release, 330, pp.821-841

  7. [7]

    and Salman, S., 2019, June

    Kalender, M., Kılıc ¸, S.E., Ersoy, S., Bozkurt, Y . and Salman, S., 2019, June. Additive manufacturing and 3D printer technology in aerospace industry. In 2019 9th International Conference on Recent Advances in Space Technologies (RAST) (pp. 689-694). IEEE

  8. [8]

    and Chun, J., 2021

    Jeong, J., Park, H., Lee, Y ., Kang, J. and Chun, J., 2021. Developing parametric design fashion products using 3D printing technology. Fashion and Textiles, 8, pp.1-25

  9. [9]

    and Butt, H., 2023

    Tsegay, F., Ghannam, R., Daniel, N. and Butt, H., 2023. 3D printing smart eyeglass frames: a review. ACS Applied Engineering Materials, 1(4), pp.1142-1163

  10. [10]

    and Malhotra, R.,

    Jahangir, M.N., Cleeman, J., Pan, C., Chang, C.H. and Malhotra, R.,

  11. [11]

    Journal of Manufacturing Processes, 82, pp.319-335

    Flash light assisted additive manufacturing of 3D structural elec- tronics (FLAME). Journal of Manufacturing Processes, 82, pp.319-335

  12. [12]

    and Pearce, J.M., 2016

    Laplume, A., Anzalone, G.C. and Pearce, J.M., 2016. Open-source, self- replicating 3-D printer factory for small-business manufacturing. The In- ternational Journal of Advanced Manufacturing Technology, 85, pp.633- 642

  13. [13]

    Building research equipment with free, open-source hardware

    Pearce, J.M., 2012. Building research equipment with free, open-source hardware. Science, 337(6100), pp.1303-1304

  14. [14]

    and Shabani, S., 2021

    Haghshenas Gorgani, H., Korani, H., Jahedan, R. and Shabani, S., 2021. A nonlinear error compensator for FDM 3D printed part dimensions using a hybrid algorithm based on GMDH neural network. Journal of Compu- tational Applied Mechanics, 52(3), pp.451-477

  15. [15]

    and Yodo, N., 2019

    Dey, A. and Yodo, N., 2019. A systematic survey of FDM process pa- rameter optimization and their influence on part characteristics. Journal of Manufacturing and Materials Processing, 3(3), p.64

  16. [16]

    and Malhotra, R., 2023, June

    Cleeman, J. and Malhotra, R., 2023, June. Highly Parsimonious Multi-Fidelity Learning of Process Parameter-Performance Relation- ships: A Case Study With Fused Filament Fabrication. In International Manufacturing Science and Engineering Conference (V ol. 87240, p. V002T06A031). American Society of Mechanical Engineers

  17. [17]

    and Gunasekaran, J.J.M.T.P., 2021

    Solomon, I.J., Sevvel, P. and Gunasekaran, J.J.M.T.P., 2021. A review on the various processing parameters in FDM. Materials Today: Proceed- ings, 37, pp.509-514

  18. [18]

    and Talamona, D., 2021

    Zharylkassyn, B., Perveen, A. and Talamona, D., 2021. E ffect of process parameters and materials on the dimensional accuracy of FDM parts. Ma- terials Today: Proceedings, 44, pp.1307-1311

  19. [19]

    and Mishra, A., 2021

    Maurya, N.K., Maurya, M., Dwivedi, S.P., Srivastava, A.K., Saxena, A., Chahuan, S., Tiwari, A. and Mishra, A., 2021. Investigation of ef- fect of process variable on dimensional accuracy of FDM component us- ing response surface methodology. World Journal of Engineering, 18(5), pp.710-719

  20. [20]

    and Kate, K.H., 2024

    Ajjarapu, K.P.K., Mishra, R., Malhotra, R. and Kate, K.H., 2024. Map- ping 3D printed part density and filament flow characteristics in the mate- rial extrusion (MEX) process for filled and unfilled polymers. Virtual and Physical Prototyping, 19(1), p.e2331206

  21. [21]

    and Altınkaynak, A.,

    Hıra, O., Y ¨uceda˘g, S., Samankan, S., C ¸ ic ¸ek,¨O.Y . and Altınkaynak, A.,

  22. [22]

    Progress in Additive Manufacturing, pp.1-16

    Numerical and experimental analysis of optimal nozzle dimensions for FDM printers. Progress in Additive Manufacturing, pp.1-16

  23. [23]

    and Wang, Y ., 2022

    Lei, M., Wei, Q., Li, M., Zhang, J., Yang, R. and Wang, Y ., 2022. Numeri- cal simulation and experimental study of the effects of process parameters on filament morphology and mechanical properties of FDM 3D printed PLA/GNPs nanocomposite. Polymers, 14(15), p.3081

  24. [24]

    and Taheri, H., 2022

    Baechle-Clayton, M., Loos, E., Taheri, M. and Taheri, H., 2022. Failures and flaws in fused deposition modeling (FDM) additively manufactured polymers and composites. Journal of Composites Science, 6(7), p.202

  25. [25]

    and Piromalis, D., 2021

    Kantaros, A. and Piromalis, D., 2021. Employing a low-cost desktop 3D printer: Challenges, and how to overcome them by tuning key process parameters. International Journal of Mechanics and Applications, 10(1), pp.11-19

  26. [26]

    and Monz ´on, M., 2020

    Hsiang Loh, G., Pei, E., Gonzalez-Gutierrez, J. and Monz ´on, M., 2020. An overview of material extrusion troubleshooting. Applied Sciences, 10(14), p.4776

  27. [27]

    and Cohan, A., 2024

    Ni, A., Yin, P., Zhao, Y ., Riddell, M., Feng, T., Shen, R., Yin, S., Liu, Y ., Yavuz, S., Xiong, C., Joty, S., Zhou, Y ., Radev, D. and Cohan, A., 2024. L2CEval: Evaluating language-to-code generation capabilities of large language models. Transactions of the Association for Computational Lin- guistics, 12, pp.1311-1329

  28. [28]

    and Lehe, L.J., 2024

    Devunuri, S., Qiam, S. and Lehe, L.J., 2024. ChatGPT for GTFS: Bench- marking LLMs on GTFS Semantics and Retrieval. *Public Transport*, pp.1-25

  29. [29]

    and Hu, B., 2024

    Kevian, D., Syed, U., Guo, X., Havens, A., Dullerud, G., Seiler, P., Qin, L. and Hu, B., 2024. Capabilities of large language models in control engineering: A benchmark study on GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra. *arXiv preprint arXiv:2404.03647*

  30. [30]

    and Gupta, R., 2023

    Sriwastwa, A., Ravi, P., Emmert, A., Chokshi, S., Kondor, S., Dhal, K., Patel, P., Chepelev, L.L., Rybicki, F.J. and Gupta, R., 2023. Generative AI for medical 3D printing: a comparison of ChatGPT outputs to reference standard education. 3D Printing in Medicine, 9(1), p.21

  31. [31]

    and Pugliese, R., 2023

    Badini, S., Regondi, S., Frontoni, E. and Pugliese, R., 2023. Assess- ing the capabilities of ChatGPT to improve additive manufacturing trou- bleshooting. Advanced Industrial and Engineering Polymer Research, 6(3), pp.278-287

  32. [32]

    and Krishnamurthy, A., 2023

    Jignasu, A., Marshall, K., Ganapathysubramanian, B., Balu, A., Hegde, C. and Krishnamurthy, A., 2023. Towards foundational AI models for additive manufacturing: Language models for G-code debugging, manip- ulation, and comprehension. arXiv preprint arXiv:2309.02465

  33. [33]

    OpenAI, 2024. GPT-4o. Available at: https://platform.openai. com/docs/models/gpt-4

  34. [34]

    Claude 3.5 Sonnet

    Anthropic, 2024. Claude 3.5 Sonnet. Available at: https://www. anthropic.com/claude/sonnet

  35. [35]

    Llama 3.1: Open Foundation and Fine-Tuned Chat Mod- els

    Meta AI, 2024. Llama 3.1: Open Foundation and Fine-Tuned Chat Mod- els. Available at: https://ai.facebook.com/blog/llama-3-1/

  36. [36]

    LoRA: Low-Rank Adaptation of Large Language Models

    Hu, E.J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L. and Chen, W., 2021. LoRA: Low-Rank Adaptation of Large Language Models. arXiv preprint arXiv:2106.09685

  37. [37]

    and Riedel, S., 2020

    Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V ., Goyal, N., K¨uttler, H., Lewis, M., Yih, W.T., Rockt ¨aschel, T. and Riedel, S., 2020. Retrieval-augmented generation for knowledge-intensive NLP tasks. Ad- vances in Neural Information Processing Systems, 33, pp.9459-9474

  38. [38]

    and Farimani, A.B., 2024

    Chandrasekhar, A., Chan, J., Ogoke, F., Ajenifujah, O. and Farimani, A.B., 2024. AMGPT: a Large Language Model for Contextual Querying in Additive Manufacturing. *arXiv preprint arXiv:2406.00031*

  39. [39]

    and Hoi, S., 2023

    Li, J., Li, D., Savarese, S. and Hoi, S., 2023. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large lan- guage models. In *Proceedings of the 40th International Conference on Machine Learning* (V ol. 202, pp. 19730-19742). PMLR

  40. [40]

    and Lee, Y .J., 2024

    Liu, H., Li, C., Wu, Q. and Lee, Y .J., 2024. Visual instruction tuning. In *Advances in Neural Information Processing Systems* (V ol. 36)

  41. [41]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Zhu, D., Chen, J., Shen, X., Li, X. and Elhoseiny, M., 2023. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. *arXiv preprint arXiv:2304.10592*

  42. [42]

    and Farimani, A.B., 2024

    Jadhav, Y ., Pak, P. and Farimani, A.B., 2024. LLM-3D Print: Large Language Models To Monitor and Control 3D Printing. *arXiv preprint arXiv:2408.14307*

  43. [43]

    and R ´e, C., 2022

    Arora, S., Narayan, A., Chen, M.F., Orr, L., Guha, N., Bhatia, K., Chami, I., Sala, F. and R ´e, C., 2022. Ask me anything: A simple strategy for 8 prompting language models. *arXiv preprint arXiv:2210.02441*

  44. [44]

    and Thomson, M., 2023

    Bhargava, A., Witkowski, C., Looi, S.Z. and Thomson, M., 2023. What’s the Magic Word? A Control Theory of LLM Prompting. *arXiv preprint arXiv:2310.04444*

  45. [45]

    Language Models are Few-Shot Learners

    Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert- V oss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, ...

  46. [46]

    and Zhou, D., 2022

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V . and Zhou, D., 2022. Chain-of-thought prompting elicits reasoning in large language models. *Advances in Neural Information Processing Systems*, 35, pp.24824-24837

  47. [47]

    and Narasimhan, K., 2024

    Yao, S., Yu, D., Zhao, J., Shafran, I., Gri ffiths, T., Cao, Y . and Narasimhan, K., 2024. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. *Advances in Neural Information Process- ing Systems*, 36. 9