FDM-Bench: A Comprehensive Benchmark for Evaluating Large Language Models in Additive Manufacturing Tasks
Pith reviewed 2026-05-23 06:54 UTC · model grok-4.3
The pith
FDM-Bench tests large language models on fused deposition modeling tasks and finds closed-source models stronger at G-code anomaly detection while Llama-3.1-405B leads slightly on user queries.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FDM-Bench supplies user queries across experience levels together with G-code samples that contain a range of anomalies; when GPT-4o, Claude 3.5 Sonnet, Llama-3.1-70B, and Llama-3.1-405B are evaluated on these items, expert panel review shows closed-source models generally outperform open-source models in anomaly detection while Llama-3.1-405B holds a slight advantage on user-query responses.
What carries the argument
FDM-Bench, the dataset of experience-stratified user queries and anomalous G-code samples that serves as the evaluation substrate for LLM performance on FDM tasks.
If this is right
- LLMs can now be compared systematically on their ability to assist with FDM parameter setting and defect diagnosis.
- Closed-source models appear preferable for automated G-code inspection tasks.
- Larger open-source models may be competitive when the goal is conversational support for users of varying skill.
- FDM-Bench supplies a reusable test set that can track future model improvements in additive manufacturing.
Where Pith is reading between the lines
- The same benchmark format could be adapted to other additive manufacturing methods such as SLA or SLS to test cross-process generalization.
- If models improve on FDM-Bench, non-experts might begin using 3D printers with less formal training, changing who participates in small-scale manufacturing.
- Standardized scoring rubrics beyond expert panels could make the benchmark easier to run at scale and reduce subjectivity.
Load-bearing premise
The chosen user queries and G-code anomalies represent the full range of real-world FDM challenges and the expert panel assessments give a reliable, consistent measure of model quality.
What would settle it
A fresh collection of real-world FDM queries and G-code anomalies that produces different performance rankings among the four models, or expert raters whose scores disagree substantially with the original panel.
Figures
read the original abstract
Fused Deposition Modeling (FDM) is a widely used additive manufacturing (AM) technique valued for its flexibility and cost-efficiency, with applications in a variety of industries including healthcare and aerospace. Recent developments have made affordable FDM machines accessible and encouraged adoption among diverse users. However, the design, planning, and production process in FDM require specialized interdisciplinary knowledge. Managing the complex parameters and resolving print defects in FDM remain challenging. These technical complexities form the most critical barrier preventing individuals without technical backgrounds and even professional engineers without training in other domains from participating in AM design and manufacturing. Large Language Models (LLMs), with their advanced capabilities in text and code processing, offer the potential for addressing these challenges in FDM. However, existing research on LLM applications in this field is limited, typically focusing on specific use cases without providing comprehensive evaluations across multiple models and tasks. To this end, we introduce FDM-Bench, a benchmark dataset designed to evaluate LLMs on FDM-specific tasks. FDM-Bench enables a thorough assessment by including user queries across various experience levels and G-code samples that represent a range of anomalies. We evaluate two closed-source models (GPT-4o and Claude 3.5 Sonnet) and two open-source models (Llama-3.1-70B and Llama-3.1-405B) on FDM-Bench. A panel of FDM experts assess the models' responses to user queries in detail. Results indicate that closed-source models generally outperform open-source models in G-code anomaly detection, whereas Llama-3.1-405B demonstrates a slight advantage over other models in responding to user queries. These findings underscore FDM-Bench's potential as a foundational tool for advancing research on LLM capabilities in FDM.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces FDM-Bench, a benchmark dataset for evaluating LLMs on FDM-specific tasks in additive manufacturing. It includes user queries across experience levels and G-code samples representing various anomalies. Four models are evaluated (GPT-4o, Claude 3.5 Sonnet, Llama-3.1-70B, Llama-3.1-405B), with a panel of FDM experts assessing responses to user queries in detail. The central claims are that closed-source models generally outperform open-source models on G-code anomaly detection, while Llama-3.1-405B shows a slight advantage on user queries, positioning FDM-Bench as a foundational evaluation tool.
Significance. If the benchmark samples prove representative and the expert assessments reliable, the work fills a documented gap in comprehensive, multi-model, multi-task LLM evaluation for AM. It supplies an empirical resource that future studies can build upon for standardized comparisons, with explicit credit due for constructing a new task suite rather than relying on ad-hoc prompts.
major comments (2)
- [Evaluation / Results (panel assessment description)] The comparative claims rest on expert-panel judgments of model outputs, yet the manuscript provides no details on panel size, selection criteria, calibration procedure, inter-rater agreement (e.g., Fleiss' kappa), or disagreement-resolution protocol. This information is required to substantiate the headline rankings (closed-source superiority on anomaly detection; Llama-3.1-405B edge on queries).
- [Benchmark Construction / Dataset Description] The representativeness claim—that queries span experience levels and G-code anomalies cover a range of real-world defects—is asserted without quantitative justification, sampling methodology, or comparison against documented FDM defect distributions. This directly affects whether the reported model orderings generalize beyond the chosen samples.
minor comments (2)
- [Abstract] The abstract would benefit from stating the exact number of user queries and G-code samples evaluated, allowing readers to gauge the scale of the reported comparisons immediately.
- [Results] Clarify whether G-code anomaly detection was scored by the same expert panel or by an automated metric; the current separation of tasks in the abstract leaves the evaluation protocol ambiguous.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments identify important gaps in methodological transparency that affect the interpretability of our results. We respond to each major comment below.
read point-by-point responses
-
Referee: [Evaluation / Results (panel assessment description)] The comparative claims rest on expert-panel judgments of model outputs, yet the manuscript provides no details on panel size, selection criteria, calibration procedure, inter-rater agreement (e.g., Fleiss' kappa), or disagreement-resolution protocol. This information is required to substantiate the headline rankings (closed-source superiority on anomaly detection; Llama-3.1-405B edge on queries).
Authors: We agree that these details are required to substantiate the reliability of the expert assessments. The submitted manuscript does not provide them. In the revision we will add a dedicated subsection describing panel size, expert selection criteria and qualifications, any calibration steps, inter-rater agreement statistics (including Fleiss' kappa), and the disagreement-resolution protocol. revision: yes
-
Referee: [Benchmark Construction / Dataset Description] The representativeness claim—that queries span experience levels and G-code anomalies cover a range of real-world defects—is asserted without quantitative justification, sampling methodology, or comparison against documented FDM defect distributions. This directly affects whether the reported model orderings generalize beyond the chosen samples.
Authors: We acknowledge that the manuscript asserts representativeness without quantitative justification or explicit sampling methodology. In the revision we will expand the benchmark-construction section to document the selection process for queries and G-code samples and to include any available comparisons against documented FDM defect distributions from the literature; where quantitative comparisons are not feasible we will explicitly state this limitation. revision: yes
Circularity Check
No circularity: empirical benchmark evaluation without derivations or fitted inputs
full rationale
The paper introduces FDM-Bench as a new dataset for LLM evaluation on FDM tasks and reports direct empirical results from model inferences plus expert panel judgments. No equations, derivations, parameter fitting, or self-citation chains appear in the abstract or described structure. Claims about model performance rankings rest on the constructed benchmark and external expert assessments rather than reducing to any internal inputs by construction. This is a standard empirical benchmark paper with no load-bearing self-referential logic.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Expert assessments provide a valid and consistent measure of LLM response quality in technical FDM domains.
Forward citations
Cited by 1 Pith paper
-
Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization
Frontier-Eng is a new benchmark for generative optimization in engineering where agents iteratively improve designs under fixed interaction budgets using executable verifiers, with top models like GPT 5.4 showing limi...
Reference graph
Works this paper leans on
-
[1]
Zisopol, D.G., T ˘anase, M. and Portoac ˘a, A.I., 2023. Innovative strate- gies for technical-economical optimization of FDM production. Poly- mers, 15(18), p.3787
work page 2023
-
[2]
Sing, S.L., Tey, C.F., Tan, J.H.K., Huang, S. and Yeong, W.Y ., 2020. 3D printing of metals in rapid prototyping of biomaterials: Techniques in additive manufacturing. In Rapid prototyping of biomaterials (pp. 17-40). Woodhead Publishing
work page 2020
-
[3]
Aabith, S., Caulfield, R., Akhlaghi, O., Papadopoulou, A., Homer- Vanniasinkam, S. and Tiwari, M.K., 2022. 3D direct-write printing of water soluble micromoulds for high-resolution rapid prototyping. Addi- tive Manufacturing, 58, p.103019
work page 2022
-
[4]
Cleeman, J., Bogut, A., Mangrolia, B., Ripberger, A., Maghouli, A., Kate, K. and Malhotra, R., 2022, June. Multiplexed 3D Printing of Thermoplas- tics. In International Manufacturing Science and Engineering Conference (V ol. 85802, p. V001T01A004). American Society of Mechanical Engi- neers
work page 2022
-
[5]
and Fenollosa-Art ´es, F., 2021
Buj-Corral, I., Tejo-Otero, A. and Fenollosa-Art ´es, F., 2021. Use of FDM technology in healthcare applications: recent advances. Fused Deposition Modeling Based 3D Printing, pp.277-297
work page 2021
-
[6]
Cailleaux, S., Sanchez-Ballester, N.M., Gueche, Y .A., Bataille, B. and Soulairol, I., 2021. Fused Deposition Modeling (FDM), the new asset for the production of tailored medicines. Journal of controlled release, 330, pp.821-841
work page 2021
-
[7]
Kalender, M., Kılıc ¸, S.E., Ersoy, S., Bozkurt, Y . and Salman, S., 2019, June. Additive manufacturing and 3D printer technology in aerospace industry. In 2019 9th International Conference on Recent Advances in Space Technologies (RAST) (pp. 689-694). IEEE
work page 2019
-
[8]
Jeong, J., Park, H., Lee, Y ., Kang, J. and Chun, J., 2021. Developing parametric design fashion products using 3D printing technology. Fashion and Textiles, 8, pp.1-25
work page 2021
-
[9]
Tsegay, F., Ghannam, R., Daniel, N. and Butt, H., 2023. 3D printing smart eyeglass frames: a review. ACS Applied Engineering Materials, 1(4), pp.1142-1163
work page 2023
- [10]
-
[11]
Journal of Manufacturing Processes, 82, pp.319-335
Flash light assisted additive manufacturing of 3D structural elec- tronics (FLAME). Journal of Manufacturing Processes, 82, pp.319-335
-
[12]
Laplume, A., Anzalone, G.C. and Pearce, J.M., 2016. Open-source, self- replicating 3-D printer factory for small-business manufacturing. The In- ternational Journal of Advanced Manufacturing Technology, 85, pp.633- 642
work page 2016
-
[13]
Building research equipment with free, open-source hardware
Pearce, J.M., 2012. Building research equipment with free, open-source hardware. Science, 337(6100), pp.1303-1304
work page 2012
-
[14]
Haghshenas Gorgani, H., Korani, H., Jahedan, R. and Shabani, S., 2021. A nonlinear error compensator for FDM 3D printed part dimensions using a hybrid algorithm based on GMDH neural network. Journal of Compu- tational Applied Mechanics, 52(3), pp.451-477
work page 2021
-
[15]
Dey, A. and Yodo, N., 2019. A systematic survey of FDM process pa- rameter optimization and their influence on part characteristics. Journal of Manufacturing and Materials Processing, 3(3), p.64
work page 2019
-
[16]
Cleeman, J. and Malhotra, R., 2023, June. Highly Parsimonious Multi-Fidelity Learning of Process Parameter-Performance Relation- ships: A Case Study With Fused Filament Fabrication. In International Manufacturing Science and Engineering Conference (V ol. 87240, p. V002T06A031). American Society of Mechanical Engineers
work page 2023
-
[17]
and Gunasekaran, J.J.M.T.P., 2021
Solomon, I.J., Sevvel, P. and Gunasekaran, J.J.M.T.P., 2021. A review on the various processing parameters in FDM. Materials Today: Proceed- ings, 37, pp.509-514
work page 2021
-
[18]
Zharylkassyn, B., Perveen, A. and Talamona, D., 2021. E ffect of process parameters and materials on the dimensional accuracy of FDM parts. Ma- terials Today: Proceedings, 44, pp.1307-1311
work page 2021
-
[19]
Maurya, N.K., Maurya, M., Dwivedi, S.P., Srivastava, A.K., Saxena, A., Chahuan, S., Tiwari, A. and Mishra, A., 2021. Investigation of ef- fect of process variable on dimensional accuracy of FDM component us- ing response surface methodology. World Journal of Engineering, 18(5), pp.710-719
work page 2021
-
[20]
Ajjarapu, K.P.K., Mishra, R., Malhotra, R. and Kate, K.H., 2024. Map- ping 3D printed part density and filament flow characteristics in the mate- rial extrusion (MEX) process for filled and unfilled polymers. Virtual and Physical Prototyping, 19(1), p.e2331206
work page 2024
-
[21]
Hıra, O., Y ¨uceda˘g, S., Samankan, S., C ¸ ic ¸ek,¨O.Y . and Altınkaynak, A.,
-
[22]
Progress in Additive Manufacturing, pp.1-16
Numerical and experimental analysis of optimal nozzle dimensions for FDM printers. Progress in Additive Manufacturing, pp.1-16
-
[23]
Lei, M., Wei, Q., Li, M., Zhang, J., Yang, R. and Wang, Y ., 2022. Numeri- cal simulation and experimental study of the effects of process parameters on filament morphology and mechanical properties of FDM 3D printed PLA/GNPs nanocomposite. Polymers, 14(15), p.3081
work page 2022
-
[24]
Baechle-Clayton, M., Loos, E., Taheri, M. and Taheri, H., 2022. Failures and flaws in fused deposition modeling (FDM) additively manufactured polymers and composites. Journal of Composites Science, 6(7), p.202
work page 2022
-
[25]
Kantaros, A. and Piromalis, D., 2021. Employing a low-cost desktop 3D printer: Challenges, and how to overcome them by tuning key process parameters. International Journal of Mechanics and Applications, 10(1), pp.11-19
work page 2021
-
[26]
Hsiang Loh, G., Pei, E., Gonzalez-Gutierrez, J. and Monz ´on, M., 2020. An overview of material extrusion troubleshooting. Applied Sciences, 10(14), p.4776
work page 2020
-
[27]
Ni, A., Yin, P., Zhao, Y ., Riddell, M., Feng, T., Shen, R., Yin, S., Liu, Y ., Yavuz, S., Xiong, C., Joty, S., Zhou, Y ., Radev, D. and Cohan, A., 2024. L2CEval: Evaluating language-to-code generation capabilities of large language models. Transactions of the Association for Computational Lin- guistics, 12, pp.1311-1329
work page 2024
-
[28]
Devunuri, S., Qiam, S. and Lehe, L.J., 2024. ChatGPT for GTFS: Bench- marking LLMs on GTFS Semantics and Retrieval. *Public Transport*, pp.1-25
work page 2024
-
[29]
Kevian, D., Syed, U., Guo, X., Havens, A., Dullerud, G., Seiler, P., Qin, L. and Hu, B., 2024. Capabilities of large language models in control engineering: A benchmark study on GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra. *arXiv preprint arXiv:2404.03647*
-
[30]
Sriwastwa, A., Ravi, P., Emmert, A., Chokshi, S., Kondor, S., Dhal, K., Patel, P., Chepelev, L.L., Rybicki, F.J. and Gupta, R., 2023. Generative AI for medical 3D printing: a comparison of ChatGPT outputs to reference standard education. 3D Printing in Medicine, 9(1), p.21
work page 2023
-
[31]
Badini, S., Regondi, S., Frontoni, E. and Pugliese, R., 2023. Assess- ing the capabilities of ChatGPT to improve additive manufacturing trou- bleshooting. Advanced Industrial and Engineering Polymer Research, 6(3), pp.278-287
work page 2023
-
[32]
Jignasu, A., Marshall, K., Ganapathysubramanian, B., Balu, A., Hegde, C. and Krishnamurthy, A., 2023. Towards foundational AI models for additive manufacturing: Language models for G-code debugging, manip- ulation, and comprehension. arXiv preprint arXiv:2309.02465
-
[33]
OpenAI, 2024. GPT-4o. Available at: https://platform.openai. com/docs/models/gpt-4
work page 2024
-
[34]
Anthropic, 2024. Claude 3.5 Sonnet. Available at: https://www. anthropic.com/claude/sonnet
work page 2024
-
[35]
Llama 3.1: Open Foundation and Fine-Tuned Chat Mod- els
Meta AI, 2024. Llama 3.1: Open Foundation and Fine-Tuned Chat Mod- els. Available at: https://ai.facebook.com/blog/llama-3-1/
work page 2024
-
[36]
LoRA: Low-Rank Adaptation of Large Language Models
Hu, E.J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L. and Chen, W., 2021. LoRA: Low-Rank Adaptation of Large Language Models. arXiv preprint arXiv:2106.09685
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[37]
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V ., Goyal, N., K¨uttler, H., Lewis, M., Yih, W.T., Rockt ¨aschel, T. and Riedel, S., 2020. Retrieval-augmented generation for knowledge-intensive NLP tasks. Ad- vances in Neural Information Processing Systems, 33, pp.9459-9474
work page 2020
-
[38]
Chandrasekhar, A., Chan, J., Ogoke, F., Ajenifujah, O. and Farimani, A.B., 2024. AMGPT: a Large Language Model for Contextual Querying in Additive Manufacturing. *arXiv preprint arXiv:2406.00031*
-
[39]
Li, J., Li, D., Savarese, S. and Hoi, S., 2023. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large lan- guage models. In *Proceedings of the 40th International Conference on Machine Learning* (V ol. 202, pp. 19730-19742). PMLR
work page 2023
-
[40]
Liu, H., Li, C., Wu, Q. and Lee, Y .J., 2024. Visual instruction tuning. In *Advances in Neural Information Processing Systems* (V ol. 36)
work page 2024
-
[41]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Zhu, D., Chen, J., Shen, X., Li, X. and Elhoseiny, M., 2023. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. *arXiv preprint arXiv:2304.10592*
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[42]
Jadhav, Y ., Pak, P. and Farimani, A.B., 2024. LLM-3D Print: Large Language Models To Monitor and Control 3D Printing. *arXiv preprint arXiv:2408.14307*
-
[43]
Arora, S., Narayan, A., Chen, M.F., Orr, L., Guha, N., Bhatia, K., Chami, I., Sala, F. and R ´e, C., 2022. Ask me anything: A simple strategy for 8 prompting language models. *arXiv preprint arXiv:2210.02441*
-
[44]
Bhargava, A., Witkowski, C., Looi, S.Z. and Thomson, M., 2023. What’s the Magic Word? A Control Theory of LLM Prompting. *arXiv preprint arXiv:2310.04444*
-
[45]
Language Models are Few-Shot Learners
Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert- V oss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, ...
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[46]
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V . and Zhou, D., 2022. Chain-of-thought prompting elicits reasoning in large language models. *Advances in Neural Information Processing Systems*, 35, pp.24824-24837
work page 2022
-
[47]
Yao, S., Yu, D., Zhao, J., Shafran, I., Gri ffiths, T., Cao, Y . and Narasimhan, K., 2024. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. *Advances in Neural Information Process- ing Systems*, 36. 9
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.