arxiv: 2604.09866 · v1 · submitted 2026-04-10 · 💻 cs.SE · cs.AI

Recognition: unknown

Automating Structural Analysis Across Multiple Software Platforms Using Large Language Models

Ziheng Geng , Jiachen Liu , Ian Franklin , Ran Cao , Dan M. Frangopol , Minghui Cheng

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:33 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords large language modelsmulti-agent systemsstructural analysisfinite element modelingautomationETABSSAP2000OpenSees

0 comments

The pith

Large language models with a two-stage multi-agent architecture can automate frame structural analysis across ETABS, SAP2000, and OpenSees with over 90 percent accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper seeks to demonstrate that LLMs can overcome the restriction of operating on only one structural analysis platform by handling multiple finite element tools that engineers actually use. It introduces a system in which one set of agents first interprets natural-language inputs to extract geometry, materials, boundaries, and loads, then assembles the details into a single JSON file. A second set of agents then works in parallel to translate that JSON into ready-to-run code for each target platform using its specific syntax and modeling rules. The approach is tested on twenty standard frame problems, with ten repeated trials per case, showing reliable results above 90 percent accuracy. If the method holds, structural engineers could describe a project once and obtain executable models for whichever software their project or firm requires.

Core claim

The central claim is that a two-stage multi-agent LLM architecture enables reliable automation of frame structural analysis across three distinct platforms. Stage 1 agents collaborate to interpret user text and produce a unified JSON representation containing all geometric, material, boundary, and load data needed for finite-element modeling. Stage 2 agents then convert this JSON, in parallel, into executable scripts tailored to ETABS, SAP2000, and OpenSees by following each platform's syntax rules and modeling workflow. Evaluation on twenty representative frame problems across ten repeated trials yields consistent accuracy exceeding 90 percent on every platform.

What carries the argument

The two-stage multi-agent architecture: Stage 1 agents perform collaborative structured reasoning to extract and compile modeling information into a unified JSON file, while Stage 2 agents translate that file in parallel into platform-specific executable scripts.

If this is right

Engineers can generate correct analysis models for different platforms from a single natural-language description.
The workflow removes the need to learn and write separate scripts for each finite-element tool.
Repeated high accuracy on standard frame problems indicates the system can be deployed for routine modeling tasks.
Companies using mixed software environments can adopt one LLM-assisted workflow instead of maintaining separate processes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same two-stage pattern could be tested on nonlinear or dynamic analysis problems once the basic frame capability is established.
Connecting the JSON intermediate representation to parametric design tools might allow automatic iteration over geometry or load variations.
Performance on incomplete or noisy real-project text would reveal whether additional clarification agents or external data lookup steps are required.

Load-bearing premise

The assumption that LLM agents can reliably extract complete and accurate geometric, material, boundary, and load information from arbitrary user text and translate it without syntax or modeling errors for real-world project inputs.

What would settle it

Running the system on frame descriptions drawn from actual engineering projects that contain ambiguous wording, omitted details, or non-standard elements and checking whether the generated scripts produce correct models or fail with errors.

Figures

Figures reproduced from arXiv: 2604.09866 by Dan M. Frangopol, Ian Franklin, Jiachen Liu, Minghui Cheng, Ran Cao, Ziheng Geng.

**Figure 1.** Figure 1: Development of LLMs capable of operating multiple software platforms for automated structural analysis. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: A benchmark dataset comprising twenty representative frame structural analysis problems (adapted from [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Textual template for problem description in 2D frame structural analysis. Fig. 3: Textual template for problem description in frame structural analysis. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Multi-agent system architecture for automated frame structural analysis across FEA platforms. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Multi-agent system architecture for automated 2D frame structural analysis across structural members, such as columns and girders, and establishes their connectivity by linking the corresponding end [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Multi-agent system architecture for automated 2D frame structural analysis across FEM lf ETABS, primarily for building structures, adopts a different modeling logic from OpenSees and SAP2000. It first sets [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Semantic mapping agent that converts JSON files into story-level representations. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Performance of the proposed multi-agent LLMs across three structural analysis platforms. [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Visualization of structural analysis results from OpenSees using STKO. [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 10.** Figure 10: Visualization of the structural analysis results from SAP2000. [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗

**Figure 11.** Figure 11: Visualization of the structural analysis results from ETABS. [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗

**Figure 12.** Figure 12: Performance of GPT-5.2 on 2D frame structural analysis across failures attribute to the rigorous syntax requirements and specific modeling logics that the general-purpose LLM cannot [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗

**Figure 13.** Figure 13: Illustrative examples of semantic reasoning errors in OpenSees scripts generated by GPT-5.2. [PITH_FULL_IMAGE:figures/full_fig_p012_13.png] view at source ↗

**Figure 14.** Figure 14: Performance of Gemini-3-Pro on 2D frame structural ana The performance of Gemini-3-Pro is better than GPT-5.2 in using OpenSees, as shown in Fig. [PITH_FULL_IMAGE:figures/full_fig_p012_14.png] view at source ↗

read the original abstract

Recent advances in large language models (LLMs) have shown the promise to significantly accelerate the workflow by automating structural modeling and analysis. However, existing studies primarily focus on enabling LLMs to operate a single structural analysis software platform. In practice, structural engineers often rely on multiple finite element analysis (FEA) tools, such as ETABS, SAP2000, and OpenSees, depending on project needs, user preferences, and company constraints. This limitation restricts the practical deployment of LLM-assisted engineering workflows. To address this gap, this study develops LLMs capable of automating frame structural analysis across multiple software platforms. The LLMs adopt a two-stage multi-agent architecture. In Stage 1, a cohort of agents collaboratively interpret user input and perform structured reasoning to infer geometric, material, boundary, and load information required for finite element modeling. The outputs of these agents are compiled into a unified JSON representation. In Stage 2, code translation agents operate in parallel to convert the JSON file into executable scripts across multiple structural analysis platforms. Each agent is prompted with the syntax rules and modeling workflows of its target software. The LLMs are evaluated using 20 representative frame problems across three widely used platforms: ETABS, SAP2000, and OpenSees. Results from ten repeated trials demonstrate consistently reliable performance, achieving accuracy exceeding 90% across all cases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper shows a workable two-stage LLM setup that turns text descriptions of frame structures into scripts for ETABS, SAP2000, and OpenSees via a JSON intermediate, but the 90%+ accuracy claim rests on undefined metrics and narrow tests.

read the letter

The new piece is the multi-platform extension: one group of agents extracts geometry, materials, boundaries, and loads into a common JSON, then parallel translation agents generate code for each target platform using its specific syntax rules. This directly tackles the practical issue that engineers often cannot stick to one FEA tool. The architecture is described clearly enough that someone could re-implement the handoff logic without much guesswork. On the 20 frame problems they ran ten times each, the system produced usable output most of the time, which is a concrete data point for this kind of automation. That is the main credit it earns. The evaluation, however, stays at a high level. The paper never spells out what counts as accurate—whether the script parses without error, whether the numerical results match across platforms within tolerance, or something else. The test cases are called representative, yet they appear to be clean, complete descriptions rather than the incomplete or ambiguous inputs that show up in real projects. There is no error breakdown by stage, no baseline comparison to single-prompt or rule-based approaches, and no check on whether the final analysis results are equivalent across the three platforms. Those gaps make the reliability claim harder to assess. This work is aimed at structural engineers and researchers already experimenting with LLMs for modeling tasks. A reader who wants a proof-of-concept for handling multiple tools will get something usable from it. Anyone expecting rigorous validation or broad generalizability will find it thin. I would send it to peer review. The core idea is practical and the implementation choices are transparent, so referees could usefully press for better-defined success criteria and more varied test inputs without starting from zero.

Referee Report

2 major / 0 minor

Summary. The manuscript presents a two-stage multi-agent LLM architecture to automate frame structural analysis across ETABS, SAP2000, and OpenSees. Stage 1 employs collaborative agents to interpret user text and compile geometric, material, boundary, and load data into a unified JSON representation. Stage 2 uses parallel code-translation agents, each prompted with target-platform syntax, to generate executable scripts from the JSON. The system is evaluated on 20 representative frame problems using ten repeated trials per case, with the abstract reporting accuracy exceeding 90% across all platforms.

Significance. If the accuracy metric proves to be a well-defined measure of modeling fidelity (including cross-platform result equivalence within engineering tolerances) and the approach generalizes beyond the tested cases, the work could meaningfully advance practical multi-platform automation in structural engineering by reducing reliance on manual scripting. The unified JSON intermediate representation and use of repeated trials provide a reproducible empirical foundation that strengthens the reported performance claims.

major comments (2)

[Abstract] Abstract: The headline claim of 'consistently reliable performance, achieving accuracy exceeding 90%' lacks any definition of the accuracy metric. It is unclear whether accuracy refers to syntax-valid script generation, complete and correct extraction of all model parameters from arbitrary text, or equivalence of downstream analysis outputs (e.g., nodal displacements or member forces within tolerance) across platforms. This definition is load-bearing for the central claim of practical automation.
[Evaluation] Evaluation description: The tests are confined to 20 'representative' frame problems with no reported selection criteria, coverage of input variability (incomplete descriptions, ambiguous phrasing, non-standard units), quantitative error breakdown by failure mode, or verification that generated models produce equivalent engineering results across ETABS/SAP2000/OpenSees. Without these, the generalization to 'real-world project inputs' does not follow from the reported trials.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the scope and limitations of our work. Below, we provide point-by-point responses to the major comments and outline the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The headline claim of 'consistently reliable performance, achieving accuracy exceeding 90%' lacks any definition of the accuracy metric. It is unclear whether accuracy refers to syntax-valid script generation, complete and correct extraction of all model parameters from arbitrary text, or equivalence of downstream analysis outputs (e.g., nodal displacements or member forces within tolerance) across platforms. This definition is load-bearing for the central claim of practical automation.

Authors: We agree that a clear definition of the accuracy metric is essential in the abstract. In our evaluation, accuracy is measured as the success rate across ten trials per problem, where success is determined by the generated script being syntactically correct, executable in the target software, and producing a model that accurately reflects all specified geometric, material, boundary, and load parameters from the input description. This was verified through code inspection and simulation runs. We did not extend to comparing analysis results like displacements across platforms in this study. We will revise the abstract to explicitly state this definition and the scope of the verification performed. revision: yes
Referee: [Evaluation] Evaluation description: The tests are confined to 20 'representative' frame problems with no reported selection criteria, coverage of input variability (incomplete descriptions, ambiguous phrasing, non-standard units), quantitative error breakdown by failure mode, or verification that generated models produce equivalent engineering results across ETABS/SAP2000/OpenSees. Without these, the generalization to 'real-world project inputs' does not follow from the reported trials.

Authors: The 20 problems were chosen as representative examples covering simple to moderately complex frame structures with varying numbers of bays and stories, as described in the evaluation section. We acknowledge the lack of explicit selection criteria and coverage of input variabilities such as ambiguous phrasing or non-standard units in the current manuscript. We will add a detailed description of the problem selection process, include examples of input variability tested, and provide a quantitative breakdown of error types observed in the failed trials. Regarding cross-platform result equivalence, our current focus was on the correctness of the modeling process rather than post-analysis outputs; we will add this as a noted limitation and suggest it for future extensions. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical evaluation on independent tests

full rationale

The paper presents a procedural description of a two-stage multi-agent LLM system for cross-platform structural modeling, followed by direct empirical testing on 20 representative frame problems with repeated trials and reported accuracy rates. No equations, fitted parameters, predictions derived from inputs, self-definitional constructs, or load-bearing self-citations appear in the derivation or evaluation chain. Results are measured against external test cases rather than reducing to the method's own definitions or prior outputs by construction, making the central claims self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that current LLMs possess sufficient structured reasoning and code-generation capability for engineering syntax without additional fine-tuning or external verification.

axioms (1)

domain assumption LLM agents can accurately infer complete structural model parameters from natural-language user input
Stage 1 relies on this capability to produce a correct unified JSON representation.

pith-pipeline@v0.9.0 · 5555 in / 1201 out tokens · 29471 ms · 2026-05-10T16:33:42.558451+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 13 canonical work pages · 2 internal anchors

[1]

gpt-oss-120b & gpt-oss-20b Model Card

“gpt-oss-120b & gpt-oss-20b model card.”arXiv preprint arXiv:2508.10925. ASDEA Software Technology

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Version: 4.1.0 (accessed 2026-01-19)

“STKO: Scientific toolkit for opensees, <https://www.stko.net/>. Version: 4.1.0 (accessed 2026-01-19). Bai, Y ., Tu, S., Zhang, J., Peng, H., Wang, X., Lv, X., Cao, S., Xu, J., Hou, L., Dong, Y ., et al

2026
[3]

Is gpt-oss all you need? benchmarking large language models for financial intelligence and the surprising efficiency paradox

“Is gpt-oss all you need? benchmarking large language models for financial intelligence and the surprising efficiency paradox.”arXiv preprint arXiv:2512.14717. Chen, G., Alsharef, A., Ovid, A., Albert, A., and Jaselskis, E

work page arXiv
[4]

Evaluating long-context reasoning in llm-based webagents.arXiv preprint arXiv:2512.04307,

“Evaluating long-context reasoning in llm-based webagents.”arXiv preprint arXiv:2512.04307. Computers and Structures, Inc. 2025a.ETABS: Integrated Building Design Software. Computers and Structures, Inc., Walnut Creek, CA.https://www.csiamerica.com/products/etabs. Computers and Structures, Inc. 2025b.SAP2000: Integrated Software for Structural Analysis an...

work page arXiv
[5]

Bimgent: Towards autonomous building modeling via computer-use agents

“Bimgent: Towards autonomous building modeling via computer-use agents.”arXiv preprint arXiv:2506.07217. Dong, J., Zhang, Y ., Liu, Y ., Zhong, Z., Wei, T., Zhang, C., and Qiu, H. 2025a. “Revisiting the reliability of language models in instruction-following.”arXiv preprint arXiv:2512.14754. Dong, Y ., Jiang, X., Qian, J., Wang, T., Zhang, K., Jin, Z., an...

work page arXiv
[6]

A novel multi-agent architecture to reduce hallucinations of large language models in multi-step structural modeling

“A novel multi-agent architecture to reduce hallucinations of large language models in multi-step structural modeling.”arXiv preprint arXiv:2603.07728. Geng, Z., Liu, J., Cao, R., Cheng, L., Wang, H., and Cheng, M

work page arXiv
[7]

A lightweight large language model-based multi-agent system for 2d frame structural analysis

“A lightweight large language model-based multi-agent system for 2d frame structural analysis.”arXiv preprint arXiv:2510.05414. 14 APREPRINT- APRIL14, 2026 Google DeepMind

work page arXiv 2026
[8]

Accessed: 2026-01-06

“Gemini 3 pro model card. Accessed: 2026-01-06. Jiang, J., Wang, F., Shen, J., Kim, S., and Kim, S

2026
[9]

A Survey on Large Language Models for Code Generation

“A survey on large language models for code generation.” arXiv preprint arXiv:2406.00515. Jiang, Y ., Wang, J., Shen, X., and Dai, K

work page internal anchor Pith review arXiv
[10]

The framework and implementation of using large language models to answer questions about building codes and standards

“The framework and implementation of using large language models to answer questions about building codes and standards.”Journal of Computing in Civil Engineering, 39 (4): 05025004. Liang, H., Kalaleh, M. T., and Mei, Q. 2025a. “Integrating large language models for automated structural analysis.” arXiv preprint arXiv:2504.09754. Liang, H., Zhou, Y ., Kal...

work page arXiv
[11]

Longreason: A synthetic long-context reasoning benchmark via context expansion

“Longreason: A synthetic long-context reasoning benchmark via context expansion.”arXiv preprint arXiv:2501.15089. Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al. 2024a. “Deepseek-v3 technical report.”arXiv preprint arXiv:2412.19437. Liu, J., Geng, Z., Cao, R., Cheng, L., Bocchini, P., and Cheng, M

work page arXiv
[12]

Toolace: Winning the points of llm function calling.arXiv preprint arXiv:2409.00920, 2024

“A large language model-empowered agent for reliable and robust structural analysis.”Structure and Infrastructure Engineering, 1–16. Liu, W., Huang, X., Zeng, X., Hao, X., Yu, S., Li, D., Wang, S., Gan, W., Liu, Z., Yu, Y ., et al. 2024b. “Toolace: Winning the points of llm function calling.”arXiv preprint arXiv:2409.00920. Lu, J., Holleis, T., Zhang, Y ....

work page arXiv
[13]

Toolsandbox: A stateful, conversational, interactive evaluation benchmark for llm tool use capabilities

“Toolsandbox: A stateful, conversational, interactive evaluation benchmark for llm tool use capabilities.”Findings of the Association for Computational Linguistics: NAACL 2025, 1160–1183. McKenna, F

2025
[14]

Accessed: 2026-01-06

“Update to gpt-5 system card: Gpt-5.2. Accessed: 2026-01-06. Pu, H., Yang, X., Li, J., and Guo, R

2026
[15]

Llm with tools: A survey.arXiv preprint arXiv:2409.18807, 2024

“Llm with tools: A survey.”arXiv preprint arXiv:2409.18807. Wan, Q., Wang, Z., Zhou, J., Wang, W., Geng, Z., Liu, J., Cao, R., Cheng, M., and Cheng, L

work page arXiv
[16]

Som-1k: A thousand-problem benchmark dataset for strength of materials

“Som-1k: A thousand-problem benchmark dataset for strength of materials.”arXiv preprint arXiv:2509.21079. Xia, Z., Zhong, B., Zhang, S., Zhao, T., and Skibniewski, M. J

work page arXiv
[17]

Codeif: Benchmarking the instruction-following capabilities of large language models for code generation

“Codeif: Benchmarking the instruction-following capabilities of large language models for code generation.”arXiv preprint arXiv:2502.19166. Zhang, Y ., Wei, S., Huang, Y ., Su, Y ., Lu, S., Jiang, K., and Li, H

work page arXiv