arxiv: 2605.13773 · v2 · submitted 2026-05-13 · 💻 cs.SE · cs.AI· cs.LO

Recognition: no theorem link

(How) Do Large Language Models Understand High-Level Message Sequence Charts?

Mohammad Reza Mousavi

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:45 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.LO

keywords large language modelshigh-level message sequence chartsformal semanticssoftware modelingUML sequence diagramsabstractiontracessemantic understanding

0 comments

The pith

Large language models achieve only about 52 percent accuracy when tested on the formal semantics of High-Level Message Sequence Charts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether LLMs can correctly handle the formal semantics of HMSCs, visual models with precise rules for events, ordering, abstractions, and traces that underpin UML sequence diagrams. Three models were evaluated on 129 tasks that start with basic event queries and advance to semantic-preserving transformations and calculations of trace sets and equivalent labelled transition systems. Overall accuracy reached only about 52 percent, with 88 percent on simple ordering but 36 percent on abstractions and compositions and 42 percent on traces and LTSs. A sympathetic reader cares because LLMs are already applied to software design artifacts, so incomplete semantic grasp could produce inconsistent or invalid specifications.

Core claim

The authors demonstrate through systematic testing that LLMs possess only a modest understanding of the formal semantics of High-Level Message Sequence Charts, attaining approximately 52% accuracy across the 129 tasks. They perform well on basic constructs such as events and their ordering, reaching about 88% accuracy, but falter on semantic-preserving abstractions and compositions at 36% and on traces and equivalent labelled transition systems at 42%. In particular, none of the models made use of co-regions or explicit causal dependencies during transformations.

What carries the argument

A benchmark of 129 semantic tasks applied to three LLMs to measure accuracy on HMSC concepts from basic ordering to abstraction, composition, traces, and LTS equivalence.

If this is right

LLMs can handle basic event and ordering queries in HMSCs with reasonable reliability but cannot be trusted alone for abstraction, composition, or trace calculations.
Tools that embed LLMs in architectural design with HMSCs or UML diagrams require separate verification steps for any semantic manipulation.
The complete absence of co-region and causal-dependency usage points to a specific gap that training data or prompting must address.
Selective deployment of LLMs is warranted: they suit simple queries yet need safeguards on tasks where accuracy falls below 40 percent.
Widespread use of LLMs for HMSC-based specifications risks producing designs that deviate from formal semantics unless checked by dedicated verifiers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Applying the same task battery to other formalisms such as state machines could show whether modest semantic understanding is common across visual modeling languages.
Explicit prompting that forces consideration of co-regions and causal links might narrow the performance gap on abstraction and trace tasks.
Hybrid pipelines that pair LLM output with automated semantic checkers for HMSCs would mitigate the observed weaknesses in practice.
Scaling the task set to larger HMSCs would test whether accuracy declines further as diagram complexity increases.

Load-bearing premise

The 129 tasks chosen represent a representative sample of HMSC formal semantics and LLM answers reflect genuine understanding rather than surface pattern matching.

What would settle it

A new or expanded collection of HMSC tasks, balanced across all semantic categories, that produces accuracy markedly higher or lower than 52 percent overall or changes the observed gaps between basic and advanced tasks.

Figures

Figures reproduced from arXiv: 2605.13773 by Mohammad Reza Mousavi.

**Figure 1.** Figure 1: An MSC labelled “example1”, with four instances, labelled “i1” to “i4”, five messages labelled “m0” to “m4”, and an internal action labelled “a”. 3 High-Level Message Sequence Charts 3.1 Syntax A message sequence chart (MSC) is given by a labelled frame (a bounding box) comprising a number of instances, represented by vertical lines labelled by the instance names. Intuitively, they can be thought of as obj… view at source ↗

**Figure 2.** Figure 2: An HMSC labelled comprising a vertical composition of “Connection request” with a choice between “Connection confirm” and “Time-out”. 3.2 Semantics The identified key semantic concepts are listed below: 1. Causal ordering: the most important semantic concept in Basic Message Sequence Charts (BMSCs, the building blocks of HMSCs) is the causal ordering of events; they comprise the following types of orders:… view at source ↗

**Figure 3.** Figure 3: The MSCs referenced in the HMSC of [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Large Language Models (LLMs) are being employed widely to automate tasks across the software development life-cycle. It is, however, unclear whether these tasks are performed consistently with respect to the semantics of the artefacts being handled. This question is particularly under-researched concerning architectural design specification. In this paper, we address this question for High-Level Message Sequence Charts (HMSCs). These are visual models with a rigorous formal semantics that have been used for various purposes, including as a foundation for Sequence Diagrams in the Unified Modelling Language (UML). We examine whether LLMs "understand" the semantics of HMSCs by examining three LLMs (Gemini-3, GPT-5.4, and Qwen-3.6) on how they perform 129 semantic tasks ranging from querying basic semantic constructs in HMSCs (i.e., events and their ordering) to semantic-preserving abstractions and compositions, and calculating the set of traces and trace-equivalent labelled transition systems. The results show that LLMs only have a modest understanding of the formal semantics of HMSCs (ca. 52% overall accuracy), with great variability across different semantic concepts: while LLMs seem to understand the basic semantic concepts of MSCs (ca. 88% accuracy), they struggle with semantic reasoning in tasks involving abstraction and composition (ca. 36% accuracy) and traces and LTSs (ca. 42% accuracy). In particular, all three LLMs struggle with the notions of co-region and explicit causal dependencies and never employed them in semantic-preserving transformations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper investigates whether LLMs understand the formal semantics of High-Level Message Sequence Charts (HMSCs) by evaluating three models (Gemini-3, GPT-5.4, Qwen-3.6) on 129 tasks spanning basic semantic constructs (events, ordering), semantic-preserving abstractions/compositions, and computation of traces and trace-equivalent labelled transition systems. It reports ~52% overall accuracy, with 88% on basics but only 36% on abstraction/composition and 42% on traces/LTS, and notes particular failures on co-regions and explicit causal dependencies.

Significance. If substantiated, the concrete accuracy figures across task categories would provide useful empirical data on LLM limitations for formal semantic reasoning in architectural models, with direct relevance to UML Sequence Diagrams and automated software design tasks. The work is a straightforward benchmarking study with no fitted parameters or derivations, and its value lies in the reported performance gaps rather than any theoretical advance.

major comments (3)

[Experimental design / task definition] The selection and coverage of the 129 tasks is not justified by any mapping to the formal semantics of HMSCs or by a documented generation protocol. Without this, the central claim that LLMs exhibit only 'modest understanding' (abstract) cannot be evaluated, as the accuracy numbers may reflect sampling bias rather than representative coverage of constructs such as co-regions and causal dependencies.
[Methodology and results analysis] No controls are described to distinguish semantic understanding from surface-level pattern matching or training-data recall (e.g., adversarial rephrasings, non-semantic baselines, or variants that preserve syntax but alter semantics). This directly undermines the interpretation of the 36% abstraction/composition and 42% traces/LTS accuracies as evidence of limited semantic reasoning.
[Results section] Accuracy is reported as aggregate percentages per category without per-task error analysis, statistical tests for differences between categories, or details on scoring criteria and inter-rater reliability for LLM outputs. This makes the reported variability (88% vs. 36%) difficult to interpret as a robust finding.

minor comments (2)

[Experimental setup] Clarify the exact model versions and access dates (Gemini-3, GPT-5.4, Qwen-3.6 appear non-standard) and whether temperature or other generation parameters were fixed across runs.
[Appendix or supplementary material] Add a table or appendix listing representative task examples from each category to allow readers to assess difficulty and coverage.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and describe the revisions we will make to improve the manuscript's rigor and clarity.

read point-by-point responses

Referee: [Experimental design / task definition] The selection and coverage of the 129 tasks is not justified by any mapping to the formal semantics of HMSCs or by a documented generation protocol. Without this, the central claim that LLMs exhibit only 'modest understanding' (abstract) cannot be evaluated, as the accuracy numbers may reflect sampling bias rather than representative coverage of constructs such as co-regions and causal dependencies.

Authors: We appreciate the referee's point on experimental design. The 129 tasks were systematically derived from the standard formal semantics of HMSCs (covering events, ordering, co-regions, causal dependencies, abstractions, compositions, traces, and LTS equivalences), but we acknowledge that the manuscript lacks an explicit mapping table or generation protocol. In the revised version, we will add a new subsection (Section 3.2) with a detailed breakdown mapping each task category to specific semantic constructs, along with the generation protocol and rationale for task distribution to demonstrate representative coverage and mitigate sampling bias concerns. revision: yes
Referee: [Methodology and results analysis] No controls are described to distinguish semantic understanding from surface-level pattern matching or training-data recall (e.g., adversarial rephrasings, non-semantic baselines, or variants that preserve syntax but alter semantics). This directly undermines the interpretation of the 36% abstraction/composition and 42% traces/LTS accuracies as evidence of limited semantic reasoning.

Authors: This is a valid methodological concern. While tasks such as full trace enumeration and LTS computation inherently require systematic semantic application unlikely to succeed via surface pattern matching alone, we agree that explicit controls are needed for stronger claims. In revision, we will add a dedicated discussion of potential confounds and include baseline results (e.g., on syntax-preserving but semantically altered variants) in an appendix to show performance drops, thereby supporting the interpretation of the reported accuracies as evidence of limited semantic reasoning. revision: partial
Referee: [Results section] Accuracy is reported as aggregate percentages per category without per-task error analysis, statistical tests for differences between categories, or details on scoring criteria and inter-rater reliability for LLM outputs. This makes the reported variability (88% vs. 36%) difficult to interpret as a robust finding.

Authors: We agree that aggregate reporting limits robustness. In the revised Results section, we will include per-task accuracy breakdowns (in an appendix table), statistical tests (e.g., chi-squared) confirming significant differences across categories, and expanded details on scoring criteria (manual verification against formal semantics, with a sample double-checked by a second author and inter-rater agreement reported). These changes will make the observed variability (e.g., 88% vs. 36%) more interpretable as a reliable finding. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmarking with direct result reporting

full rationale

The paper performs an empirical evaluation by running three LLMs on a fixed set of 129 semantic tasks for HMSCs and reports accuracy figures (overall ~52%, with breakdowns by category) directly from model outputs against ground-truth semantics. No derivations, equations, fitted parameters, or predictions are present that could reduce to inputs by construction. No self-citation chains are invoked to justify uniqueness or force results. The study is self-contained against external benchmarks (model responses vs. formal semantics), satisfying the criteria for a non-circular empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The study relies on the established formal semantics of HMSCs from prior literature; no new free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5574 in / 1097 out tokens · 24518 ms · 2026-05-15T05:45:17.736185+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 2 internal anchors

[1]

Addison Wesley Longman Publishing Co., Inc., USA (1999)

Booch,G.,Rumbaugh,J.,Jacobson,I.:TheUnifiedModelingLanguageuserguide. Addison Wesley Longman Publishing Co., Inc., USA (1999)

work page 1999
[2]

llm-assisted code de- velopment

Cheung, K.S., Kaul, M., Jahangirova, G., Mousavi, M.R., Zie, E.: Com- parative analysis of carbon footprint in manual vs. llm-assisted code de- velopment. In: Proceedings of the 1st International Workshop on Responsi- ble Software Engineering, ResponsibleSE 2025, Trondheim, Norway, June 23- 28, 2025. pp. 13–20. ACM (2025). https://doi.org/10.1145/3711919....

work page doi:10.1145/3711919.3728678 2025
[3]

In: Dorner, M., Ram- ler, R., Winkler, D., Bergsmann, J

Elberzhager, F., Gerbershagen, M., Ginkel, J.: Using llms to evaluate architecture documents – results from a digital marketplace environment. In: Dorner, M., Ram- ler, R., Winkler, D., Bergsmann, J. (eds.) Software Architecture as the Backbone of Software Quality. pp. 65–81. Springer Nature Switzerland, Cham (2026)

work page 2026
[4]

In: 2nd IEEE/ACM International Conference on AI-powered Software, AIware 2025, Seoul, Republic of Korea, November 19-20, 2025

Gröpler, R., Klepke, S., Johns, J., Dreschinski, A., Schmid, K., Dornauer, B., Tüzün, E., Noppen, J., Mousavi, M.R., Tang, Y., Viehmann, J., Aslangül, S.S., Lee, B.S., Ziolkowski, A., Zie, E.: The future of generative AI in soft- ware engineering: A vision from industry and academia in the european ge- nius project. In: 2nd IEEE/ACM International Conferen...

work page doi:10.1109/aiware69974.2025.00026 2025
[5]

In: 2025 IEEE/ACM International Workshop on Designing Software (Designing)

Guerra, L.P.F., Ernst, N.: Assessing llms for front-end soft- ware architecture knowledge. In: 2025 IEEE/ACM International Workshop on Designing Software (Designing). p. 6–10. IEEE Press (2025). https://doi.org/10.1109/Designing66910.2025.00007, https://doi.org/10.1109/Designing66910.2025.00007

work page doi:10.1109/designing66910.2025.00007 2025
[6]

In: Proceedings of the 41st International Conference on Machine Learning

Hooda, A., Christodorescu, M., Allamanis, M., Wilson, A., Fawaz, K., Jha, S.: Do large code models understand programming concepts? counterfactual analysis for code predicates. In: Proceedings of the 41st International Conference on Machine Learning. ICML’24, JMLR.org (2024)

work page 2024
[7]

(ITU-T), I.T.S.S.: Recommendation Z.120: Message sequence chart (MSC) (September 1993)

work page 1993
[8]

International Journal on Software Tools for Technology Transfer (2026)

Laneve, C., Spanò, A., Ressi, D., Rossi, S., Bugliesi, M.: Understanding code semantics: a benchmark study of llms. International Journal on Software Tools for Technology Transfer (2026). https://doi.org/10.1007/s10009-026-00842- 4, https://doi.org/10.1007/s10009-026-00842-4

work page doi:10.1007/s10009-026-00842- 2026
[9]

Ma, W., Liu, S., Lin, Z., Wang, W., Hu, Q., Liu, Y., Zhang, C., Nie, L., Li, L., Liu, Y.: Lms: Understanding code syntax and semantics for code analysis (2024), https://arxiv.org/abs/2305.12138

work page arXiv 2024
[10]

Nikiema, S.L., Samhi, J., Kaboré, A.K., Klein, J., Bissyandé, T.F.: The code bar- rier: What llms actually understand? (2025), https://arxiv.org/abs/2504.10557

work page arXiv 2025
[11]

Mousavi Engineering: New Ideas and Emerging Results (ICSE-NIER)

North, M., Atapour-Abarghouei, A., Bencomo, N.: Beyond syntax: How do llms understand code? In: 2025 IEEE/ACM 47th International Conference on Software 20 M.R. Mousavi Engineering: New Ideas and Emerging Results (ICSE-NIER). pp. 86–90 (2025). https://doi.org/10.1109/ICSE-NIER66352.2025.00023

work page doi:10.1109/icse-nier66352.2025.00023 2025
[12]

Ouyang, S., Zhang, J.M., Gong, J., Jahangirova, G., Mousavi, M.R., Johns, J., Lee, B.S., Ziolkowski, A., Virginas, B., Noppen, J.: Benchmarking and evaluating vlms for software architecture diagram understanding (2026), https://arxiv.org/abs/2604.04009

work page internal anchor Pith review Pith/arXiv arXiv 2026
[13]

AJOB Neu- roscience11(2), 88–95 (2020)

Salles, A., Evers, K., Farisco, M.: Anthropomorphism in ai. AJOB Neu- roscience11(2), 88–95 (2020). https://doi.org/10.1080/21507740.2020.1740350, https://doi.org/10.1080/21507740.2020.1740350

work page doi:10.1080/21507740.2020.1740350 2020
[14]

Schmid,L.,Hey,T.,Armbruster,M.,Corallo,S.,Fuchß,D.,Keim,J.,Liu,H.,Kozi- olek, A.: Software architecture meets llms: A systematic literature review (2025), https://arxiv.org/abs/2505.16697

work page arXiv 2025
[15]

In: Andrikopoulos, V., Pautasso, C., Ali, N., Soldani, J., Xu, X

Soliman, M., Ashraf, E., Abdelsalam, K.M.K., Keim, J., Venkatesh, A.P.S.: Llms for software architecture knowledge: A comparative analysis among seven llms. In: Andrikopoulos, V., Pautasso, C., Ali, N., Soldani, J., Xu, X. (eds.) Software Architecture. pp. 99–115. Springer Nature Switzerland, Cham (2026)

work page 2026
[16]

In: 2025 IEEE 22nd International Conference on Software Architecture (ICSA)

Soliman, M., Keim, J.: Do large language models contain software architec- tural knowledge? : An exploratory case study with gpt. In: 2025 IEEE 22nd International Conference on Software Architecture (ICSA). pp. 13–24 (2025). https://doi.org/10.1109/ICSA65012.2025.00012

work page doi:10.1109/icsa65012.2025.00012 2025
[17]

Spiess, C., Devanbu, P., Barr, E.T.: How robustly do llms understand execution semantics? (2026), https://arxiv.org/abs/2604.16320

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V

Wei, A., Cao, J., Li, R., Chen, H., Zhang, Y., Wang, Z., Liu, Y., Teix- eira, T.S.F.X., Yang, D., Wang, K., Aiken, A.: EquiBench: Benchmarking large language models’ reasoning about program semantics via equivalence check- ing. In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V. (eds.) Proceedings of the 2025 Conference on Empirical Methods in...

work page doi:10.18653/v1/2025.emnlp-main.1718 2025
[19]

ACM Trans

Zhou, X., Li, R., Liang, P., Zhang, B., Shahin, M., Li, Z., Yang, C.: Us- ing llms in generating design rationale for software architecture decisions. ACM Trans. Softw. Eng. Methodol. (Dec 2025). https://doi.org/10.1145/3785010, https://doi.org/10.1145/3785010, just Accepted

work page doi:10.1145/3785010 2025