pith. machine review for the scientific record. sign in

arxiv: 2604.04009 · v1 · submitted 2026-04-05 · 💻 cs.SE

Recognition: no theorem link

Benchmarking and Evaluating VLMs for Software Architecture Diagram Understanding

Authors on Pith no claims yet

Pith reviewed 2026-05-13 17:11 UTC · model grok-4.3

classification 💻 cs.SE
keywords SADU benchmarksoftware architecture diagramsvision-language modelsdiagram understandingVLMssoftware engineeringvisual relation groundingbenchmark evaluation
0
0 comments X

The pith

Vision-language models reach at most 70 percent accuracy on software architecture diagram tasks in a new benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Software architecture diagrams serve as key artifacts for communicating system structure, behavior, and data throughout development, yet vision-language models have received little targeted testing on them. The paper introduces the SADU benchmark, built from 154 curated diagrams across behavioral, structural, and ER types together with 2,431 structured question-answer pairs focused on counting and retrieval. Evaluation of eleven current VLMs shows the strongest result at 70.18 percent accuracy for gemini-3-flash-preview and only 17.77 percent for gpt-4o-mini. The performance gap traces to specific shortfalls in diagram reasoning and visual relation grounding. This establishes a concrete baseline for measuring progress toward diagram-aware systems that can support design-stage engineering work.

Core claim

The paper presents SADU as a benchmark containing 154 carefully curated software architecture diagrams of behavioral, structural, and ER types, each paired with structured annotations and 2,431 question-answer tasks focused on counting and retrieval reasoning. Evaluation across 11 state-of-the-art VLMs from the Gemini, Claude, GPT, and Qwen families demonstrates that even the top performer, gemini-3-flash-preview, attains only 70.18 percent accuracy while gpt-4o-mini reaches just 17.77 percent, exposing clear limitations in diagram reasoning and visual relation grounding.

What carries the argument

The SADU benchmark of 154 diagrams and 2,431 counting and retrieval questions that probes VLMs on structured software engineering artifacts rather than generic images.

If this is right

  • Software architecture diagram understanding remains challenging for all tested state-of-the-art VLMs.
  • Models show particular weaknesses in visual relation grounding and structured diagram reasoning.
  • SADU supplies a repeatable test for tracking improvements in diagram-aware AI systems.
  • Current performance levels fall short of the reliability needed for faithful AI assistance in design-stage workflows.
  • Progress on SADU tasks would directly support more consistent AI use across the software development lifecycle.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • SADU-style questions could be added to VLM pre-training mixtures to reduce the domain gap between general images and engineering diagrams.
  • Hybrid pipelines that combine diagram parsing with code analysis might compensate for the isolated weaknesses shown here.
  • Extending the benchmark to include sequence or deployment diagrams would test whether the same relation-grounding shortfalls appear in other standard software notations.
  • Low scores on ER diagrams suggest that data-modeling understanding may lag behind structural understanding in current VLMs.

Load-bearing premise

The 154 curated diagrams and 2,431 questions sufficiently represent the range of real-world software architecture diagrams and the reasoning skills needed by practicing engineers.

What would settle it

A VLM achieving over 90 percent accuracy on the full SADU set after training only on general image-text data would indicate that the observed limitations are not inherent to current architectures.

Figures

Figures reproduced from arXiv: 2604.04009 by Adam Ziolkowski, Beum Seuk Lee, Botond Virginas, Gunel Jahangirova, Jack Johns, Jie M. Zhang, Jingzhi Gong, Joost Noppen, Mohammad Reza Mousavi, Shuyin Ouyang.

Figure 1
Figure 1. Figure 1: Overview of the SADU construction pipeline. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Example prompt used for requesting VLMs. The [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: RQ1: Model accuracy across diagram complexity. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: RQ1: Performance across question subtypes. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
read the original abstract

Software architecture diagrams are important design artifacts for communicating system structure, behavior, and data organization throughout the software development lifecycle. Although recent progress in large language models has substantially advanced code-centric software engineering tasks such as code generation, testing, and maintenance, the ability of modern vision-language models (VLMs) to understand software architecture diagrams remains underexplored. To address this gap, we present SADU, a benchmark for Software Architecture Diagram Understanding that evaluates VLMs on architecture diagrams as structured software engineering artifacts rather than generic images. SADU contains 154 carefully curated diagrams spanning behavioral, structural, and ER diagrams, paired with structured annotations and 2,431 question-answer tasks covering counting and retrieval reasoning. We evaluate 11 state-of-the-art VLMs from the Gemini, Claude, GPT, and Qwen families. Our results show that software architecture diagram understanding remains challenging for current models: the best-performing model gemini-3-flash-preview achieves only 70.18\% accuracy, while gpt-4o-mini only achieves 17.77\% accuracy. The results further reveal the weaknesses in diagram reasoning and visual relation grounding, highlighting a gap between current VLMs and the needs of design-stage software engineering. SADU provides a foundation for future research on diagram-aware AI systems and more faithful AI-assisted software engineering workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SADU, a benchmark for Software Architecture Diagram Understanding consisting of 154 carefully curated diagrams spanning behavioral, structural, and ER types, paired with 2,431 question-answer tasks focused on counting and retrieval reasoning. It evaluates 11 state-of-the-art VLMs from the Gemini, Claude, GPT, and Qwen families, reporting that the best model (Gemini-3-flash-preview) reaches only 70.18% accuracy while GPT-4o-mini reaches 17.77%, and concludes that current VLMs exhibit weaknesses in diagram reasoning and visual relation grounding, highlighting a gap for design-stage software engineering.

Significance. If the benchmark holds, the work is significant for providing the first dedicated empirical evaluation of VLMs on software architecture diagrams as structured SE artifacts rather than generic images. The multi-model comparison across 11 systems with no post-hoc exclusions or parameter fitting supplies a reproducible baseline that can guide future diagram-aware AI development in software engineering.

major comments (2)
  1. [§3 (Dataset Construction)] §3 (Dataset Construction): The claim that the 154 diagrams sufficiently represent real-world software architecture tasks rests on the description of them as 'carefully curated' spanning behavioral/structural/ER types, but the section provides no quantitative validation such as distributions of element counts, relation density, notation variants, or direct comparison to external corpora (e.g., open-source GitHub repositories). This directly affects the generalizability of the reported performance ceiling and the identified weaknesses in visual relation grounding.
  2. [§5 (Results)] §5 (Results): The interpretation that low accuracies reveal 'weaknesses in diagram reasoning and visual relation grounding' is plausible from the aggregate numbers, but the section lacks a per-question-type or per-diagram-type error breakdown that would isolate whether failures stem from visual grounding, counting, or relation extraction; without this, the precise nature of the gap remains underspecified.
minor comments (2)
  1. [Abstract and §4] Abstract and §4: The mention of 'structured annotations' should be expanded with a brief description of their format (e.g., whether they include explicit relation graphs or bounding boxes) to allow readers to assess how the 2,431 tasks were derived.
  2. [Results tables] Results tables: Adding 95% confidence intervals or standard deviations to the reported accuracy percentages would strengthen statistical interpretation of the model comparisons.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and positive recommendation for minor revision. We address each major comment below and will incorporate the suggested improvements to strengthen the manuscript.

read point-by-point responses
  1. Referee: §3 (Dataset Construction): The claim that the 154 diagrams sufficiently represent real-world software architecture tasks rests on the description of them as 'carefully curated' spanning behavioral/structural/ER types, but the section provides no quantitative validation such as distributions of element counts, relation density, notation variants, or direct comparison to external corpora (e.g., open-source GitHub repositories). This directly affects the generalizability of the reported performance ceiling and the identified weaknesses in visual relation grounding.

    Authors: We agree that quantitative validation would improve the characterization of the benchmark's representativeness. In the revised manuscript, we will expand §3 with a new table and accompanying text providing summary statistics on element counts (nodes, edges, labels), relation densities, and notation variants across the 154 diagrams. Where feasible, we will also include a brief comparison to a sampled set of architecture diagrams from public GitHub repositories to support generalizability claims. revision: yes

  2. Referee: §5 (Results): The interpretation that low accuracies reveal 'weaknesses in diagram reasoning and visual relation grounding' is plausible from the aggregate numbers, but the section lacks a per-question-type or per-diagram-type error breakdown that would isolate whether failures stem from visual grounding, counting, or relation extraction; without this, the precise nature of the gap remains underspecified.

    Authors: We agree that a finer-grained error analysis would better isolate the sources of model failures. In the revision, we will augment §5 with per-question-type accuracy breakdowns (counting vs. retrieval) and per-diagram-type results (behavioral, structural, ER). We will also add a short qualitative subsection with representative error examples to clarify whether issues arise primarily from visual grounding, counting, or relation extraction. revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical benchmark with direct model evaluations

full rationale

The paper creates the SADU benchmark (154 diagrams, 2431 QA tasks) and reports accuracy numbers for 11 external VLMs (Gemini, Claude, GPT, Qwen families). No equations, derivations, fitted parameters, predictions, or self-citations are used to generate the central results; accuracies are measured outcomes on the curated test set. The work is self-contained against external models and does not reduce any claim to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities; the work rests on standard practices for benchmark curation and model evaluation.

pith-pipeline@v0.9.0 · 5574 in / 960 out tokens · 45653 ms · 2026-05-13T17:11:53.855171+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. (How) Do Large Language Models Understand High-Level Message Sequence Charts?

    cs.SE 2026-05 conditional novelty 6.0

    LLMs achieve only modest understanding of HMSC formal semantics at 52 percent accuracy, performing strongly on basic constructs but weakly on abstractions and traces.

  2. (How) Do Large Language Models Understand High-Level Message Sequence Charts?

    cs.SE 2026-05 unverdicted novelty 5.0

    LLMs achieve only 52% overall accuracy on HMSC semantic tasks, performing well on basic concepts but poorly on abstractions, compositions, and trace calculations.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · cited by 1 Pith paper · 5 internal anchors

  1. [1]

    https://ai.google.dev/gemini-api/docs/models/gemini-2.5-flash

  2. [2]

    https://ai.google.dev/gemini-api/docs/models/gemini-2.5-flash-lite

  3. [3]

    https://ai.google.dev/gemini-api/docs/models/gemini-3-flash-preview

  4. [4]

    https://ai.google.dev/gemini-api/docs/models/gemini-3.1-flash-lite-preview

  5. [5]

    https://aws.amazon.com/what-is/sdlc/#how-does-sdlc-work–f1ezvt

  6. [6]

    https://developers.openai.com/api/docs/models/gpt-4o-mini

  7. [7]

    https://developers.openai.com/api/docs/models/gpt-5-nano

  8. [8]

    https://developers.openai.com/api/docs/models/gpt-5.4

  9. [9]

    https://doi.org/10.5281/zenodo.19339991

  10. [10]

    https://github.com/matovaro/pyunml-dataset

  11. [11]

    https://learn.microsoft.com/en-us/azure/architecture/

  12. [12]

    https://mermaid.com/

  13. [13]

    https://qwen.ai/blog?id=qwen2.5-vl

  14. [14]

    https://www.anthropic.com/news/claude-haiku-4-5

  15. [15]

    https://www.anthropic.com/news/claude-sonnet-4-5

  16. [16]

    https://www.figma.com

  17. [17]

    https://www.lucidchart.com

  18. [18]

    https://www.miro.com

  19. [19]

    State of software architecture report - 2024, 2024

  20. [20]

    State of software architecture report — 2025, 2026

  21. [21]

    Automatically recognizing the semantic elements from uml class diagram images.Journal of Systems and Software, 193:111431, 2022

    Fangwei Chen, Li Zhang, Xiaoli Lian, and Nan Niu. Automatically recognizing the semantic elements from uml class diagram images.Journal of Systems and Software, 193:111431, 2022

  22. [22]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

  23. [23]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodal- ity, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  24. [24]

    Gonzalez , title =

    Alejandro Cuadron, Dacheng Li, Wenjie Ma, Xingyao Wang, Yichuan Wang, Siyuan Zhuang, Shu Liu, Luis Gaspar Schroeder, Tian Xia, Huanzhi Mao, et al. The danger of overthinking: Examining the reasoning-action dilemma in agentic tasks.arXiv preprint arXiv:2502.08235, 2025

  25. [25]

    Draft-ing architectural design decisions using llms.arXiv preprint arXiv:2504.08207, 2025

    Rudra Dhar, Adyansh Kakran, Amey Karan, Karthik Vaidhyanathan, and Va- sudeva Varma. Draft-ing architectural design decisions using llms.arXiv preprint arXiv:2504.08207, 2025

  26. [26]

    Do vision-language models really understand visual language?arXiv preprint arXiv:2410.00193, 2024

    Yifan Hou, Buse Giledereli, Yilei Tu, and Mrinmaya Sachan. Do vision-language models really understand visual language?arXiv preprint arXiv:2410.00193, 2024

  27. [27]

    Mitigating overthinking in large reasoning models via manifold steering.arXiv preprint arXiv:2505.22411, 2025

    Yao Huang, Huanran Chen, Shouwei Ruan, Yichi Zhang, Xingxing Wei, and Yinpeng Dong. Mitigating overthinking in large reasoning models via manifold steering.arXiv preprint arXiv:2505.22411, 2025

  28. [28]

    Will generative ai fill the automation gap in software architecting? In2025 IEEE 22nd International Conference on Software Architecture Companion (ICSA-C), pages 41–45

    James Ivers and Ipek Ozkaya. Will generative ai fill the automation gap in software architecting? In2025 IEEE 22nd International Conference on Software Architecture Companion (ICSA-C), pages 41–45. IEEE, 2025

  29. [29]

    UNIKIE-BENCH: Benchmarking Large Multimodal Models for Key Information Extraction in Visual Documents

    Yifan Ji, Zhipeng Xu, Zhenghao Liu, Zulong Chen, Qian Zhang, Zhibo Yang, Junyang Lin, Yu Gu, Ge Yu, and Maosong Sun. Unikie-bench: Benchmarking large multimodal models for key information extraction in visual documents. arXiv preprint arXiv:2602.07038, 2026

  30. [30]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

  31. [31]

    Effects of defects in uml models: an experimental investigation

    Christian FJ Lange and Michel RV Chaudron. Effects of defects in uml models: an experimental investigation. InProceedings of the 28th international conference on Software engineering, pages 401–411, 2006

  32. [32]

    Devbench: A comprehensive benchmark for software development

    Bowen Li, Wenhan Wu, Ziwei Tang, Lin Shi, John Yang, Jinyang Li, Shunyu Yao, Chen Qian, Binyuan Hui, Qicheng Zhang, et al. Devbench: A comprehensive benchmark for software development.arXiv preprint arXiv:2403.08604, 3, 2024

  33. [33]

    On the perception bottleneck of vlms for chart understanding.arXiv preprint arXiv:2503.18435, 2025

    Junteng Liu, Weihao Zeng, Xiwen Zhang, Yijun Wang, Zifei Shan, and Junxian He. On the perception bottleneck of vlms for chart understanding.arXiv preprint arXiv:2503.18435, 2025

  34. [34]

    Wildvision: Evaluating vision-language models in the wild with human preferences.Advances in Neural Information Processing Systems, 37:48224–48255, 2024

    Yujie Lu, Dongfu Jiang, Wenhu Chen, William Yang Wang, Yejin Choi, and Bill Yuchen Lin. Wildvision: Evaluating vision-language models in the wild with human preferences.Advances in Neural Information Processing Systems, 37:48224–48255, 2024

  35. [35]

    Argus: Vision-centric reasoning with grounded chain-of-thought

    Yunze Man, De-An Huang, Guilin Liu, Shiwei Sheng, Shilong Liu, Liang-Yan Gui, Jan Kautz, Yu-Xiong Wang, and Zhiding Yu. Argus: Vision-centric reasoning with grounded chain-of-thought. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14268–14280, 2025

  36. [36]

    Docvlm: Make your vlm an efficient reader

    Mor Shpigel Nacson, Aviad Aberdam, Roy Ganz, Elad Ben Avraham, Alona Golts, Yair Kittenplon, Shai Mazor, and Ron Litman. Docvlm: Make your vlm an efficient reader. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29005–29015, 2025. Shuyin Ouyang, Jie M. Zhang, Jingzhi Gong, Gunel Jahangirova, Mohammad Reza Mousavi, Jack Johns...

  37. [37]

    A survey into the rigor of uml use and its perceived impact on quality and productivity

    Ariadi Nugroho and Michel RV Chaudron. A survey into the rigor of uml use and its perceived impact on quality and productivity. InProceedings of the Second ACM-IEEE international symposium on Empirical software engineering and measurement, pages 90–99, 2008

  38. [38]

    Dscodebench: A realistic benchmark for data science code generation

    Shuyin Ouyang, Dong Huang, Jingwen Guo, Zeyu Sun, Qihao Zhu, and Jie M Zhang. Dscodebench: A realistic benchmark for data science code generation. arXiv preprint arXiv:2505.15621, 2025

  39. [39]

    Math blind: Failures in diagram understanding undermine reasoning in mllms.arXiv preprint arXiv:2503.20745, 2025

    Yanpeng Sun, Shan Zhang, Wei Tang, Aotian Chen, Piotr Koniusz, Kai Zou, Yuan Xue, and Anton van den Hengel. Math blind: Failures in diagram understanding undermine reasoning in mllms.arXiv preprint arXiv:2503.20745, 2025

  40. [40]

    From Charts to Code: A Hierarchical Benchmark for Multimodal Models

    Jiahao Tang, Henry Hengyuan Zhao, Lijian Wu, Yifei Tao, Dongxing Mao, Yang Wan, Jingru Tan, Min Zeng, Min Li, and Alex Jinpeng Wang. From charts to code: A hierarchical benchmark for multimodal models.arXiv preprint arXiv:2510.17932, 2025

  41. [41]

    Document intelligence in the era of large language models: A survey.arXiv preprint arXiv:2510.13366, 2025

    Weishi Wang, Hengchang Hu, Zhijie Zhang, Zhaochen Li, Hongxin Shao, and Daniel Dahlmeier. Document intelligence in the era of large language models: A survey.arXiv preprint arXiv:2510.13366, 2025

  42. [42]

    Testeval: Benchmarking large language models for test case generation

    Wenhan Wang, Chenyuan Yang, Zhijie Wang, Yuheng Huang, Zhaoyang Chu, Da Song, Lingming Zhang, An Ran Chen, and Lei Ma. Testeval: Benchmarking large language models for test case generation. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 3547–3562, 2025

  43. [43]

    Charxiv: Charting gaps in realistic chart understanding in multimodal llms.Advances in Neural Information Processing Systems, 37:113569–113697, 2024

    Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, et al. Charxiv: Charting gaps in realistic chart understanding in multimodal llms.Advances in Neural Information Processing Systems, 37:113569–113697, 2024

  44. [44]

    Osworld: Benchmarking multimodal agents for open-ended tasks in real computer envi- ronments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer envi- ronments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

  45. [45]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9556–9567, 2024

  46. [46]

    Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark

    Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, et al. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15134–15186, 2025