pith. sign in

arxiv: 2605.22079 · v2 · pith:Y5XIBTGBnew · submitted 2026-05-21 · 💻 cs.CL

Ishigaki-IDS-Bench: A Benchmark for Generating Information Delivery Specification from BIM Information Requirements

Pith reviewed 2026-06-30 17:32 UTC · model grok-4.3

classification 💻 cs.CL
keywords Information Delivery SpecificationBIMIDS generationLLM benchmarkbuildingSMARTIFCstructured generationfacet F1
0
0 comments X

The pith

Ishigaki-IDS-Bench is the first public benchmark for turning BIM information requirements into valid IDS XML files, where even top LLMs reach only 65.6 percent facet F1 and 33.1 percent content pass rate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Ishigaki-IDS-Bench to test large language models on generating Information Delivery Specifications from natural-language BIM requirements. The benchmark supplies 166 examples across 83 scenarios written in Japanese and English by six domain experts, each paired with a gold-standard IDS file and metadata on input format, IFC version, and construction domain. Evaluation splits into two checks: formal validity scored by the buildingSMART IDSAuditTool on processability, structure, and content, plus content fidelity measured by macro-F1 over individual facets against the gold files. Results across ten models in zero-shot settings show the best facet F1 at 65.6 percent and best content pass rate at 33.1 percent. The dataset and evaluation code are released publicly so others can measure progress on this structured-generation task that demands both IFC vocabulary compliance and external-validator agreement.

Core claim

Ishigaki-IDS-Bench is the first publicly released benchmark for IDS generation from BIM information requirements. The benchmark contains 166 examples spanning 83 practical scenarios authored in Japanese and English by six BIM/IDS experts, each paired with a gold IDS file. Evaluation proceeds in two stages: formal validity scored by the buildingSMART IDSAuditTool along Processability, Structure, and Content, and content fidelity scored by facet-level macro-F1 against the gold IDS. Across 10 LLMs in zero-shot, the highest Facet F1 is 65.6 percent, achieved by GPT-5.5, while the highest Content pass rate is only 33.1 percent, achieved by Claude Opus 4.5.

What carries the argument

Ishigaki-IDS-Bench, a dataset of 83 expert-authored scenarios each linked to a gold IDS file and evaluated by a two-stage protocol of IDSAuditTool validity plus facet macro-F1.

If this is right

  • Current LLMs still fail to produce IDS files that both pass the official validator and match expert facet content at high rates.
  • Benchmarks for structured output must now incorporate domain-specific vocabulary checks and external-validator agreement rather than pure syntactic correctness.
  • Progress on IDS generation can be tracked quantitatively as new models are tested on the released 166-example set.
  • The dual evaluation protocol separates format compliance from semantic fidelity, allowing targeted diagnosis of model errors.
  • The public release under open licenses enables direct comparison of future methods against the reported zero-shot baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The benchmark's bilingual design may reveal whether Japanese or English inputs produce systematically different error patterns in IFC vocabulary handling.
  • Similar dual-stage benchmarks could be created for other machine-checkable construction documents such as COBie or BCF to test the generality of the evaluation approach.
  • If models improve on this set, the same scenarios could serve as seed data for supervised fine-tuning aimed at IFC property-set conventions.

Load-bearing premise

The 83 scenarios and corresponding gold IDS files authored by the six experts are accurate, representative of practical use cases, and correctly validated against IFC and buildingSMART conventions.

What would settle it

Re-running the evaluation after independent expert re-validation of all 83 gold IDS files and finding that any model's content pass rate rises above 50 percent.

read the original abstract

Building Information Modeling (BIM) projects increasingly use Information Delivery Specification (IDS) to formalize information requirements in a machine-checkable XML format. Because IDS conditions are grounded in the Industry Foundation Classes (IFC) vocabulary, authoring them requires expertise in IFC concepts, validation tools, and property set conventions. Existing benchmarks for structured generation do not adequately capture the additional burden of vocabulary conformance and external-validator agreement that IDS imposes. We present Ishigaki-IDS-Bench, the first publicly released benchmark for IDS generation from BIM information requirements. The benchmark contains 166 examples spanning 83 practical scenarios authored in Japanese and English by six BIM/IDS experts, each paired with a gold IDS file and metadata covering input format, turn setting, target IFC versions, and construction domain. Evaluation proceeds in two stages: (i) formal validity scored by the buildingSMART IDSAuditTool along Processability, Structure, and Content, and (ii) content fidelity scored by facet-level macro-F1 against the gold IDS. Across 10 LLMs in zero-shot, the highest Facet F1 is 65.6%, achieved by GPT-5.5, while the highest Content pass rate is only 33.1%, achieved by Claude Opus 4.5. Ishigaki-IDS-Bench is released on Hugging Face (DOI 10.57967/hf/8873) under CC BY 4.0, and the evaluation code is released on Zenodo (DOI 10.5281/zenodo.20550510) under Apache-2.0.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Ishigaki-IDS-Bench, the first publicly released benchmark for generating Information Delivery Specification (IDS) XML files from natural-language BIM information requirements. It contains 166 examples spanning 83 scenarios authored in Japanese and English by six BIM/IDS experts, each paired with a gold IDS file and metadata on input format, turn setting, target IFC versions, and construction domain. Evaluation uses two stages: formal validity via the buildingSMART IDSAuditTool (Processability, Structure, Content) and content fidelity via facet-level macro-F1 against the gold files. Across 10 LLMs in zero-shot, the best results are 65.6% Facet F1 (GPT-5.5) and 33.1% Content pass rate (Claude Opus 4.5). The benchmark is released on Hugging Face (DOI 10.57967/hf/8873) and evaluation code on Zenodo (DOI 10.5281/zenodo.20550510).

Significance. If the gold-standard IDS files are reliable and representative, the benchmark would provide a useful domain-specific resource that incorporates IFC vocabulary conformance and external-validator agreement requirements absent from general structured-generation benchmarks. The dual-metric evaluation and public release of data plus code would support reproducible research on LLM use in BIM compliance tasks and establish initial performance baselines for this specialized generation problem.

major comments (2)
  1. [Abstract] Abstract: The 83 scenarios and corresponding gold IDS files are described as authored by six BIM/IDS experts, but the manuscript supplies no inter-annotator agreement figures, external review process, or independent cross-check against buildingSMART/IFC conventions. Because both the IDSAuditTool Content score and the facet-level F1 treat the gold files as ground truth, the absence of such validation directly affects the reliability of the headline performance numbers (65.6% Facet F1, 33.1% Content pass rate).
  2. [Benchmark construction] Benchmark construction section (referenced in abstract): No information is given on the criteria used to select the 83 scenarios, their distribution across construction domains and IFC versions, or how edge cases were handled. This information is required to evaluate whether the benchmark is representative of practical use cases.
minor comments (1)
  1. [Abstract] Abstract: The relationship between the 83 scenarios and 166 examples should be clarified (e.g., whether each scenario contributes parallel Japanese and English versions).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the reliability of the gold-standard annotations and the representativeness of the benchmark. We address each major comment below and outline the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The 83 scenarios and corresponding gold IDS files are described as authored by six BIM/IDS experts, but the manuscript supplies no inter-annotator agreement figures, external review process, or independent cross-check against buildingSMART/IFC conventions. Because both the IDSAuditTool Content score and the facet-level F1 treat the gold files as ground truth, the absence of such validation directly affects the reliability of the headline performance numbers (65.6% Facet F1, 33.1% Content pass rate).

    Authors: We agree that additional documentation of the gold-standard creation process is necessary to support the benchmark's credibility. The current manuscript does not report inter-annotator agreement statistics or a formal external validation step. In the revised manuscript we will add a new subsection under benchmark construction that describes the authoring protocol used by the six experts, the consensus process employed, and any cross-checks performed against buildingSMART guidelines and IFC conventions. revision: yes

  2. Referee: [Benchmark construction] Benchmark construction section (referenced in abstract): No information is given on the criteria used to select the 83 scenarios, their distribution across construction domains and IFC versions, or how edge cases were handled. This information is required to evaluate whether the benchmark is representative of practical use cases.

    Authors: We acknowledge that the manuscript lacks explicit details on scenario selection and coverage. The 83 scenarios were chosen by the expert authors to span common practical BIM use cases. In the revision we will expand the benchmark construction section with (i) the explicit selection criteria, (ii) a table or breakdown showing distribution across construction domains and target IFC versions, and (iii) a description of how edge cases were identified and incorporated. revision: yes

Circularity Check

0 steps flagged

No circularity; benchmark and evaluations are self-contained against external references.

full rationale

The paper constructs Ishigaki-IDS-Bench by authoring 83 scenarios and gold IDS files via six experts, then measures LLM outputs using the independent buildingSMART IDSAuditTool (Processability/Structure/Content) plus facet macro-F1 against those golds. No equations, fitted parameters, or predictions are defined in terms of the target quantities; the reported 65.6% Facet F1 and 33.1% Content pass rate are direct empirical measurements. No self-citations are load-bearing for the central claims, and the derivation chain does not reduce to renaming or self-definition. The evaluation therefore stands as an independent benchmark release rather than a closed loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper adds a new benchmark dataset and two-stage evaluation protocol on top of existing BIM/IFC/IDS standards without introducing fitted parameters or new postulated entities.

axioms (1)
  • domain assumption IFC vocabulary and IDS XML format are established, machine-checkable standards for BIM information requirements.
    The benchmark construction and gold files rest on these pre-existing industry standards.

pith-pipeline@v0.9.1-grok · 5873 in / 1162 out tokens · 52714 ms · 2026-06-30T17:32:36.094185+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Ishigaki-IDS: An Open-Weight Verifier-Aware Model for Information Delivery Specification Drafting in Building Information Modeling

    cs.CL 2026-06 unverdicted novelty 5.0

    Ishigaki-IDS is a verifier-aware LLM for generating validator-passing IDS files in BIM, reaching IDSAuditPass scores of 0.651-0.753 on a 166-case benchmark and cutting practitioner work time by 54.7%.

Reference graph

Works this paper leans on

31 extracted references · 21 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    Luca Beurer-Kellner, Marc Fischer, and Martin Vechev. 2024. Guiding LLMs The Right Way: Fast, Non-Invasive Constrained Generation. InProceedings of the 41st International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 235). PMLR, 3658–3673. https://proceedings.mlr.press/v235/beurer- kellner24a.html

  2. [2]

    buildingSMART International. 2024. IDS Audit Tool. https://github.com/ buildingSMART/IDS-Audit-tool

  3. [3]

    buildingSMART International. 2024. Information Delivery Specification (IDS). https://www.buildingsmart.org/standards/bsi-standards/information- delivery-specification-ids/

  4. [4]

    Nanjiang Chen, Xuhui Lin, Hai Jiang, and Yi An. 2024. Automated Building Information Modeling Compliance Check through a Large Language Model Combined with Deep Learning and Ontology.Buildings14, 7 (2024), 1983. doi:10. 3390/buildings14071983

  5. [5]

    Zaletel, and Joel E

    Yixin Dong, Charlie F. Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, and Tianqi Chen. 2025. XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models. arXiv:2411.15100 [cs.LG] doi:10.48550/arXiv. 2411.15100

  6. [6]

    2008.BIM Handbook: A Guide to Building Information Modeling for Owners, Managers, De- signers, Engineers and Contractors

    Chuck Eastman, Paul Teicholz, Rafael Sacks, and Kathleen Liston. 2008.BIM Handbook: A Guide to Building Information Modeling for Owners, Managers, De- signers, Engineers and Contractors. John Wiley & Sons

  7. [7]

    Zhiwei Fei, Xiaoyu Shen, Dawei Zhu, Fengzhe Zhou, Zhuo Han, Songyang Zhang, Kai Chen, Zongwen Shen, and Jidong Ge. 2023. LawBench: Benchmarking Legal Knowledge of Large Language Models. arXiv:2309.16289 [cs.CL] doi:10.48550/ arXiv.2309.16289

  8. [8]

    Stefan Fuchs, Michael Witbrock, Johannes Dimyadi, and Robert Amor. 2024. Using Large Language Models for the Interpretation of Building Regulations. arXiv:2407.21060 [cs.AI] doi:10.48550/arXiv.2407.21060

  9. [9]

    Yan Gao, Fuji Hu, Chengzhang Chai, Yiwei Weng, and Haijiang Li. 2026. Multi- agent Framework for Schema-guided Reasoning and Tool-augmented Interaction with IFC Models.Automation in Construction186 (2026), 106888. doi:10.1016/j. autcon.2026.106888

  10. [10]

    Saibo Geng, Martin Josifoski, Maxime Peyrard, and Robert West. 2023. Grammar- Constrained Decoding for Structured NLP Tasks without Finetuning. InPro- ceedings of the 2023 Conference on Empirical Methods in Natural Language Pro- cessing. Association for Computational Linguistics, Singapore, 10932–10952. doi:10.18653/v1/2023.emnlp-main.674

  11. [11]

    International Organization for Standardization. 2018. ISO 16739-1:2018: Indus- try Foundation Classes (IFC) for Data Sharing in the Construction and Facility Management Industries. https://www.iso.org/standard/70303.html

  12. [12]

    Ryo Kanazawa, Koyo Hidaka, Teppei Miyamoto, Takayuki Kato, Tomoki Ando, Chenguang Wang, Dayuan Jiang, Naofumi Fujita, Shuhei Saitoh, Atomu Kondo, Koki Arakawa, and Daiho Nishioka. 2026. Ishigaki-IDS-Bench. doi:10.57967/hf/ 8873 Hugging Face dataset. Accessed: 2026-05-21

  13. [13]

    Ryo Kanazawa, Koyo Hidaka, Teppei Miyamoto, Takayuki Kato, Tomoki Ando, Chenguang Wang, Dayuan Jiang, Naofumi Fujita, Shuhei Saitoh, Atomu Kondo, Koki Arakawa, and Daiho Nishioka. 2026. Ishigaki-IDS-Bench: Evaluation Code and Reproducibility Repository. doi:10.5281/zenodo.20319465 GitHub repository release v1.0.0. Accessed: 2026-05-21

  14. [14]

    Magdalena Kładź and Andrzej Szymon Borkowski. 2025. IDS Standard and bSDD Service as Tools for Automating Information Exchange and Verification in Projects Implemented in the BIM Methodology.Buildings15, 3 (2025), 378. doi:10.3390/buildings15030378

  15. [15]

    Jin Kook Lee, Yong Cheol Lee, Moeid Shariatfar, Pedram Ghannad, and Jiansong Zhang. 2020. Generation of Entity-Based Integrated Model View Definition Modules for the Development of New BIM Data Exchange Standards.Journal of Computing in Civil Engineering34, 3 (2020), 04020011. doi:10.1061/(ASCE)CP.1943- 5487.0000888

  16. [16]

    Jia-Rui Lin, Yun-Hong Cai, Xiang-Rui Ni, Shaojie Zhou, and Peng Pan. 2026. Qwen-BIM: Developing Large Language Model for BIM-based Design with Domain-specific Benchmark and Dataset. arXiv:2602.20812 [cs.CL] doi:10.48550/ arXiv.2602.20812

  17. [17]

    Bulou Liu, Zhenhao Zhu, Qingyao Ai, Yiqun Liu, and Yueyue Wu. 2024. LeDQA: A Chinese Legal Case Document-based Question Answering Dataset. InPro- ceedings of the 33rd ACM International Conference on Information and Knowledge Management (CIKM ’24). ACM, 5385–5389. doi:10.1145/3627673.3679154

  18. [18]

    Langming Liu, Haibin Chen, Yuhao Wang, Yujin Yuan, Shilei Liu, Wenbo Su, Xiangyu Zhao, and Bo Zheng. 2025. ECKGBench: Benchmarking Large Language Models in E-commerce Leveraging Knowledge Graph. InProceedings of the 34th ACM International Conference on Information and Knowledge Management (CIKM ’25). ACM, 6461–6465. doi:10.1145/3746252.3761613

  19. [19]

    Soumya Madireddy, Lu Gao, Zia Ud Din, Kinam Kim, Ahmed Senouci, Zhe Han, and Yunpeng Zhang. 2025. Large Language Model-Driven Code Compliance Checking in Building Information Modeling.Electronics14, 11 (2025), 2146. doi:10.3390/electronics14112146

  20. [20]

    Lukas Netz, Jan Reimer, and Bernhard Rumpe. 2024. Using Grammar Masking to Ensure Syntactic Validity in LLM-based Modeling Tasks. InProceedings of the 27th ACM/IEEE International Conference on Model Driven Engineering Languages and Systems: Companion Proceedings. ACM, 570–577. doi:10.1145/3652620.3687829

  21. [21]

    Bharathi Kannan Nithyanantham, Tobias Sesterhenn, Ashwin Nedungadi, Sergio Peral Garijo, Janis Zenkner, Christian Bartelt, and Stefan Lüdtke

  22. [22]

    arXiv:2511.05533 [cs.AI] doi:10.48550/arXiv.2511.05533

    MCP4IFC: IFC-Based Building Design Using Large Language Models. arXiv:2511.05533 [cs.AI] doi:10.48550/arXiv.2511.05533

  23. [23]

    Zhenhui Ou, Dawei Li, Zhen Tan, Wenlin Li, Huan Liu, and Siyuan Song. 2025. Building Safer Sites: A Large-Scale Multi-Level Dataset for Construction Safety Research. arXiv:2508.09203 [cs.CV] doi:10.48550/arXiv.2508.09203

  24. [24]

    Kanghee Park, Jiayu Wang, Taylor Berg-Kirkpatrick, Nadia Polikarpova, and Loris D’Antoni. 2024. Grammar-Aligned Decoding. InAdvances in Neural Information Processing Systems, Vol. 37. https://proceedings.neurips.cc/paper_files/paper/ 2024/hash/2bdc2267c3d7d01523e2e17ac0a754f3-Abstract-Conference.html

  25. [25]

    Seungjun Son, Ghang Lee, Jaehwan Jung, Jongsung Kim, and Kyungki Jeon

  26. [26]

    doi:10.1016/j.aei.2022.101731 Ishigaki-IDS-Bench CIKM Resources, Submission Draft,

    Automated Generation of a Model View Definition from an Information Delivery Manual Using idmXSD and buildingSMART Data Dictionary.Advanced Engineering Informatics54 (2022), 101731. doi:10.1016/j.aei.2022.101731 Ishigaki-IDS-Bench CIKM Resources, Submission Draft,

  27. [27]

    Issa Sugiura, Takashi Ishida, Taro Makino, Chieko Tazuke, Takanori Nak- agawa, Kosuke Nakago, and David Ha. 2025. EDINET-Bench: Evaluat- ing LLMs on Complex Financial Tasks using Japanese Financial Statements. arXiv:2506.08762 [cs.CL] doi:10.48550/arXiv.2506.08762

  28. [28]

    Artur Tomczak, Claudio Benghi, Léon van Berlo, and Eilif Hjelseth. 2024. Requir- ing Circularity Data in BIM with Information Delivery Specification.Journal of Circular Economy(2024). https://circulareconomyjournal.org/articles/requiring- circularity-data-in-bim-with-information-delivery-specification/

  29. [29]

    Artur Tomczak, Léon van Berlo, Thomas Krijnen, André Borrmann, and Marzia Bolpagni. 2022. A Review of Methods to Specify Information Requirements in Digital Construction Projects. InProceedings of the 39th International Conference of CIB W78. Melbourne, Australia. doi:10.1088/1755-1315/1101/9/092024

  30. [30]

    Efficient Guided Generation for Large Language Models

    Brandon T. Willard and Rémi Louf. 2023. Efficient Guided Generation for Large Language Models. arXiv:2307.09702 [cs.CL] doi:10.48550/arXiv.2307.09702

  31. [31]

    Junwen Zheng and Martin Fischer. 2023. BIM-GPT: A Prompt-Based Virtual Assistant Framework for BIM Information Retrieval. arXiv:2304.09333 [cs.CL] doi:10.48550/arXiv.2304.09333