Ishigaki-IDS-Bench: A Benchmark for Generating Information Delivery Specification from BIM Information Requirements
Pith reviewed 2026-06-30 17:32 UTC · model grok-4.3
The pith
Ishigaki-IDS-Bench is the first public benchmark for turning BIM information requirements into valid IDS XML files, where even top LLMs reach only 65.6 percent facet F1 and 33.1 percent content pass rate.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Ishigaki-IDS-Bench is the first publicly released benchmark for IDS generation from BIM information requirements. The benchmark contains 166 examples spanning 83 practical scenarios authored in Japanese and English by six BIM/IDS experts, each paired with a gold IDS file. Evaluation proceeds in two stages: formal validity scored by the buildingSMART IDSAuditTool along Processability, Structure, and Content, and content fidelity scored by facet-level macro-F1 against the gold IDS. Across 10 LLMs in zero-shot, the highest Facet F1 is 65.6 percent, achieved by GPT-5.5, while the highest Content pass rate is only 33.1 percent, achieved by Claude Opus 4.5.
What carries the argument
Ishigaki-IDS-Bench, a dataset of 83 expert-authored scenarios each linked to a gold IDS file and evaluated by a two-stage protocol of IDSAuditTool validity plus facet macro-F1.
If this is right
- Current LLMs still fail to produce IDS files that both pass the official validator and match expert facet content at high rates.
- Benchmarks for structured output must now incorporate domain-specific vocabulary checks and external-validator agreement rather than pure syntactic correctness.
- Progress on IDS generation can be tracked quantitatively as new models are tested on the released 166-example set.
- The dual evaluation protocol separates format compliance from semantic fidelity, allowing targeted diagnosis of model errors.
- The public release under open licenses enables direct comparison of future methods against the reported zero-shot baselines.
Where Pith is reading between the lines
- The benchmark's bilingual design may reveal whether Japanese or English inputs produce systematically different error patterns in IFC vocabulary handling.
- Similar dual-stage benchmarks could be created for other machine-checkable construction documents such as COBie or BCF to test the generality of the evaluation approach.
- If models improve on this set, the same scenarios could serve as seed data for supervised fine-tuning aimed at IFC property-set conventions.
Load-bearing premise
The 83 scenarios and corresponding gold IDS files authored by the six experts are accurate, representative of practical use cases, and correctly validated against IFC and buildingSMART conventions.
What would settle it
Re-running the evaluation after independent expert re-validation of all 83 gold IDS files and finding that any model's content pass rate rises above 50 percent.
read the original abstract
Building Information Modeling (BIM) projects increasingly use Information Delivery Specification (IDS) to formalize information requirements in a machine-checkable XML format. Because IDS conditions are grounded in the Industry Foundation Classes (IFC) vocabulary, authoring them requires expertise in IFC concepts, validation tools, and property set conventions. Existing benchmarks for structured generation do not adequately capture the additional burden of vocabulary conformance and external-validator agreement that IDS imposes. We present Ishigaki-IDS-Bench, the first publicly released benchmark for IDS generation from BIM information requirements. The benchmark contains 166 examples spanning 83 practical scenarios authored in Japanese and English by six BIM/IDS experts, each paired with a gold IDS file and metadata covering input format, turn setting, target IFC versions, and construction domain. Evaluation proceeds in two stages: (i) formal validity scored by the buildingSMART IDSAuditTool along Processability, Structure, and Content, and (ii) content fidelity scored by facet-level macro-F1 against the gold IDS. Across 10 LLMs in zero-shot, the highest Facet F1 is 65.6%, achieved by GPT-5.5, while the highest Content pass rate is only 33.1%, achieved by Claude Opus 4.5. Ishigaki-IDS-Bench is released on Hugging Face (DOI 10.57967/hf/8873) under CC BY 4.0, and the evaluation code is released on Zenodo (DOI 10.5281/zenodo.20550510) under Apache-2.0.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Ishigaki-IDS-Bench, the first publicly released benchmark for generating Information Delivery Specification (IDS) XML files from natural-language BIM information requirements. It contains 166 examples spanning 83 scenarios authored in Japanese and English by six BIM/IDS experts, each paired with a gold IDS file and metadata on input format, turn setting, target IFC versions, and construction domain. Evaluation uses two stages: formal validity via the buildingSMART IDSAuditTool (Processability, Structure, Content) and content fidelity via facet-level macro-F1 against the gold files. Across 10 LLMs in zero-shot, the best results are 65.6% Facet F1 (GPT-5.5) and 33.1% Content pass rate (Claude Opus 4.5). The benchmark is released on Hugging Face (DOI 10.57967/hf/8873) and evaluation code on Zenodo (DOI 10.5281/zenodo.20550510).
Significance. If the gold-standard IDS files are reliable and representative, the benchmark would provide a useful domain-specific resource that incorporates IFC vocabulary conformance and external-validator agreement requirements absent from general structured-generation benchmarks. The dual-metric evaluation and public release of data plus code would support reproducible research on LLM use in BIM compliance tasks and establish initial performance baselines for this specialized generation problem.
major comments (2)
- [Abstract] Abstract: The 83 scenarios and corresponding gold IDS files are described as authored by six BIM/IDS experts, but the manuscript supplies no inter-annotator agreement figures, external review process, or independent cross-check against buildingSMART/IFC conventions. Because both the IDSAuditTool Content score and the facet-level F1 treat the gold files as ground truth, the absence of such validation directly affects the reliability of the headline performance numbers (65.6% Facet F1, 33.1% Content pass rate).
- [Benchmark construction] Benchmark construction section (referenced in abstract): No information is given on the criteria used to select the 83 scenarios, their distribution across construction domains and IFC versions, or how edge cases were handled. This information is required to evaluate whether the benchmark is representative of practical use cases.
minor comments (1)
- [Abstract] Abstract: The relationship between the 83 scenarios and 166 examples should be clarified (e.g., whether each scenario contributes parallel Japanese and English versions).
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the reliability of the gold-standard annotations and the representativeness of the benchmark. We address each major comment below and outline the revisions we will make.
read point-by-point responses
-
Referee: [Abstract] Abstract: The 83 scenarios and corresponding gold IDS files are described as authored by six BIM/IDS experts, but the manuscript supplies no inter-annotator agreement figures, external review process, or independent cross-check against buildingSMART/IFC conventions. Because both the IDSAuditTool Content score and the facet-level F1 treat the gold files as ground truth, the absence of such validation directly affects the reliability of the headline performance numbers (65.6% Facet F1, 33.1% Content pass rate).
Authors: We agree that additional documentation of the gold-standard creation process is necessary to support the benchmark's credibility. The current manuscript does not report inter-annotator agreement statistics or a formal external validation step. In the revised manuscript we will add a new subsection under benchmark construction that describes the authoring protocol used by the six experts, the consensus process employed, and any cross-checks performed against buildingSMART guidelines and IFC conventions. revision: yes
-
Referee: [Benchmark construction] Benchmark construction section (referenced in abstract): No information is given on the criteria used to select the 83 scenarios, their distribution across construction domains and IFC versions, or how edge cases were handled. This information is required to evaluate whether the benchmark is representative of practical use cases.
Authors: We acknowledge that the manuscript lacks explicit details on scenario selection and coverage. The 83 scenarios were chosen by the expert authors to span common practical BIM use cases. In the revision we will expand the benchmark construction section with (i) the explicit selection criteria, (ii) a table or breakdown showing distribution across construction domains and target IFC versions, and (iii) a description of how edge cases were identified and incorporated. revision: yes
Circularity Check
No circularity; benchmark and evaluations are self-contained against external references.
full rationale
The paper constructs Ishigaki-IDS-Bench by authoring 83 scenarios and gold IDS files via six experts, then measures LLM outputs using the independent buildingSMART IDSAuditTool (Processability/Structure/Content) plus facet macro-F1 against those golds. No equations, fitted parameters, or predictions are defined in terms of the target quantities; the reported 65.6% Facet F1 and 33.1% Content pass rate are direct empirical measurements. No self-citations are load-bearing for the central claims, and the derivation chain does not reduce to renaming or self-definition. The evaluation therefore stands as an independent benchmark release rather than a closed loop.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption IFC vocabulary and IDS XML format are established, machine-checkable standards for BIM information requirements.
Forward citations
Cited by 1 Pith paper
-
Ishigaki-IDS: An Open-Weight Verifier-Aware Model for Information Delivery Specification Drafting in Building Information Modeling
Ishigaki-IDS is a verifier-aware LLM for generating validator-passing IDS files in BIM, reaching IDSAuditPass scores of 0.651-0.753 on a 166-case benchmark and cutting practitioner work time by 54.7%.
Reference graph
Works this paper leans on
-
[1]
Luca Beurer-Kellner, Marc Fischer, and Martin Vechev. 2024. Guiding LLMs The Right Way: Fast, Non-Invasive Constrained Generation. InProceedings of the 41st International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 235). PMLR, 3658–3673. https://proceedings.mlr.press/v235/beurer- kellner24a.html
2024
-
[2]
buildingSMART International. 2024. IDS Audit Tool. https://github.com/ buildingSMART/IDS-Audit-tool
2024
-
[3]
buildingSMART International. 2024. Information Delivery Specification (IDS). https://www.buildingsmart.org/standards/bsi-standards/information- delivery-specification-ids/
2024
-
[4]
Nanjiang Chen, Xuhui Lin, Hai Jiang, and Yi An. 2024. Automated Building Information Modeling Compliance Check through a Large Language Model Combined with Deep Learning and Ontology.Buildings14, 7 (2024), 1983. doi:10. 3390/buildings14071983
2024
-
[5]
Yixin Dong, Charlie F. Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, and Tianqi Chen. 2025. XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models. arXiv:2411.15100 [cs.LG] doi:10.48550/arXiv. 2411.15100
work page internal anchor Pith review doi:10.48550/arxiv 2025
-
[6]
2008.BIM Handbook: A Guide to Building Information Modeling for Owners, Managers, De- signers, Engineers and Contractors
Chuck Eastman, Paul Teicholz, Rafael Sacks, and Kathleen Liston. 2008.BIM Handbook: A Guide to Building Information Modeling for Owners, Managers, De- signers, Engineers and Contractors. John Wiley & Sons
2008
- [7]
-
[8]
Stefan Fuchs, Michael Witbrock, Johannes Dimyadi, and Robert Amor. 2024. Using Large Language Models for the Interpretation of Building Regulations. arXiv:2407.21060 [cs.AI] doi:10.48550/arXiv.2407.21060
-
[9]
Yan Gao, Fuji Hu, Chengzhang Chai, Yiwei Weng, and Haijiang Li. 2026. Multi- agent Framework for Schema-guided Reasoning and Tool-augmented Interaction with IFC Models.Automation in Construction186 (2026), 106888. doi:10.1016/j. autcon.2026.106888
work page doi:10.1016/j 2026
-
[10]
Saibo Geng, Martin Josifoski, Maxime Peyrard, and Robert West. 2023. Grammar- Constrained Decoding for Structured NLP Tasks without Finetuning. InPro- ceedings of the 2023 Conference on Empirical Methods in Natural Language Pro- cessing. Association for Computational Linguistics, Singapore, 10932–10952. doi:10.18653/v1/2023.emnlp-main.674
-
[11]
International Organization for Standardization. 2018. ISO 16739-1:2018: Indus- try Foundation Classes (IFC) for Data Sharing in the Construction and Facility Management Industries. https://www.iso.org/standard/70303.html
2018
-
[12]
Ryo Kanazawa, Koyo Hidaka, Teppei Miyamoto, Takayuki Kato, Tomoki Ando, Chenguang Wang, Dayuan Jiang, Naofumi Fujita, Shuhei Saitoh, Atomu Kondo, Koki Arakawa, and Daiho Nishioka. 2026. Ishigaki-IDS-Bench. doi:10.57967/hf/ 8873 Hugging Face dataset. Accessed: 2026-05-21
-
[13]
Ryo Kanazawa, Koyo Hidaka, Teppei Miyamoto, Takayuki Kato, Tomoki Ando, Chenguang Wang, Dayuan Jiang, Naofumi Fujita, Shuhei Saitoh, Atomu Kondo, Koki Arakawa, and Daiho Nishioka. 2026. Ishigaki-IDS-Bench: Evaluation Code and Reproducibility Repository. doi:10.5281/zenodo.20319465 GitHub repository release v1.0.0. Accessed: 2026-05-21
-
[14]
Magdalena Kładź and Andrzej Szymon Borkowski. 2025. IDS Standard and bSDD Service as Tools for Automating Information Exchange and Verification in Projects Implemented in the BIM Methodology.Buildings15, 3 (2025), 378. doi:10.3390/buildings15030378
-
[15]
Jin Kook Lee, Yong Cheol Lee, Moeid Shariatfar, Pedram Ghannad, and Jiansong Zhang. 2020. Generation of Entity-Based Integrated Model View Definition Modules for the Development of New BIM Data Exchange Standards.Journal of Computing in Civil Engineering34, 3 (2020), 04020011. doi:10.1061/(ASCE)CP.1943- 5487.0000888
- [16]
-
[17]
Bulou Liu, Zhenhao Zhu, Qingyao Ai, Yiqun Liu, and Yueyue Wu. 2024. LeDQA: A Chinese Legal Case Document-based Question Answering Dataset. InPro- ceedings of the 33rd ACM International Conference on Information and Knowledge Management (CIKM ’24). ACM, 5385–5389. doi:10.1145/3627673.3679154
-
[18]
Langming Liu, Haibin Chen, Yuhao Wang, Yujin Yuan, Shilei Liu, Wenbo Su, Xiangyu Zhao, and Bo Zheng. 2025. ECKGBench: Benchmarking Large Language Models in E-commerce Leveraging Knowledge Graph. InProceedings of the 34th ACM International Conference on Information and Knowledge Management (CIKM ’25). ACM, 6461–6465. doi:10.1145/3746252.3761613
-
[19]
Soumya Madireddy, Lu Gao, Zia Ud Din, Kinam Kim, Ahmed Senouci, Zhe Han, and Yunpeng Zhang. 2025. Large Language Model-Driven Code Compliance Checking in Building Information Modeling.Electronics14, 11 (2025), 2146. doi:10.3390/electronics14112146
-
[20]
Lukas Netz, Jan Reimer, and Bernhard Rumpe. 2024. Using Grammar Masking to Ensure Syntactic Validity in LLM-based Modeling Tasks. InProceedings of the 27th ACM/IEEE International Conference on Model Driven Engineering Languages and Systems: Companion Proceedings. ACM, 570–577. doi:10.1145/3652620.3687829
-
[21]
Bharathi Kannan Nithyanantham, Tobias Sesterhenn, Ashwin Nedungadi, Sergio Peral Garijo, Janis Zenkner, Christian Bartelt, and Stefan Lüdtke
-
[22]
arXiv:2511.05533 [cs.AI] doi:10.48550/arXiv.2511.05533
MCP4IFC: IFC-Based Building Design Using Large Language Models. arXiv:2511.05533 [cs.AI] doi:10.48550/arXiv.2511.05533
-
[23]
Zhenhui Ou, Dawei Li, Zhen Tan, Wenlin Li, Huan Liu, and Siyuan Song. 2025. Building Safer Sites: A Large-Scale Multi-Level Dataset for Construction Safety Research. arXiv:2508.09203 [cs.CV] doi:10.48550/arXiv.2508.09203
-
[24]
Kanghee Park, Jiayu Wang, Taylor Berg-Kirkpatrick, Nadia Polikarpova, and Loris D’Antoni. 2024. Grammar-Aligned Decoding. InAdvances in Neural Information Processing Systems, Vol. 37. https://proceedings.neurips.cc/paper_files/paper/ 2024/hash/2bdc2267c3d7d01523e2e17ac0a754f3-Abstract-Conference.html
2024
-
[25]
Seungjun Son, Ghang Lee, Jaehwan Jung, Jongsung Kim, and Kyungki Jeon
-
[26]
doi:10.1016/j.aei.2022.101731 Ishigaki-IDS-Bench CIKM Resources, Submission Draft,
Automated Generation of a Model View Definition from an Information Delivery Manual Using idmXSD and buildingSMART Data Dictionary.Advanced Engineering Informatics54 (2022), 101731. doi:10.1016/j.aei.2022.101731 Ishigaki-IDS-Bench CIKM Resources, Submission Draft,
-
[27]
Issa Sugiura, Takashi Ishida, Taro Makino, Chieko Tazuke, Takanori Nak- agawa, Kosuke Nakago, and David Ha. 2025. EDINET-Bench: Evaluat- ing LLMs on Complex Financial Tasks using Japanese Financial Statements. arXiv:2506.08762 [cs.CL] doi:10.48550/arXiv.2506.08762
-
[28]
Artur Tomczak, Claudio Benghi, Léon van Berlo, and Eilif Hjelseth. 2024. Requir- ing Circularity Data in BIM with Information Delivery Specification.Journal of Circular Economy(2024). https://circulareconomyjournal.org/articles/requiring- circularity-data-in-bim-with-information-delivery-specification/
2024
-
[29]
Artur Tomczak, Léon van Berlo, Thomas Krijnen, André Borrmann, and Marzia Bolpagni. 2022. A Review of Methods to Specify Information Requirements in Digital Construction Projects. InProceedings of the 39th International Conference of CIB W78. Melbourne, Australia. doi:10.1088/1755-1315/1101/9/092024
-
[30]
Efficient Guided Generation for Large Language Models
Brandon T. Willard and Rémi Louf. 2023. Efficient Guided Generation for Large Language Models. arXiv:2307.09702 [cs.CL] doi:10.48550/arXiv.2307.09702
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.09702 2023
-
[31]
Junwen Zheng and Martin Fischer. 2023. BIM-GPT: A Prompt-Based Virtual Assistant Framework for BIM Information Retrieval. arXiv:2304.09333 [cs.CL] doi:10.48550/arXiv.2304.09333
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.