Ishigaki-IDS-Bench: A Benchmark for Generating Information Delivery Specification from BIM Information Requirements

Atomu Kondo; Chenguang Wang; Daiho Nishioka; Dayuan Jiang; Koki Arakawa; Koyo Hidaka; Naofumi Fujita; Ryo Kanazawa; Shuhei Saitoh; Takayuki Kato

arxiv: 2605.22079 · v2 · pith:Y5XIBTGBnew · submitted 2026-05-21 · 💻 cs.CL

Ishigaki-IDS-Bench: A Benchmark for Generating Information Delivery Specification from BIM Information Requirements

Ryo Kanazawa , Koyo Hidaka , Teppei Miyamoto , Takayuki Kato , Tomoki Ando , Chenguang Wang , Dayuan Jiang , Naofumi Fujita

show 4 more authors

Shuhei Saitoh Atomu Kondo Koki Arakawa Daiho Nishioka

This is my paper

Pith reviewed 2026-05-22 06:47 UTC · model grok-4.3

classification 💻 cs.CL

keywords IDS generationBIMlarge language modelsstructured outputbenchmarkInformation Delivery SpecificationIFCXML compliance

0 comments

The pith

A new benchmark shows LLMs can partly turn BIM requirements into IDS XML but rarely produce fully compliant outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates Ishigaki-IDS-Bench, a set of 166 expert-verified examples drawn from 83 practical BIM scenarios and supplied in both Japanese and English along with matching gold IDS files. It tests ten large language models in a zero-shot setting and measures results with audits for processability, structure, and content plus a content-agreement score against the gold files. The strongest model reaches 65.6 percent macro F1 on content agreement, yet only 27.7 percent of its outputs pass the full content audit. These findings indicate that current models can express some of the needed building information but have difficulty generating XML that consistently obeys the IDS standard and IFC vocabulary rules. The released benchmark and scripts are intended to support further work on reliable structured generation for construction data exchange.

Core claim

Ishigaki-IDS-Bench supplies 166 expert-authored and verified BIM information requirement examples paired with gold IDS XML files and shows that, in zero-shot evaluation across ten LLMs, the best model attains 65.6 percent macro F1 for content agreement while only 27.7 percent of outputs pass the Content audit, revealing that models can capture portions of the requirements yet still struggle to generate stable XML that satisfies the IDS standard and IFC vocabulary constraints.

What carries the argument

Ishigaki-IDS-Bench, a dataset of 166 expert-verified BIM-to-IDS examples evaluated by IDSAuditTool audits for processability, structure and content together with content-agreement metrics against gold files.

If this is right

Current LLMs can capture some information requirements but require advances to meet exact XML and IFC constraints reliably.
The benchmark enables direct comparison of models and systematic failure analysis for structured output tasks.
It can be used to test new constrained generation techniques aimed at domain standards.
The multilingual and multi-domain coverage allows evaluation of generalization across construction contexts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the examples prove representative, the benchmark could serve as a standard test suite for tools that automate information delivery specifications.
Similar evaluation setups may prove useful in other engineering fields that demand outputs conform to precise industry XML schemas.
Closing the observed gap could reduce manual effort in creating BIM information exchanges for real projects.

Load-bearing premise

The 83 practical scenarios, when expanded by experts into 166 verified examples, accurately represent real-world BIM information requirements across languages and construction domains.

What would settle it

A test in which any model or method produces outputs that pass the Content audit on at least 80 percent of the 166 examples would show that LLMs can stably generate IDS XML meeting the required standards.

read the original abstract

Building Information Modeling (BIM) projects increasingly use Information Delivery Specification (IDS) to formalize information requirements in a machine-checkable XML format. Because IDS conditions are grounded in the Industry Foundation Classes (IFC) vocabulary, authoring them requires expertise in IFC concepts, validation tools, and property set conventions. Existing benchmarks for structured generation do not adequately capture the additional burden of vocabulary conformance and external-validator agreement that IDS imposes. We present Ishigaki-IDS-Bench, the first publicly released benchmark for IDS generation from BIM information requirements. The benchmark contains 166 examples spanning 83 practical scenarios authored in Japanese and English by six BIM/IDS experts, each paired with a gold IDS file and metadata covering input format, turn setting, target IFC versions, and construction domain. Evaluation proceeds in two stages: (i) formal validity scored by the buildingSMART IDSAuditTool along Processability, Structure, and Content, and (ii) content fidelity scored by facet-level macro-F1 against the gold IDS. Across 10 LLMs in zero-shot, the highest Facet F1 is 65.6%, achieved by GPT-5.5, while the highest Content pass rate is only 33.1%, achieved by Claude Opus 4.5. Ishigaki-IDS-Bench is released on Hugging Face (DOI 10.57967/hf/8873) under CC BY 4.0, and the evaluation code is released on Zenodo (DOI 10.5281/zenodo.20550510) under Apache-2.0.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Ishigaki-IDS-Bench, a benchmark for evaluating LLMs on generating Information Delivery Specification (IDS) XML from BIM information requirements. The benchmark comprises 166 expert-authored and verified examples, created by expanding 83 practical scenarios into Japanese/English pairs with corresponding gold IDS files and metadata. In zero-shot evaluations across 10 LLMs, the top-performing model achieves 65.6% macro F1 for content agreement with gold standards, yet only 27.7% of generated outputs pass the Content audit. The authors conclude that while LLMs can capture some aspects of the requirements, they struggle to produce stable XML outputs that fully comply with the IDS standard and IFC vocabulary constraints. The benchmark data, gold files, and evaluation scripts are released publicly under CC BY 4.0.

Significance. Assuming the test set is representative, this benchmark addresses an important gap in evaluating structured generation for domain-specific XML formats that must adhere to both syntactic standards and specialized vocabularies from the construction industry. The reported results provide concrete evidence of current limitations in LLM outputs for such tasks, which may motivate research into better constrained generation techniques. The public release of the full benchmark, including scripts for audits and content agreement evaluation, is a notable strength that enhances reproducibility and allows for community-driven extensions.

major comments (2)

[§3 Benchmark Construction] §3 (Benchmark Construction): The selection and expansion of the 83 practical scenarios into 166 examples is not accompanied by explicit selection criteria, a defined sampling frame, or coverage statistics (e.g., distribution across construction domains, IFC versions, or requirement types). This omission is load-bearing for the central claim, as the low pass rate on the Content audit (27.7%) is interpreted as evidence of general LLM struggles with IDS generation; without representativeness evidence, the results may instead reflect the specific scope of the chosen scenarios.
[§5 Experiments and Results] §5 (Experiments): The manuscript provides numeric results and describes the use of IDSAuditTool for audits, but does not include details on the validation of the audit tool itself or how inter-expert agreement was measured during gold standard creation. This affects the reliability of the reported Processability, Structure, and Content audit outcomes.

minor comments (2)

[Abstract] Abstract: The phrase 'macro F1 for content agreement' would benefit from a brief parenthetical explanation or reference to the exact computation method used (e.g., averaging over requirement categories).
[Introduction] Introduction: Additional citations to recent work on LLM structured output generation (beyond JSON/SQL/code) would strengthen the positioning of this benchmark within the broader literature.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving transparency in benchmark construction and evaluation methodology. We address each major comment below and indicate planned revisions.

read point-by-point responses

Referee: [§3 Benchmark Construction] §3 (Benchmark Construction): The selection and expansion of the 83 practical scenarios into 166 examples is not accompanied by explicit selection criteria, a defined sampling frame, or coverage statistics (e.g., distribution across construction domains, IFC versions, or requirement types). This omission is load-bearing for the central claim, as the low pass rate on the Content audit (27.7%) is interpreted as evidence of general LLM struggles with IDS generation; without representativeness evidence, the results may instead reflect the specific scope of the chosen scenarios.

Authors: We agree that explicit selection criteria, a defined sampling frame, and coverage statistics are necessary to support claims about the generalizability of the observed LLM performance limitations. The 83 scenarios were drawn from real-world BIM information requirements collected through consultations with construction industry experts and standardized templates commonly used in Japanese and international projects. In the revised manuscript, we will expand §3 with a dedicated subsection describing the selection criteria (focusing on scenarios that exercise core IDS features such as property sets, classifications, and material specifications while ensuring verifiability), the sampling frame (starting from a larger pool of candidate requirements and filtering for practicality and expert feasibility), and quantitative coverage statistics including distributions across construction domains, IFC versions, and requirement types. These additions will allow readers to better evaluate whether the 27.7% Content audit pass rate reflects broader challenges in IDS generation. revision: yes
Referee: [§5 Experiments and Results] §5 (Experiments): The manuscript provides numeric results and describes the use of IDSAuditTool for audits, but does not include details on the validation of the audit tool itself or how inter-expert agreement was measured during gold standard creation. This affects the reliability of the reported Processability, Structure, and Content audit outcomes.

Authors: We thank the referee for identifying this gap in methodological detail. The IDSAuditTool implements validation rules directly from the buildingSMART IDS standard specification for Processability and Structure audits, with Content audit checks based on IFC vocabulary constraints; we will add a description of its development and internal validation against official test cases in the revised §5. For gold standard creation, each example was authored by a primary expert and independently reviewed by two additional experts, with discrepancies resolved via discussion to achieve consensus. We did not compute quantitative inter-expert agreement metrics such as Cohen’s or Fleiss’ kappa. In the revision we will provide a fuller account of the verification workflow and note the consensus process. We are prepared to report the number of revisions required during verification and, if requested, to perform a post-hoc agreement analysis on a representative subset for the final version. revision: partial

Circularity Check

0 steps flagged

No circularity: benchmark results derive from external gold standards

full rationale

The paper introduces Ishigaki-IDS-Bench by expanding 83 practical scenarios into 166 expert-verified Japanese/English examples with independently authored gold IDS files. Zero-shot LLM evaluation computes macro F1 for content agreement and pass rates on IDSAuditTool audits directly against these gold references. No equations, fitted parameters, self-definitional reductions, or load-bearing self-citations appear; performance numbers are not forced by construction from the inputs. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no new free parameters or invented entities. It relies on existing IDS and IFC standards plus standard LLM evaluation practices.

axioms (1)

domain assumption Expert-authored and verified scenarios provide reliable ground truth for evaluating structured generation against industry standards.
The benchmark construction begins from 83 practical scenarios expanded and verified by BIM/IDS experts.

pith-pipeline@v0.9.0 · 5837 in / 1311 out tokens · 49749 ms · 2026-05-22T06:47:34.614837+00:00 · methodology

Ishigaki-IDS-Bench: A Benchmark for Generating Information Delivery Specification from BIM Information Requirements

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)