Joint Instance Segmentation and Geometric Attribute Regression for Roof Structures in Aerial Imagery

Luuk Versteeg; Martin R. Oswald; Rob G.J. Wijnhoven

arxiv: 2605.26370 · v1 · pith:O25BZQTNnew · submitted 2026-05-25 · 💻 cs.CV

Joint Instance Segmentation and Geometric Attribute Regression for Roof Structures in Aerial Imagery

Luuk Versteeg , Rob G.J. Wijnhoven , Martin R. Oswald This is my paper

Pith reviewed 2026-06-29 22:21 UTC · model grok-4.3

classification 💻 cs.CV

keywords instance segmentationroof attribute regressionaerial imagery3D building reconstructionMask R-CNNgeometric attributesLoD2 models

0 comments

The pith

Joint prediction of roof segment masks with height, slope and azimuth from one aerial image enables reconstruction of simplified 3D building models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an extension of Mask R-CNN that adds a regression branch to output both per-instance roof masks and three continuous attributes per segment: building height, roof slope and roof azimuth. Two adjustments stabilize training: a conditional loss that withholds azimuth supervision on flat roofs where the label is noisy, and a log-normalized encoding that counters the skewed distribution of building heights. The resulting masks and attributes together suffice to produce LoD2-level 3D models directly from a single overhead orthophoto, so that costly 3D reference data is required only during training on the Dutch aerial-plus-3DBAG dataset.

Core claim

The predicted per-segment masks and attributes are sufficient to reconstruct simplified 3D building models (LoD2) from a single overhead image, requiring expensive 3D reference data only for training.

What carries the argument

An attribute regression branch attached to Mask R-CNN together with a conditional azimuth loss that skips flat-roof segments and a log-normalized height representation.

If this is right

Simplified 3D building models can be produced from a single aerial orthophoto at inference time.
Only the training stage requires the expensive 3D reference data.
Reported errors are approximately 4 degrees for roof slope, 7 degrees for azimuth and 1 meter for height.
Instance segmentation reaches AP50 of 0.566 using a DINOv3 ConvNeXt-Base backbone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same joint mask-plus-attribute output could be applied to other man-made objects whose 3D form can be approximated from overhead imagery.
Large-scale city modeling pipelines could shift from repeated 3D acquisition to periodic 2D imagery plus a trained regressor.
Performance on imagery from different countries or sensors would test how much the learned mapping depends on Dutch building stock and acquisition conditions.

Load-bearing premise

The automatically derived ground truth labels from the 3DBAG nationwide LiDAR-based dataset are accurate and consistent enough to serve as reliable supervision for the continuous attribute regression tasks.

What would settle it

Reconstruct LoD2 models from the network outputs on a held-out set of buildings and compare the resulting geometry directly against independent high-accuracy LiDAR or manual survey measurements of the same structures.

Figures

Figures reproduced from arXiv: 2605.26370 by Luuk Versteeg, Martin R. Oswald, Rob G.J. Wijnhoven.

**Figure 3.** Figure 3: Distribution of building attributes across dataset splits. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 5.** Figure 5: Motivation for the conditional azimuth loss (ground truth [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: Distribution of roof segment angles (left) and heights [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Ground truth (blue) and predicted (orange) distributions for roof segment angle, azimuth, and height on the test set. The predicted [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Instance segmentation results. Left: predicted masks (green). Middle: matched ground-truth masks (light blue) and missed [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Roof angle prediction examples. (a) Well-performing case on structured row houses with small boundary deviations. (b) The [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 10.** Figure 10: Azimuth prediction examples. (a) The model performs well on structured row houses with consistent orientations. (b) Failure [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗

**Figure 11.** Figure 11: Height prediction examples. (a) The model correctly distinguishes low residential buildings from taller structures. (b) Adjacent [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

**Figure 12.** Figure 12: Per-cluster error distributions for (a) roof angle, (b) azimuth, and (c) height. Each row of subplots groups roof segments by [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗

read the original abstract

We present a method for jointly predicting instance-level roof segment masks together with three continuous geometric attributes -- building height, roof slope, and roof azimuth -- from a single aerial orthophoto. Our approach extends Mask R-CNN with a dedicated attribute regression branch and introduces two key innovations: a conditional azimuth loss that suppresses supervision for flat roof segments where azimuth labels are inherently noisy, and a log-normalized height representation that addresses the heavily skewed distribution of building heights. We train and evaluate on a large-scale dataset of Dutch aerial images paired with automatically derived ground truth from 3DBAG, a nationwide LiDAR-based 3D building dataset. Using a DINOv3 ConvNeXt-Base backbone, our method achieves a mean absolute error of approximately 4 degrees for roof slope, 7 degrees for azimuth, and 1 meter for building height, with an instance segmentation AP$_{50}$ of 0.566. The predicted per-segment masks and attributes are sufficient to reconstruct simplified 3D building models (LoD2) from a single overhead image, requiring expensive 3D reference data only for training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds two sensible loss tweaks to Mask R-CNN for roof attributes and reports concrete MAEs, but never checks whether the outputs actually produce usable LoD2 models.

read the letter

The main thing to know is that this work gets attribute regression numbers from single aerial images using a Mask R-CNN extension, but the claim that those outputs suffice for LoD2 reconstruction is not tested.

They add regression heads for height, slope, and azimuth. The conditional azimuth loss skips supervision on flat roofs where labels are noisy, and the log-normalized height handles the skewed distribution of building sizes. Both are straightforward fixes for this specific setting.

The data comes from Dutch orthophotos paired with labels automatically pulled from the 3DBAG LiDAR dataset. They use a DINOv3 ConvNeXt-Base backbone and report roughly 1 m height MAE, 4° slope, 7° azimuth, and 0.566 AP50 on instances. The scale of the training set is a clear positive.

The actual novelty is limited to those two loss modifications rather than a new framework. The evaluation stays with standard supervised metrics on external data, so there is no circularity issue.

The soft spot is the missing end-to-end check. The abstract states that the masks plus attributes are sufficient to reconstruct LoD2 models from one image, yet the only numbers are the separate per-attribute errors. No surface error, volume difference, or visual fidelity metric on the resulting 3D models is given, so the sufficiency statement remains an untested step. The abstract also omits baselines and ablations, which makes it hard to judge how much the two losses actually move the needle.

This is for people working on scalable building extraction in remote sensing or urban modeling who want a practical pipeline and some numbers to compare against. It is not foundational but could serve as a reference point.

I would send it for peer review. The motivation and data are solid, and the gaps are fixable with additional experiments rather than fatal.

Referee Report

2 major / 0 minor

Summary. The manuscript presents a multi-task extension of Mask R-CNN for joint roof instance segmentation and regression of three geometric attributes (building height, roof slope, roof azimuth) from single aerial orthophotos. It introduces a conditional azimuth loss to handle noisy flat-roof labels and a log-normalized height representation for skewed distributions. The model is trained on a large Dutch dataset with automatically derived ground truth from the 3DBAG LiDAR-based 3D building collection, reporting instance AP50 of 0.566 together with MAEs of approximately 1 m (height), 4° (slope), and 7° (azimuth) using a DINOv3 ConvNeXt-Base backbone. The authors state that the per-segment masks and attributes suffice to reconstruct simplified LoD2 3D building models from a single overhead image.

Significance. If the reconstruction sufficiency claim is substantiated, the work would enable scalable LoD2 building model generation from widely available 2D aerial imagery, with expensive 3D reference data required only during training. The use of a nationwide real-world dataset and the domain-specific loss modifications for attribute regression represent practical contributions to remote-sensing computer vision. The absence of direct end-to-end reconstruction metrics, however, limits the assessed significance of the central claim.

major comments (2)

[Abstract] Abstract: The central claim that 'the predicted per-segment masks and attributes are sufficient to reconstruct simplified 3D building models (LoD2)' is unsupported by any quantitative end-to-end evaluation. No metric is reported that compares reconstructed LoD2 geometry (surface error, volume difference, or visual fidelity) derived from the model's outputs against 3DBAG reference models; only separate segmentation AP50 and per-attribute MAEs are provided, rendering the sufficiency statement an untested extrapolation.
The manuscript provides no validation or error analysis of the automatically derived continuous attribute labels from the 3DBAG LiDAR dataset. Because these labels serve as the sole supervision for the regression tasks, the lack of reported label accuracy or consistency checks for height, slope, and azimuth directly affects the reliability of the reported MAEs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive comments. We address the major comments point by point below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'the predicted per-segment masks and attributes are sufficient to reconstruct simplified 3D building models (LoD2)' is unsupported by any quantitative end-to-end evaluation. No metric is reported that compares reconstructed LoD2 geometry (surface error, volume difference, or visual fidelity) derived from the model's outputs against 3DBAG reference models; only separate segmentation AP50 and per-attribute MAEs are provided, rendering the sufficiency statement an untested extrapolation.

Authors: We agree that a direct quantitative evaluation of the end-to-end LoD2 reconstruction would provide stronger evidence for the claim. The manuscript focuses on the prediction task, and the statement is intended to highlight the practical utility of the outputs for LoD2 modeling, as these are the defining parameters. However, to address this, we will revise the abstract to qualify the claim as 'enable the reconstruction of simplified LoD2 models' and include a short discussion on how the predicted attributes can be used for reconstruction, along with the expected impact of the reported errors. This will make the claim more precise without overstatement. revision: yes
Referee: [—] The manuscript provides no validation or error analysis of the automatically derived continuous attribute labels from the 3DBAG LiDAR dataset. Because these labels serve as the sole supervision for the regression tasks, the lack of reported label accuracy or consistency checks for height, slope, and azimuth directly affects the reliability of the reported MAEs.

Authors: This is a valid point. The ground truth attributes are automatically derived from 3DBAG, and while 3DBAG is a high-quality nationwide dataset, we did not include an analysis of the derivation accuracy or potential label noise beyond the conditional loss for azimuth. We will add a dedicated paragraph or subsection describing the label extraction process from 3DBAG and discuss known limitations, such as potential inaccuracies in slope and azimuth for certain roof types. This will help contextualize the reported MAEs. revision: yes

Circularity Check

0 steps flagged

No circularity detected; derivation is standard supervised learning on external data

full rationale

The paper describes a Mask R-CNN extension with an added regression branch, trained end-to-end on aerial images paired with independent LiDAR-derived ground truth from the external 3DBAG dataset. Reported results consist of direct test-set metrics (instance AP50 and per-attribute MAEs) with no equations or claims showing that any output reduces to a fitted input by construction. The two listed innovations (conditional azimuth loss and log-normalized height) are explicit modeling choices, not self-referential definitions. The LoD2 sufficiency statement is an unmeasured extrapolation but does not create a circular derivation chain. No self-citations, uniqueness theorems, or ansatzes from prior author work are invoked as load-bearing steps.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim depends on the quality of automatically generated 3DBAG labels and on the assumption that the proposed loss modifications meaningfully improve regression under label noise; these are not independently validated in the abstract.

free parameters (2)

multi-task loss weighting
Weights balancing segmentation and regression losses are typically tuned on validation data and directly affect the reported MAEs.
log-normalization scale parameters
Parameters controlling the log transform of height are chosen to match the data distribution and affect the height regression performance.

axioms (2)

domain assumption Automatically derived 3DBAG labels provide sufficiently accurate supervision for height, slope, and azimuth.
The entire training and evaluation pipeline rests on this without reported label noise analysis.
domain assumption A single orthophoto contains enough visual cues to regress continuous 3D roof attributes at the reported accuracy.
Implicit premise enabling the single-image reconstruction claim.

pith-pipeline@v0.9.1-grok · 5734 in / 1607 out tokens · 51316 ms · 2026-06-29T22:21:27.686223+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 3 canonical work pages

[1]

Biternion nets: Continuous head pose regression from dis- crete training labels

Lucas Beyer, Alexander Hermans, and Bastian Leibe. Biternion nets: Continuous head pose regression from dis- crete training labels. InGerman Conference on Pattern Recognition (GCPR), 2015. 4

2015
[2]

Schwing, Alexan- der Kirillov, and Rohit Girdhar

Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InCVPR,
[3]

G ¨unt¨urk, Hasan F

Mahmoud El Hussieni, Bahadir K. G ¨unt¨urk, Hasan F. Ates ¸, and O ˘guz Hano ˘glu. Mask-to-height: A YOLOv11-based architecture for joint building instance segmentation and height classification from satellite imagery.arXiv preprint arXiv:2510.27224, 2025. 2

work page arXiv 2025
[4]

Energy performance of buildings directive, 2024.https : / / energy

European Commission. Energy performance of buildings directive, 2024.https : / / energy . ec . europa . eu / topics / energy - efficiency / energy - efficient-buildings_en. 1

2024
[5]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR,
[6]

Mask R-CNN

Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask R-CNN. InICCV, 2017. 2, 3

2017
[7]

H. W. Kuhn. The Hungarian method for the assignment problem.Naval Research Logistics Quarterly, 2(1-2):83–97,
[8]

The model captures the overall height structure, with errors on tall or complex buildings

Qingyu Li, Lichao Mou, Yuansheng Hua, Yilei Shi, Sining 10 (a) Predicted (top) and ground-truth (bottom) height maps. The model captures the overall height structure, with errors on tall or complex buildings. (b) Inconsistent height predictions across neighboring roofs with similar true heights. Figure 11. Height prediction examples. (a) The model correct...

2023
[9]

Lawrence Zitnick

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C. Lawrence Zitnick. Microsoft COCO: Common objects in context.Eu- ropean Conference on Computer Vision (ECCV), pages 740– 755, 2014. 5

2014
[10]

Feature pyramid networks for object detection

Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. InCVPR, 2017. 3

2017
[11]

Alexandros Mouzakitis et al. SolarNet: A convolutional neu- ral network-based framework for rooftop solar potential esti- mation from aerial imagery.International Journal of Applied Earth Observation and Geoinformation, 2022. 2

2022
[12]

Leveraging large-scale aerial data for accurate urban rooftop solar potential estimation via multitask learning.Solar Energy, 2025

Alexandros Mouzakitis et al. Leveraging large-scale aerial data for accurate urban rooftop solar potential estimation via multitask learning.Solar Energy, 2025. 2

2025
[13]

Olson and Shoshanna Saxe

Alexander W. Olson and Shoshanna Saxe. Single-image building height estimation using EfficientNet: A simplified, scalable approach.Findings, 2024. 2

2024
[14]

DINOv3: Convnext-base with knowl- edge distillation from self-supervised transformers.arXiv preprint, 2024

Maxime Oquab et al. DINOv3: Convnext-base with knowl- edge distillation from self-supervised transformers.arXiv preprint, 2024. 4, 5

2024
[15]

3d bag: Automated reconstruction of 3d city models from open data.ISPRS Journal of Photogram- metry and Remote Sensing, 2022

Ravi Peters, Bal ´azs Dukai, Stelios Vitalis, Jordi van Liempt, and Jantien Stoter. 3d bag: Automated reconstruction of 3d city models from open data.ISPRS Journal of Photogram- metry and Remote Sensing, 2022. 1, 2, 3

2022
[16]

PDOK lucht- foto RGB open, 2024.https : / / www

Samenwerkingsverband Beeldmateriaal. PDOK lucht- foto RGB open, 2024.https : / / www . pdok . nl / introductie/ - /article / pdok - luchtfoto - rgb-open-. 3

2024
[17]

Mingxing Tan and Quoc V . Le. EfficientNetV2: Smaller models and faster training.arXiv preprint arXiv:2104.00298,

work page arXiv
[18]

Sinan U. Ulu, A. Enes Doruk, I. Can Yagmur, Bahadir K. Gunturk, Oguz Hanoglu, and Hasan F. Ates. BuildMamba: A visual state-space based model for multi-task building seg- mentation and height estimation from satellite images.arXiv preprint arXiv:2603.08523, 2026. 2

work page arXiv 2026
[19]

VGGT: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. InCVPR, pages 5294–5306, 2025. 2

2025
[20]

Mask2Former with improved query for se- mantic segmentation in remote-sensing images.Mathemat- ics, 12(5), 2024

Hao Zhang et al. Mask2Former with improved query for se- mantic segmentation in remote-sensing images.Mathemat- ics, 12(5), 2024. 2 11 (a) Angle prediction errors per cluster. Top row: flat roofs; bottom row: steep roofs; left to right: low to very high. (b) Azimuth prediction errors for steep roofs. Distributions are centered near zero, with most predict...

2024

[1] [1]

Biternion nets: Continuous head pose regression from dis- crete training labels

Lucas Beyer, Alexander Hermans, and Bastian Leibe. Biternion nets: Continuous head pose regression from dis- crete training labels. InGerman Conference on Pattern Recognition (GCPR), 2015. 4

2015

[2] [2]

Schwing, Alexan- der Kirillov, and Rohit Girdhar

Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InCVPR,

[3] [3]

G ¨unt¨urk, Hasan F

Mahmoud El Hussieni, Bahadir K. G ¨unt¨urk, Hasan F. Ates ¸, and O ˘guz Hano ˘glu. Mask-to-height: A YOLOv11-based architecture for joint building instance segmentation and height classification from satellite imagery.arXiv preprint arXiv:2510.27224, 2025. 2

work page arXiv 2025

[4] [4]

Energy performance of buildings directive, 2024.https : / / energy

European Commission. Energy performance of buildings directive, 2024.https : / / energy . ec . europa . eu / topics / energy - efficiency / energy - efficient-buildings_en. 1

2024

[5] [5]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR,

[6] [6]

Mask R-CNN

Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask R-CNN. InICCV, 2017. 2, 3

2017

[7] [7]

H. W. Kuhn. The Hungarian method for the assignment problem.Naval Research Logistics Quarterly, 2(1-2):83–97,

[8] [8]

The model captures the overall height structure, with errors on tall or complex buildings

Qingyu Li, Lichao Mou, Yuansheng Hua, Yilei Shi, Sining 10 (a) Predicted (top) and ground-truth (bottom) height maps. The model captures the overall height structure, with errors on tall or complex buildings. (b) Inconsistent height predictions across neighboring roofs with similar true heights. Figure 11. Height prediction examples. (a) The model correct...

2023

[9] [9]

Lawrence Zitnick

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C. Lawrence Zitnick. Microsoft COCO: Common objects in context.Eu- ropean Conference on Computer Vision (ECCV), pages 740– 755, 2014. 5

2014

[10] [10]

Feature pyramid networks for object detection

Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. InCVPR, 2017. 3

2017

[11] [11]

Alexandros Mouzakitis et al. SolarNet: A convolutional neu- ral network-based framework for rooftop solar potential esti- mation from aerial imagery.International Journal of Applied Earth Observation and Geoinformation, 2022. 2

2022

[12] [12]

Leveraging large-scale aerial data for accurate urban rooftop solar potential estimation via multitask learning.Solar Energy, 2025

Alexandros Mouzakitis et al. Leveraging large-scale aerial data for accurate urban rooftop solar potential estimation via multitask learning.Solar Energy, 2025. 2

2025

[13] [13]

Olson and Shoshanna Saxe

Alexander W. Olson and Shoshanna Saxe. Single-image building height estimation using EfficientNet: A simplified, scalable approach.Findings, 2024. 2

2024

[14] [14]

DINOv3: Convnext-base with knowl- edge distillation from self-supervised transformers.arXiv preprint, 2024

Maxime Oquab et al. DINOv3: Convnext-base with knowl- edge distillation from self-supervised transformers.arXiv preprint, 2024. 4, 5

2024

[15] [15]

3d bag: Automated reconstruction of 3d city models from open data.ISPRS Journal of Photogram- metry and Remote Sensing, 2022

Ravi Peters, Bal ´azs Dukai, Stelios Vitalis, Jordi van Liempt, and Jantien Stoter. 3d bag: Automated reconstruction of 3d city models from open data.ISPRS Journal of Photogram- metry and Remote Sensing, 2022. 1, 2, 3

2022

[16] [16]

PDOK lucht- foto RGB open, 2024.https : / / www

Samenwerkingsverband Beeldmateriaal. PDOK lucht- foto RGB open, 2024.https : / / www . pdok . nl / introductie/ - /article / pdok - luchtfoto - rgb-open-. 3

2024

[17] [17]

Mingxing Tan and Quoc V . Le. EfficientNetV2: Smaller models and faster training.arXiv preprint arXiv:2104.00298,

work page arXiv

[18] [18]

Sinan U. Ulu, A. Enes Doruk, I. Can Yagmur, Bahadir K. Gunturk, Oguz Hanoglu, and Hasan F. Ates. BuildMamba: A visual state-space based model for multi-task building seg- mentation and height estimation from satellite images.arXiv preprint arXiv:2603.08523, 2026. 2

work page arXiv 2026

[19] [19]

VGGT: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. InCVPR, pages 5294–5306, 2025. 2

2025

[20] [20]

Mask2Former with improved query for se- mantic segmentation in remote-sensing images.Mathemat- ics, 12(5), 2024

Hao Zhang et al. Mask2Former with improved query for se- mantic segmentation in remote-sensing images.Mathemat- ics, 12(5), 2024. 2 11 (a) Angle prediction errors per cluster. Top row: flat roofs; bottom row: steep roofs; left to right: low to very high. (b) Azimuth prediction errors for steep roofs. Distributions are centered near zero, with most predict...

2024