arxiv: 2605.11863 · v1 · submitted 2026-05-12 · 💻 cs.CV · eess.IV

Recognition: 2 theorem links

· Lean Theorem

GATA2Floor: Graph attention for floor counting in street-view facades

Ngoc Tan Le , Tzoulio Chamiti , Eirini Papagiannopoulou , Nikos Deligiannis

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:23 UTC · model grok-4.3

classification 💻 cs.CV eess.IV

keywords graph attentionfacade analysisfloor countingstreet viewgraph neural networkscomputer visionself-supervised learning

0 comments

The pith

Graph attention networks count building floors from street-view facades by assigning windows to latent levels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper models each facade as a graph whose nodes are detected windows and doors and whose edges follow a vertical prior. It introduces GATA2Floor, a multi-head GATv2 architecture that outputs both the total floor count and a soft assignment of each element to one of several latent floor slots via learned cross-attention queries. The approach is shown to remain interpretable and to handle irregular layouts better than isolated detection methods. It further demonstrates that the same graph reasoning can run label-free by generating proposals from self-supervised features scored with vision-language models. The resulting relational structure is positioned as useful for downstream urban analytics tasks that need to understand building scale.

Core claim

GATA2Floor predicts the global floor count of a building and, via learnable cross-attention queries, softly assigns detected facade elements to latent floor slots. The model is built on a graph whose nodes are window and door detections and whose edges incorporate a vertical prior; multi-head GATv2 layers propagate information across this structure to produce both the scalar count and the per-element floor assignments.

What carries the argument

GATA2Floor, a multi-head GATv2 network that uses learnable cross-attention queries to assign facade graph nodes to latent floor slots while predicting the total floor count.

Load-bearing premise

Modeling facades as graphs with a vertical prior on edges plus GATv2 attention will reliably capture floor structure even in irregular or occluded real-world images.

What would settle it

A set of street-view facades containing irregular or occluded window patterns on which the model produces floor-count errors larger than those of a simple vertical sorting baseline would falsify the claim.

read the original abstract

Automated analysis of building facades from street-level imagery has great potential for urban analytics, energy assessment, and emergency planning. However, it requires reasoning over spatially arranged elements rather than solely isolated detections. In this work, we model each facade as a graph over window/door detections with a vertical prior on edges. Additionally, we introduce GATA2Floor, a multi-head Graph Attention v2 (GATv2) based model that predicts the global floor count of a building and, via learnable cross-attention queries, softly assigns elements to latent floor slots, yielding interpretable outputs and robustness to irregular designs. To mitigate the lack of labeled datasets, we demonstrate that the proposed graph-based reasoning can be applied without annotations by leveraging a lightweight label-free proposal mechanism based on self-supervised features and vision-language scoring. Our approach demonstrates the value of graph-attention-based relational reasoning for facade understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GATA2Floor applies GATv2 to vertical-prior graphs of facade detections for floor counting and slot assignment, plus a label-free proposal step, but the method's behavior when detections miss floors is not yet demonstrated.

read the letter

The main takeaway is that this paper builds a graph from window and door detections on street-view facades, adds a vertical prior to the edges, runs GATv2 to predict the total floor count, and uses learnable cross-attention queries to softly assign detections to latent floor slots. They also show a label-free route that starts from self-supervised features and vision-language scoring to generate the initial proposals without manual labels. That combination is the concrete new piece: a targeted GATv2 setup for this relational task rather than a generic graph model. The work does well by treating facade elements as a structured scene instead of isolated boxes, and the label-free path is a sensible response to the scarcity of annotated facade data. The interpretable slot assignments are a clear practical benefit for downstream urban analytics. The softer part is the handling of incomplete detections. Street-view images routinely have occlusions, shadows, or irregular spacing that produce missing or extra boxes. If one floor ends up with no nodes, the vertical connections break and the cross-attention queries have nothing to attend to, so both the count and the assignments lose their grounding. The abstract claims robustness to irregular designs, yet the stress-test concern about severed vertical priors is not addressed by any reported ablation on detection failure modes or quantitative results on occluded cases. The exact form of the vertical prior (how y-coordinate differences become edge weights) also needs to be spelled out clearly. This paper is for computer-vision researchers working on urban scene understanding or automated building analysis. A reader who wants to see graph attention applied to a real spatial-reasoning problem with a practical label-free twist will find it useful. It deserves a serious referee because the core architecture is coherent and the task is well-motivated; referees can require the missing robustness checks and performance numbers.

Referee Report

3 major / 1 minor

Summary. The paper proposes GATA2Floor, a multi-head GATv2-based graph attention model that represents building facades as graphs over window/door detections with a vertical prior on edges. It predicts the global floor count while using learnable cross-attention queries to softly assign detections to latent floor slots for interpretability and robustness to irregular designs. A label-free proposal mechanism based on self-supervised features and vision-language scoring is introduced to address the lack of annotated data, demonstrating the utility of relational graph reasoning for facade understanding.

Significance. If the central claims hold, the work would advance automated facade analysis for urban analytics, energy assessment, and emergency planning by showing how graph attention with vertical priors and cross-attention queries can yield both accurate counts and interpretable floor assignments. The label-free self-supervised component is a clear strength that could broaden applicability where labeled data is scarce.

major comments (3)

[Abstract] Abstract: the approach is described but no performance numbers, error analysis, ablation studies, or validation details are supplied, so it is impossible to verify whether the graph construction, GATv2 attention, and cross-attention queries actually support the floor-counting and assignment claims.
[Method] Method section (graph construction): the vertical prior is invoked but its precise definition (e.g., how y-coordinate differences are turned into edge weights or adjacency) is not given; this is load-bearing because missing or spurious detections under occlusion would sever vertical connections and break both count prediction and slot assignment.
[Experiments] Experiments: no ablation on detection failure modes (occlusion, shadows, irregular spacing) is reported, yet the central claim of robustness to irregular designs rests on the assumption that the graph plus GATv2 plus cross-attention queries can recover from such failures.

minor comments (1)

[Method] Notation for the number of latent floor slots and attention heads should be introduced explicitly with their default values or ranges.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the positive assessment of our work's significance and for the constructive major comments. We address each point below and will revise the manuscript to improve clarity, detail, and verifiability.

read point-by-point responses

Referee: [Abstract] Abstract: the approach is described but no performance numbers, error analysis, ablation studies, or validation details are supplied, so it is impossible to verify whether the graph construction, GATv2 attention, and cross-attention queries actually support the floor-counting and assignment claims.

Authors: We agree that the abstract would be strengthened by including key quantitative results. In the revised manuscript, we will update the abstract to report the primary performance metrics (floor-count accuracy and soft-assignment precision on the evaluated datasets) along with a concise statement of the validation protocol. This change will make the claims immediately verifiable while remaining within length constraints. revision: yes
Referee: [Method] Method section (graph construction): the vertical prior is invoked but its precise definition (e.g., how y-coordinate differences are turned into edge weights or adjacency) is not given; this is load-bearing because missing or spurious detections under occlusion would sever vertical connections and break both count prediction and slot assignment.

Authors: The referee correctly identifies that the vertical prior requires an explicit definition. We will revise the method section to state that nodes are connected when their normalized vertical distance is below threshold τ, with edge weights w_ij = exp(−|y_i − y_j|/σ). The revised text will also include the concrete values of τ and σ used in experiments and explain how GATv2 multi-head attention combined with the cross-attention queries enables information propagation even when some vertical edges are absent due to occlusion. revision: yes
Referee: [Experiments] Experiments: no ablation on detection failure modes (occlusion, shadows, irregular spacing) is reported, yet the central claim of robustness to irregular designs rests on the assumption that the graph plus GATv2 plus cross-attention queries can recover from such failures.

Authors: We acknowledge that a targeted analysis of detection failure modes would strengthen the robustness claim. Our current experiments already evaluate the model on real-world facades containing occlusions and irregular spacing, where it outperforms non-graph baselines. For the revision we will add a dedicated subsection presenting both qualitative examples of challenging cases and quantitative results on synthetically perturbed detections (random removal of 15–25 % of nodes to simulate occlusion), thereby directly illustrating the contribution of the graph structure and cross-attention queries. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper constructs facade graphs from detections, applies a vertical prior on edges, and uses standard GATv2 attention plus learnable cross-attention queries to predict floor count and assign elements to slots. No derivation step reduces by construction to a fitted parameter, self-definition, or load-bearing self-citation chain; the components are independent applications of established graph attention techniques without tautological renaming or imported uniqueness theorems from the same authors. The approach remains self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach rests on the assumption that vertical graph structure plus attention suffices for floor reasoning and that self-supervised features can substitute for labels; no new physical entities are postulated.

free parameters (1)

number of attention heads and latent floor slots
Learnable parameters in the multi-head GATv2 and cross-attention queries whose values are determined during training.

axioms (1)

domain assumption Facades can be represented as graphs with a vertical prior on edges between window/door detections
Invoked to justify the graph construction step before applying attention.

invented entities (1)

latent floor slots no independent evidence
purpose: Soft assignment targets for detected elements to produce interpretable floor-level outputs
Introduced via learnable cross-attention queries to handle irregular facade designs.

pith-pipeline@v0.9.0 · 5468 in / 1301 out tokens · 48357 ms · 2026-05-13T07:23:18.107476+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

model each facade as a graph over window/door detections with a vertical prior on edges... GATv2... learnable cross-attention queries... softly assigns elements to latent floor slots
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

vertical bias mask... dy_norm... τ_vertical = α_outlier × μ_top-k

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 2 internal anchors

[1]

GATA2Floor: Graph attention for floor counting in street-view facades

INTRODUCTION Street view imagery (SVI) offers a valuable resource in build- ing facades with multiple potential applications (energy esti- mation, construction cost/style prediction etc.), where accu- rate building-level information is critical. Estimating floors, however, requires reasoning over spatially arranged elements (windows/doors) rather than tre...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

PROPOSED METHODOLOGY The proposed GATA2Floor operates on precomputed window and door bounding boxes obtained either from a supervised detector or from a lightweight label-free proposal mechanism (Section 2.5) when annotations are unavailable, and builds a graph over those boxes rather than on the raw image. Con- cretely, given a set ofNelement detections ...

work page
[3]

Floor counting and Assignment (GATA2Floor) Proposed GATA2Floor Build vertical-aware graph Input Graph Features Input Embedding + Pos Enc Residual GAT block L Vertical Attention Multi-Head Cross-AttentionFloor Q Assignment Head Global mean Pool + Globalfeatures Counting HeadConfidence Head GATv2Conv GraphNorm LeakyReLU Dropdout LayerNorm FFN LayerNorm Vert...

work page
[4]

Facade element detection (Pretrained detector OR Label-free proposal) Dense patch embedding Coherence Input Image Edge extraction y x Spatial var map Coherence map Grayscale Saliency map Edge map Spatial var DINOv2 GMM x2 VLM Prompt Crops Result (proposals)Label-free proposal Mask R-CNN YOLO OR Supervised detector (Pretrained) Result (detections) Fig. 1. ...

work page
[5]

Datasets We use multiple common labeled datasets in the facade de- tection field like the Amsterdam Facade, ECP, eTRIMS, and ParisArtDecoFacades [13, 14]

EXPERIMENTS AND RESULTS 3.1. Datasets We use multiple common labeled datasets in the facade de- tection field like the Amsterdam Facade, ECP, eTRIMS, and ParisArtDecoFacades [13, 14]. We perform manual labeling for the floor-level ground truth generation. 3.2. Graph-based representation We first evaluate the proposed graph-based representation be- fore th...

work page
[6]

CONCLUSION This work models facades as vertical-aware graphs over win- dow/door detections and introduces GATA2Floor, a multi- head GATv2 architecture that jointly performs global floor counting and soft element-to-floor assignment. Extensive ex- periments across public and a large unlabeled datasets show that GATA2Floor outperforms clustering-based basel...

work page
[7]

Faster R-CNN: Towards real-time object detection with region proposal networks,

S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in29th Annual Conference on Neural Infor- mation Processing Systems (NeurIPS), 2015, pp. 91–99

work page 2015
[8]

Mask R-CNN,

K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick, “Mask R-CNN,” inIEEE International Conference on Com- puter Vision (ICCV), 2017, pp. 2980–2988

work page 2017
[9]

You only look once: Unified, real-time object detec- tion,

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detec- tion,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 779–788

work page 2016
[10]

Deep learning-based door and window detection from build- ing fac ¸ade,

G. Sezen, M. C ¸ akır, M. E. Atik, and Z. Duran, “Deep learning-based door and window detection from build- ing fac ¸ade,” inThe International Archives of the Pho- togrammetry, Remote Sensing and Spatial Information Sciences (ISPRS Archives), 2022, vol. XLIII-B4-2022, pp. 315–320

work page 2022
[11]

Zero-shot building attribute extraction from large-scale vision and language models,

F. Pan, S. Jeon, B. Wang, F. Mckenna, and S. X. Yu, “Zero-shot building attribute extraction from large-scale vision and language models,” inIEEE/CVF Winter Con- ference on Applications of Computer Vision (WACV), 2024, pp. 8632–8641

work page 2024
[12]

The graph neural network model,

F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, “The graph neural network model,” IEEE Transactions on Neural Networks, vol. 20, no. 1, pp. 61–80, 2009

work page 2009
[13]

Graph attention networks,

P. Veli ˇckovi´c, G. Cucurull, A. Casanova, A. Romero, P. Li`o, and Y . Bengio, “Graph attention networks,” in International Conference on Learning Representations (ICLR), 2018

work page 2018
[14]

How attentive are graph attention networks?,

S. Brody, U. Alon, and E. Yahav, “How attentive are graph attention networks?,” inInternational Conference on Learning Representations (ICLR), 2022

work page 2022
[15]

Floorlevel-net: Rec- ognizing floor-level lines with height-attention-guided multi-task learning,

M. Wu, W. Zeng, and C.-W. Fu, “Floorlevel-net: Rec- ognizing floor-level lines with height-attention-guided multi-task learning,”IEEE Transactions on Image Pro- cessing, vol. 30, pp. 6686–6699, 2021

work page 2021
[16]

Geodata- based number of floor estimation for urban residential buildings as an input parameter for energy modelling,

F. Moubayed, R. Becker, and J. Blankenbach, “Geodata- based number of floor estimation for urban residential buildings as an input parameter for energy modelling,” Geo-spatial Information Science, vol. 0, pp. 1–27, 2025

work page 2025
[17]

Semi-supervised learning from street-view images and openstreetmap for automatic building height estimation,

H. Li, Z. Yuan, G. Dax, G. Kong, H. Fan, A. Zipf, and M. Werner, “Semi-supervised learning from street-view images and openstreetmap for automatic building height estimation,”arXiv preprint arXiv:2307.02574, 2023

work page arXiv 2023
[18]

Building floor number estimation from crowdsourced street-level images: Munich dataset and baseline method,

Y . Sun, S. Chen, Y . Tian, and X. X. Zhu, “Building floor number estimation from crowdsourced street-level images: Munich dataset and baseline method,”arXiv preprint arXiv:2505.18021, 2025

work page arXiv 2025
[19]

eTRIMS image database for interpreting images of man-made scenes,

F. Kor ˇc and W. F¨orstner, “eTRIMS image database for interpreting images of man-made scenes,” Tech. Rep. TR-IGG-P-2009-01, Dept. of Photogrammetry, Univer- sity of Bonn, 2009

work page 2009
[20]

Learning gram- mars for architecture-specific facade parsing,

R. Gadde, R. Marlet, and N. Paragios, “Learning gram- mars for architecture-specific facade parsing,”Interna- tional Journal of Computer Vision, vol. 117, no. 3, pp. 290–316, 2016

work page 2016
[21]

Fast R-CNN,

Ross Girshick, “Fast R-CNN,” inIEEE Interna- tional Conference on Computer Vision (ICCV), 2015, pp. 1440–1448

work page 2015
[22]

A 3×3 isotropic gradient op- erator for image processing,

I. Sobel and G. Feldman, “A 3×3 isotropic gradient op- erator for image processing,” inPattern Classification and Scene Analysis, pp. 271–272. 1973

work page 1973
[23]

Dinov2: Learning robust visual features without supervision,

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fer- nandez, Daniel Haziza, Francisco Massa, Alaaeldin El- Nouby, et al., “Dinov2: Learning robust visual features without supervision,”Transactions on Machine Learn- ing Research Journal, 2024

work page 2024
[24]

Maxi- mum likelihood from incomplete data via the EM algo- rithm,

A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maxi- mum likelihood from incomplete data via the EM algo- rithm,”Journal of the Royal Statistical Society: Series B (Methodological), vol. 39, no. 1, pp. 1–22, 1977

work page 1977
[25]

Learning transferable vi- sual models from natural language supervision,

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever, “Learning transferable vi- sual models from natural language supervision,” inIn- ternational Conference on Machine Learning (ICML). PmLR, 2021, pp. 8748–8763

work page 2021
[26]

GPT-4o System Card

OpenAI, “GPT-4o system card,”arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Decoupled weight decay regularization,

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,”International Conference on Learning Representations (ICLR), 2019

work page 2019
[28]

Floor count from street view imagery using learning-based fac ¸ade parsing,

D. J. Dobson, “Floor count from street view imagery using learning-based fac ¸ade parsing,” Master’s thesis, TU Delft, 2023

work page 2023
[29]

Yolo-world: Real-time open-vocabulary object detec- tion,

T. Cheng, L. Song, Y . Ge, W. Liu, X. Wang, and Y . Shan, “Yolo-world: Real-time open-vocabulary object detec- tion,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024