pith. machine review for the scientific record. sign in

arxiv: 2605.11863 · v1 · submitted 2026-05-12 · 💻 cs.CV · eess.IV

Recognition: 2 theorem links

· Lean Theorem

GATA2Floor: Graph attention for floor counting in street-view facades

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:23 UTC · model grok-4.3

classification 💻 cs.CV eess.IV
keywords graph attentionfacade analysisfloor countingstreet viewgraph neural networkscomputer visionself-supervised learning
0
0 comments X

The pith

Graph attention networks count building floors from street-view facades by assigning windows to latent levels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper models each facade as a graph whose nodes are detected windows and doors and whose edges follow a vertical prior. It introduces GATA2Floor, a multi-head GATv2 architecture that outputs both the total floor count and a soft assignment of each element to one of several latent floor slots via learned cross-attention queries. The approach is shown to remain interpretable and to handle irregular layouts better than isolated detection methods. It further demonstrates that the same graph reasoning can run label-free by generating proposals from self-supervised features scored with vision-language models. The resulting relational structure is positioned as useful for downstream urban analytics tasks that need to understand building scale.

Core claim

GATA2Floor predicts the global floor count of a building and, via learnable cross-attention queries, softly assigns detected facade elements to latent floor slots. The model is built on a graph whose nodes are window and door detections and whose edges incorporate a vertical prior; multi-head GATv2 layers propagate information across this structure to produce both the scalar count and the per-element floor assignments.

What carries the argument

GATA2Floor, a multi-head GATv2 network that uses learnable cross-attention queries to assign facade graph nodes to latent floor slots while predicting the total floor count.

Load-bearing premise

Modeling facades as graphs with a vertical prior on edges plus GATv2 attention will reliably capture floor structure even in irregular or occluded real-world images.

What would settle it

A set of street-view facades containing irregular or occluded window patterns on which the model produces floor-count errors larger than those of a simple vertical sorting baseline would falsify the claim.

read the original abstract

Automated analysis of building facades from street-level imagery has great potential for urban analytics, energy assessment, and emergency planning. However, it requires reasoning over spatially arranged elements rather than solely isolated detections. In this work, we model each facade as a graph over window/door detections with a vertical prior on edges. Additionally, we introduce GATA2Floor, a multi-head Graph Attention v2 (GATv2) based model that predicts the global floor count of a building and, via learnable cross-attention queries, softly assigns elements to latent floor slots, yielding interpretable outputs and robustness to irregular designs. To mitigate the lack of labeled datasets, we demonstrate that the proposed graph-based reasoning can be applied without annotations by leveraging a lightweight label-free proposal mechanism based on self-supervised features and vision-language scoring. Our approach demonstrates the value of graph-attention-based relational reasoning for facade understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes GATA2Floor, a multi-head GATv2-based graph attention model that represents building facades as graphs over window/door detections with a vertical prior on edges. It predicts the global floor count while using learnable cross-attention queries to softly assign detections to latent floor slots for interpretability and robustness to irregular designs. A label-free proposal mechanism based on self-supervised features and vision-language scoring is introduced to address the lack of annotated data, demonstrating the utility of relational graph reasoning for facade understanding.

Significance. If the central claims hold, the work would advance automated facade analysis for urban analytics, energy assessment, and emergency planning by showing how graph attention with vertical priors and cross-attention queries can yield both accurate counts and interpretable floor assignments. The label-free self-supervised component is a clear strength that could broaden applicability where labeled data is scarce.

major comments (3)
  1. [Abstract] Abstract: the approach is described but no performance numbers, error analysis, ablation studies, or validation details are supplied, so it is impossible to verify whether the graph construction, GATv2 attention, and cross-attention queries actually support the floor-counting and assignment claims.
  2. [Method] Method section (graph construction): the vertical prior is invoked but its precise definition (e.g., how y-coordinate differences are turned into edge weights or adjacency) is not given; this is load-bearing because missing or spurious detections under occlusion would sever vertical connections and break both count prediction and slot assignment.
  3. [Experiments] Experiments: no ablation on detection failure modes (occlusion, shadows, irregular spacing) is reported, yet the central claim of robustness to irregular designs rests on the assumption that the graph plus GATv2 plus cross-attention queries can recover from such failures.
minor comments (1)
  1. [Method] Notation for the number of latent floor slots and attention heads should be introduced explicitly with their default values or ranges.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the positive assessment of our work's significance and for the constructive major comments. We address each point below and will revise the manuscript to improve clarity, detail, and verifiability.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the approach is described but no performance numbers, error analysis, ablation studies, or validation details are supplied, so it is impossible to verify whether the graph construction, GATv2 attention, and cross-attention queries actually support the floor-counting and assignment claims.

    Authors: We agree that the abstract would be strengthened by including key quantitative results. In the revised manuscript, we will update the abstract to report the primary performance metrics (floor-count accuracy and soft-assignment precision on the evaluated datasets) along with a concise statement of the validation protocol. This change will make the claims immediately verifiable while remaining within length constraints. revision: yes

  2. Referee: [Method] Method section (graph construction): the vertical prior is invoked but its precise definition (e.g., how y-coordinate differences are turned into edge weights or adjacency) is not given; this is load-bearing because missing or spurious detections under occlusion would sever vertical connections and break both count prediction and slot assignment.

    Authors: The referee correctly identifies that the vertical prior requires an explicit definition. We will revise the method section to state that nodes are connected when their normalized vertical distance is below threshold τ, with edge weights w_ij = exp(−|y_i − y_j|/σ). The revised text will also include the concrete values of τ and σ used in experiments and explain how GATv2 multi-head attention combined with the cross-attention queries enables information propagation even when some vertical edges are absent due to occlusion. revision: yes

  3. Referee: [Experiments] Experiments: no ablation on detection failure modes (occlusion, shadows, irregular spacing) is reported, yet the central claim of robustness to irregular designs rests on the assumption that the graph plus GATv2 plus cross-attention queries can recover from such failures.

    Authors: We acknowledge that a targeted analysis of detection failure modes would strengthen the robustness claim. Our current experiments already evaluate the model on real-world facades containing occlusions and irregular spacing, where it outperforms non-graph baselines. For the revision we will add a dedicated subsection presenting both qualitative examples of challenging cases and quantitative results on synthetically perturbed detections (random removal of 15–25 % of nodes to simulate occlusion), thereby directly illustrating the contribution of the graph structure and cross-attention queries. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper constructs facade graphs from detections, applies a vertical prior on edges, and uses standard GATv2 attention plus learnable cross-attention queries to predict floor count and assign elements to slots. No derivation step reduces by construction to a fitted parameter, self-definition, or load-bearing self-citation chain; the components are independent applications of established graph attention techniques without tautological renaming or imported uniqueness theorems from the same authors. The approach remains self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach rests on the assumption that vertical graph structure plus attention suffices for floor reasoning and that self-supervised features can substitute for labels; no new physical entities are postulated.

free parameters (1)
  • number of attention heads and latent floor slots
    Learnable parameters in the multi-head GATv2 and cross-attention queries whose values are determined during training.
axioms (1)
  • domain assumption Facades can be represented as graphs with a vertical prior on edges between window/door detections
    Invoked to justify the graph construction step before applying attention.
invented entities (1)
  • latent floor slots no independent evidence
    purpose: Soft assignment targets for detected elements to produce interpretable floor-level outputs
    Introduced via learnable cross-attention queries to handle irregular facade designs.

pith-pipeline@v0.9.0 · 5468 in / 1301 out tokens · 48357 ms · 2026-05-13T07:23:18.107476+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 2 internal anchors

  1. [1]

    GATA2Floor: Graph attention for floor counting in street-view facades

    INTRODUCTION Street view imagery (SVI) offers a valuable resource in build- ing facades with multiple potential applications (energy esti- mation, construction cost/style prediction etc.), where accu- rate building-level information is critical. Estimating floors, however, requires reasoning over spatially arranged elements (windows/doors) rather than tre...

  2. [2]

    PROPOSED METHODOLOGY The proposed GATA2Floor operates on precomputed window and door bounding boxes obtained either from a supervised detector or from a lightweight label-free proposal mechanism (Section 2.5) when annotations are unavailable, and builds a graph over those boxes rather than on the raw image. Con- cretely, given a set ofNelement detections ...

  3. [3]

    Floor counting and Assignment (GATA2Floor) Proposed GATA2Floor Build vertical-aware graph Input Graph Features Input Embedding + Pos Enc Residual GAT block L Vertical Attention Multi-Head Cross-AttentionFloor Q Assignment Head Global mean Pool + Globalfeatures Counting HeadConfidence Head GATv2Conv GraphNorm LeakyReLU Dropdout LayerNorm FFN LayerNorm Vert...

  4. [4]

    Facade element detection (Pretrained detector OR Label-free proposal) Dense patch embedding Coherence Input Image Edge extraction y x Spatial var map Coherence map Grayscale Saliency map Edge map Spatial var DINOv2 GMM x2 VLM Prompt Crops Result (proposals)Label-free proposal Mask R-CNN YOLO OR Supervised detector (Pretrained) Result (detections) Fig. 1. ...

  5. [5]

    Datasets We use multiple common labeled datasets in the facade de- tection field like the Amsterdam Facade, ECP, eTRIMS, and ParisArtDecoFacades [13, 14]

    EXPERIMENTS AND RESULTS 3.1. Datasets We use multiple common labeled datasets in the facade de- tection field like the Amsterdam Facade, ECP, eTRIMS, and ParisArtDecoFacades [13, 14]. We perform manual labeling for the floor-level ground truth generation. 3.2. Graph-based representation We first evaluate the proposed graph-based representation be- fore th...

  6. [6]

    CONCLUSION This work models facades as vertical-aware graphs over win- dow/door detections and introduces GATA2Floor, a multi- head GATv2 architecture that jointly performs global floor counting and soft element-to-floor assignment. Extensive ex- periments across public and a large unlabeled datasets show that GATA2Floor outperforms clustering-based basel...

  7. [7]

    Faster R-CNN: Towards real-time object detection with region proposal networks,

    S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in29th Annual Conference on Neural Infor- mation Processing Systems (NeurIPS), 2015, pp. 91–99

  8. [8]

    Mask R-CNN,

    K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick, “Mask R-CNN,” inIEEE International Conference on Com- puter Vision (ICCV), 2017, pp. 2980–2988

  9. [9]

    You only look once: Unified, real-time object detec- tion,

    J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detec- tion,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 779–788

  10. [10]

    Deep learning-based door and window detection from build- ing fac ¸ade,

    G. Sezen, M. C ¸ akır, M. E. Atik, and Z. Duran, “Deep learning-based door and window detection from build- ing fac ¸ade,” inThe International Archives of the Pho- togrammetry, Remote Sensing and Spatial Information Sciences (ISPRS Archives), 2022, vol. XLIII-B4-2022, pp. 315–320

  11. [11]

    Zero-shot building attribute extraction from large-scale vision and language models,

    F. Pan, S. Jeon, B. Wang, F. Mckenna, and S. X. Yu, “Zero-shot building attribute extraction from large-scale vision and language models,” inIEEE/CVF Winter Con- ference on Applications of Computer Vision (WACV), 2024, pp. 8632–8641

  12. [12]

    The graph neural network model,

    F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, “The graph neural network model,” IEEE Transactions on Neural Networks, vol. 20, no. 1, pp. 61–80, 2009

  13. [13]

    Graph attention networks,

    P. Veli ˇckovi´c, G. Cucurull, A. Casanova, A. Romero, P. Li`o, and Y . Bengio, “Graph attention networks,” in International Conference on Learning Representations (ICLR), 2018

  14. [14]

    How attentive are graph attention networks?,

    S. Brody, U. Alon, and E. Yahav, “How attentive are graph attention networks?,” inInternational Conference on Learning Representations (ICLR), 2022

  15. [15]

    Floorlevel-net: Rec- ognizing floor-level lines with height-attention-guided multi-task learning,

    M. Wu, W. Zeng, and C.-W. Fu, “Floorlevel-net: Rec- ognizing floor-level lines with height-attention-guided multi-task learning,”IEEE Transactions on Image Pro- cessing, vol. 30, pp. 6686–6699, 2021

  16. [16]

    Geodata- based number of floor estimation for urban residential buildings as an input parameter for energy modelling,

    F. Moubayed, R. Becker, and J. Blankenbach, “Geodata- based number of floor estimation for urban residential buildings as an input parameter for energy modelling,” Geo-spatial Information Science, vol. 0, pp. 1–27, 2025

  17. [17]

    Semi-supervised learning from street-view images and openstreetmap for automatic building height estimation,

    H. Li, Z. Yuan, G. Dax, G. Kong, H. Fan, A. Zipf, and M. Werner, “Semi-supervised learning from street-view images and openstreetmap for automatic building height estimation,”arXiv preprint arXiv:2307.02574, 2023

  18. [18]

    Building floor number estimation from crowdsourced street-level images: Munich dataset and baseline method,

    Y . Sun, S. Chen, Y . Tian, and X. X. Zhu, “Building floor number estimation from crowdsourced street-level images: Munich dataset and baseline method,”arXiv preprint arXiv:2505.18021, 2025

  19. [19]

    eTRIMS image database for interpreting images of man-made scenes,

    F. Kor ˇc and W. F¨orstner, “eTRIMS image database for interpreting images of man-made scenes,” Tech. Rep. TR-IGG-P-2009-01, Dept. of Photogrammetry, Univer- sity of Bonn, 2009

  20. [20]

    Learning gram- mars for architecture-specific facade parsing,

    R. Gadde, R. Marlet, and N. Paragios, “Learning gram- mars for architecture-specific facade parsing,”Interna- tional Journal of Computer Vision, vol. 117, no. 3, pp. 290–316, 2016

  21. [21]

    Fast R-CNN,

    Ross Girshick, “Fast R-CNN,” inIEEE Interna- tional Conference on Computer Vision (ICCV), 2015, pp. 1440–1448

  22. [22]

    A 3×3 isotropic gradient op- erator for image processing,

    I. Sobel and G. Feldman, “A 3×3 isotropic gradient op- erator for image processing,” inPattern Classification and Scene Analysis, pp. 271–272. 1973

  23. [23]

    Dinov2: Learning robust visual features without supervision,

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fer- nandez, Daniel Haziza, Francisco Massa, Alaaeldin El- Nouby, et al., “Dinov2: Learning robust visual features without supervision,”Transactions on Machine Learn- ing Research Journal, 2024

  24. [24]

    Maxi- mum likelihood from incomplete data via the EM algo- rithm,

    A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maxi- mum likelihood from incomplete data via the EM algo- rithm,”Journal of the Royal Statistical Society: Series B (Methodological), vol. 39, no. 1, pp. 1–22, 1977

  25. [25]

    Learning transferable vi- sual models from natural language supervision,

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever, “Learning transferable vi- sual models from natural language supervision,” inIn- ternational Conference on Machine Learning (ICML). PmLR, 2021, pp. 8748–8763

  26. [26]

    GPT-4o System Card

    OpenAI, “GPT-4o system card,”arXiv preprint arXiv:2410.21276, 2024

  27. [27]

    Decoupled weight decay regularization,

    I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,”International Conference on Learning Representations (ICLR), 2019

  28. [28]

    Floor count from street view imagery using learning-based fac ¸ade parsing,

    D. J. Dobson, “Floor count from street view imagery using learning-based fac ¸ade parsing,” Master’s thesis, TU Delft, 2023

  29. [29]

    Yolo-world: Real-time open-vocabulary object detec- tion,

    T. Cheng, L. Song, Y . Ge, W. Liu, X. Wang, and Y . Shan, “Yolo-world: Real-time open-vocabulary object detec- tion,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024