pith. sign in

arxiv: 2606.23669 · v1 · pith:ESTHI426new · submitted 2026-06-22 · 💻 cs.CV

GeoFidelity-Bench: Evaluating Segment-Level Geographic Fidelity in Text-to-Image Street-View Generation

Pith reviewed 2026-06-26 09:09 UTC · model grok-4.3

classification 💻 cs.CV
keywords text-to-image generationstreet viewgeographic fidelitybenchmarksegment retrievalprompt evaluationMapillaryOpenStreetMap
0
0 comments X

The pith

Street and neighborhood names raise top-1 retrieval accuracy by 5.5 points in generated street views, yet the similarity margin to the nearest same-city segment stays near zero.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GeoFidelity-Bench to test whether text-to-image models can produce street views that match a specific requested road segment rather than a generic city appearance. It assembles 7,117 real Mapillary images from 109 named OpenStreetMap segments across 25 cities and ranks each generated panel against the target reference, the nearest same-city segment, other same-city segments, and other-city segments. Experiments across six models show that adding street and neighborhood names to city-only prompts lifts top-1 accuracy by 5.5 percentage points with a 3.4-to-7.7 confidence interval. The same experiments find almost no extra similarity for the exact target over its nearest alternative, indicating that local names mainly improve neighborhood plausibility. Real-image queries successfully recover segment identity, confirming that the reference panels carry usable segment-level signal.

Core claim

GeoFidelity-Bench ranks generated panels by similarity to the target reference panel versus the nearest same-city segment, other same-city segments, and other-city segments. City-only prompts yield low top-1 accuracy; adding correct street and neighborhood names raises accuracy by 5.5 percentage points, while the similarity margin between target and nearest same-city segment remains near zero. Appending raw GPS coordinates as text yields no statistically clear gain, and prompts with incorrect local names still confer partial improvement. Held-out real-image queries recover segment identity, validating that the references contain recoverable segment-level signal.

What carries the argument

The reference-panel ranking protocol that scores each generated image against target, nearest same-city, other same-city, and other-city panels to isolate segment-level geographic fidelity rather than absolute similarity.

If this is right

  • Local names improve broad local plausibility more than exact segment identity.
  • Raw GPS coordinates appended as text yield no statistically clear additional benefit.
  • Only part of the accuracy gain depends on using the correct local names rather than any local name.
  • The benchmark distinguishes real images by segment, confirming usable signal in the references.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future generators may need mechanisms beyond plain text prompts to encode segment-specific visual features.
  • The ranking protocol could be reused to measure progress on other fine-grained visual control tasks.
  • Persistent near-zero margins suggest that training corpora may under-represent the distinctive appearance of individual road segments.

Load-bearing premise

The curated reference panels contain usable segment-level signal recoverable by real-image queries, and the ranking isolates geographic fidelity without being dominated by lighting, season, or camera angle.

What would settle it

If queries with held-out real images fail to rank the correct segment first at rates above chance, the reference panels lack recoverable segment-level signal.

Figures

Figures reproduced from arXiv: 2606.23669 by Hanzhe Hong, Kaizhen Tan, Siru Tao.

Figure 1
Figure 1. Figure 1: Benchmark design and evaluation workflow. Each named OSM road segment is evaluated under city-only (L0), [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Reference hierarchy validation. Bars report mean [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Cross-city similarity of real references. The matrix [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative segment-fidelity failures. Examples compare real reference images with outputs from six open-weight [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
read the original abstract

Text-to-image models can generate visually plausible city streets, but whether their outputs correspond to a requested road segment rather than a generic city prior remains unclear. We introduce GeoFidelity-Bench, a reference-panel benchmark for segment-conditioned geographic fidelity in street-view generation. It contains 7,117 curated Mapillary images covering 109 named OpenStreetMap road segments in 25 cities across six continents. For each generated panel, the benchmark ranks the target reference panel against panels from the nearest segment in the same city, other segments in the same city, and segments from other cities, making local discrimination rather than absolute target similarity the primary test. We evaluate six open-weight text-to-image generators under city-only, street-and-neighborhood, and GPS-augmented prompts. Adding street and neighborhood names is associated with an increase of 5.5 percentage points in top-1 retrieval accuracy over city-only prompts, with a 95% confidence interval from 3.4 to 7.7 percentage points. However, the similarity margin between the target and the nearest segment in the same city remains near zero, indicating that local names improve broad local plausibility more than exact segment identity. Prompts that keep the city fixed but use incorrect street or neighborhood names further show that only part of the gain depends on the correct local names, while appending raw GPS coordinates as ordinary text yields no statistically clear additional benefit. Held-out real-image queries successfully recover segment identity, showing that the curated references contain usable segment-level signal. GeoFidelity-Bench thus reveals a persistent gap between city- or neighborhood-plausible street-view generation and faithful generation for a specific road segment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper introduces GeoFidelity-Bench, a reference-panel benchmark with 7,117 curated Mapillary images spanning 109 named OpenStreetMap road segments across 25 cities on six continents. It evaluates six open-weight text-to-image models under city-only, street-and-neighborhood, and GPS-augmented prompts by ranking each generated panel against the target reference, the nearest same-city segment, other same-city segments, and other-city segments. The central empirical claim is that adding street and neighborhood names raises top-1 retrieval accuracy by 5.5 percentage points (95% CI 3.4–7.7) relative to city-only prompts, yet the similarity margin between target and nearest same-city segment remains near zero; incorrect local names still confer partial gains while raw GPS text adds none. A held-out real-image validation confirms that the reference panels contain recoverable segment-level signal under the same protocol.

Significance. If the reported effect sizes and validation hold, the work supplies a concrete, multi-continent benchmark that quantifies the gap between city- or neighborhood-plausible street-view generation and faithful segment-level fidelity. The inclusion of a real-image control, non-overlapping confidence interval, and explicit distinction between broad-plausibility and exact-identity gains strengthens the empirical contribution and provides a reproducible testbed for future conditional-generation research.

minor comments (3)
  1. [§4] §4 (evaluation protocol): the exact embedding model, similarity metric, and aggregation rule used for the reported top-1 accuracy and margins are not stated explicitly; adding one sentence or a short pseudocode block would remove ambiguity without altering the central claim.
  2. [§3] Table 1 or §3: the six evaluated models are named only in the abstract; listing their exact checkpoints and parameter counts in the main text or a table would improve reproducibility.
  3. [Results] The manuscript reports a 95% CI but does not indicate whether the interval accounts for multiple comparisons across prompt conditions; a brief note on the statistical procedure would be helpful.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their detailed summary of the work, positive assessment of its significance, and recommendation for minor revision. No specific major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity in empirical benchmark

full rationale

This paper introduces an empirical benchmark (GeoFidelity-Bench) consisting of curated Mapillary reference panels and a retrieval-ranking protocol. All reported results—top-1 accuracy gains of 5.5 pp with CI, near-zero target-vs-nearest margins, and held-out real-image validation—are direct statistical outputs of applying the fixed protocol to model generations under varying prompts. No derivations, fitted parameters renamed as predictions, self-citation chains, or ansatzes appear; the central claims follow from the described data collection and ranking procedure without reduction to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that the Mapillary-sourced reference panels carry recoverable segment identity and that the three-way ranking protocol measures geographic fidelity rather than visual style or metadata artifacts. No free parameters or invented entities are described.

axioms (2)
  • domain assumption The curated 7,117-image reference set contains usable segment-level signal recoverable by real-image queries.
    Stated in the abstract as the control result that validates the benchmark.
  • domain assumption Top-1 retrieval accuracy against nearest same-city, other same-city, and other-city panels isolates geographic fidelity.
    Core of the evaluation design described in the abstract.

pith-pipeline@v0.9.1-grok · 5842 in / 1390 out tokens · 26338 ms · 2026-06-26T09:09:57.457557+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 20 canonical work pages · 1 internal anchor

  1. [1]

    Video-bench: Human-aligned video generation benchmark

    Feng, Chao and Chen, Ziyang and Ho. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=. doi:10.1109/CVPR52734.2025.00264 , publisher=

  2. [2]

    MotionCtrl: A Unified and Flexible Motion Controller for Video Generation , url=

    Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video Diffusion , author=. ACM SIGGRAPH 2024 Conference Papers , pages=. doi:10.1145/3641519.3657513 , publisher=

  3. [3]

    Hall, Melissa and Ross, Candace and Williams, Adina and Carion, Nicolas and Drozdzal, Michal and Romero-Soriano, Adriana , journal=

  4. [4]

    doi:10.52202/075280-2888 , publisher=

    Ramaswamy, Vikram V and Lin, Sing Yu and Zhao, Dora and Adcock, Aaron and van der Maaten, Laurens and Ghadiyaram, Deepti and Russakovsky, Olga , booktitle=. doi:10.52202/075280-2888 , publisher=

  5. [5]

    doi:10.52202/075280-0379 , publisher=

    Vivanco Cepeda, Vicente and Nayak, Gaurav Kumar and Shah, Mubarak , booktitle=. doi:10.52202/075280-0379 , publisher=

  6. [6]

    Vbench: Comprehensive benchmark suite for video generative models

    Astruc, Guillaume and Dufour, Nicolas and Siglidis, Ioannis and Aronssohn, Constantin and Bouia, Nacim and Fu, Stephanie and Loiseau, Romain and Nguyen, Van Nguyen and Raude, Charles and Vincent, Elliot and Xu, Lintao and Zhou, Hongyu and Landrieu, Loic , booktitle=. doi:10.1109/CVPR52733.2024.02074 , publisher=

  7. [7]

    Vbench: Comprehensive benchmark suite for video generative models

    Li, Zuoyue and Li, Zhenqiang and Cui, Zhaopeng and Pollefeys, Marc and Oswald, Martin R , booktitle=. doi:10.1109/CVPR52733.2024.00682 , publisher=

  8. [8]

    European Conference on Computer Vision (ECCV) , pages=

    Geospecific View Generation -- Geometry-Context Aware High-Resolution Ground View Inference from Satellite Views , author=. European Conference on Computer Vision (ECCV) , pages=. doi:10.1007/978-3-031-72970-6_20 , publisher=

  9. [9]

    Vbench: Comprehensive benchmark suite for video generative models

    Xie, Haozhe and Chen, Zhaoxi and Hong, Fangzhou and Liu, Ziwei , booktitle=. doi:10.1109/CVPR52733.2024.00923 , publisher=

  10. [10]

    doi:10.48550/arXiv.2407.11965 , year=

    Shang, Yu and Lin, Yuming and Zheng, Yu and Fan, Hangyu and Ding, Jingtao and Feng, Jie and Chen, Jiansheng and Tian, Li and Li, Yong , howpublished=. doi:10.48550/arXiv.2407.11965 , year=. 2407.11965 , archiveprefix=

  11. [11]

    Transactions on Machine Learning Research , url=

    Oquab, Maxime and Darcet, Timoth. Transactions on Machine Learning Research , url=

  12. [12]

    Otaduy, and Dan Casas

    Masked-attention Mask Transformer for Universal Image Segmentation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=. doi:10.1109/CVPR52688.2022.00135 , publisher=

  13. [13]

    Journal of Machine Learning Research , volume=

    A Kernel Two-Sample Test , author=. Journal of Machine Learning Research , volume=

  14. [14]

    Dataset condensation with distribution matching

    Ali-bey, Amar and Chaib-draa, Brahim and Gigu. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , pages=. doi:10.1109/WACV56688.2023.00301 , publisher=

  15. [15]

    doi:10.1109/LRA.2023.3343602 , year=

    Keetha, Nikhil and Mishra, Avneesh and Karhade, Jay and Jatavallabhula, Krishna Murthy and Scherer, Sebastian and Krishna, Madhava and Garg, Sourav , journal=. doi:10.1109/LRA.2023.3343602 , year=

  16. [16]

    Vbench: Comprehensive benchmark suite for video generative models

    Haas, Lukas and Skreta, Michal and Alberti, Silas and Finn, Chelsea , booktitle=. doi:10.1109/CVPR52733.2024.01225 , publisher=

  17. [17]

    doi:10.48550/arXiv.2406.11988 , year=

    Decomposed Evaluations of Geographic Disparities in Text-to-Image Models , author=. doi:10.48550/arXiv.2406.11988 , year=. 2406.11988 , archiveprefix=

  18. [18]

    International Conference on Learning Representations (ICLR) , publisher=

    Podell, Dustin and English, Zion and Lacey, Kyle and Blattmann, Andreas and Dockhorn, Tim and M. International Conference on Learning Representations (ICLR) , publisher=

  19. [19]

    doi:10.1007/978-3-031-73411-3_5 , publisher=

    Chen, Junsong and Ge, Chongjian and Xie, Enze and Wu, Yue and Yao, Lewei and Ren, Xiaozhe and Wang, Zhongdao and Luo, Ping and Lu, Huchuan and Li, Zhenguo , booktitle=. doi:10.1007/978-3-031-73411-3_5 , publisher=

  20. [20]

    Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

    Li, Zhimin and Zhang, Jianwei and Lin, Qin and Xiong, Jiangfeng and Long, Yanxin and Deng, Xinchi and Zhang, Yingfang and Liu, Xingchao and Huang, Minbin and Xiao, Zedong and others , howpublished=. doi:10.48550/arXiv.2405.08748 , year=. 2405.08748 , archiveprefix=

  21. [21]

    Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pages=

    Sigmoid Loss for Language Image Pre-Training , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pages=. 2023 , doi=

  22. [22]

    Neuhold, Gerhard and Ollmann, Tobias and Bul. The. Proceedings of the IEEE International Conference on Computer Vision (ICCV) , pages=. doi:10.1109/ICCV.2017.534 , publisher=

  23. [23]

    Place identity: a generative

    Jang, Kee Moon and Chen, Junda and Kang, Yuhao and Kim, Junghwan and Lee, Jinhyung and Duarte, Fabio and Ratti, Carlo , journal=. Place identity: a generative. 2024 , publisher=

  24. [24]

    Advances in Neural Information Processing Systems (NeurIPS) , volume=

    GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

  25. [25]

    CLIP- Score: A reference-free evaluation metric for image captioning

    Hessel, Jack and Holtzman, Ari and Forbes, Maxwell and Le Bras, Ronan and Choi, Yejin , booktitle=. doi:10.18653/v1/2021.emnlp-main.595 , year=

  26. [26]

    In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018

    The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages=. doi:10.1109/CVPR.2018.00068 , publisher=

  27. [27]

    doi:10.1109/MPRV.2008.80 , year=

    Haklay, Mordechai and Weber, Patrick , journal=. doi:10.1109/MPRV.2008.80 , year=

  28. [28]

    Ali-bey, A.; Chaib-draa, B.; and Gigu \`e re, P. 2023. MixVPR : Feature Mixing for Visual Place Recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2998--3007. Los Alamitos, CA, USA: IEEE Computer Society

  29. [29]

    N.; Raude, C.; Vincent, E.; Xu, L.; Zhou, H.; and Landrieu, L

    Astruc, G.; Dufour, N.; Siglidis, I.; Aronssohn, C.; Bouia, N.; Fu, S.; Loiseau, R.; Nguyen, V. N.; Raude, C.; Vincent, E.; Xu, L.; Zhou, H.; and Landrieu, L. 2024. OpenStreetView-5M : The Many Roads to Global Visual Geolocation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 21967--21977. Los Alamitos, CA, US...

  30. [30]

    Black Forest Labs . 2024. Announcing Black Forest Labs . https://bfl.ai/announcing-black-forest-labs. Introduces the FLUX.1 suite of text-to-image models. Accessed: 2026-06-22

  31. [31]

    Chen, J.; Ge, C.; Xie, E.; Wu, Y.; Yao, L.; Ren, X.; Wang, Z.; Luo, P.; Lu, H.; and Li, Z. 2024. PixArt- : Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation. In European Conference on Computer Vision (ECCV), volume 15090 of Lecture Notes in Computer Science, 74--91. Cham, Switzerland: Springer Science and Business Media Deut...

  32. [32]

    G.; Kirillov, A.; and Girdhar, R

    Cheng, B.; Misra, I.; Schwing, A. G.; Kirillov, A.; and Girdhar, R. 2022. Masked-attention Mask Transformer for Universal Image Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 1290--1299. Los Alamitos, CA, USA: IEEE Computer Society

  33. [33]

    Deng, B.; Tucker, R.; Li, Z.; Guibas, L.; Snavely, N.; and Wetzstein, G. 2024. Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video Diffusion. In ACM SIGGRAPH 2024 Conference Papers, 1--11. New York, NY, USA: Association for Computing Machinery. Article 27

  34. [34]

    A.; and Owens, A

    Feng, C.; Chen, Z.; Ho y \'n ski, A.; Efros, A. A.; and Owens, A. 2025. GPS as a Control Signal for Image Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2766--2778. Los Alamitos, CA, USA: IEEE Computer Society

  35. [35]

    M.; Rasch, M

    Gretton, A.; Borgwardt, K. M.; Rasch, M. J.; Sch \"o lkopf, B.; and Smola, A. 2012. A Kernel Two-Sample Test. Journal of Machine Learning Research, 13(25): 723--773

  36. [36]

    Haas, L.; Skreta, M.; Alberti, S.; and Finn, C. 2024. PIGEON : Predicting Image Geolocations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 12893--12902. Los Alamitos, CA, USA: IEEE Computer Society

  37. [37]

    Haklay, M.; and Weber, P. 2008. OpenStreetMap : User-Generated Street Maps. IEEE Pervasive Computing, 7(4): 12--18

  38. [38]

    Hall, M.; Ross, C.; Williams, A.; Carion, N.; Drozdzal, M.; and Romero-Soriano, A. 2024. DIG In : Evaluating Disparities in Image Generations with Indicators for Geographic Diversity. Transactions on Machine Learning Research

  39. [39]

    Hessel, J.; Holtzman, A.; Forbes, M.; Le Bras, R.; and Choi, Y. 2021. CLIPScore : A Reference-free Evaluation Metric for Image Captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), 7514--7528. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics

  40. [40]

    Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; and Hochreiter, S. 2017. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Advances in Neural Information Processing Systems (NeurIPS), volume 30, 6626--6637. Red Hook, NY, USA: Curran Associates, Inc

  41. [41]

    M.; Chen, J.; Kang, Y.; Kim, J.; Lee, J.; Duarte, F.; and Ratti, C

    Jang, K. M.; Chen, J.; Kang, Y.; Kim, J.; Lee, J.; Duarte, F.; and Ratti, C. 2024. Place identity: a generative AI 's perspective. Humanities and Social Sciences Communications, 11: 1156

  42. [42]

    M.; Scherer, S.; Krishna, M.; and Garg, S

    Keetha, N.; Mishra, A.; Karhade, J.; Jatavallabhula, K. M.; Scherer, S.; Krishna, M.; and Garg, S. 2023. AnyLoc : Towards Universal Visual Place Recognition. IEEE Robotics and Automation Letters, 9(2): 1286--1293

  43. [43]

    Li, Z.; Li, Z.; Cui, Z.; Pollefeys, M.; and Oswald, M. R. 2024 a . Sat2Scene : 3D Urban Scene Generation from Satellite Images with Diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 7141--7150. Los Alamitos, CA, USA: IEEE Computer Society

  44. [44]

    Li, Z.; Zhang, J.; Lin, Q.; Xiong, J.; Long, Y.; Deng, X.; Zhang, Y.; Liu, X.; Huang, M.; Xiao, Z.; et al. 2024 b . Hunyuan-DiT : A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding. arXiv preprint. arXiv:2405.08748

  45. [45]

    Mapillary . 2024. An Introduction to Mapillary . https://help.mapillary.com/hc/en-us/articles/115001770269-An-Introduction-to-Mapillary. Accessed: 2026-06-22

  46. [46]

    R.; and Kontschieder, P

    Neuhold, G.; Ollmann, T.; Bul \`o , S. R.; and Kontschieder, P. 2017. The Mapillary Vistas Dataset for Semantic Understanding of Street Scenes. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 5000--5009. Los Alamitos, CA, USA: IEEE Computer Society

  47. [47]

    Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. 2024. DINOv2 : Learning Robust Visual Features without Supervision. Transactions on Machine Learning Research

  48. [48]

    Podell, D.; English, Z.; Lacey, K.; Blattmann, A.; Dockhorn, T.; M \"u ller, J.; Penna, J.; and Rombach, R. 2024. SDXL : Improving Latent Diffusion Models for High-Resolution Image Synthesis. In International Conference on Learning Representations (ICLR). Vienna, Austria: OpenReview.net

  49. [49]

    V.; Lin, S

    Ramaswamy, V. V.; Lin, S. Y.; Zhao, D.; Adcock, A.; van der Maaten, L.; Ghadiyaram, D.; and Russakovsky, O. 2023. GeoDE : A Geographically Diverse Evaluation Dataset for Object Recognition. In Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 66127--66137. Red Hook, NY, USA: Curran Associates, Inc

  50. [50]

    Shang, Y.; Lin, Y.; Zheng, Y.; Fan, H.; Ding, J.; Feng, J.; Chen, J.; Tian, L.; and Li, Y. 2024. UrbanWorld : An Urban World Model for 3D City Generation. arXiv preprint. arXiv:2407.11965

  51. [51]

    Stability AI . 2024. Introducing Stable Diffusion 3.5. https://stability.ai/news/introducing-stable-diffusion-3-5. Accessed: 2026-06-22

  52. [52]

    Sureddy, A.; Padalia, D.; Periyakaruppa, N.; Saha, O.; Williams, A.; Romero-Soriano, A.; Richards, M.; Kirichenko, P.; and Hall, M. 2024. Decomposed Evaluations of Geographic Disparities in Text-to-Image Models. arXiv preprint. arXiv:2406.11988

  53. [53]

    K.; and Shah, M

    Vivanco Cepeda, V.; Nayak, G. K.; and Shah, M. 2023. GeoCLIP : Clip-Inspired Alignment between Locations and Images for Effective Worldwide Geo-localization. In Advances in Neural Information Processing Systems (NeurIPS), 8690--8701. Red Hook, NY, USA: Curran Associates, Inc

  54. [54]

    Xie, H.; Chen, Z.; Hong, F.; and Liu, Z. 2024. CityDreamer : Compositional Generative Model of Unbounded 3D Cities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 9666--9675. Los Alamitos, CA, USA: IEEE Computer Society

  55. [55]

    Xu, N.; and Qin, R. 2024. Geospecific View Generation -- Geometry-Context Aware High-Resolution Ground View Inference from Satellite Views. In European Conference on Computer Vision (ECCV), volume 15105 of Lecture Notes in Computer Science, 349--366. Cham, Switzerland: Springer Nature Switzerland

  56. [56]

    Zhai, X.; Mustafa, B.; Kolesnikov, A.; and Beyer, L. 2023. Sigmoid Loss for Language Image Pre-Training. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 11975--11986. Los Alamitos, CA, USA: IEEE Computer Society

  57. [57]

    A.; Shechtman, E.; and Wang, O

    Zhang, R.; Isola, P.; Efros, A. A.; Shechtman, E.; and Wang, O. 2018. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 586--595. Los Alamitos, CA, USA: IEEE Computer Society