pith. sign in

arxiv: 2606.01192 · v1 · pith:ROQSU2WEnew · submitted 2026-05-31 · 💻 cs.CV

PairedGTA: Generating Driving Datasets for Controlled Photometric Shift Analysis

Pith reviewed 2026-06-28 17:21 UTC · model grok-4.3

classification 💻 cs.CV
keywords paired datasetsphotometric shiftssemantic segmentationgame engineautonomous drivingweather conditionsillumination changessynthetic data
0
0 comments X

The pith

A GTA-based framework produces perfectly paired driving images that differ only in weather and lighting to isolate photometric effects on perception models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a generation method that uses the GTA game engine to render multiple versions of the exact same driving scene, with identical geometry, camera position, and object placements but altered illumination or weather. Real datasets rarely supply such pairs because traffic and viewpoint change between captures, so model errors cannot be cleanly traced to photometric factors alone. The new approach samples locations, places dynamic objects procedurally, and renders pixel-aligned outputs under controlled adverse conditions. This setup lets researchers measure how semantic segmentation models degrade when only lighting or weather changes, rather than when geometry or semantics also shift.

Core claim

By leveraging software APIs that communicate with the GTA game engine, the framework modifies illumination and weather conditions while preserving scene geometry, camera pose, and the identity and placement of dynamic objects. For each sampled location, it procedurally instantiates dynamic entities and renders pixel-aligned images under diverse adverse conditions. The benefit of the proposed generation framework in driving scenarios is demonstrated through a systematic analysis of semantic segmentation models, whose output degradation can be attributed more directly to photometric shifts rather than to uncontrolled semantic or geometric factors.

What carries the argument

Procedural instantiation and rendering inside the GTA engine that changes only photometric parameters while fixing all other scene elements.

If this is right

  • Semantic segmentation output changes can be attributed more directly to photometric shifts.
  • Systematic evaluation of perception models becomes possible across many adverse conditions without confounding geometric or semantic variation.
  • Pixel-aligned image sets allow controlled measurement of model robustness to illumination and weather alone.
  • The generated data supports analysis that separates photometric from layout-related sources of error in autonomous driving perception.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same paired-generation technique could be applied to other perception tasks such as object detection or optical flow to check whether photometric sensitivity is task-dependent.
  • If the synthetic shifts prove close enough to real ones, the datasets could serve as a cheap way to augment scarce real paired data for robustness training.
  • The framework implicitly suggests that game-engine control over rendering parameters offers a route to test camera-invariant features without collecting new physical footage.

Load-bearing premise

The assumption that photometric shifts produced inside the GTA engine affect perception models in ways that represent real-world camera behavior.

What would settle it

A side-by-side test in which the same segmentation model is run on both PairedGTA pairs and any real-world driving pairs captured under matching condition changes; if degradation patterns diverge sharply, the claim that the synthetic pairs isolate representative photometric effects would be weakened.

Figures

Figures reproduced from arXiv: 2606.01192 by Alessandro Biondi, Andrea Chianese, Giorgio Buttazzo, Giulio Rossolini, Marco Cococcioni.

Figure 1
Figure 1. Figure 1: Examples of paired images produced by the proposed framework. Dynamic objects are generated once [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Example of a daytime sunny image and a corresponding sunset sunny variant generated by the framework, highlighting object poses and placement constraints. set of photometric conditions considered in the dataset. Each condition ci specifies environmental rendering parameters, such as time of day, illumination, rain, fog, or cloud coverage. For each scene k, the framework first constructs an internal scene d… view at source ↗
Figure 3
Figure 3. Figure 3: Communication pipeline between the proposed framework (blue block), other third-party software tools, and the GTA game engine. The interaction with the game engine is mediated by VPilot [37], which provides a Python interface for client-side communication and scenario orchestration. VPilot communicates with the DeepGTAV server, exposed by the game plugin through a TCP connection on port 8000, by default. T… view at source ↗
Figure 4
Figure 4. Figure 4: Dataset-level distribution of pseudo-labels gen￾erated with SegFormer-B5. The bottom panels show an example of a reference image and the segmentation map. Dataset. The evaluation is conducted on a dataset generated using the pipeline described in Section 3. The dataset contains more than 100 unique spatial lo￾cations. For each location k, we generate nine spatially aligned images corresponding to all combi… view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of photometric shifts across three sunny scenarios under different illumination conditions: day, sunset, and night. In [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Consistency analysis on our dataset and ACDC [30], using SegFormer (left) and MaskFormer (right). Frames are ranked according to the number of high-confidence predictions in clean scenarios for each class. Higher ρvalid indicates stronger preservation of the clean-to-adverse ranking. 4.4 Cost analysis of the generation steps This section reports a computational cost analysis for generating paired images wi… view at source ↗
Figure 7
Figure 7. Figure 7: Low-consistency examples for the person class from our dataset, on the left, and ACDC, on the right, with segmentation outputs produced by SegFormer-B5. Pipeline Overhead Total 10 20 30 40 Time [s] 29.8s 4.0s 33.8s 0 5 10 15 20 25 30 Average time [s] Phase 1 | 0.0s (0.0%) Phase 2 |6.1s (20.5%) Phase 3 | 23.7s (79.5%) [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Computational cost analysis of the proposed framework. Left: execution time over 100 scenarios for the pipeline, external overhead, and total process. Right: average pipeline-time decomposition across phases. This enables the effect of photometric changes to be studied in isolation, reducing the confounding factors that typically affect real-world adverse-condition datasets. We used the generated data to e… view at source ↗
read the original abstract

Evaluating the performance of visual perception systems for autonomous driving is essential to ensure reliable operation across diverse environmental scenarios. Ideally, a balanced and fair analysis across different adverse conditions would require perfectly paired images of the same scene under different weather or illumination changes. This would allow evaluating the effect of photometric shifts independently of geometry and semantic changes. Unfortunately, real-world datasets rarely provide images of the same scene under different environmental conditions, because, normally, camera pose, traffic, and locations of dynamic objects (vehicles, pedestrians, etc.) vary over time, thus yielding only coarsely paired data. To address this challenge, this work introduces a data generation framework based on a high-fidelity game engine for extracting perfectly paired images. By leveraging software APIs that communicate with the GTA game engine, the framework modifies illumination and weather conditions while preserving scene geometry, camera pose, and the identity and placement of dynamic objects. For each sampled location, it procedurally instantiates dynamic entities and renders pixel-aligned images under diverse adverse conditions. The benefit of the proposed generation framework in driving scenarios is demonstrated through a systematic analysis of semantic segmentation models, whose output degradation can be attributed more directly to photometric shifts rather than to uncontrolled semantic or geometric factors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces PairedGTA, a framework that uses GTA game engine APIs to generate perfectly paired driving-scene images under controlled changes in illumination and weather. The method procedurally instantiates dynamic objects at sampled locations and renders multiple pixel-aligned images while holding scene geometry, camera pose, and object identities/placements fixed, enabling analysis of semantic segmentation degradation attributable to photometric shifts rather than geometric or semantic confounders.

Significance. If the generated pairs function as described, the framework supplies a controlled testbed for isolating photometric effects on perception models, addressing a practical limitation of real-world driving datasets that rarely contain exact scene matches across conditions. The procedural instantiation step is a concrete engineering contribution that could support reproducible robustness studies in autonomous driving.

major comments (1)
  1. [Abstract] Abstract: the claim that the framework 'demonstrates' the benefit for semantic segmentation analysis is unsupported by any quantitative results, error metrics, or comparison to real data in the provided description, leaving the utility claim only partially substantiated.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the framework's utility for controlled photometric analysis. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the framework 'demonstrates' the benefit for semantic segmentation analysis is unsupported by any quantitative results, error metrics, or comparison to real data in the provided description, leaving the utility claim only partially substantiated.

    Authors: We agree that the abstract, as a concise summary, does not itself contain quantitative metrics or direct comparisons. The manuscript body (Sections 4–5) presents the systematic analysis with mIoU and per-class degradation metrics across controlled photometric conditions on multiple segmentation models. To resolve the concern, we will revise the abstract wording to state that the benefit 'is demonstrated through systematic analysis' (removing any implication that metrics appear in the abstract) and will ensure the claim is fully supported by the experiments section. No comparison to real data is claimed or required, as the contribution is the controlled synthetic pairs themselves. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes a procedural framework for generating paired driving images via GTA engine APIs that control illumination/weather while fixing geometry, pose, and object placement. No equations, fitted parameters, predictions, or derivation chain exist in the provided text; the central mechanism is a direct engineering construction whose correctness does not reduce to self-definition or self-citation. External validity of the generated shifts is a separate question outside the pairing mechanism itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the fidelity of the game engine simulation and the assumption that procedural placement matches the statistical properties needed for driving scenarios.

axioms (1)
  • domain assumption The GTA game engine rendering produces photometric shifts whose impact on perception models is representative of real-world conditions.
    Invoked in the description of how the framework enables attribution of degradation to photometric shifts.

pith-pipeline@v0.9.1-grok · 5753 in / 1120 out tokens · 30255 ms · 2026-06-28T17:21:25.945344+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 4 canonical work pages · 3 internal anchors

  1. [1]

    Night-to-Day Image Translation for Retrieval-based Localization

    A. Anoosheh, T. Sattler, R. Timofte, M. Pollefeys, and L. Van Gool. Night-to-day image translation for retrieval-based localization.arXiv preprint arXiv:1809.09767, 2018

  2. [2]

    S. Baik, S. Kim, and E. Kim. Weatherflux: Universal weather translation with diffusion models.ICLR, 2025

  3. [3]

    Ben-David, J

    S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. A theory of learning from different domains.Machine Learning, 79(1–2):151–175, 2010

  4. [4]

    Cao and R

    M. Cao and R. Ramezani. Data generation using simulation technology to improve perception mechanism of autonomous vehicles, 2022

  5. [5]

    L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. InECCV, 2018

  6. [6]

    Cheng, I

    B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar. Masked-attention mask transformer for universal image segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022

  7. [7]

    Cordts, M

    M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016

  8. [8]

    B. H. K. Czarnecki and S. Waslander. Precise synthetic image and lidar (presil) dataset for autonomous vehicle perception.Computer Vision and Pattern Recognition, arXiv:1905.00160, 2019

  9. [9]

    D’Amico, F

    G. D’Amico, F. Nesti, G. Rossolini, M. Marinoni, S. Sabina, and G. Buttazzo. Syndra: Synthetic dataset for railway applications. InProceedings of the Winter Conference on Applications of Computer Vision (WACV), pages 3437–3446, February 2025

  10. [10]

    aitorzip: https://github.com/aitorzip/DeepGTAV

  11. [11]

    Dosovitskiy, G

    A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun. Carla: An open urban driving simulator. CoRL, 2017

  12. [12]

    Gaidon, Q

    A. Gaidon, Q. Wang, Y. Cabon, and E. Vig. Virtual worlds as proxy for multi-object tracking analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016

  13. [13]

    Gella, H

    B. Gella, H. Zhang, R. Upadhyay, T. Chang, M. Waliman, Y. Ba, A. Wong, and A. Kadambi. Weatherproof: A paired-dataset approach to semantic segmentation in adverse weather.arXiv preprint arXiv:2312.09534, 2023

  14. [14]

    Gurbindo, A

    U. Gurbindo, A. Brando, J. Abella, and C. König. Object detection in adverse weather conditions for autonomous vehicles using instruct pix2pix. In2025 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2025

  15. [15]

    H. Ha, X. Jin, J. Kim, J. Liu, Z. Wang, K. D. Nguyen, A. Blume, N. Peng, K.-W. Chang, and H. Ji. Synthia: Novel concept design with affordance composition.CVPR, 2021

  16. [16]

    Benchmarking neural network robustness to common corruptions

    Hendrycks and Dietterich. Benchmarking neural network robustness to common corruptions. InICLR, 2019

  17. [17]

    FCNs in the Wild: Pixel-level Adversarial and Constraint-based Adaptation

    J. Hoffman, D. Wang, F. Yu, and T. Darrell. FCNs in the wild: Pixel-level adversarial and constraint-based adaptation.arXiv:1612.02649, 2016

  18. [18]

    Y. Hong, H. Pan, W. Sun, and Y. Jia. Deep dual-resolution networks for real-time and accurate semantic segmentation of road scenes. InCVPR, 2021

  19. [19]

    Rockstar Games: Policy on posting copyrighted Rockstar Games material: http:// tinyurl.com/pjfoqo5r. 11

  20. [20]

    Isola, J.-Y

    P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1125–1134, 2017

  21. [21]

    Y. Jia, L. Hoyer, S. Huang, T. Wang, L. Van Gool, K. Schindler, and A. Obukhov. Dginstyle: Domain- generalizable semantic segmentation with image diffusion models and stylized semantic control. InEuropean Conference on Computer Vision (ECCV), 2024

  22. [22]

    Kiefer, D

    B. Kiefer, D. Ott, and A. Zell. Leveraging synthetic data in object detection on unmanned aerial vehicles, 2021

  23. [23]

    Martinez, C

    M. Martinez, C. Sitawarin, K. Finch, L. Meincke, A. Yablonski, and A. Kornhauser. Beyond grand theft auto v for training, testing and enhancing deep learning in self driving cars, 2017

  24. [24]

    Michaelis, B

    C. Michaelis, B. Mitzkus, R. Geirhos, E. Rusak, O. Bringmann, A. S. Ecker, M. Bethge, and W. Brendel. Benchmarking robustness in object detection: Autonomous driving when winter is coming. InNeurIPS Workshop on Machine Learning for Autonomous Driving, 2019

  25. [25]

    Neuhold, T

    G. Neuhold, T. Ollmann, S. Rota Bulò, and P. Kontschieder. The mapillary vistas dataset for semantic understanding of street scenes. InICCV, 2017

  26. [26]

    S. R. Richter, Z. Hayder, and V. Koltun. Playing for benchmarks. InProceedings of the IEEE International Conference on Computer Vision (ICCV), 2017

  27. [27]

    S. R. Richter, V. Vineet, S. Roth, and V. Koltun. Playing for data: Ground truth from computer games. In Proceedings of the European Conference on Computer Vision (ECCV), pages 102–118, 2016

  28. [28]

    Sakaridis, D

    C. Sakaridis, D. Dai, and L. Van Gool. Semantic foggy scene understanding with synthetic data.International Journal of Computer Vision, 126(9):973–992, 2018

  29. [29]

    Sakaridis, D

    C. Sakaridis, D. Dai, and L. Van Gool. Guided curriculum model adaptation and uncertainty-aware evaluation for semantic nighttime image segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019

  30. [30]

    Sakaridis, D

    C. Sakaridis, D. Dai, and L. Van Gool. Acdc: The adverse conditions dataset with correspondences for semantic driving scene understanding. InICCV, 2021

  31. [31]

    Sankaranarayanan, Y

    S. Sankaranarayanan, Y. Balaji, A. Jain, S. Nam Lim, and R. Chellappa. Learning from synthetic data: Addressing domain shift for semantic segmentation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3752–3761, 2018

  32. [32]

    Alexander Blade: http://www.dev-c.com/gtav/scripthookv/

  33. [33]

    T. Sun, M. Segu, J. Postels, Y. Wang, L. Van Gool, B. Schiele, F. Tombari, and F. Yu. Shift: A synthetic driving dataset for continuous multi-task domain adaptation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21371–21382, 2022

  34. [34]

    Taori et al

    R. Taori et al. Measuring robustness to natural distribution shifts in image classification. InNeurIPS, 2020

  35. [35]

    Torralba and A

    A. Torralba and A. A. Efros. Unbiased look at dataset bias. InCVPR, 2011

  36. [36]

    Tsai, W.-C

    Y.-H. Tsai, W.-C. Hung, S. Schulter, K. Sohn, M.-H. Yang, and M. Chandraker. Learning to adapt structured output space for semantic segmentation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018

  37. [37]

    aitorzip: https://github.com/aitorzip/VPilot

  38. [38]

    E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo. Segformer: Simple and efficient design for semantic segmentation with transformers.Advances in neural information processing systems, 34:12077–12090, 2021

  39. [39]

    J. Xu, E. Xie, X. Liu, W. Chen, D. Liang, and P. Luo. Pidnet: A real-time semantic segmentation network inspired from pid controller. InCVPR, 2023

  40. [40]

    F. Yu, H. Chen, X. Wang, W. Xian, Y. Chen, F. Liu, V. Madhavan, and T. Darrell. BDD100K: A diverse driving dataset for heterogeneous multitask learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

  41. [41]

    Zendel, M

    O. Zendel, M. Murschitz, M. Zeilinger, D. Steininger, and C. Beleznai. Analyzing computer vision data - the good, the bad and the ugly. InCVPR Workshops, 2018

  42. [42]

    J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. InProceedings of the IEEE International Conference on Computer Vision (ICCV), 2017. 12