pith. sign in

arxiv: 2507.10236 · v2 · pith:HNLG54IYnew · submitted 2025-07-14 · 💻 cs.CV

Navigating the Challenges of AI-Generated Image Detection in the Wild: What Truly Matters?

Pith reviewed 2026-05-21 23:27 UTC · model grok-4.3

classification 💻 cs.CV
keywords AI-generated image detectionreal-world evaluationsocial media imagesdesign choiceslow-level traceshigh-level semanticsITW-SM datasetAUC improvement
0
0 comments X

The pith

Optimizing each design choice to propagate low-level traces and high-level semantics improves AI-generated image detection AUC by 26.87% under real-world conditions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the ITW-SM dataset of real and AI-generated images collected from major social media platforms to test detectors outside laboratory benchmarks. It systematically varies architecture, pre-trained latent spaces, training data volume, and preprocessing steps to measure their effects. Simple increases in model scale or data quantity do not reliably raise performance. Instead, the authors find that tuning the full pipeline so it can carry forward both low-level generation artifacts and high-level semantic content produces consistent gains. This yields an average 26.87 percent AUC lift across several existing detectors when evaluated on the challenging social-media images.

Core claim

By curating the ITW-SM dataset from major social media platforms and performing ablation studies on detector components, the authors establish that effective AI-generated image detection in uncontrolled environments requires a processing pipeline optimized to transmit and utilize both low-level forensic traces and high-level image semantics, rather than relying on larger models or more data alone; this optimization produces an average AUC improvement of 26.87 percent across multiple state-of-the-art approaches.

What carries the argument

The optimized detection pipeline that propagates and analyzes both low-level traces and high-level image semantics.

If this is right

  • Naively scaling pre-training or adding more training data does not always improve detection performance.
  • Effective real-world detectors must balance low-level trace analysis with high-level semantic understanding.
  • The same optimizations improve performance across multiple existing state-of-the-art detection approaches.
  • These choices supply a practical roadmap for constructing more resilient detectors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar dual-level optimization may help detectors for other generative media such as video or audio.
  • Test datasets will require regular refresh as new generative models appear.
  • Adding explicit semantic modules could further strengthen purely artifact-based detectors.
  • Benchmark-only evaluations are likely to overestimate real-world robustness.

Load-bearing premise

The ITW-SM dataset and the specific experimental conditions used are representative enough of real-world social media images and future AI generators for the observed gains to generalize.

What would settle it

A new test collection drawn from previously unseen social media platforms or generated by newer AI models shows no AUC improvement or a performance drop when the optimized pipeline is applied.

Figures

Figures reproduced from arXiv: 2507.10236 by Christos Koutlis, Despina Konstantinidou, Dimitrios Karageorgiou, Emmanouil Schinas, Olga Papadopoulou, Symeon Papadopoulos.

Figure 1
Figure 1. Figure 1: Framework illustrating the factors impacting expected performance and generalization in [PITH_FULL_IMAGE:figures/full_fig_p011_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Real (a-e) and generated (f-j) images from the ITW-SM Dataset. [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Model performance (AUC) on benchmark datasets, as reported in original papers, and [PITH_FULL_IMAGE:figures/full_fig_p021_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Original and updated model performance (AUC) [PITH_FULL_IMAGE:figures/full_fig_p027_4.png] view at source ↗
read the original abstract

As generative Artificial Intelligence (AI) advances, the realism of AI generated imagery has reached a threshold capable of deceiving even vigilant human observers. Yet, while current AI-generated Image Detection (AID) approaches perform exceptionally well on controlled benchmark datasets, they struggle significantly with real-world cases. To study this behavior we introduce the ITW-SM dataset, a curated collection of real and AI-generated images originating from major social media platforms. We employ it to analyze the effects of key design choices typically considered when building a detector, involving its architecture, pre-trained latent spaces, training data as well as pre-processing approaches. We indicate that naively scaling the pre-training stage or opting for more training data does not always lead to better detection performance. Instead, our work reveals that it is crucial to optimize each design choice to enable the processing pipeline to propagate and effectively analyze both low-level traces as well as high-level image semantics. Building on our findings, we achieve a substantial average improvement of 26.87% in AUC across multiple state-of-the-art detection approaches and under real-world conditions, providing a roadmap for developing more resilient detectors. Our assets are available on https://mever-team.github.io/itw-sm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces the ITW-SM dataset of real and AI-generated images collected from major social media platforms. It systematically examines the impact of design choices in AI-generated image detectors—architecture, pre-trained latent spaces, training data, and pre-processing—and concludes that naive scaling of pre-training or data volume does not reliably improve performance. Instead, optimizing each choice to allow the pipeline to analyze both low-level forensic traces and high-level semantics produces a 26.87% average AUC gain across multiple state-of-the-art detectors under real-world conditions. The assets are released publicly.

Significance. If the reported gains prove robust, the work is significant for the field of multimedia forensics and computer vision. It supplies an empirical roadmap that prioritizes balanced feature propagation over simple scaling, introduces a new in-the-wild benchmark, and demonstrates concrete performance lifts on social-media imagery where existing detectors degrade. The public release of the dataset and code supports reproducibility and follow-on research.

major comments (3)
  1. [§4] §4 (Experimental Evaluation) and associated tables: The abstract and results claim a 26.87% average AUC improvement, yet the manuscript provides neither per-detector variance, number of random seeds, nor statistical significance tests (e.g., paired t-test or Wilcoxon test across runs). Without these, it is impossible to determine whether the reported gain exceeds experimental noise and therefore supports the central claim that the optimized pipeline is reliably superior.
  2. [§3.1] §3.1 (ITW-SM Dataset Construction): The generalization argument rests on ITW-SM being representative of unseen platforms and future generators. The curation details—exact generator versions, platform-specific compression pipelines, and any post-processing filters—are not sufficiently quantified. If the dataset inadvertently emphasizes particular artifacts, the observed low-level/high-level balance may be partly tuned to ITW-SM rather than reflecting a transferable principle.
  3. [§4.3] §4.3 (Ablation Studies): The paper states that naive scaling does not always help and that balancing low- and high-level cues is crucial, but the ablation tables do not isolate the marginal contribution of each optimized component (pre-processing, latent space, architecture) to the final 26.87% gain. This weakens the causal link between the design principle and the measured improvement.
minor comments (2)
  1. [Figure 3] Figure 3 and §4.2: Axis labels and legend entries are too small for comfortable reading; increasing font size and adding a short caption explaining the color coding would improve clarity.
  2. [Related Work] Related Work section: Several recent works on social-media image forensics (post-2023) are missing; adding them would better situate the contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have addressed each major comment below and will incorporate revisions to strengthen the empirical support and clarity of our claims.

read point-by-point responses
  1. Referee: [§4] §4 (Experimental Evaluation) and associated tables: The abstract and results claim a 26.87% average AUC improvement, yet the manuscript provides neither per-detector variance, number of random seeds, nor statistical significance tests (e.g., paired t-test or Wilcoxon test across runs). Without these, it is impossible to determine whether the reported gain exceeds experimental noise and therefore supports the central claim that the optimized pipeline is reliably superior.

    Authors: We agree that the absence of variance estimates and statistical tests limits the strength of the central claim. In the revised manuscript we will report AUC results averaged over five independent random seeds with standard deviations for each detector and include paired t-tests (or Wilcoxon signed-rank tests where appropriate) comparing the baseline and optimized pipelines. These additions will allow readers to assess whether the 26.87% average gain exceeds experimental variability. revision: yes

  2. Referee: [§3.1] §3.1 (ITW-SM Dataset Construction): The generalization argument rests on ITW-SM being representative of unseen platforms and future generators. The curation details—exact generator versions, platform-specific compression pipelines, and any post-processing filters—are not sufficiently quantified. If the dataset inadvertently emphasizes particular artifacts, the observed low-level/high-level balance may be partly tuned to ITW-SM rather than reflecting a transferable principle.

    Authors: We concur that additional quantitative details on curation are required to support claims of representativeness. In the revised §3.1 we will enumerate the specific generator versions and release dates used, document the exact JPEG quality factors and resizing pipelines applied by each platform, and describe the post-processing filters (e.g., duplicate removal, resolution thresholds). These expansions will clarify the artifact distribution and help readers evaluate the transferability of the low-/high-level balance principle. revision: yes

  3. Referee: [§4.3] §4.3 (Ablation Studies): The paper states that naive scaling does not always help and that balancing low- and high-level cues is crucial, but the ablation tables do not isolate the marginal contribution of each optimized component (pre-processing, latent space, architecture) to the final 26.87% gain. This weakens the causal link between the design principle and the measured improvement.

    Authors: We acknowledge that the current ablation tables do not isolate the incremental effect of each component. In the revised version we will add a set of incremental ablation experiments that successively introduce the optimized pre-processing, latent-space choice, and architecture, reporting the marginal AUC gain at each step. This will provide a clearer decomposition of how each design decision contributes to the overall 26.87% improvement. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical ablation study with self-contained experimental results

full rationale

This paper is an empirical ablation study that introduces the ITW-SM dataset and measures the effects of architecture, pre-trained latent spaces, training data, and pre-processing choices on AI-generated image detection performance. The central result of a 26.87% average AUC improvement is obtained directly from experimental evaluations on the collected real-world images rather than from any mathematical derivation, first-principles prediction, or quantity that reduces to its own inputs by construction. No self-definitional patterns, fitted-input-called-predictions, or load-bearing self-citation chains appear in the reported analysis; the work remains self-contained against its own benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work to force the outcome.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the representativeness of the newly collected social-media dataset and on the assumption that the tested design choices capture the dominant factors affecting real-world detector performance.

axioms (1)
  • domain assumption The ITW-SM dataset is representative of real-world social media images and generators
    Generalization of the 26.87% AUC improvement rests on this assumption.

pith-pipeline@v0.9.0 · 5769 in / 1237 out tokens · 38360 ms · 2026-05-21T23:27:32.111784+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Automated In-the-Wild Data Collection for Continual AI Generated Image Detection

    cs.CV 2026-05 unverdicted novelty 7.0

    An automated fact-check-based pipeline for in-the-wild AI image data, when mixed with generator data in continual learning, lets detectors adapt to new generators while avoiding forgetting and delivers 8-9% accuracy g...

  2. Boosting Robust AIGI Detection with LoRA-based Pairwise Training

    cs.CV 2026-04 unverdicted novelty 4.0

    LoRA-based pairwise training with distortion and size simulations boosts robust AIGI detection under severe distortions, placing third in the NTIRE challenge.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · cited by 2 Pith papers · 10 internal anchors

  1. [1]

    Amoroso, R., Morelli, D., Cornia, M., Baraldi, L., Bimbo, A.D., Cucchiara, R.: Parents and children: Distinguishing multimodal deepfakes from natural images (2024), https://arxiv.org/abs/2304.00500

  2. [2]

    In: IEEE Open Journal of Signal Processing (2023)

    Bammey, Q.: Synthbuster: Towards detection of diffusion model generated images. In: IEEE Open Journal of Signal Processing (2023)

  3. [3]

    Chai, L., Bau, D., Lim, S.N., Isola, P.: What makes fake images detectable? understanding properties that generalize (2020), https://arxiv.org/abs/2008. 10588

  4. [4]

    Chen, Y., Zou, J.: Twigma: A dataset of ai-generated images with metadata from twitter (2023), https://arxiv.org/abs/2306.08310 28

  5. [5]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Cherti, M., Beaumont, R., Wightman, R., Wortsman, M., Ilharco, G., Gor- don, C., Schuhmann, C., Schmidt, L., Jitsev, J.: Reproducible scaling laws for contrastive language-image learning. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). p. 2818–2829. IEEE (Jun 2023). https://doi.org/10.1109/cvpr52729.2023.00276, http://dx.do...

  6. [6]

    Corvi, R., Cozzolino, D., Poggi, G., Nagano, K., Verdoliva, L.: Intriguing properties of synthetic images: from generative adversarial networks to diffusion models (2023), https://arxiv.org/abs/2304.06408

  7. [7]

    org/abs/2211.00680

    Corvi, R., Cozzolino, D., Zingarini, G., Poggi, G., Nagano, K., Verdoliva, L.: On the detection of synthetic images generated by diffusion models (2022), https://arxiv. org/abs/2211.00680

  8. [8]

    Cozzolino, D., Gragnaniello, D., Poggi, G., Verdoliva, L.: Towards universal gan image detection (2021), https://arxiv.org/abs/2112.12606

  9. [9]

    Cozzolino, D., Poggi, G., Corvi, R., Nießner, M., Verdoliva, L.: Raising the bar of ai-generated image detection with clip (2024), https://arxiv.org/abs/2312.00195

  10. [10]

    Cozzolino, D., Poggi, G., Nießner, M., Verdoliva, L.: Zero-shot detection of ai-generated images (2024), https://arxiv.org/abs/2409.15875

  11. [11]

    In: Proceedings of the 6th ACM Multimedia Systems Conference (2015)

    Dang-Nguyen, D.T., Pasquini, C., Conotter, V., Boato, G.: Raise: a raw images 29 dataset for digital image forensics. In: Proceedings of the 6th ACM Multimedia Systems Conference (2015)

  12. [12]

    Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis (2021), https://arxiv.org/abs/2105.05233

  13. [13]

    In: Proceedings of the 2nd ACM International Workshop on Multimedia AI against Disinformation

    Dogoulis, P., Kordopatis-Zilos, G., Kompatsiaris, I., Papadopoulos, S.: Improving synthetically generated image detection in cross-concept settings. In: Proceedings of the 2nd ACM International Workshop on Multimedia AI against Disinformation. ICMR ’23, ACM (Jun 2023). https://doi.org/10.1145/3592572.3592846, http://dx. doi.org/10.1145/3592572.3592846

  14. [14]

    Durall, R., Keuper, M., Keuper, J.: Watch your up-convolution: Cnn based generative deep neural networks are failing to reproduce spectral distributions (2020), https: //arxiv.org/abs/2003.01826

  15. [15]

    Gragnaniello, D., Cozzolino, D., Marra, F., Poggi, G., Verdoliva, L.: Are gan generated images easy to detect? a critical analysis of the state-of-the-art (2021), https:// arxiv.org/abs/2104.02617

  16. [16]

    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition (2015), https://arxiv.org/abs/1512.03385

  17. [17]

    He, Z., Chen, P.Y., Ho, T.Y.: Rigid: A training-free and model-agnostic framework for robust ai-generated image detection (2024), https://arxiv.org/abs/2405.20112 30

  18. [18]

    Ju, Y., Jia, S., Ke, L., Xue, H., Nagano, K., Lyu, S.: Fusing global and local features for generalized ai-synthesized image detection (2022), https://arxiv.org/ abs/2203.13964

  19. [19]

    arXiv preprint arXiv:2408.11541 (2024)

    Karageorgiou, D., Bammey, Q., Porcellini, V., Goupil, B., Teyssou, D., Papadopoulos, S.: Evolution of detection performance throughout the online lifespan of synthetic images. arXiv preprint arXiv:2408.11541 (2024)

  20. [20]

    Karageorgiou, D., Papadopoulos, S., Kompatsiaris, I., Gavves, E.: Any-resolution ai-generated image detection by spectral learning (2024), https://arxiv.org/abs/ 2411.19417

  21. [21]

    Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of gans for improved quality, stability, and variation (2018), https://arxiv.org/abs/1710.10196

  22. [22]

    Konstantinidou, D., Koutlis, C., Papadopoulos, S.: Texturecrop: Enhancing synthetic image detection through texture-based cropping (2025), https://arxiv.org/abs/ 2407.15500

  23. [23]

    Koutlis, C., Papadopoulos, S.: Leveraging representations from intermediate encoder- blocks for synthetic image detection (2024), https://arxiv.org/abs/2402.19091

  24. [24]

    International Journal of Computer Vision 128(7), 1956–1981 (Mar 2020)

    Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Kamali, S., Popov, S., Malloci, M., Kolesnikov, A., Duerig, T., Ferrari, V.: The open images dataset v4: Unified image classification, object detection, and visual 31 relationship detection at scale. International Journal of Computer Vision 128(7), 1956–1981 (Mar 2020). https:...

  25. [25]

    Annals of Data Science 12(1), 141–170 (2025)

    Li, J., Zhang, C., Zhu, W., Ren, Y.: A comprehensive survey of image generation models based on deep learning. Annals of Data Science 12(1), 141–170 (2025)

  26. [26]

    Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models (2023), https://arxiv.org/ abs/2301.12597

  27. [27]

    Li, J., Selvaraju, R.R., Gotmare, A.D., Joty, S., Xiong, C., Hoi, S.: Align before fuse: Vision and language representation learning with momentum distillation (2021), https://arxiv.org/abs/2107.07651

  28. [28]

    In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Li, Y., Bammey, Q., Gardella, M., Nikoukhah, T., Morel, J.M., Colom, M., Gioi, R.G.V.: Masksim: Detection of synthetic images by masked spectrum similarity anal- ysis. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3855–3865. IEEE (Jun 2024)

  29. [29]

    Lin, T.Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Zitnick, C.L., Doll´ ar, P.: Microsoft coco: Common objects in context (2015), https://arxiv.org/abs/1405.0312

  30. [30]

    Lu, Z., Huang, D., Bai, L., Qu, J., Wu, C., Liu, X., Ouyang, W.: Seeing is not always 32 believing: Benchmarking human and model perception of ai-generated images (2023), https://arxiv.org/abs/2304.13023

  31. [31]

    Mandelli, S., Bonettini, N., Bestagini, P., Tubaro, S.: Detecting gan-generated images by orthogonal training of multiple cnns (2022), https://arxiv.org/abs/2203.02246

  32. [32]

    Ojha, U., Li, Y., Lee, Y.J.: Towards universal fake image detectors that generalize across generative models (2024), https://arxiv.org/abs/2302.10174

  33. [33]

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.Y., Li, S.W., Misra, I., Rabbat, M., Sharma, V., Synnaeve, G., Xu, H., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: Dinov2: Learning robust visual featur...

  34. [34]

    In: 11th International Workshop on Biometrics and Forensics (IWBF)

    Papa, L., Faiella, L., Corvitto, L., Maiano, L., Amerini, I.: On the use of stable diffusion for creating realistic faces: From generation to detection. In: 11th International Workshop on Biometrics and Forensics (IWBF). pp. 1–6 (2023)

  35. [35]

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision (2021), https://arxiv.org/abs/ 2103.00020 33

  36. [36]

    Ricker, J., Lukovnikov, D., Fischer, A.: Aeroblade: Training-free detection of latent diffusion images using autoencoder reconstruction error (2024), https://arxiv.org/ abs/2401.17879

  37. [37]

    Schinas, M., Papadopoulos, S.: Sidbench: A python framework for reliably assessing synthetic image detection methods (2024), https://arxiv.org/abs/2404.18552

  38. [38]

    Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., Schramowski, P., Kundurthy, S., Crowson, K., Schmidt, L., Kaczmarczyk, R., Jitsev, J.: Laion-5b: An open large-scale dataset for training next generation image-text models (2022), https://arxiv.org/ abs/2210.08402

  39. [39]

    Sha, Z., Li, Z., Yu, N., Zhang, Y.: De-fake: Detection and attribution of fake images generated by text-to-image generation models (2023), https://arxiv.org/ abs/2210.06998

  40. [40]

    Tan, C., Liu, H., Zhao, Y., Wei, S., Gu, G., Liu, P., Wei, Y.: Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detection (2023), https://arxiv.org/abs/2312.10461

  41. [41]

    In: Proceedings of the 34 IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Tan, C., Zhao, Y., Wei, S., Gu, G., Wei, Y.: Learning on gradients: Generalized artifacts representation for gan-generated images detection. In: Proceedings of the 34 IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12105– 12114 (2023)

  42. [42]

    Tan, M., Le, Q.V.: Efficientnet: Rethinking model scaling for convolutional neural networks (2020), https://arxiv.org/abs/1905.11946

  43. [43]

    Tredinnick, L., Laybats, C.: The dangers of generative artificial intelligence (2023)

  44. [44]

    for now (2020), https://arxiv.org/abs/1912.11035

    Wang, S.Y., Wang, O., Zhang, R., Owens, A., Efros, A.A.: Cnn-generated images are surprisingly easy to spot... for now (2020), https://arxiv.org/abs/1912.11035

  45. [45]

    Wang, Z., Bao, J., Zhou, W., Wang, W., Hu, H., Chen, H., Li, H.: Dire for diffusion- generated image detection (2023), https://arxiv.org/abs/2303.09295

  46. [46]

    Yan, S., Li, O., Cai, J., Hao, Y., Jiang, X., Hu, Y., Xie, W.: A sanity check for ai-generated image detection (2025), https://arxiv.org/abs/2406.19435

  47. [47]

    Yu, F., Seff, A., Zhang, Y., Song, S., Funkhouser, T., Xiao, J.: Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop (2016), https://arxiv.org/abs/1506.03365 35