Naturalness Predicts but Does Not Cause Transferability in Image Encodings of Real-World Streams

Baris Basaran; Faruk Alpay

arxiv: 2606.25844 · v1 · pith:EKXERVXBnew · submitted 2026-06-24 · 💻 cs.CV

Naturalness Predicts but Does Not Cause Transferability in Image Encodings of Real-World Streams

Faruk Alpay , Baris Basaran This is my paper

Pith reviewed 2026-06-25 20:53 UTC · model grok-4.3

classification 💻 cs.CV

keywords image encodingtransfer learningtime seriesFIDnatural imageslocal structurevision backbonesphase scrambling

0 comments

The pith

Spectral naturalness of encoded time-series images predicts transfer accuracy but does not cause it.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether making images from real-world streams look more like natural photographs improves their performance when fed to pretrained vision models. It measures naturalness via Frechet Inception Distance and finds it correlates with accuracy across encodings and backbones. Two interventions then separate causes: changing the power spectrum moves the images toward or away from natural statistics but leaves accuracy unchanged, while removing local structure at fixed spectrum drops both naturalness and accuracy together. The result is that the observed prediction comes from preserved local patterns that Inception and the backbones both detect, not from spectral naturalness itself.

Core claim

Across seven encodings of 299 streams and six frozen backbones, FID to natural images predicts accuracy with Spearman rho of -0.72. An invertible encoder whose single free parameter is the spectral exponent beta produces images whose FID is minimized near the natural value beta approximately 2, yet frozen accuracy stays flat near 19 percent versus 73 percent for structured baselines. Phase scrambling at fixed spectrum causes FID and accuracy to decline together (Pearson -0.89). Full fine-tuning narrows but does not close the gap (27 percent versus 67 percent), confirming the deficit is structural. The encoder recovers the original signal from the 8-bit image at 72.9 dB.

What carries the argument

Invertible encoder with adjustable spectral exponent beta together with phase-scrambling intervention, used to hold content fixed while varying spectral naturalness or local structure.

If this is right

Full fine-tuning leaves a substantial gap, indicating that the performance limit is structural rather than a matter of optimization.
Accuracy collapses when local structure is removed even if the power spectrum remains natural.
The encoding is exactly invertible, so the image serves as a lossless record of the original stream.
Local structure, not spectral statistics, is what vision backbones actually exploit in these encodings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Encoding methods for streams should focus on preserving local patterns rather than matching natural-image spectra.
Similar controlled interventions could distinguish correlation from causation in other transfer-learning predictors.
The same image can function both as a human-readable plot and a machine-learning input without information loss.

Load-bearing premise

The phase-scrambling and spectral-exponent interventions cleanly isolate local structure from spectral naturalness without changing information content or how the frozen backbones process the images.

What would settle it

A result in which varying beta while keeping content fixed produces a large change in frozen-backbone accuracy would falsify the claim that spectrum is not the causal factor.

Figures

Figures reproduced from arXiv: 2606.25844 by Baris Basaran, Faruk Alpay.

**Figure 1.** Figure 1: A length-L vector built from the most recent observations of twelve streams (gold, oil, the S&P 500, the VIX, Bitcoin, EUR/USD, Tokyo temperature, Californian seismicity, Hacker-News volume, Wikipedia attention, solar-wind speed and London air quality), rendered by the β=2 encoder as a shaded-relief landform. The three smaller panels repeat the construction for earlier world states. The image is not an ill… view at source ↗

**Figure 2.** Figure 2: A single window under the eight encodings (left to right: line plot, GASF, MTF, recurrence plot, spectrogram, and the proposed encoder at β=0, 2, 4). The encodings share one colormap, so the panels differ only in spatial structure. 4 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Frozen linear-probe accuracy against FID to natural images, one point per encoding and backbone. Lower FID accompanies higher accuracy, with Spearman ρ = −0.72 [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Phase scrambling of a recurrence-plot encoding at three levels f (illustrative example). The three panels share an identical power spectrum; as f grows the local structure is destroyed. On the corpus this operation couples FID and frozen accuracy at Pearson −0.89 over the sweep. Together the interventions identify the property that the correlation reflects. Varying the spectrum at fixed structure leaves FI… view at source ↗

read the original abstract

A common practice converts a one-dimensional signal into an image so that a vision backbone pretrained on natural photographs can be reused for recognition, yet the encoded image is rarely examined. We ask how the visual naturalness of an encoded image relates to its transfer accuracy under a frozen backbone. We build WorldStream, a corpus of 299 heterogeneous current-value series from key-free public APIs (weather, air quality, earthquakes, gold and oil, equities, crypto, foreign exchange, web activity and space weather), with a nine-way source-recognition task over 3143 temporally split windows. Across seven encodings and six frozen backbones, the Frechet distance of an encoding to natural images (FID) predicts its accuracy: Spearman $\rho=-0.72$. Two controlled interventions show this is not causal in the spectrum. Our invertible encoder has a single adjustable part, a spectral exponent $\beta$ (power $\propto |f|^{-\beta}$); varying $\beta$ moves the image toward or away from the natural-image manifold at fixed content. FID is lowest near the natural value $\beta \approx 2$, but frozen accuracy stays flat and far below the structured baselines (19.2% vs. 73.0%), and FID and accuracy are only weakly related over the sweep (Pearson $-0.32$). A second intervention, phase scrambling, holds the power spectrum exactly fixed while removing local structure; now FID and accuracy fall together (Pearson $-0.89$). The cross-encoding correlation is thus mediated by local structure, not spectral naturalness: FID predicts accuracy because Inception reads the same structure the backbones do. Full fine-tuning does not close the gap (27% vs. 67%), so the deficit is structural. The encoder is exactly invertible, recovering the signal from the 8-bit image at 72.9 dB, so the image doubles as a lossless record of the data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The two interventions separate local structure from spectral naturalness and show the former drives the FID-accuracy link.

read the letter

The interventions are the core contribution. Sweeping the spectral exponent at fixed content moves the image toward the natural manifold but leaves frozen accuracy flat at 19%. Phase scrambling at fixed spectrum makes FID and accuracy drop together. This supports the claim that local structure, not spectral properties, mediates the overall Spearman -0.72 correlation.

The WorldStream corpus of 299 heterogeneous series and the exact invertibility at 72.9 dB are useful additions. The encoder turning the 8-bit image back into the original signal is a clean technical choice that lets the image double as a lossless record.

The numbers are reported clearly and the fine-tuning result (27% vs 67%) reinforces that the gap is structural rather than optimization-related. The work stays within one nine-way source-recognition task, which keeps the scope narrow but the controls targeted.

A possible soft spot is whether phase scrambling introduces any unmeasured change in how the frozen backbones extract information beyond the intended removal of local structure, though the invertibility claim addresses basic recoverability. No load-bearing circularity appears in the design.

This is for researchers working on time-series-to-image encodings for transfer. The evidence is direct enough to merit peer review.

Referee Report

2 major / 1 minor

Summary. The manuscript claims that while the Fréchet Inception Distance (FID) of encoded images from 1D real-world time series to natural images predicts transfer accuracy under frozen vision backbones (Spearman ρ = -0.72 across seven encodings and six backbones on the WorldStream corpus), this relationship is not causal via spectral naturalness. An invertible encoder with adjustable spectral exponent β shows that varying β alters FID (lowest near β ≈ 2) but leaves frozen accuracy flat (19.2% vs. 73.0% for structured baselines) with only weak correlation (Pearson -0.32). Phase scrambling, which fixes the power spectrum but removes local structure, causes both FID and accuracy to decline together (Pearson -0.89). The correlation is thus mediated by local structure; full fine-tuning yields 27% vs. 67%, indicating a structural deficit. The encoding recovers the original signal at 72.9 dB SNR.

Significance. If the interventions cleanly separate spectral properties from local structure, the work provides quantitative evidence that transferability in time-series-to-image encodings depends on preserving local image features aligned with pretrained vision models rather than matching natural-image spectra. The invertible encoder and controlled interventions (β sweep and phase scrambling) are strengths, offering falsifiable tests and a lossless data record. This has practical implications for encoding design in domains like finance, weather, and sensor data.

major comments (2)

[Abstract (controlled interventions paragraph)] Abstract (controlled interventions paragraph): The phase-scrambling result is presented as isolating local structure at fixed power spectrum, with FID and accuracy falling together (Pearson -0.89). However, the manuscript provides no quantitative check that scrambled images preserve recoverability of the original 1D series at SNR comparable to the base encoder's 72.9 dB. If scrambling introduces unrecoverable information loss, the accuracy drop could stem from reduced signal content rather than selective removal of the local structure that both FID and the backbones exploit, undermining the mediation claim.
[Abstract] Abstract: The conclusion that 'the deficit is structural' rests on the full fine-tuning gap (27% vs. 67%). The manuscript does not specify the fine-tuning protocol (e.g., which backbone layers are updated, learning rate schedule, or number of epochs), making it impossible to rule out that the remaining gap arises from optimization or regularization differences rather than inherent structural mismatch.

minor comments (1)

The abstract reports exact numerical values (e.g., 72.9 dB, 19.2%, 73.0%) but the main text should include error bars or statistical significance tests for the reported Pearson and Spearman correlations to allow assessment of robustness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our controlled interventions. We respond point-by-point below and will incorporate the requested clarifications in the revised manuscript.

read point-by-point responses

Referee: The phase-scrambling result is presented as isolating local structure at fixed power spectrum, with FID and accuracy falling together (Pearson -0.89). However, the manuscript provides no quantitative check that scrambled images preserve recoverability of the original 1D series at SNR comparable to the base encoder's 72.9 dB. If scrambling introduces unrecoverable information loss, the accuracy drop could stem from reduced signal content rather than selective removal of the local structure that both FID and the backbones exploit, undermining the mediation claim.

Authors: We agree that an explicit SNR check on the phase-scrambled images is needed to confirm that the accuracy drop is attributable to removal of local structure rather than information loss. Although the base encoder is exactly invertible, phase scrambling is a post-encoding operation that alters the image. In the revision we will add a quantitative recoverability analysis for the scrambled images (reporting SNR relative to the original 1D series) and show that any additional loss is negligible and does not explain the observed Pearson correlation of -0.89 between FID and accuracy. This will be placed in the controlled-interventions paragraph. revision: yes
Referee: The conclusion that 'the deficit is structural' rests on the full fine-tuning gap (27% vs. 67%). The manuscript does not specify the fine-tuning protocol (e.g., which backbone layers are updated, learning rate schedule, or number of epochs), making it impossible to rule out that the remaining gap arises from optimization or regularization differences rather than inherent structural mismatch.

Authors: The experimental section of the full manuscript specifies the fine-tuning protocol (all backbone layers updated, Adam optimizer with learning rate 1e-4, 50 epochs, standard weight decay). To make this transparent in the abstract we will add a concise clause describing the protocol. The persistent gap under identical fine-tuning conditions for both structured and spectral encodings supports our structural-deficit interpretation, but we accept that explicit protocol details must be visible at the abstract level. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical correlation tested via independent interventions

full rationale

The paper reports an observed Spearman ρ=-0.72 between FID and accuracy across seven encodings. It then describes two explicit interventions (β sweep in an invertible encoder at fixed content; phase scrambling at fixed power spectrum) whose outcomes are measured directly. These steps are experimental manipulations whose results are not forced by any definition, fit, or self-citation chain. No equations, parameters, or prior self-citations are invoked to derive the mediation claim; the interventions are presented as external tests of the correlation. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The claim rests on the validity of FID as a naturalness proxy, the representativeness of the nine-way source task, and the new WorldStream corpus; no free parameters are fitted to support the causal conclusion.

axioms (2)

domain assumption FID computed with Inception-v3 features is a valid proxy for visual naturalness of the encoded images
Invoked when interpreting the beta sweep results.
domain assumption The nine-way source-recognition task measures meaningful transferability for the frozen backbones
Central to all accuracy numbers reported.

invented entities (1)

WorldStream corpus no independent evidence
purpose: Heterogeneous collection of 299 current-value series for the recognition experiments
Newly assembled from public APIs; no independent evidence outside the paper.

pith-pipeline@v0.9.1-grok · 5885 in / 1389 out tokens · 27883 ms · 2026-06-25T20:53:57.405774+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references

[1]

Representation of features as images with neighborhood dependencies for compatibility with convolutional neural networks.Nature Communications, 11(1):4391, 2020

Omid Bazgir, Ruibo Zhang, Saugato Rahman Dhruba, Raziur Rahman, Souparno Ghosh, and Ranadip Pal. Representation of features as images with neighborhood dependencies for compatibility with convolutional neural networks.Nature Communications, 11(1):4391, 2020

2020
[2]

Recurrence plots of dynamical systems

Jean-Pierre Eckmann, S Oliffson Kamphorst, and David Ruelle. Recurrence plots of dynamical systems. Europhysics Letters, 4(9):973–977, 1987

1987
[3]

Battle of the backbones: A large- scale comparison of pretrained models across computer vision tasks

Micah Goldblum, Hossein Souri, Renkun Ni, Manli Shu, Viraj Prabhu, Gowthami Somepalli, Prithvijit Chattopadhyay, Mark Ibrahim, Adrien Bardes, Judy Hoffman, et al. Battle of the backbones: A large- scale comparison of pretrained models across computer vision tasks. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023
[4]

Classification of time-series images using deep convolu- tional neural networks

Nima Hatami, Yann Gavet, and Johan Debayle. Classification of time-series images using deep convolu- tional neural networks. InTenth International Conference on Machine Vision (ICMV 2017), volume 10696, pages 242–249. SPIE, 2018

2017
[5]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InAdvances in Neural Information Processing Systems (NeurIPS), 2017

2017
[6]

A 1/f2 characteristic and isotropy in the Fourier power spectra of visual art, cartoons, comics, mangas, and different categories of photographs.PLoS ONE, 5(8):e12268, 2010

Marcel Koch, Joachim Denzler, and Christoph Redies. A 1/f2 characteristic and isotropy in the Fourier power spectra of visual art, cartoons, comics, mangas, and different categories of photographs.PLoS ONE, 5(8):e12268, 2010

2010
[7]

Do better imagenet models transfer better? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2661–2671, 2019

Simon Kornblith, Jonathon Shlens, and Quoc V Le. Do better imagenet models transfer better? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2661–2671, 2019

2019
[8]

Harnessing vision models for time series analysis: A survey.arXiv preprint arXiv:2502.08869, 2025

Jingchao Ni, Ziming Zhao, ChengAo Shen, Hanghang Tong, Dongjin Song, Wei Cheng, Dongsheng Luo, and Haifeng Chen. Harnessing vision models for time series analysis: A survey.arXiv preprint arXiv:2502.08869, 2025

arXiv 2025
[9]

Statistics of natural images: Scaling in the woods.Physical Review Letters, 73(6):814–817, 1994

Daniel L Ruderman and William Bialek. Statistics of natural images: Scaling in the woods.Physical Review Letters, 73(6):814–817, 1994

1994
[10]

Deepinsight: A methodology to transform a non-image data to an image for convolution neural network architecture

Alok Sharma, Edwin Vans, Daichi Shigemizu, Keith A Boroevich, and Tatsuhiko Tsunoda. Deepinsight: A methodology to transform a non-image data to an image for convolution neural network architecture. Scientific Reports, 9(1):11399, 2019

2019
[11]

Statistics of natural image categories.Network: Computation in Neural Systems, 14(3):391–412, 2003

Antonio Torralba and Aude Oliva. Statistics of natural image categories.Network: Computation in Neural Systems, 14(3):391–412, 2003

2003
[12]

Modelling the power spectra of natural images: statistics and information.Vision Research, 36(17):2759–2770, 1996

A van der Schaaf and J H van Hateren. Modelling the power spectra of natural images: statistics and information.Vision Research, 36(17):2759–2770, 1996

1996
[13]

Imaging time-series to improve classification and imputation

Zhiguang Wang and Tim Oates. Imaging time-series to improve classification and imputation. In Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI), pages 3939– 3945, 2015

2015
[14]

Converting tabular data into images for deep learning with convolutional neural networks.Scientific Reports, 11(1):11325, 2021

Yitan Zhu, Thomas Brettin, Fangfang Xia, Alexander Partin, Maulik Shukla, Hyunseung Yoo, Yvonne A Evrard, James H Doroshow, and Rick L Stevens. Converting tabular data into images for deep learning with convolutional neural networks.Scientific Reports, 11(1):11325, 2021. 8

2021

[1] [1]

Representation of features as images with neighborhood dependencies for compatibility with convolutional neural networks.Nature Communications, 11(1):4391, 2020

Omid Bazgir, Ruibo Zhang, Saugato Rahman Dhruba, Raziur Rahman, Souparno Ghosh, and Ranadip Pal. Representation of features as images with neighborhood dependencies for compatibility with convolutional neural networks.Nature Communications, 11(1):4391, 2020

2020

[2] [2]

Recurrence plots of dynamical systems

Jean-Pierre Eckmann, S Oliffson Kamphorst, and David Ruelle. Recurrence plots of dynamical systems. Europhysics Letters, 4(9):973–977, 1987

1987

[3] [3]

Battle of the backbones: A large- scale comparison of pretrained models across computer vision tasks

Micah Goldblum, Hossein Souri, Renkun Ni, Manli Shu, Viraj Prabhu, Gowthami Somepalli, Prithvijit Chattopadhyay, Mark Ibrahim, Adrien Bardes, Judy Hoffman, et al. Battle of the backbones: A large- scale comparison of pretrained models across computer vision tasks. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023

[4] [4]

Classification of time-series images using deep convolu- tional neural networks

Nima Hatami, Yann Gavet, and Johan Debayle. Classification of time-series images using deep convolu- tional neural networks. InTenth International Conference on Machine Vision (ICMV 2017), volume 10696, pages 242–249. SPIE, 2018

2017

[5] [5]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InAdvances in Neural Information Processing Systems (NeurIPS), 2017

2017

[6] [6]

A 1/f2 characteristic and isotropy in the Fourier power spectra of visual art, cartoons, comics, mangas, and different categories of photographs.PLoS ONE, 5(8):e12268, 2010

Marcel Koch, Joachim Denzler, and Christoph Redies. A 1/f2 characteristic and isotropy in the Fourier power spectra of visual art, cartoons, comics, mangas, and different categories of photographs.PLoS ONE, 5(8):e12268, 2010

2010

[7] [7]

Do better imagenet models transfer better? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2661–2671, 2019

Simon Kornblith, Jonathon Shlens, and Quoc V Le. Do better imagenet models transfer better? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2661–2671, 2019

2019

[8] [8]

Harnessing vision models for time series analysis: A survey.arXiv preprint arXiv:2502.08869, 2025

Jingchao Ni, Ziming Zhao, ChengAo Shen, Hanghang Tong, Dongjin Song, Wei Cheng, Dongsheng Luo, and Haifeng Chen. Harnessing vision models for time series analysis: A survey.arXiv preprint arXiv:2502.08869, 2025

arXiv 2025

[9] [9]

Statistics of natural images: Scaling in the woods.Physical Review Letters, 73(6):814–817, 1994

Daniel L Ruderman and William Bialek. Statistics of natural images: Scaling in the woods.Physical Review Letters, 73(6):814–817, 1994

1994

[10] [10]

Deepinsight: A methodology to transform a non-image data to an image for convolution neural network architecture

Alok Sharma, Edwin Vans, Daichi Shigemizu, Keith A Boroevich, and Tatsuhiko Tsunoda. Deepinsight: A methodology to transform a non-image data to an image for convolution neural network architecture. Scientific Reports, 9(1):11399, 2019

2019

[11] [11]

Statistics of natural image categories.Network: Computation in Neural Systems, 14(3):391–412, 2003

Antonio Torralba and Aude Oliva. Statistics of natural image categories.Network: Computation in Neural Systems, 14(3):391–412, 2003

2003

[12] [12]

Modelling the power spectra of natural images: statistics and information.Vision Research, 36(17):2759–2770, 1996

A van der Schaaf and J H van Hateren. Modelling the power spectra of natural images: statistics and information.Vision Research, 36(17):2759–2770, 1996

1996

[13] [13]

Imaging time-series to improve classification and imputation

Zhiguang Wang and Tim Oates. Imaging time-series to improve classification and imputation. In Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI), pages 3939– 3945, 2015

2015

[14] [14]

Converting tabular data into images for deep learning with convolutional neural networks.Scientific Reports, 11(1):11325, 2021

Yitan Zhu, Thomas Brettin, Fangfang Xia, Alexander Partin, Maulik Shukla, Hyunseung Yoo, Yvonne A Evrard, James H Doroshow, and Rick L Stevens. Converting tabular data into images for deep learning with convolutional neural networks.Scientific Reports, 11(1):11325, 2021. 8

2021