pith. sign in

arxiv: 2605.21268 · v1 · pith:YW23WDVYnew · submitted 2026-05-20 · 💻 cs.CV

Vision Transformers and Convolutional Neural Networks for Land Use Scene Classification

Pith reviewed 2026-05-21 05:23 UTC · model grok-4.3

classification 💻 cs.CV
keywords land use scene classificationremote sensingvision transformersconvolutional neural networksUC Merced datasetEuroSAT datasetdeep learning comparisonscene classification accuracy
0
0 comments X

The pith

CNNs handle small land use datasets with local textures well while Vision Transformers capture global relationships when given enough data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper compares convolutional neural networks and vision transformers for classifying land use scenes in remote sensing images. It tests both approaches on the UC Merced and EuroSAT benchmarks to measure accuracy, precision, recall, F1-score, and computational demands. The central finding is that CNNs remain effective with limited training samples and strong local patterns, whereas vision transformers better model long-range spatial dependencies in complex scenes once sufficient data is supplied. These differences matter for applications such as environmental monitoring and urban planning, where the amount of labeled imagery often varies. The study concludes by noting that vision transformers generally demand more computation and larger datasets to reach their best results.

Core claim

Experimental results demonstrate that CNNs perform robustly on datasets with limited training samples and strong local texture characteristics, whereas Vision Transformers exhibit superior performance in capturing global spatial relationships in complex scenes when sufficient training data are available. However, ViTs typically require greater computational resources and larger training datasets to achieve optimal performance.

What carries the argument

Side-by-side evaluation of classification metrics and computational complexity for AlexNet as a CNN representative against a Vision Transformer model, run on the UC Merced Land Use and EuroSAT datasets.

If this is right

  • CNNs remain a practical choice for land use classification when training samples are scarce or local texture dominates.
  • Vision Transformers gain an edge once training data becomes plentiful enough to support modeling of long-range spatial dependencies.
  • Vision Transformers carry higher computational costs, which may restrict their deployment in settings with limited processing power.
  • Model selection for remote sensing tasks should weigh dataset size and scene complexity rather than default to a single architecture family.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The pattern could guide selection of models in other domains that mix local detail with global layout, such as medical image analysis.
  • Hybrid networks that combine early CNN layers for texture with later transformer blocks for context might reduce the data hunger of pure ViTs while retaining their global modeling benefit.
  • Further tests on datasets with controlled mixtures of local and global features would sharpen the boundary conditions under which one architecture overtakes the other.

Load-bearing premise

The performance differences seen with AlexNet and the chosen Vision Transformer on the UC Merced and EuroSAT benchmarks will generalize to other remote sensing datasets and real-world conditions.

What would settle it

Repeating the comparison on a fresh remote sensing dataset while deliberately varying the number of training samples per class and checking whether the same pattern of CNN strength on small sets and ViT strength on large sets still appears would test the central claim.

Figures

Figures reproduced from arXiv: 2605.21268 by Arun D. Kulkarni.

Figure 1
Figure 1. Figure 1: FIGURE 1 [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: FIGURE 2 [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: FIGURE 3 [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 9
Figure 9. Figure 9: FIGURE 9 [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: FIGURE 11 [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗
Figure 15
Figure 15. Figure 15: FIGURE 15 [PITH_FULL_IMAGE:figures/full_fig_p011_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: FIGURE 16 [PITH_FULL_IMAGE:figures/full_fig_p011_16.png] view at source ↗
read the original abstract

Land Use Scene Classification (LUSC) from remote sensing imagery plays a critical role in environmental monitoring, urban planning, and sustainable resource management. In recent years, deep learning methods have significantly advanced the state of the art, with Convolutional Neural Networks (CNNs) dominating the field because of their strong ability to capture local spatial features. However, the emergence of Vision Transformers (ViTs) has introduced a new paradigm that models long-range dependencies through self-attention mechanisms, potentially enabling improved global context understanding. This paper presents a comparative assessment of Vision Transformers and CNN-based architecture for remote sensing land use scene classification. Representative CNN models, such as AlexNet, is evaluated alongside the Vision Transformer (ViT) using benchmark remote sensing datasets, including the UC Merced Land Use and EuroSAT Land Use datasets. The study examines classification accuracy, precision, recall, F1-score, and computational complexity to provide a comprehensive performance comparison. Experimental results demonstrate that CNNs perform robustly on datasets with limited training samples and strong local texture characteristics, whereas Vision Transformers exhibit superior performance in capturing global spatial relationships in complex scenes when sufficient training data are available. However, ViTs typically require greater computational resources and larger training datasets to achieve optimal performance. The findings of this study provide insights into the strengths and limitations of both architectures and offer guidance for selecting appropriate models for remote sensing land use scene classification applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript compares Vision Transformers (ViT) and CNN-based architectures for land use scene classification on remote sensing imagery. It evaluates AlexNet as a representative CNN alongside ViT on the UC Merced and EuroSAT benchmarks, examining accuracy, precision, recall, F1-score, and computational complexity. The central claim is that CNNs perform robustly on limited training samples with strong local textures, while ViTs excel at capturing global spatial relationships in complex scenes when sufficient data are available, albeit with higher computational demands.

Significance. If the empirical results hold under expanded baselines, the study could provide practical guidance for architecture selection in remote sensing applications, clarifying data-dependent trade-offs between local convolutional features and global self-attention. The use of public benchmark datasets and standard metrics strengthens the contribution.

major comments (1)
  1. [Abstract] Abstract: The claim that 'CNNs perform robustly on datasets with limited training samples and strong local texture characteristics' rests exclusively on AlexNet evaluations; without modern CNN baselines (such as ResNet-50, EfficientNet, or ConvNeXt) under matched training regimes and data splits, observed differences may reflect AlexNet's lower capacity rather than inherent convolutional inductive biases, undermining generalization to the CNN family.
minor comments (2)
  1. [Abstract] Abstract: Directional conclusions are stated without accompanying numerical accuracies, confidence intervals, statistical tests, or explicit details on training protocols and data splits, which limits immediate assessment of result robustness.
  2. Consider including a table or subsection explicitly reporting per-class metrics and computational complexity (e.g., FLOPs or inference time) to support the complexity comparison.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address the major comment below and outline the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that 'CNNs perform robustly on datasets with limited training samples and strong local texture characteristics' rests exclusively on AlexNet evaluations; without modern CNN baselines (such as ResNet-50, EfficientNet, or ConvNeXt) under matched training regimes and data splits, observed differences may reflect AlexNet's lower capacity rather than inherent convolutional inductive biases, undermining generalization to the CNN family.

    Authors: We agree that the current reliance on AlexNet alone weakens the generalization of our claims to the broader CNN family. AlexNet was chosen as a representative early CNN to emphasize the contrast in local feature extraction and data efficiency relative to ViTs. To strengthen the manuscript, we will add ResNet-50 and EfficientNet-B0 as additional CNN baselines. These models will be trained and evaluated under identical data splits, augmentation, and optimization settings as the existing experiments. Updated results and discussion will be included to better isolate the role of convolutional inductive biases versus model capacity. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical comparison on public benchmarks

full rationale

The paper conducts a standard experimental comparison of AlexNet (as representative CNN) and ViT on the UC Merced and EuroSAT datasets, reporting accuracy, precision, recall, F1, and complexity metrics. No mathematical derivations, equations, fitted parameters renamed as predictions, or self-citation chains appear in the derivation of the central claims. Results follow directly from training and evaluation on external public datasets, satisfying the self-contained benchmark criterion with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical comparison study using off-the-shelf deep learning architectures and public benchmark datasets; no free parameters, mathematical axioms, or new postulated entities are introduced.

pith-pipeline@v0.9.0 · 5772 in / 1070 out tokens · 32859 ms · 2026-05-21T05:23:03.178702+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 1 internal anchor

  1. [1]

    Dirichlet -derived multiple topic scene classification model for high spatial resolution remote sensing imagery,

    B. Zhao, Y. Zhong, G. Xia, and L. Zhang, “Dirichlet -derived multiple topic scene classification model for high spatial resolution remote sensing imagery,” IEEE Transactions on Geoscience and Remote Sensing, vol. 54, no. 4, pp. 2108–2123, 2016

  2. [2]

    A survey on object detection in optical remote sensing images,

    G. Cheng and J. Han, “A survey on object detection in optical remote sensing images,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 117, pp. 11–28, 2016

  3. [3]

    AID: A benchmark dataset for performance evaluation of aerial scene classification,

    G.-S. Xia et al., “AID: A benchmark dataset for performance evaluation of aerial scene classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 7, pp. 3965–3981, 2017

  4. [4]

    Deep learning,

    Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp. 436–444, 2015

  5. [5]

    EfficientNet: Rethinking model scaling for convolutional neural networks,

    M. Tan and Q. Le, “EfficientNet: Rethinking model scaling for convolutional neural networks,” in Proceedings of ICML, vol. 97, pp. 6105-6114, 2019

  6. [6]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Z. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proceedings of ICLR, 2021. https://doi.org/10.48550/arXiv.2010.11929

  7. [7]

    A survey of image classification methods and techniques for improving classification performance

    D. Lu and Q. Weng. “A survey of image classification methods and techniques for improving classification performance”. International Journal of Remote Sensing, vol. 28, no. 5, pp. 823–870. 2007

  8. [8]

    Vision transformer for remote sensing image classification: A review,

    L. Zhao, J. Wang, and Y. Chen, “Vision transformer for remote sensing image classification: A review,” Remote Sensing, vol. 14, no. 15, p. 3776, 2022

  9. [9]

    Distinctive image features from scale -invariant key points,

    D. G. Lowe, “Distinctive image features from scale -invariant key points,” International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004

  10. [10]

    Histograms of oriented gradients for human detection,

    N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2005

  11. [11]

    Multiresolution gray-scale and rotation invariant texture classification with local binary patterns,

    T. Ojala, M. Pietikäinen, and T. Mäenpää, “Multiresolution gray-scale and rotation invariant texture classification with local binary patterns,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 971–987, 2002

  12. [12]

    Transferring deep convolutional neural networks for the scene classification of high-resolution remote sensing imagery,

    F. Hu, G.-S. Xia, J. Hu, and L. Zhang, “Transferring deep convolutional neural networks for the scene classification of high-resolution remote sensing imagery,” Remote Sensing, vol. 7, no. 11, pp. 14680 –14707, 2015.G. O. Young, “Synthetic structure of industrial plastics,” in Plastics, 2nd ed., vol. 3, J. Peters, Ed. New York, NY, USA: McGraw- Hill, 1964...

  13. [13]

    ImageNet classification with deep convolutional neural networks,

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems, vol. 25, pp. 1097–1105, 2012. FIGURE 15. Classified Images AlexNET 10-Class EuroSAT Dataset FIGURE 16. Classified Images ViT 10-Class EuroSAT Dataset VOLUME 14, 2026 12

  14. [14]

    Very deep convolutional networks for large -scale image recognition,

    K. Simonyan and A. Zisserman, “Very deep convolutional networks for large -scale image recognition,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2015

  15. [15]

    Going deeper with convolutions,

    C. Szegedy, W. Liu, Y. Jia et al., “Going deeper with convolutions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2015, pp. 1–9

  16. [16]

    Deep learning in remote sensing: A comprehensive review and list of resources,

    X. X. Zhu, D. Tuia, L. Mou et al., “Deep learning in remote sensing: A comprehensive review and list of resources,” IEEE Geosci. Remote Sens. Mag., vol. 5, no. 4, pp. 8–36, 2017

  17. [17]

    Remote sensing image scene classification: Benchmark and state of the art,

    G. Cheng, J. Han, and X. Lu, “Remote sensing image scene classification: Benchmark and state of the art,” Proc. IEEE, vol. 105, no. 10, pp. 1865–1883, 2017

  18. [18]

    Do deep features generalize from everyday objects to remote sensing and aerial scenes domains?

    O. A. B. Penatti, K. Nogueira, and J. A. dos Santos, “Do deep features generalize from everyday objects to remote sensing and aerial scenes domains?” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops (CVPR Workshops), 2015, pp. 44–51

  19. [19]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar et al., “Attention is all you need,” in Adv. Neural Inf. Process. Syst. (NeurIPS), 2017, pp. 5998–6008

  20. [20]

    Vision transformers in computer vision: A survey,

    K. Han, Y. Wang, J. Guo et al., “Vision transformers in computer vision: A survey,” ACM Comput. Surv., vol. 55, no. 6, pp. 1–41, 2023

  21. [21]

    Hybrid transformer networks for remote sensing image scene classification,

    Y. Xu, X. Zhang, H. Li et al., “Hybrid transformer networks for remote sensing image scene classification,” Remote Sens., vol. 15, no. 4, p. 1021, 2023

  22. [22]

    EuroSAT: A novel dataset and deep learning benchmark for land use and land cover classification,

    P. Helber, B. Bischke, A. Dengel, and D. Borth, “EuroSAT: A novel dataset and deep learning benchmark for land use and land cover classification,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 12, no. 7, pp. 22 17–2226, 2019

  23. [23]

    Gradient -based learning applied to document recognition,

    Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient -based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998

  24. [24]

    C. M. Bishop, Pattern Recognition and Machine Learning. New York, NY, USA: Springer, 2006

  25. [25]

    Deep learning for the Earth sciences: A comprehensive approach to remote sensing, climate science, and geosciences,

    G. Camps-Valls, D. Tuia, X. X. Zhu, and M. Reichstein, “Deep learning for the Earth sciences: A comprehensive approach to remote sensing, climate science, and geosciences,” Nature Reviews Earth & Environment, 2021

  26. [26]

    Goodfellow, Y

    I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT Press, 2016

  27. [27]

    Adam: A Method for Stochastic Optimization,

    D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” ICLR, 2015. W.-K. Chen, Linear Networks and Systems. Belmont, CA, USA: Wadsworth, 1993, pp. 123–135

  28. [28]

    Bag-of-visual-words and spatial extensions for land -use classification,

    Y. Yang and S. Newsam, "Bag-of-visual-words and spatial extensions for land -use classification," Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM GIS 2010), pp. 270–279, 2010. https://www.kaggle.com/datasets/abdulhasibuddin/uc-merced-land- use-dataset

  29. [29]

    Hendrycks, D

    P. Helber, B. Bischke, A. Dengel and D. Borth, "EuroSAT: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification," in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 12, no. 7, pp. 2217-2226, July 2019, doi: 10.1109/JSTARS.2019.2918242. https://www.kaggle.com/datasets/apollo2506/eurosat...