Vision Transformers and Convolutional Neural Networks for Land Use Scene Classification
Pith reviewed 2026-05-21 05:23 UTC · model grok-4.3
The pith
CNNs handle small land use datasets with local textures well while Vision Transformers capture global relationships when given enough data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Experimental results demonstrate that CNNs perform robustly on datasets with limited training samples and strong local texture characteristics, whereas Vision Transformers exhibit superior performance in capturing global spatial relationships in complex scenes when sufficient training data are available. However, ViTs typically require greater computational resources and larger training datasets to achieve optimal performance.
What carries the argument
Side-by-side evaluation of classification metrics and computational complexity for AlexNet as a CNN representative against a Vision Transformer model, run on the UC Merced Land Use and EuroSAT datasets.
If this is right
- CNNs remain a practical choice for land use classification when training samples are scarce or local texture dominates.
- Vision Transformers gain an edge once training data becomes plentiful enough to support modeling of long-range spatial dependencies.
- Vision Transformers carry higher computational costs, which may restrict their deployment in settings with limited processing power.
- Model selection for remote sensing tasks should weigh dataset size and scene complexity rather than default to a single architecture family.
Where Pith is reading between the lines
- The pattern could guide selection of models in other domains that mix local detail with global layout, such as medical image analysis.
- Hybrid networks that combine early CNN layers for texture with later transformer blocks for context might reduce the data hunger of pure ViTs while retaining their global modeling benefit.
- Further tests on datasets with controlled mixtures of local and global features would sharpen the boundary conditions under which one architecture overtakes the other.
Load-bearing premise
The performance differences seen with AlexNet and the chosen Vision Transformer on the UC Merced and EuroSAT benchmarks will generalize to other remote sensing datasets and real-world conditions.
What would settle it
Repeating the comparison on a fresh remote sensing dataset while deliberately varying the number of training samples per class and checking whether the same pattern of CNN strength on small sets and ViT strength on large sets still appears would test the central claim.
Figures
read the original abstract
Land Use Scene Classification (LUSC) from remote sensing imagery plays a critical role in environmental monitoring, urban planning, and sustainable resource management. In recent years, deep learning methods have significantly advanced the state of the art, with Convolutional Neural Networks (CNNs) dominating the field because of their strong ability to capture local spatial features. However, the emergence of Vision Transformers (ViTs) has introduced a new paradigm that models long-range dependencies through self-attention mechanisms, potentially enabling improved global context understanding. This paper presents a comparative assessment of Vision Transformers and CNN-based architecture for remote sensing land use scene classification. Representative CNN models, such as AlexNet, is evaluated alongside the Vision Transformer (ViT) using benchmark remote sensing datasets, including the UC Merced Land Use and EuroSAT Land Use datasets. The study examines classification accuracy, precision, recall, F1-score, and computational complexity to provide a comprehensive performance comparison. Experimental results demonstrate that CNNs perform robustly on datasets with limited training samples and strong local texture characteristics, whereas Vision Transformers exhibit superior performance in capturing global spatial relationships in complex scenes when sufficient training data are available. However, ViTs typically require greater computational resources and larger training datasets to achieve optimal performance. The findings of this study provide insights into the strengths and limitations of both architectures and offer guidance for selecting appropriate models for remote sensing land use scene classification applications.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript compares Vision Transformers (ViT) and CNN-based architectures for land use scene classification on remote sensing imagery. It evaluates AlexNet as a representative CNN alongside ViT on the UC Merced and EuroSAT benchmarks, examining accuracy, precision, recall, F1-score, and computational complexity. The central claim is that CNNs perform robustly on limited training samples with strong local textures, while ViTs excel at capturing global spatial relationships in complex scenes when sufficient data are available, albeit with higher computational demands.
Significance. If the empirical results hold under expanded baselines, the study could provide practical guidance for architecture selection in remote sensing applications, clarifying data-dependent trade-offs between local convolutional features and global self-attention. The use of public benchmark datasets and standard metrics strengthens the contribution.
major comments (1)
- [Abstract] Abstract: The claim that 'CNNs perform robustly on datasets with limited training samples and strong local texture characteristics' rests exclusively on AlexNet evaluations; without modern CNN baselines (such as ResNet-50, EfficientNet, or ConvNeXt) under matched training regimes and data splits, observed differences may reflect AlexNet's lower capacity rather than inherent convolutional inductive biases, undermining generalization to the CNN family.
minor comments (2)
- [Abstract] Abstract: Directional conclusions are stated without accompanying numerical accuracies, confidence intervals, statistical tests, or explicit details on training protocols and data splits, which limits immediate assessment of result robustness.
- Consider including a table or subsection explicitly reporting per-class metrics and computational complexity (e.g., FLOPs or inference time) to support the complexity comparison.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address the major comment below and outline the revisions we will make.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that 'CNNs perform robustly on datasets with limited training samples and strong local texture characteristics' rests exclusively on AlexNet evaluations; without modern CNN baselines (such as ResNet-50, EfficientNet, or ConvNeXt) under matched training regimes and data splits, observed differences may reflect AlexNet's lower capacity rather than inherent convolutional inductive biases, undermining generalization to the CNN family.
Authors: We agree that the current reliance on AlexNet alone weakens the generalization of our claims to the broader CNN family. AlexNet was chosen as a representative early CNN to emphasize the contrast in local feature extraction and data efficiency relative to ViTs. To strengthen the manuscript, we will add ResNet-50 and EfficientNet-B0 as additional CNN baselines. These models will be trained and evaluated under identical data splits, augmentation, and optimization settings as the existing experiments. Updated results and discussion will be included to better isolate the role of convolutional inductive biases versus model capacity. revision: yes
Circularity Check
No circularity: direct empirical comparison on public benchmarks
full rationale
The paper conducts a standard experimental comparison of AlexNet (as representative CNN) and ViT on the UC Merced and EuroSAT datasets, reporting accuracy, precision, recall, F1, and complexity metrics. No mathematical derivations, equations, fitted parameters renamed as predictions, or self-citation chains appear in the derivation of the central claims. Results follow directly from training and evaluation on external public datasets, satisfying the self-contained benchmark criterion with no reduction of outputs to inputs by construction.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Experimental results demonstrate that CNNs perform robustly on datasets with limited training samples and strong local texture characteristics, whereas Vision Transformers exhibit superior performance in capturing global spatial relationships
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
B. Zhao, Y. Zhong, G. Xia, and L. Zhang, “Dirichlet -derived multiple topic scene classification model for high spatial resolution remote sensing imagery,” IEEE Transactions on Geoscience and Remote Sensing, vol. 54, no. 4, pp. 2108–2123, 2016
work page 2016
-
[2]
A survey on object detection in optical remote sensing images,
G. Cheng and J. Han, “A survey on object detection in optical remote sensing images,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 117, pp. 11–28, 2016
work page 2016
-
[3]
AID: A benchmark dataset for performance evaluation of aerial scene classification,
G.-S. Xia et al., “AID: A benchmark dataset for performance evaluation of aerial scene classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 7, pp. 3965–3981, 2017
work page 2017
-
[4]
Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp. 436–444, 2015
work page 2015
-
[5]
EfficientNet: Rethinking model scaling for convolutional neural networks,
M. Tan and Q. Le, “EfficientNet: Rethinking model scaling for convolutional neural networks,” in Proceedings of ICML, vol. 97, pp. 6105-6114, 2019
work page 2019
-
[6]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Z. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proceedings of ICLR, 2021. https://doi.org/10.48550/arXiv.2010.11929
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2010.11929 2021
-
[7]
A survey of image classification methods and techniques for improving classification performance
D. Lu and Q. Weng. “A survey of image classification methods and techniques for improving classification performance”. International Journal of Remote Sensing, vol. 28, no. 5, pp. 823–870. 2007
work page 2007
-
[8]
Vision transformer for remote sensing image classification: A review,
L. Zhao, J. Wang, and Y. Chen, “Vision transformer for remote sensing image classification: A review,” Remote Sensing, vol. 14, no. 15, p. 3776, 2022
work page 2022
-
[9]
Distinctive image features from scale -invariant key points,
D. G. Lowe, “Distinctive image features from scale -invariant key points,” International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004
work page 2004
-
[10]
Histograms of oriented gradients for human detection,
N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2005
work page 2005
-
[11]
Multiresolution gray-scale and rotation invariant texture classification with local binary patterns,
T. Ojala, M. Pietikäinen, and T. Mäenpää, “Multiresolution gray-scale and rotation invariant texture classification with local binary patterns,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 971–987, 2002
work page 2002
-
[12]
F. Hu, G.-S. Xia, J. Hu, and L. Zhang, “Transferring deep convolutional neural networks for the scene classification of high-resolution remote sensing imagery,” Remote Sensing, vol. 7, no. 11, pp. 14680 –14707, 2015.G. O. Young, “Synthetic structure of industrial plastics,” in Plastics, 2nd ed., vol. 3, J. Peters, Ed. New York, NY, USA: McGraw- Hill, 1964...
work page 2015
-
[13]
ImageNet classification with deep convolutional neural networks,
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems, vol. 25, pp. 1097–1105, 2012. FIGURE 15. Classified Images AlexNET 10-Class EuroSAT Dataset FIGURE 16. Classified Images ViT 10-Class EuroSAT Dataset VOLUME 14, 2026 12
work page 2012
-
[14]
Very deep convolutional networks for large -scale image recognition,
K. Simonyan and A. Zisserman, “Very deep convolutional networks for large -scale image recognition,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2015
work page 2015
-
[15]
Going deeper with convolutions,
C. Szegedy, W. Liu, Y. Jia et al., “Going deeper with convolutions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2015, pp. 1–9
work page 2015
-
[16]
Deep learning in remote sensing: A comprehensive review and list of resources,
X. X. Zhu, D. Tuia, L. Mou et al., “Deep learning in remote sensing: A comprehensive review and list of resources,” IEEE Geosci. Remote Sens. Mag., vol. 5, no. 4, pp. 8–36, 2017
work page 2017
-
[17]
Remote sensing image scene classification: Benchmark and state of the art,
G. Cheng, J. Han, and X. Lu, “Remote sensing image scene classification: Benchmark and state of the art,” Proc. IEEE, vol. 105, no. 10, pp. 1865–1883, 2017
work page 2017
-
[18]
Do deep features generalize from everyday objects to remote sensing and aerial scenes domains?
O. A. B. Penatti, K. Nogueira, and J. A. dos Santos, “Do deep features generalize from everyday objects to remote sensing and aerial scenes domains?” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops (CVPR Workshops), 2015, pp. 44–51
work page 2015
-
[19]
A. Vaswani, N. Shazeer, N. Parmar et al., “Attention is all you need,” in Adv. Neural Inf. Process. Syst. (NeurIPS), 2017, pp. 5998–6008
work page 2017
-
[20]
Vision transformers in computer vision: A survey,
K. Han, Y. Wang, J. Guo et al., “Vision transformers in computer vision: A survey,” ACM Comput. Surv., vol. 55, no. 6, pp. 1–41, 2023
work page 2023
-
[21]
Hybrid transformer networks for remote sensing image scene classification,
Y. Xu, X. Zhang, H. Li et al., “Hybrid transformer networks for remote sensing image scene classification,” Remote Sens., vol. 15, no. 4, p. 1021, 2023
work page 2023
-
[22]
EuroSAT: A novel dataset and deep learning benchmark for land use and land cover classification,
P. Helber, B. Bischke, A. Dengel, and D. Borth, “EuroSAT: A novel dataset and deep learning benchmark for land use and land cover classification,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 12, no. 7, pp. 22 17–2226, 2019
work page 2019
-
[23]
Gradient -based learning applied to document recognition,
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient -based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998
work page 1998
-
[24]
C. M. Bishop, Pattern Recognition and Machine Learning. New York, NY, USA: Springer, 2006
work page 2006
-
[25]
G. Camps-Valls, D. Tuia, X. X. Zhu, and M. Reichstein, “Deep learning for the Earth sciences: A comprehensive approach to remote sensing, climate science, and geosciences,” Nature Reviews Earth & Environment, 2021
work page 2021
-
[26]
I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT Press, 2016
work page 2016
-
[27]
Adam: A Method for Stochastic Optimization,
D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” ICLR, 2015. W.-K. Chen, Linear Networks and Systems. Belmont, CA, USA: Wadsworth, 1993, pp. 123–135
work page 2015
-
[28]
Bag-of-visual-words and spatial extensions for land -use classification,
Y. Yang and S. Newsam, "Bag-of-visual-words and spatial extensions for land -use classification," Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM GIS 2010), pp. 270–279, 2010. https://www.kaggle.com/datasets/abdulhasibuddin/uc-merced-land- use-dataset
work page 2010
-
[29]
P. Helber, B. Bischke, A. Dengel and D. Borth, "EuroSAT: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification," in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 12, no. 7, pp. 2217-2226, July 2019, doi: 10.1109/JSTARS.2019.2918242. https://www.kaggle.com/datasets/apollo2506/eurosat...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.