arxiv: 2605.10772 · v1 · submitted 2026-05-11 · 💻 cs.CV · cs.AI· eess.IV

Recognition: no theorem link

Towards a Large Language-Vision Question Answering Model for MSTAR Automatic Target Recognition

David F. Ramirez , Tim L. Overman , Kristen Jaskie , Marv Kleine , Andreas Spanias

Authors on Pith no claims yet

Pith reviewed 2026-05-12 05:32 UTC · model grok-4.3

classification 💻 cs.CV cs.AIeess.IV

keywords synthetic aperture radarautomatic target recognitionvisual question answeringMSTAR datasetlanguage-vision modelsparameter-efficient fine-tuningremote sensingtarget classification

0 comments

The pith

A fine-tuned language-vision model identifies fine-grained targets in SAR imagery at 98 percent accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how large language-vision models can be adapted for automatic target recognition in synthetic aperture radar images. The authors build a new benchmark by extending the MSTAR dataset with descriptive captions and visual question-answer pairs focused on nuanced vehicle details. Through parameter-efficient fine-tuning of architectures such as CLIP and LLaVA, they report 98 percent accuracy on fine-grained target qualities. This line of work matters because distinguishing specific military targets in radar data under varied conditions has long required months or years of specialized human training.

Core claim

The authors construct a SAR training and evaluation benchmark derived from the MSTAR Public Dataset that includes text captions and question-answer pairs for visual question answering. They then apply parameter-efficient fine-tuning to large language-vision models to achieve 98 percent accuracy when identifying fine-grained target qualities in the SAR imagery. This setup is presented as a step toward machine-assisted remote sensing ATR suitable for military and intelligence applications where environmental complexity makes recognition difficult.

What carries the argument

Parameter-efficient fine-tuning of a large language-vision model on a custom SAR visual question answering benchmark created by extending MSTAR imagery with captions and QA pairs.

Load-bearing premise

The new SAR benchmark with added captions and VQA pairs contains no data leakage or selection bias across its training and evaluation splits, and the reported accuracy reflects genuine generalization rather than overfitting to the particular MSTAR-derived images.

What would settle it

Testing the fine-tuned model on an independent collection of SAR images gathered from a different radar platform or under substantially altered environmental conditions and measuring accuracy well below 98 percent on comparable target-identification questions would indicate that the result does not generalize.

Figures

Figures reproduced from arXiv: 2605.10772 by Andreas Spanias, David F. Ramirez, Kristen Jaskie, Marv Kleine, Tim L. Overman.

**Figure 2.** Figure 2: The MSTAR dataset produced by AFRL and DARPA includes military vehicles centered in SAR images [3]. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Generatively pre-trained transformers (GPT) are synonymous with LLM and represent state-of-the-art methods for language understanding [10]. A GPT is a causal and recursive system that uses only preceding words or characters, known as tokens, to predict the next single token in an auto-regressive manner. The entire token history, including appended predictions, is then repeatedly ingested until a complete s… view at source ↗

**Figure 4.** Figure 4: The CLIP training method aligns pre-trained language and vision encoders for joint understanding. 2.3 Large Language-and-Vision Assistant LLVMs combined with modern training techniques enable applied visual understanding with limited computing resources. The LLaVA series of models [14] is a state-of-the-art method first published in 2023, resulting from a collaborative effort between academia and industry.… view at source ↗

read the original abstract

Large language-vision models (LLVM), such as OpenAI's ChatGPT and GPT-4, have gained prominence as powerful tools for analyzing text and imagery. The merging of these data domains represents a significant paradigm shift with far-reaching implications for automatic target recognition (ATR). Recent transformer-based LLVM research has shown substantial improvements for geospatial perception tasks. Our study examines the application of LLVM to remote sensing image captioning and visual question-answering (VQA), with a specific focus on synthetic aperture radar (SAR) imagery. We examine newly published LLVM methods, including CLIP and LLaVA neural network transformer architectures. We have developed a work-in-progress SAR training and evaluation benchmark derived from the MSTAR Public Dataset. This has been extended to include descriptive text captions and question-answer pairs for VQA tasks. This challenge dataset is designed to push the boundaries of an LLVM in identifying nuanced ATR details in SAR imagery. Utilizing parameter-efficient fine-tuning, we train an LLVM method to identify fine-grained target qualities at 98% accuracy. We detail our data setup and experiments, addressing potential pitfalls that could lead to misleading conclusions. Accurately identifying and differentiating military vehicle types in SAR data poses a critical challenge, especially under complex environmental conditions. Mastering this target recognition skill may require a human analyst months of training and years of practice. This research represents a unique effort to apply LLVM to SAR applications, advancing machine-assisted remote sensing ATR for military and intelligence contexts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds a MSTAR-derived VQA benchmark for SAR ATR and fine-tunes CLIP/LLaVA to claim 98% accuracy, but supplies almost no evidence on splits or controls.

read the letter

The main takeaway is a new benchmark that adds captions and VQA pairs to the MSTAR SAR dataset, then applies existing CLIP and LLaVA models with parameter-efficient fine-tuning to reach 98% on fine-grained target identification. That is the concrete contribution here. Prior SAR ATR work has stayed mostly with CNNs or classical features, so extending these vision-language models to this modality and task is a reasonable next step. The authors correctly note how hard SAR imagery is for distinguishing vehicle types under limited views and conditions, and they frame the VQA setup as a way to push for more nuanced recognition. Framing the problem around military and intelligence use cases is also straightforward and on target. The soft spots sit in the results section. The abstract states the 98% figure and says pitfalls were addressed, yet it gives no numbers on train/test split construction, instance-level separation, baseline comparisons, or error bars. MSTAR is small and highly structured, with fixed depression angles and serial numbers, so any question templates or non-independent splits could produce high accuracy without the model learning actual SAR invariances. The stress-test concern about leakage during benchmark creation is the right one to check; if the full paper does not show explicit audits or ablations on how the pairs were generated, the generalization claim stays unsupported. This work is aimed at the SAR ATR community and anyone trying to adapt VLMs to remote-sensing data. Readers looking for benchmark ideas or quick applications of LLaVA-style models might find the setup useful, but anyone needing reproducible numbers will have to wait for clearer validation. It deserves peer review because the domain application is timely and the benchmark itself could be valuable once the data-handling details are verified.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces a SAR-specific training and evaluation benchmark derived from the MSTAR Public Dataset, augmented with descriptive captions and VQA pairs. It applies parameter-efficient fine-tuning to large language-vision models (CLIP and LLaVA) and reports 98% accuracy on identifying fine-grained target qualities in SAR imagery, while stating that potential pitfalls have been addressed in the data setup and experiments.

Significance. If the 98% accuracy reflects genuine generalization on an independent test set free of leakage, the work would represent a meaningful step toward integrating LLVMs into remote-sensing ATR, where fine-grained vehicle discrimination has historically required extensive analyst training. The creation of an open SAR VQA benchmark is itself a useful community resource, even if the current empirical validation remains incomplete.

major comments (3)

Abstract and Experiments section: The central claim of 98% accuracy after parameter-efficient fine-tuning is presented without any reported train/test split statistics, instance-level separation details, sample counts per class, baselines (zero-shot LLVM or conventional SAR ATR), or error bars. Given MSTAR's small size and fixed imaging geometry, these omissions make it impossible to determine whether the result demonstrates generalization or merely memorization of split-specific artifacts.
Data Setup section: The construction of captions and VQA pairs is described at a high level but lacks explicit documentation of the question-generation method, any leakage audit (e.g., ensuring no vehicle serial number or configuration overlap between splits), or ablation on template-based versus human-authored questions. This directly undermines the weakest assumption that the benchmark tests SAR-specific invariances rather than correlations introduced during dataset creation.
Experiments section: No ablation studies on the parameter-efficient fine-tuning components, no comparison against non-LLVM baselines, and no analysis of failure modes on the reported 98% figure are provided. These omissions leave the contribution of the LLVM architecture itself unisolated and the generalization claim unsupported.

minor comments (2)

The abstract refers to the benchmark as 'work-in-progress' yet presents a definitive 98% accuracy figure; clarifying the maturity of the dataset and any planned expansions would improve reader expectations.
Notation for the LLVM architectures (CLIP vs. LLaVA) and the specific PEFT method (e.g., LoRA rank, adapter placement) should be defined consistently in the methods section to aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that the current manuscript, presented as a work-in-progress, requires expanded documentation and additional analyses to strengthen the claims of generalization. We will revise the paper to address each point.

read point-by-point responses

Referee: Abstract and Experiments section: The central claim of 98% accuracy after parameter-efficient fine-tuning is presented without any reported train/test split statistics, instance-level separation details, sample counts per class, baselines (zero-shot LLVM or conventional SAR ATR), or error bars. Given MSTAR's small size and fixed imaging geometry, these omissions make it impossible to determine whether the result demonstrates generalization or merely memorization of split-specific artifacts.

Authors: We acknowledge that the current draft summarizes the experimental setup at a high level without these details. In the revised manuscript we will add a dedicated subsection reporting train/test split statistics (including per-class sample counts), explicit instance-level separation criteria (e.g., no shared vehicle serial numbers or configurations across splits), zero-shot LLVM and conventional SAR ATR baselines, and error bars computed over multiple random seeds. This will allow readers to evaluate whether the 98% figure reflects genuine generalization. revision: yes
Referee: Data Setup section: The construction of captions and VQA pairs is described at a high level but lacks explicit documentation of the question-generation method, any leakage audit (e.g., ensuring no vehicle serial number or configuration overlap between splits), or ablation on template-based versus human-authored questions. This directly undermines the weakest assumption that the benchmark tests SAR-specific invariances rather than correlations introduced during dataset creation.

Authors: We will expand the Data Setup section with a precise description of the question-generation procedure, including the templates employed and the extent of human authoring. A leakage audit will be documented that verifies separation by vehicle serial number and configuration. We will also include an ablation comparing model performance on purely template-generated versus human-authored questions to confirm that the benchmark evaluates SAR-specific invariances. revision: yes
Referee: Experiments section: No ablation studies on the parameter-efficient fine-tuning components, no comparison against non-LLVM baselines, and no analysis of failure modes on the reported 98% figure are provided. These omissions leave the contribution of the LLVM architecture itself unisolated and the generalization claim unsupported.

Authors: The revised Experiments section will incorporate ablation studies on the PEFT components (e.g., varying LoRA rank and comparing alternative adapters). We will add direct comparisons to non-LLVM baselines such as CNN classifiers and traditional SAR ATR methods. We will also provide a failure-mode analysis that examines misclassified examples and relates errors to SAR imaging characteristics. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical fine-tuning accuracy on custom benchmark

full rationale

The paper reports training an LLVM via parameter-efficient fine-tuning on a newly constructed MSTAR-derived SAR VQA benchmark and measuring 98% accuracy for fine-grained target identification. This is a standard experimental outcome (train on constructed data, evaluate accuracy) rather than any derivation, equation, or quantity that reduces to its inputs by construction. No self-definitional steps, fitted inputs renamed as predictions, load-bearing self-citations, uniqueness theorems, or ansatz smuggling are present. The result is self-contained as an empirical measurement on the described benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities; the work rests on standard assumptions of supervised fine-tuning and benchmark construction from public data.

pith-pipeline@v0.9.0 · 5581 in / 1186 out tokens · 35548 ms · 2026-05-12T05:32:04.946352+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 7 internal anchors

[1]

Language Models are Few-Shot Learners

Tom B. Brown, Benjamin Mann, Nick Ryder, et al., “Language Models are Few -Shot Learners,” Proc. 34th Neural Information Processing Systems (NeurIPS), pp 1877-1901, 6 Dec. 2020, https://doi.org/10.48550/arXiv.2005.14165

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2005.14165 1901
[2]

MSTAR Extended Operating Conditions: A Tutorial,

Eric R. Keydel, Shung Wu Lee, and John T. Moore, "MSTAR Extended Operating Conditions: A Tutorial," Proc. SPIE 2757, Algorithms for Synthetic Aperture Radar Imagery III, 10 June 1996, https://doi.org/10.1117/12.242059

work page doi:10.1117/12.242059 1996
[3]

Moving and Stationary Target Acquisition and Recognition (MSTAR) Public Release,

DARPA and AFRL, Sep. 1995, "Moving and Stationary Target Acquisition and Recognition (MSTAR) Public Release," Sensor Data Management System. [Online]. https://www.sdms.afrl.af.mil/index.php?collection=mstar

work page 1995
[4]

A Comprehensive Survey on SAR ATR in Deep-Learning Era,

Jianwei Li, Zhentao Yu, Lu Yu, Pu Cheng, Jie Chen, and Cheng Chi, "A Comprehensive Survey on SAR ATR in Deep-Learning Era," Remote Sensing, 15(5), pp 1454, 5 March 2023. https://doi.org/10.3390/rs15051454

work page doi:10.3390/rs15051454 2023
[5]

Unsupervised SAR Representation Learning Improves Classification Performance,

Nolan Vaughn, Bo Sullivan, and Kristen Jaskie, "Unsupervised SAR Representation Learning Improves Classification Performance," Proc. SPIE 13039, ATR XXXIV, 130390J, 7 June 2024, https://doi.org/10.1117/12.3013982

work page doi:10.1117/12.3013982 2024
[6]

Quantum Classification for Synthetic Aperture Radar,

Salil Naik, Nolan Vaughn, Glen Uehara, Andreas Spanias, and Kristen Jaskie, “Quantum Classification for Synthetic Aperture Radar,” Proc. SPIE 13039, ATR XXXIV, 130390H 7 June 2024, https://doi.org/10.1117/12.3016462

work page doi:10.1117/12.3016462 2024
[7]

Sparse Manifold Learning with Applications to SAR Image Classification,

Visar Berisha, et al., "Sparse Manifold Learning with Applications to SAR Image Classification," IEEE Intl. Conf. Acoustics, Speech and Signal Processing (ICASSP), 15 Apr. 2007, https://doi.org/10.1109/ICASSP.2007.366873

work page doi:10.1109/icassp.2007.366873 2007
[8]

Sparse Representations for Automatic Target Classification in SAR Images,

Jayaraman J. Thiagarajan, et al., “Sparse Representations for Automatic Target Classification in SAR Images,” Intl. Symp. on Communications, Control and Signal Processing, 2010, https://doi.org/10.1109/ISCCSP.2010.5463416

work page doi:10.1109/isccsp.2010.5463416 2010
[9]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin, “Attention is All you Need,” Proc. 31st NeurIPS, 2017, https://doi.org/10.48550/arXiv.1706.03762

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1706.03762 2017
[10]

Saddam Hossain Mukta, Kaniz Fatema, Nur Mohammad Fahad, Sadman Sakib, Most Marufatul Jannat Mim, Jubaer Ah- mad, Mohammed Eunus Ali, and Sami Azam

Mohaimenul Azam Khan Raiaan , et al., "A Review on Large Language Models: Architectures, Applications, Taxonomies, Open Issues and Challenges," IEEE Access, 2024, https://doi.org/10.1109/ACCESS.2024.3365742

work page doi:10.1109/access.2024.3365742 2024
[11]

Improving Language Understanding by Generative Pre-Training,

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever, “Improving Language Understanding by Generative Pre-Training,” OpenAI, 11 June 2018. [Online]. https://openai.com/index/language-unsupervised/

work page 2018
[12]

Mistral 7B

Albert Q. Jiang, Alexandre Sablayrolles , Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, et al. “Mistral 7B,” arXiv, 10 Oct. 2023, https://doi.org/10.48550/arXiv.2310.06825

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.06825 2023
[13]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, et al. “Learning Transferable Visual Models from Natural Language Supervision,” Proc. 38th Intl. Conf. on Machine Learning (ICML), 18 July 2021, https://doi.org/10.48550/arXiv.2103.00020

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2103.00020 2021
[14]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee, “Visual Instruction Tuning,” Proc. 37th NeurIPS, 10 Dec. 2023, https://doi.org/10.48550/arXiv.2304.08485

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2304.08485 2023
[15]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, et al., “ An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” Intl. Conf. on Learning Representations (ICLR), 3 May 2021, https://doi.org/10.48550/arXiv.2010.11929

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2010.11929 2021
[16]

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer, “QLoRA: Efficient Finetuning of Quantized LLMs,” Proc. 37th NeurIPS, 10 Dec. 2023, https://doi.org/10.48550/arXiv.2305.14314

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.14314 2023
[17]

RSVQA: Visual Question Answering for Remote Sensing Data,

Sylvain Lobry, et al. , “RSVQA: Visual Question Answering for Remote Sensing Data,” IEEE Transactions on Geoscience and Remote Sensing, 58(12), pp 8555-8566, 7 May 2020, https://doi.org/10.1109/TGRS.2020.2988782

work page doi:10.1109/tgrs.2020.2988782 2020
[18]

RS5M and GeoRSCLIP: A Large-Scale Vision- Language Dataset and a Large Vision-Language Model for Remote Sensing,

Zilun Zhang, et al., “RS5M and GeoRSCLIP: A Large-Scale Vision- Language Dataset and a Large Vision-Language Model for Remote Sensing,” IEEE Transactions on Geoscience and Remote Sensing, 62, 12 Sep. 2024

work page 2024
[19]

RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery,

Yakoub Bazi, et al. “RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery,” Remote Sensing, 16(9), pp 1477, 23 April 2024, https://doi.org/10.3390/rs16091477

work page doi:10.3390/rs16091477 2024
[20]

Scattering Prompt Tuning: A Fine -tuned Foundation Model for SAR Object Recognition,

Weilong Guo, Shengyang Liv, and Jian Yang, “Scattering Prompt Tuning: A Fine -tuned Foundation Model for SAR Object Recognition,” Proc. IEEE/CVF Computer Vision and Pattern Recognition Workshops (CVPRW), 2024

work page 2024
[21]

SARATR -X: Toward Building a Foundation Model for SAR Target Recognition,

Weijie Li, et al. , “SARATR -X: Toward Building a Foundation Model for SAR Target Recognition,” IEEE Transactions on Image Processing, 34, 28 Jan. 2025, https://doi.org/10.1109/TIP.2025.3531988

work page doi:10.1109/tip.2025.3531988 2025
[22]

Leveraging Visual Language Model and Generative Diffusion Model for Zero-Shot SAR Target Recognition,

Junyu Wang, Hao Sun, Tao Tang, et al, “Leveraging Visual Language Model and Generative Diffusion Model for Zero-Shot SAR Target Recognition,” Remote Sensing, 9 Aug. 2024, 16(16), https://doi.org/10.3390/rs16162927

work page doi:10.3390/rs16162927 2024
[23]

LLaVA-NeXT: Stronger LLMs Supercharge Multimodal Capabilities in the Wild,

Bo Li, Kaichen Zhang, Hao Zhang, et al., “LLaVA-NeXT: Stronger LLMs Supercharge Multimodal Capabilities in the Wild,” LLaVA-NeXT, May 2024. [Online]. https://llava-vl.github.io/blog/2024-05-10-llava-next-stronger-llms/

work page 2024
[24]

In: CVPR

Haotian Liu, et al., “Improved Baselines with Visual Instruction Tuning ,” Proc. IEEE/CVF Computer Vision and Pattern Recognition (CVPR), 16 June 2024, https://doi.org/10.1109/CVPR52733.2024.02484

work page doi:10.1109/cvpr52733.2024.02484 2024
[25]

Model Memory Calculator,

Hugging Face and HF-Accelerate, “Model Memory Calculator,” Hugging Face Spaces, Accessed: 24 Mar. 2025, [Online]. https://huggingface.co/spaces/hf-accelerate/model-memory-usage

work page 2025
[26]

Tiktokenizer,

dqbd, “Tiktokenizer,” Vercel, Accessed: 26 Mar. 2025, [Online]. https://tiktokenizer.vercel.app

work page 2025
[27]

Mistral Tokenizer,

Lunary, “Mistral Tokenizer,” Lunary.AI, Accessed: 26 Mar. 2025, [Online]. https://lunary.ai/mistral-tokenizer. APPENDIX Table 6. The size and training requirements of various models are presented, with the most relevant listed at the bottom. Model Training Requirements Model Parameters (Trainable) VRAM Minimum GPU Count VRAM / GPU VRAM Total Training Time...

work page 2025