arxiv: 2605.10739 · v1 · submitted 2026-05-11 · 📡 eess.IV · cs.AI· cs.CV

Recognition: no theorem link

Geospatial-Temporal Sensemaking of Remote Sensing Activity Detections with Multimodal Large Language Model

Andreas Spanias, David F. Ramirez, Kristen Jaskie, Tim Overman

Pith reviewed 2026-05-12 04:30 UTC · model grok-4.3

classification 📡 eess.IV cs.AIcs.CV

keywords remote sensingvisual question answeringmultimodal large language modelsSentinel-2construction activity detectionspatiotemporal analysisdataset creationchange reasoning

0 comments

The pith

Converting heavy construction annotations from Sentinel-2 satellite images into over two million visual question-answering examples lets multimodal language models reason about activity states, progression, and future developments at fixed

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates SMART-HC-VQA by transforming the IARPA SMART Heavy Construction dataset's site annotations, construction types, temporal phases, and geographic metadata into natural language question-answer pairs paired with Sentinel-2 image chips. This re-frames fixed geospatial sites as evolving targets whose attributes change across sparse observations, turning change detection into a task of understanding ongoing processes. The authors supply 21,837 image chips, 65,511 single-image VQA items, and roughly 2.3 million two-image temporal examples generated by pairwise combinatorial augmentation. They also implement a multi-image training setup based on LLaVA-NeXT Mistral-7B that accepts dated image inputs and learns from the metadata-derived questions. The result is positioned as a reproducible base for language-guided remote sensing that aims to support reasoning about process evolution rather than isolated detections.

Core claim

SMART-HC-VQA redefines an existing construction-site annotation collection as a temporally extended automatic target recognition and visual question answering challenge in which a fixed geospatial site functions as the target whose attributes and activity states evolve across sparse satellite observations, with the dataset and associated multi-image MLLM framework providing the concrete means to train models that reason about activity progression and potential future states.

What carries the argument

SMART-HC-VQA dataset of Sentinel-2 image chips paired with VQA triplets derived from construction annotations, temporal-phase labels, and observation relationships, plus the Image-Pairwise Combinatorial Augmentation method that expands it to 2.3 million temporal comparison examples.

If this is right

Models trained this way can answer questions about construction phases and activity evolution instead of only reporting presence or absence of change.
The same conversion workflow can be applied to other remote-sensing annotation sets to create language-supervised temporal reasoning tasks.
Fixed-site analysis becomes possible in which the same geographic location is tracked across multiple dated observations as a single evolving target.
The multi-image input adaptation of LLaVA-NeXT demonstrates a practical route for incorporating dated satellite pairs directly into existing multimodal training pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The dataset could support downstream tasks such as forecasting construction timelines or flagging anomalous activity sequences once models are fine-tuned on the temporal pairs.
Extending the same annotation-to-VQA pipeline to non-construction domains like agriculture or disaster recovery would test whether the approach generalizes beyond heavy-equipment sites.
Because the examples are generated from existing annotations rather than new human labeling, the method offers a low-cost way to bootstrap large temporal VQA corpora from any labeled satellite time series.

Load-bearing premise

Transforming construction annotations and temporal relationships into natural language VQA triplets retains enough spatiotemporal detail for a multimodal model to learn genuine reasoning about activity states and progression.

What would settle it

Train the described LLaVA-NeXT-based model on SMART-HC-VQA and test whether its answers to questions about next-phase construction activity or progression order match human-annotated ground truth on held-out image pairs at rates no better than a simple change-detection baseline that ignores the language component.

read the original abstract

We introduce SMART-HC-VQA, a Sentinel-2-based visual question answering dataset derived from the IARPA SMART Heavy Construction dataset, designed for spatiotemporal analysis of human activity. The dataset transforms construction-site annotations, construction-type labels, temporal-phase labels, geographic metadata, and observation relationships into natural language question-answer triplets. This approach redefines the existing dataset as a temporally extended automatic target recognition and visual question answering (VQA) challenge, considering a fixed geospatial site as a target whose attributes and activity states evolve across sparse satellite observations. Currently, SMART-HC-VQA comprises 21,837 accessible Sentinel-2 image chips, 65,511 single-image VQA examples, and approximately 2.3 million two-image temporal comparison examples generated via our novel Image-Pairwise Combinatorial Augmentation. We detail the workflow for retrieving and processing Sentinel-2 imagery, segmenting large satellite tiles into site-centered images, maintaining traceability to SMART-HC annotations, and analyzing the distributions of site size, observation count, temporal coverage, construction type, and phase labels. Additionally, we describe an implemented multi-image MLLM training framework based on LLaVA-NeXT Mistral-7B, adapted to accept multiple dated image inputs and train on metadata-derived VQA examples. This work offers a reproducible foundation for understanding language-guided remote sensing activities, aiming not only to detect change but also to reason about the ongoing processes, their progression, and potential future developments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper mainly delivers a large new temporal VQA dataset for construction monitoring from Sentinel-2 imagery plus a multi-image MLLM setup, but reports no experiments or validation at all.

read the letter

The key thing to know is that the authors built SMART-HC-VQA, a Sentinel-2 dataset with 65k single-image VQA examples and 2.3 million temporal pairs derived from IARPA SMART construction annotations. They also describe an adaptation of LLaVA-NeXT to handle multiple dated images and train on metadata-derived questions about activity phases and relationships. That dataset scale and the combinatorial pairing method are the concrete new pieces here.

Referee Report

2 major / 2 minor

Summary. The paper introduces SMART-HC-VQA, a Sentinel-2-based VQA dataset derived from the IARPA SMART Heavy Construction annotations. It transforms construction-site labels, temporal-phase information, geographic metadata, and observation relationships into natural-language question-answer triplets, yielding 65,511 single-image and ~2.3 million two-image temporal examples via Image-Pairwise Combinatorial Augmentation. The manuscript details the Sentinel-2 retrieval/processing workflow, site-centered chip segmentation, label distributions, and an adapted multi-image LLaVA-NeXT Mistral-7B training framework intended to support reasoning about activity states, progression, and future developments.

Significance. If the generated VQA triplets preserve sufficient spatiotemporal structure and the framework can be shown to learn non-trivial reasoning, the contribution would supply a large-scale, reproducible resource for language-guided remote-sensing activity analysis. The scale of the temporal-comparison examples and the explicit linkage to existing heavy-construction annotations represent a concrete step toward process-level understanding rather than isolated change detection.

major comments (2)

[Abstract / dataset construction workflow] Abstract and dataset-construction workflow: the central claim that the annotation-to-VQA conversion supplies a foundation for MLLMs to reason about activity progression rests on the untested assumption that natural-language reformulation preserves temporal-phase and observation-relationship information without distortion or loss; no quantitative check (e.g., inter-annotator agreement on temporal ordering or information-retrieval metrics on the generated triplets) is reported.
[Implemented multi-image MLLM training framework] Implemented multi-image MLLM training framework section: the manuscript describes the LLaVA-NeXT adaptation and training setup on the 2.3 million examples yet provides zero training runs, accuracy metrics, baseline comparisons, or error analysis on held-out VQA examples, leaving the utility of the dataset and framework for spatiotemporal sensemaking unsupported by evidence.

minor comments (2)

[Abstract] The exact counts and generation statistics for the two-image temporal examples should be stated precisely rather than approximated, and the Image-Pairwise Combinatorial Augmentation procedure should include pseudocode or a small worked example for reproducibility.
[dataset analysis] Distributions of site size, observation count, temporal coverage, construction type, and phase labels are analyzed but not accompanied by tables or figures in the provided text; adding them would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and for highlighting the potential of the SMART-HC-VQA dataset for advancing language-guided remote sensing analysis. We address each of the major comments below.

read point-by-point responses

Referee: Abstract and dataset-construction workflow: the central claim that the annotation-to-VQA conversion supplies a foundation for MLLMs to reason about activity progression rests on the untested assumption that natural-language reformulation preserves temporal-phase and observation-relationship information without distortion or loss; no quantitative check (e.g., inter-annotator agreement on temporal ordering or information-retrieval metrics on the generated triplets) is reported.

Authors: The VQA triplets are produced via a fully automated, deterministic pipeline that directly maps the original SMART annotations—including temporal phases, construction types, and observation relationships—into templated natural-language questions and answers. This design ensures that the information is preserved by construction rather than through subjective reformulation. Nevertheless, we agree that explicit validation would strengthen the claim. In the revised manuscript, we will add quantitative checks, such as retrieval metrics measuring how accurately key attributes (e.g., phase labels and temporal ordering) can be recovered from the generated triplets, to empirically demonstrate fidelity. revision: yes
Referee: Implemented multi-image MLLM training framework section: the manuscript describes the LLaVA-NeXT adaptation and training setup on the 2.3 million examples yet provides zero training runs, accuracy metrics, baseline comparisons, or error analysis on held-out VQA examples, leaving the utility of the dataset and framework for spatiotemporal sensemaking unsupported by evidence.

Authors: The manuscript's primary focus is the creation and characterization of the SMART-HC-VQA dataset together with the specification of an adapted multi-image LLaVA-NeXT training framework. While the framework has been implemented, the current version of the paper does not report training experiments or performance metrics. We concur that providing such evidence would better substantiate the utility for spatiotemporal sensemaking. Accordingly, the revised manuscript will incorporate preliminary experimental results, including accuracy metrics on held-out temporal VQA examples and comparisons against single-image baselines, along with a brief error analysis. revision: yes

Circularity Check

0 steps flagged

No circularity: work is external dataset curation plus public MLLM adaptation

full rationale

The manuscript describes creation of SMART-HC-VQA by transforming annotations from the external IARPA SMART Heavy Construction dataset into VQA triplets, plus adaptation of the publicly available LLaVA-NeXT model for multi-image input. No equations, fitted parameters, self-referential predictions, or load-bearing self-citations appear. The claimed foundation is the reproducibility of the curation workflow and training setup, which does not reduce to any input by construction. This matches the default non-circular case for dataset papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that Sentinel-2 imagery processing and annotation-to-VQA conversion faithfully retain spatiotemporal information without introducing artifacts that would prevent meaningful MLLM learning.

axioms (1)

domain assumption Sentinel-2 imagery can be segmented into site-centered chips while maintaining traceability to SMART-HC annotations
Invoked in the workflow for retrieving, processing, and segmenting large satellite tiles

pith-pipeline@v0.9.0 · 5580 in / 1328 out tokens · 85956 ms · 2026-05-12T04:30:06.587007+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 4 internal anchors

[1]

Evaluating broad area search and classification of heavy construction activity from multi-source, multi-temporal satellite image sequences,

C. R. Ratto, M. T. Kelbaugh, T. A. Stout, C. D. Piatko, and H. R. Goldberg, “Evaluating broad area search and classification of heavy construction activity from multi-source, multi-temporal satellite image sequences,” in Geospatial Informatics XV, Proc. SPIE, vol. 13461, Art. no. 1346107, 2025, doi.org/10.1117/12.3053632

work page doi:10.1117/12.3053632 2025
[2]

Sentinel-2,

European Space Agency, “Sentinel-2,” Copernicus Sentinel Missions, ESA. Accessed: May 8, 2026. [Online]. Available: esa.int/Applications/Observing_the_Earth/Copernicus/Sentinel-2

work page 2026
[3]

Automated global - scale detection and characterization of anthropogenic activity using multi-source satellite-based remote sensing imagery,

H. R. Goldberg, C. R. Ratto, A. Banerjee, M. T. Kelbaugh, M. Giglio, and E. F. Vermote, “Automated global - scale detection and characterization of anthropogenic activity using multi-source satellite-based remote sensing imagery,” in Geospatial Informatics XIII, Proc. SPIE, vol. 12525, Art. no. 1252502, 2023, doi.org/10.1117/12.2663071

work page doi:10.1117/12.2663071 2023
[4]

Dynamicvl: Benchmarking multimodal large language models for dynamic city understanding.arXiv preprint arXiv:2505.21076, 2025

W. Xuan, J. Wang, H. Qi, Z. Chen, Z. Zheng, Y. Zhong, J. Xia, and N. Yokoya, “DynamicVL: Benchmarking multimodal large language models for dynamic city understanding,” arXiv:2505.21076, 2025 , doi.org/10.48550/arXiv.2505.21076

work page doi:10.48550/arxiv.2505.21076 2025
[5]

BTCChat: Advancing remote sensing bi -temporal change captioning with multimodal large language model,

Y. Li, W. Xu, Y. Zhang, Z. Wei, and M. Peng, “BTCChat: Advancing remote sensing bi -temporal change captioning with multimodal large language model,” arXiv:2509.05895, 2025 , doi.org/10.48550/arXiv.2509.05895

work page doi:10.48550/arxiv.2509.05895 2025
[6]

OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations

S. Soni, A. Dudhane, H. Debary, M. Fiaz, M. A. Munir, M. S. Danish, P. Fraccaro, C. D. Watson, L. J. Klein, F. S. Khan, and S. Khan, “EarthDial: Turning multi-sensory Earth observations to interactive dialogues,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2025, doi.org/10.1109/CVPR52734.2025.01334

work page doi:10.1109/cvpr52734.2025.01334 2025
[7]

URL https://doi.org/10.1109/CVPR52733

X. Guo, J. Lao, B. Dang, Y. Zhang, L. Yu, L. Ru, L. Zhong, Z. Huang, K. Wu, D. Hu, H. He, J. Wang, J. Chen, M. Yang, Y. Zhang, and Y. Li, “SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2024, pp. 27662–27673, do...

work page doi:10.1109/cvpr52733.2024.02613 2024
[8]

Olmoearth: Stable latent image modeling for multimodal earth observation.arXiv preprint arXiv:2511.13655, 2025

H. Herzog, F. Bastani, Y. Zhang, G. Tseng, J. Redmon, H. Sablon, R. Park, J. Morrison, A. Buraczynski, K. Farley, J. Hansen, A. Howe, P. A. Johnson, M. Otterlee, T. Schmitt, H. Pitelka, S. Daspit, R. Ratner, C. Wilhelm, S. Wood, M. Jacobi, H. Kerner, E. Shelhamer, A. Farhadi, R. Krishna, and P. Beukema, “OlmoEarth: Stable latent image modeling for multimo...

work page doi:10.48550/arxiv.2511.13655 2025
[9]

URL https://doi.org/10.1109/CVPR52733

K. Kuckreja, M. S. Danish, M. Naseer, A. Das, S. H. Khan, and F. S. Khan, “GeoChat: Grounded large vision - language model for remote sensing,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2024, pp. 27831–27840, doi.org/10.1109/CVPR52733.2024.02629

work page doi:10.1109/cvpr52733.2024.02629 2024
[10]

RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery,

Y. Bazi, L. Bashmal, M. M. Al Rahhal, R. Ricci, and F. Melgani, “RS -LLaVA: A large vision-language model for joint captioning and question answering in remote sensing imagery,” Remote Sensing, vol. 16, no. 9, Art. no. 1477, 2024, doi.org/10.3390/rs16091477

work page doi:10.3390/rs16091477 2024
[11]

EarthGPT: A universal multi -modal large language model for multi-sensor image comprehension in remote sensing domain,

W. Zhang, M. Cai, T. Zhang, Z. Yin, and X. Mao, “EarthGPT: A universal multi -modal large language model for multi-sensor image comprehension in remote sensing domain,” IEEE Trans. Geosci. Remote Sens., vol. 62, pp. 1 – 20, 2024, doi.org/10.1109/TGRS.2024.3409624

work page doi:10.1109/tgrs.2024.3409624 2024
[12]

LHRS-Bot: Empowering remote sensing with VGI-enhanced large multimodal language model,

D. Muhtar, Z. Li, F. Gu, X. Zhang, and P. Xiao, “LHRS-Bot: Empowering remote sensing with VGI-enhanced large multimodal language model,” in Computer Vision – ECCV 2024, Lecture Notes in Computer Science, Cham, Switzerland: Springer, 2024, pp. 440–457, doi.org/10.1007/978-3-031-72904-1_26

work page doi:10.1007/978-3-031-72904-1_26 2024
[13]

Neural plasticity-inspired foundation model for observing the Earth crossing modalities,

Z. Xiong, Y. Wang, F. Zhang, A. J. Stewart, J. Hanna, D. Borth, I. Papoutsis, B. Le Saux, G. Camps -Valls, and X. X. Zhu, “Neural plasticity-inspired foundation model for observing the Earth crossing modalities,” arXiv:2403.15356, 2024, doi.org/10.48550/arXiv.2403.15356

work page doi:10.48550/arxiv.2403.15356 2024
[14]

Remote sensing retrieval -augmented generation: Bridging remote sensing imagery and comprehensive knowledge with a multimodal dataset and retrieval - augmented generation model,

C. Wen, Y. Lin, X. Qu, N. Li, Y. Liao, H. Lin, and X. Li, “Remote sensing retrieval -augmented generation: Bridging remote sensing imagery and comprehensive knowledge with a multimodal dataset and retrieval - augmented generation model,” IEEE Geosci. Remote Sens. Mag., pp. 2–20, 2026, doi.org/10.1109/MGRS.2025.3645852

work page doi:10.1109/mgrs.2025.3645852 2026
[15]

GeoLLaVA-8K: Scaling remote-sensing multimodal large language models to 8K resolution,

F. Wang, M. Chen, Y. Li, D. Wang, H. Wang, Z. Guo, Z. Wang, B. Shan, L. Lan, Y. Wang, H. Wang, W. Yang, B. Du, and J. Zhang, “GeoLLaVA-8K: Scaling remote-sensing multimodal large language models to 8K resolution,” arXiv:2505.21375, 2025, doi.org/10.48550/arXiv.2505.21375

work page doi:10.48550/arxiv.2505.21375 2025
[16]

Efficient prompt tuning of large vision-language model for fine-grained ship classification,

L. Lan, F. Wang, X.-T. Zheng, Z. Wang, and X. Liu, “Efficient prompt tuning of large vision-language model for fine-grained ship classification,” IEEE Trans. Geosci. Remote Sens., vol. 63, pp. 1 –10, 2025, doi.org/10.1109/TGRS.2024.3509721

work page doi:10.1109/tgrs.2024.3509721 2025
[17]

IFShip: Interpretable fine -grained ship classification with domain knowledge-enhanced vision-language models,

M. Guo, M. Wu, Y. Shen, H. Li, and C. Tao, “IFShip: Interpretable fine -grained ship classification with domain knowledge-enhanced vision-language models,” Pattern Recognition, vol. 166, Art. no. 111672, 2025, doi.org/10.1016/j.patcog.2025.111672

work page doi:10.1016/j.patcog.2025.111672 2025
[18]

CARP: Cloud-adaptive robust prompting of vision-language models for ship classification under cloud occlusion,

H. Zhan, Y. Song, X. Huang, X. Tan, and T. Zhang, “CARP: Cloud-adaptive robust prompting of vision-language models for ship classification under cloud occlusion,” Frontiers in Remote Sensing, vol. 6, Art. no. 1662024, 2025, doi.org/10.3389/frsen.2025.1662024

work page doi:10.3389/frsen.2025.1662024 2025
[19]

VLPRSDet: A vision-language pretrained model for remote sensing object detection,

D. Liu, X. Liang, Y. Qi, Y. Xi, J. Jin, and J. Zhang, “VLPRSDet: A vision-language pretrained model for remote sensing object detection,” Neurocomputing, vol. 658, Art. no. 131712, 2025, doi.org/10.1016/j.neucom.2025.131712

work page doi:10.1016/j.neucom.2025.131712 2025
[20]

LLaMA -Unidetector: A LLaMA-based universal framework for open-vocabulary object detection in remote sensing imagery,

G. Wang, J. Xie, T. Zhang, Y. Sun, H. Chen, Z. Yin, and J. Li, “LLaMA -Unidetector: A LLaMA-based universal framework for open-vocabulary object detection in remote sensing imagery,” IEEE Transactions on Geoscience and Remote Sensing, vol. 63, pp. 1–18, 2025, doi.org/10.1109/TGRS.2025.3564332

work page doi:10.1109/tgrs.2025.3564332 2025
[21]

VectorLLM: Human -like extraction of structured building contours via multimodal LLMs,

T. Zhang, S. Wei, S. Chen, W. Yu, M. Luo, and S. Ji, “VectorLLM: Human -like extraction of structured building contours via multimodal LLMs,” ISPRS J. Photogramm. Remote Sens., vol. 233, pp. 55 –68, 2026, doi.org/10.1016/j.isprsjprs.2026.01.025

work page doi:10.1016/j.isprsjprs.2026.01.025 2026
[22]

Leveraging Visual Language Model and Generative Diffusion Model for Zero-Shot SAR Target Recognition,

J. Wang, H. Sun, T. Tang, Y. Sun, Q. He, L. Lei, and K. Ji, “Leveraging visual language model and generative diffusion model for zero-shot SAR target recognition,” Remote Sensing, vol. 16, no. 16, Art. no. 2927, 2024, doi.org/10.3390/rs16162927

work page doi:10.3390/rs16162927 2024
[23]

SARVLM: A vision language foundation model for semantic understanding and target recognition in SAR imagery,

Q. Ma, Z. Wang, W. Liu, X. Lu, B. Deng, P. Duan, X. Kang, and S. Li, “SARVLM: A vision language foundation model for semantic understanding and target recognition in SAR imagery,” arXiv:2510.22665, 2025 , doi.org/10.48550/arXiv.2510.22665

work page doi:10.48550/arxiv.2510.22665 2025
[24]

Popeye: A unified visual -language model for multi-source ship detection from remote sensing imagery,

W. Zhang, M. Cai, T. Zhang, G. Lei, Y. Zhuang, and X. Mao, “Popeye: A unified visual -language model for multi-source ship detection from remote sensing imagery,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 17, pp. 19813–19826, 2024, doi.org/10.1109/JSTARS.2024.3488034

work page doi:10.1109/jstars.2024.3488034 2024
[25]

Moving and Stationary Target Acquisition and Recognition (MSTAR) Public Release,

DARPA and AFRL, “Moving and Stationary Target Acquisition and Recognition (MSTAR) Public Release,” Sensor Data Management System, Sep. 1995. [Online]. Available: sdms.afrl.af.mil/index.php?collection=mstar

work page 1995
[26]

Towards a large language -vision question answering model for MSTAR automatic target recognition,

D. F. Ramirez, T. L. Overman, K. Jaskie, M. Kleine, and A. Spanias, “Towards a large language -vision question answering model for MSTAR automatic target recognition,” in Automatic Target Recognition XXXV, K. Chen, R. I. Hammoud, and T. L. Overman, Eds., Proc. SPIE, vol. 13463, Art. no. 134630D, 2025, doi.org/10.1117/12.3053859

work page doi:10.1117/12.3053859 2025
[27]

LLaVA -NeXT: Stronger LLMs supercharge multimodal capabilities in the wild,

B. Li, K. Zhang, H. Zhang, D. Guo, R. Zhang, F. Li, Y. Zhang, Z. Liu, and C. Li, “LLaVA -NeXT: Stronger LLMs supercharge multimodal capabilities in the wild,” LLaVA Blog, May 2024. [Online]. Available: llava- vl.github.io/blog/2024-05-10-llava-next-stronger-llms/

work page 2024
[28]

SAR-RAG: ATR Visual Question Answering by Semantic Search, Retrieval, and MLLM Generation

D. F. Ramirez, T. Overman, K. Jaskie, J. Marvin, and A. Spanias, “SAR-RAG: ATR visual question answering by semantic search, retrieval, and MLLM generation,” arXiv:2602.04712, 2026 , doi.org/10.48550/arXiv.2602.04712

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.04712 2026
[29]

SARLANG-1M: A benchmark for vision- language modeling in SAR image understanding,

Y. Wei, A. Xiao, Y. Ren, Y. Zhu, H. Chen, J. Xia, and N. Yokoya, “SARLANG-1M: A benchmark for vision- language modeling in SAR image understanding,” IEEE Trans. Geosci. Remote Sens., 2026, doi.org/10.1109/TGRS.2026.3652099

work page doi:10.1109/tgrs.2026.3652099 2026
[30]

SARChat-Bench-2M: A multi-task vision-language benchmark for SAR image interpretation,

Z. Ma, X. Xiao, S. Dong, P. Wang, H. Wang, and Q. Pan, “SARChat-Bench-2M: A multi-task vision-language benchmark for SAR image interpretation,” arXiv:2502.08168, 2025, doi.org/10.48550/arXiv.2502.08168

work page doi:10.48550/arxiv.2502.08168 2025
[31]

IARPA SMART Heavy Construction Dataset,

JHU/APL PubGeo, “IARPA SMART Heavy Construction Dataset,” GitHub repository. Accessed: May 8, 2026. [Online]. Available: github.com/pubgeo/IARPA-SMART

work page 2026
[32]

Sentinel-1,

European Space Agency, “Sentinel-1,” Copernicus Sentinel Missions, ESA. Accessed: May 8, 2026. [Online]. Available: esa.int/Applications/Observing_the_Earth/Copernicus/Sentinel-1

work page 2026
[33]

Gemma 3 Technical Report

Gemma Team, A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, S. Perrin, T. Matejovicova, A. Ramé, M. Rivière, et al., “Gemma 3 technical report,” arXiv:2503.19786, 2025, doi.org/10.48550/arXiv.2503.19786

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.19786 2025
[34]

The Llama 3 Herd of Models

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al -Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al., “The Llama 3 herd of models,” arXiv:2407.21783, 2024, doi.org/10.48550/arXiv.2407.21783

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024
[35]

Qwen3 Technical Report

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, et al., “Qwen3 technical report,” arXiv:2505.09388, 2025, doi.org/10.48550/arXiv.2505.09388

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
[36]

arXiv preprint arXiv:2601.07372 , year=

X. Cheng, W. Zeng, D. Dai, Q. Chen, B. Wang, Z. Xie, K. Huang, X. Yu, Z. Hao, Y. Li, H. Zhang, H. Zhang, D. Zhao, and W. Liang, “Conditional memory via scalable lookup: A new axis of sparsity for large language models,” arXiv:2601.07372, 2026, doi.org/10.48550/arXiv.2601.07372

work page doi:10.48550/arxiv.2601.07372 2026
[37]

DeepSeek-V4: Towards highly efficient million-token context intelligence,

DeepSeek-AI, “DeepSeek-V4: Towards highly efficient million-token context intelligence,” Tech. Rep., Apr. 2026. [Online]. Available: huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf

work page 2026