pith. machine review for the scientific record. sign in

arxiv: 2605.10739 · v1 · submitted 2026-05-11 · 📡 eess.IV · cs.AI· cs.CV

Recognition: no theorem link

Geospatial-Temporal Sensemaking of Remote Sensing Activity Detections with Multimodal Large Language Model

Andreas Spanias, David F. Ramirez, Kristen Jaskie, Tim Overman

Pith reviewed 2026-05-12 04:30 UTC · model grok-4.3

classification 📡 eess.IV cs.AIcs.CV
keywords remote sensingvisual question answeringmultimodal large language modelsSentinel-2construction activity detectionspatiotemporal analysisdataset creationchange reasoning
0
0 comments X

The pith

Converting heavy construction annotations from Sentinel-2 satellite images into over two million visual question-answering examples lets multimodal language models reason about activity states, progression, and future developments at fixed

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates SMART-HC-VQA by transforming the IARPA SMART Heavy Construction dataset's site annotations, construction types, temporal phases, and geographic metadata into natural language question-answer pairs paired with Sentinel-2 image chips. This re-frames fixed geospatial sites as evolving targets whose attributes change across sparse observations, turning change detection into a task of understanding ongoing processes. The authors supply 21,837 image chips, 65,511 single-image VQA items, and roughly 2.3 million two-image temporal examples generated by pairwise combinatorial augmentation. They also implement a multi-image training setup based on LLaVA-NeXT Mistral-7B that accepts dated image inputs and learns from the metadata-derived questions. The result is positioned as a reproducible base for language-guided remote sensing that aims to support reasoning about process evolution rather than isolated detections.

Core claim

SMART-HC-VQA redefines an existing construction-site annotation collection as a temporally extended automatic target recognition and visual question answering challenge in which a fixed geospatial site functions as the target whose attributes and activity states evolve across sparse satellite observations, with the dataset and associated multi-image MLLM framework providing the concrete means to train models that reason about activity progression and potential future states.

What carries the argument

SMART-HC-VQA dataset of Sentinel-2 image chips paired with VQA triplets derived from construction annotations, temporal-phase labels, and observation relationships, plus the Image-Pairwise Combinatorial Augmentation method that expands it to 2.3 million temporal comparison examples.

If this is right

  • Models trained this way can answer questions about construction phases and activity evolution instead of only reporting presence or absence of change.
  • The same conversion workflow can be applied to other remote-sensing annotation sets to create language-supervised temporal reasoning tasks.
  • Fixed-site analysis becomes possible in which the same geographic location is tracked across multiple dated observations as a single evolving target.
  • The multi-image input adaptation of LLaVA-NeXT demonstrates a practical route for incorporating dated satellite pairs directly into existing multimodal training pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The dataset could support downstream tasks such as forecasting construction timelines or flagging anomalous activity sequences once models are fine-tuned on the temporal pairs.
  • Extending the same annotation-to-VQA pipeline to non-construction domains like agriculture or disaster recovery would test whether the approach generalizes beyond heavy-equipment sites.
  • Because the examples are generated from existing annotations rather than new human labeling, the method offers a low-cost way to bootstrap large temporal VQA corpora from any labeled satellite time series.

Load-bearing premise

Transforming construction annotations and temporal relationships into natural language VQA triplets retains enough spatiotemporal detail for a multimodal model to learn genuine reasoning about activity states and progression.

What would settle it

Train the described LLaVA-NeXT-based model on SMART-HC-VQA and test whether its answers to questions about next-phase construction activity or progression order match human-annotated ground truth on held-out image pairs at rates no better than a simple change-detection baseline that ignores the language component.

read the original abstract

We introduce SMART-HC-VQA, a Sentinel-2-based visual question answering dataset derived from the IARPA SMART Heavy Construction dataset, designed for spatiotemporal analysis of human activity. The dataset transforms construction-site annotations, construction-type labels, temporal-phase labels, geographic metadata, and observation relationships into natural language question-answer triplets. This approach redefines the existing dataset as a temporally extended automatic target recognition and visual question answering (VQA) challenge, considering a fixed geospatial site as a target whose attributes and activity states evolve across sparse satellite observations. Currently, SMART-HC-VQA comprises 21,837 accessible Sentinel-2 image chips, 65,511 single-image VQA examples, and approximately 2.3 million two-image temporal comparison examples generated via our novel Image-Pairwise Combinatorial Augmentation. We detail the workflow for retrieving and processing Sentinel-2 imagery, segmenting large satellite tiles into site-centered images, maintaining traceability to SMART-HC annotations, and analyzing the distributions of site size, observation count, temporal coverage, construction type, and phase labels. Additionally, we describe an implemented multi-image MLLM training framework based on LLaVA-NeXT Mistral-7B, adapted to accept multiple dated image inputs and train on metadata-derived VQA examples. This work offers a reproducible foundation for understanding language-guided remote sensing activities, aiming not only to detect change but also to reason about the ongoing processes, their progression, and potential future developments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SMART-HC-VQA, a Sentinel-2-based VQA dataset derived from the IARPA SMART Heavy Construction annotations. It transforms construction-site labels, temporal-phase information, geographic metadata, and observation relationships into natural-language question-answer triplets, yielding 65,511 single-image and ~2.3 million two-image temporal examples via Image-Pairwise Combinatorial Augmentation. The manuscript details the Sentinel-2 retrieval/processing workflow, site-centered chip segmentation, label distributions, and an adapted multi-image LLaVA-NeXT Mistral-7B training framework intended to support reasoning about activity states, progression, and future developments.

Significance. If the generated VQA triplets preserve sufficient spatiotemporal structure and the framework can be shown to learn non-trivial reasoning, the contribution would supply a large-scale, reproducible resource for language-guided remote-sensing activity analysis. The scale of the temporal-comparison examples and the explicit linkage to existing heavy-construction annotations represent a concrete step toward process-level understanding rather than isolated change detection.

major comments (2)
  1. [Abstract / dataset construction workflow] Abstract and dataset-construction workflow: the central claim that the annotation-to-VQA conversion supplies a foundation for MLLMs to reason about activity progression rests on the untested assumption that natural-language reformulation preserves temporal-phase and observation-relationship information without distortion or loss; no quantitative check (e.g., inter-annotator agreement on temporal ordering or information-retrieval metrics on the generated triplets) is reported.
  2. [Implemented multi-image MLLM training framework] Implemented multi-image MLLM training framework section: the manuscript describes the LLaVA-NeXT adaptation and training setup on the 2.3 million examples yet provides zero training runs, accuracy metrics, baseline comparisons, or error analysis on held-out VQA examples, leaving the utility of the dataset and framework for spatiotemporal sensemaking unsupported by evidence.
minor comments (2)
  1. [Abstract] The exact counts and generation statistics for the two-image temporal examples should be stated precisely rather than approximated, and the Image-Pairwise Combinatorial Augmentation procedure should include pseudocode or a small worked example for reproducibility.
  2. [dataset analysis] Distributions of site size, observation count, temporal coverage, construction type, and phase labels are analyzed but not accompanied by tables or figures in the provided text; adding them would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and for highlighting the potential of the SMART-HC-VQA dataset for advancing language-guided remote sensing analysis. We address each of the major comments below.

read point-by-point responses
  1. Referee: Abstract and dataset-construction workflow: the central claim that the annotation-to-VQA conversion supplies a foundation for MLLMs to reason about activity progression rests on the untested assumption that natural-language reformulation preserves temporal-phase and observation-relationship information without distortion or loss; no quantitative check (e.g., inter-annotator agreement on temporal ordering or information-retrieval metrics on the generated triplets) is reported.

    Authors: The VQA triplets are produced via a fully automated, deterministic pipeline that directly maps the original SMART annotations—including temporal phases, construction types, and observation relationships—into templated natural-language questions and answers. This design ensures that the information is preserved by construction rather than through subjective reformulation. Nevertheless, we agree that explicit validation would strengthen the claim. In the revised manuscript, we will add quantitative checks, such as retrieval metrics measuring how accurately key attributes (e.g., phase labels and temporal ordering) can be recovered from the generated triplets, to empirically demonstrate fidelity. revision: yes

  2. Referee: Implemented multi-image MLLM training framework section: the manuscript describes the LLaVA-NeXT adaptation and training setup on the 2.3 million examples yet provides zero training runs, accuracy metrics, baseline comparisons, or error analysis on held-out VQA examples, leaving the utility of the dataset and framework for spatiotemporal sensemaking unsupported by evidence.

    Authors: The manuscript's primary focus is the creation and characterization of the SMART-HC-VQA dataset together with the specification of an adapted multi-image LLaVA-NeXT training framework. While the framework has been implemented, the current version of the paper does not report training experiments or performance metrics. We concur that providing such evidence would better substantiate the utility for spatiotemporal sensemaking. Accordingly, the revised manuscript will incorporate preliminary experimental results, including accuracy metrics on held-out temporal VQA examples and comparisons against single-image baselines, along with a brief error analysis. revision: yes

Circularity Check

0 steps flagged

No circularity: work is external dataset curation plus public MLLM adaptation

full rationale

The manuscript describes creation of SMART-HC-VQA by transforming annotations from the external IARPA SMART Heavy Construction dataset into VQA triplets, plus adaptation of the publicly available LLaVA-NeXT model for multi-image input. No equations, fitted parameters, self-referential predictions, or load-bearing self-citations appear. The claimed foundation is the reproducibility of the curation workflow and training setup, which does not reduce to any input by construction. This matches the default non-circular case for dataset papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that Sentinel-2 imagery processing and annotation-to-VQA conversion faithfully retain spatiotemporal information without introducing artifacts that would prevent meaningful MLLM learning.

axioms (1)
  • domain assumption Sentinel-2 imagery can be segmented into site-centered chips while maintaining traceability to SMART-HC annotations
    Invoked in the workflow for retrieving, processing, and segmenting large satellite tiles

pith-pipeline@v0.9.0 · 5580 in / 1328 out tokens · 85956 ms · 2026-05-12T04:30:06.587007+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 4 internal anchors

  1. [1]

    Evaluating broad area search and classification of heavy construction activity from multi-source, multi-temporal satellite image sequences,

    C. R. Ratto, M. T. Kelbaugh, T. A. Stout, C. D. Piatko, and H. R. Goldberg, “Evaluating broad area search and classification of heavy construction activity from multi-source, multi-temporal satellite image sequences,” in Geospatial Informatics XV, Proc. SPIE, vol. 13461, Art. no. 1346107, 2025, doi.org/10.1117/12.3053632

  2. [2]

    Sentinel-2,

    European Space Agency, “Sentinel-2,” Copernicus Sentinel Missions, ESA. Accessed: May 8, 2026. [Online]. Available: esa.int/Applications/Observing_the_Earth/Copernicus/Sentinel-2

  3. [3]

    Automated global - scale detection and characterization of anthropogenic activity using multi-source satellite-based remote sensing imagery,

    H. R. Goldberg, C. R. Ratto, A. Banerjee, M. T. Kelbaugh, M. Giglio, and E. F. Vermote, “Automated global - scale detection and characterization of anthropogenic activity using multi-source satellite-based remote sensing imagery,” in Geospatial Informatics XIII, Proc. SPIE, vol. 12525, Art. no. 1252502, 2023, doi.org/10.1117/12.2663071

  4. [4]

    Dynamicvl: Benchmarking multimodal large language models for dynamic city understanding.arXiv preprint arXiv:2505.21076, 2025

    W. Xuan, J. Wang, H. Qi, Z. Chen, Z. Zheng, Y. Zhong, J. Xia, and N. Yokoya, “DynamicVL: Benchmarking multimodal large language models for dynamic city understanding,” arXiv:2505.21076, 2025 , doi.org/10.48550/arXiv.2505.21076

  5. [5]

    BTCChat: Advancing remote sensing bi -temporal change captioning with multimodal large language model,

    Y. Li, W. Xu, Y. Zhang, Z. Wei, and M. Peng, “BTCChat: Advancing remote sensing bi -temporal change captioning with multimodal large language model,” arXiv:2509.05895, 2025 , doi.org/10.48550/arXiv.2509.05895

  6. [6]

    OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations

    S. Soni, A. Dudhane, H. Debary, M. Fiaz, M. A. Munir, M. S. Danish, P. Fraccaro, C. D. Watson, L. J. Klein, F. S. Khan, and S. Khan, “EarthDial: Turning multi-sensory Earth observations to interactive dialogues,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2025, doi.org/10.1109/CVPR52734.2025.01334

  7. [7]

    URL https://doi.org/10.1109/CVPR52733

    X. Guo, J. Lao, B. Dang, Y. Zhang, L. Yu, L. Ru, L. Zhong, Z. Huang, K. Wu, D. Hu, H. He, J. Wang, J. Chen, M. Yang, Y. Zhang, and Y. Li, “SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2024, pp. 27662–27673, do...

  8. [8]

    Olmoearth: Stable latent image modeling for multimodal earth observation.arXiv preprint arXiv:2511.13655, 2025

    H. Herzog, F. Bastani, Y. Zhang, G. Tseng, J. Redmon, H. Sablon, R. Park, J. Morrison, A. Buraczynski, K. Farley, J. Hansen, A. Howe, P. A. Johnson, M. Otterlee, T. Schmitt, H. Pitelka, S. Daspit, R. Ratner, C. Wilhelm, S. Wood, M. Jacobi, H. Kerner, E. Shelhamer, A. Farhadi, R. Krishna, and P. Beukema, “OlmoEarth: Stable latent image modeling for multimo...

  9. [9]

    URL https://doi.org/10.1109/CVPR52733

    K. Kuckreja, M. S. Danish, M. Naseer, A. Das, S. H. Khan, and F. S. Khan, “GeoChat: Grounded large vision - language model for remote sensing,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2024, pp. 27831–27840, doi.org/10.1109/CVPR52733.2024.02629

  10. [10]

    RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery,

    Y. Bazi, L. Bashmal, M. M. Al Rahhal, R. Ricci, and F. Melgani, “RS -LLaVA: A large vision-language model for joint captioning and question answering in remote sensing imagery,” Remote Sensing, vol. 16, no. 9, Art. no. 1477, 2024, doi.org/10.3390/rs16091477

  11. [11]

    EarthGPT: A universal multi -modal large language model for multi-sensor image comprehension in remote sensing domain,

    W. Zhang, M. Cai, T. Zhang, Z. Yin, and X. Mao, “EarthGPT: A universal multi -modal large language model for multi-sensor image comprehension in remote sensing domain,” IEEE Trans. Geosci. Remote Sens., vol. 62, pp. 1 – 20, 2024, doi.org/10.1109/TGRS.2024.3409624

  12. [12]

    LHRS-Bot: Empowering remote sensing with VGI-enhanced large multimodal language model,

    D. Muhtar, Z. Li, F. Gu, X. Zhang, and P. Xiao, “LHRS-Bot: Empowering remote sensing with VGI-enhanced large multimodal language model,” in Computer Vision – ECCV 2024, Lecture Notes in Computer Science, Cham, Switzerland: Springer, 2024, pp. 440–457, doi.org/10.1007/978-3-031-72904-1_26

  13. [13]

    Neural plasticity-inspired foundation model for observing the Earth crossing modalities,

    Z. Xiong, Y. Wang, F. Zhang, A. J. Stewart, J. Hanna, D. Borth, I. Papoutsis, B. Le Saux, G. Camps -Valls, and X. X. Zhu, “Neural plasticity-inspired foundation model for observing the Earth crossing modalities,” arXiv:2403.15356, 2024, doi.org/10.48550/arXiv.2403.15356

  14. [14]

    Remote sensing retrieval -augmented generation: Bridging remote sensing imagery and comprehensive knowledge with a multimodal dataset and retrieval - augmented generation model,

    C. Wen, Y. Lin, X. Qu, N. Li, Y. Liao, H. Lin, and X. Li, “Remote sensing retrieval -augmented generation: Bridging remote sensing imagery and comprehensive knowledge with a multimodal dataset and retrieval - augmented generation model,” IEEE Geosci. Remote Sens. Mag., pp. 2–20, 2026, doi.org/10.1109/MGRS.2025.3645852

  15. [15]

    GeoLLaVA-8K: Scaling remote-sensing multimodal large language models to 8K resolution,

    F. Wang, M. Chen, Y. Li, D. Wang, H. Wang, Z. Guo, Z. Wang, B. Shan, L. Lan, Y. Wang, H. Wang, W. Yang, B. Du, and J. Zhang, “GeoLLaVA-8K: Scaling remote-sensing multimodal large language models to 8K resolution,” arXiv:2505.21375, 2025, doi.org/10.48550/arXiv.2505.21375

  16. [16]

    Efficient prompt tuning of large vision-language model for fine-grained ship classification,

    L. Lan, F. Wang, X.-T. Zheng, Z. Wang, and X. Liu, “Efficient prompt tuning of large vision-language model for fine-grained ship classification,” IEEE Trans. Geosci. Remote Sens., vol. 63, pp. 1 –10, 2025, doi.org/10.1109/TGRS.2024.3509721

  17. [17]

    IFShip: Interpretable fine -grained ship classification with domain knowledge-enhanced vision-language models,

    M. Guo, M. Wu, Y. Shen, H. Li, and C. Tao, “IFShip: Interpretable fine -grained ship classification with domain knowledge-enhanced vision-language models,” Pattern Recognition, vol. 166, Art. no. 111672, 2025, doi.org/10.1016/j.patcog.2025.111672

  18. [18]

    CARP: Cloud-adaptive robust prompting of vision-language models for ship classification under cloud occlusion,

    H. Zhan, Y. Song, X. Huang, X. Tan, and T. Zhang, “CARP: Cloud-adaptive robust prompting of vision-language models for ship classification under cloud occlusion,” Frontiers in Remote Sensing, vol. 6, Art. no. 1662024, 2025, doi.org/10.3389/frsen.2025.1662024

  19. [19]

    VLPRSDet: A vision-language pretrained model for remote sensing object detection,

    D. Liu, X. Liang, Y. Qi, Y. Xi, J. Jin, and J. Zhang, “VLPRSDet: A vision-language pretrained model for remote sensing object detection,” Neurocomputing, vol. 658, Art. no. 131712, 2025, doi.org/10.1016/j.neucom.2025.131712

  20. [20]

    LLaMA -Unidetector: A LLaMA-based universal framework for open-vocabulary object detection in remote sensing imagery,

    G. Wang, J. Xie, T. Zhang, Y. Sun, H. Chen, Z. Yin, and J. Li, “LLaMA -Unidetector: A LLaMA-based universal framework for open-vocabulary object detection in remote sensing imagery,” IEEE Transactions on Geoscience and Remote Sensing, vol. 63, pp. 1–18, 2025, doi.org/10.1109/TGRS.2025.3564332

  21. [21]

    VectorLLM: Human -like extraction of structured building contours via multimodal LLMs,

    T. Zhang, S. Wei, S. Chen, W. Yu, M. Luo, and S. Ji, “VectorLLM: Human -like extraction of structured building contours via multimodal LLMs,” ISPRS J. Photogramm. Remote Sens., vol. 233, pp. 55 –68, 2026, doi.org/10.1016/j.isprsjprs.2026.01.025

  22. [22]

    Leveraging Visual Language Model and Generative Diffusion Model for Zero-Shot SAR Target Recognition,

    J. Wang, H. Sun, T. Tang, Y. Sun, Q. He, L. Lei, and K. Ji, “Leveraging visual language model and generative diffusion model for zero-shot SAR target recognition,” Remote Sensing, vol. 16, no. 16, Art. no. 2927, 2024, doi.org/10.3390/rs16162927

  23. [23]

    SARVLM: A vision language foundation model for semantic understanding and target recognition in SAR imagery,

    Q. Ma, Z. Wang, W. Liu, X. Lu, B. Deng, P. Duan, X. Kang, and S. Li, “SARVLM: A vision language foundation model for semantic understanding and target recognition in SAR imagery,” arXiv:2510.22665, 2025 , doi.org/10.48550/arXiv.2510.22665

  24. [24]

    Popeye: A unified visual -language model for multi-source ship detection from remote sensing imagery,

    W. Zhang, M. Cai, T. Zhang, G. Lei, Y. Zhuang, and X. Mao, “Popeye: A unified visual -language model for multi-source ship detection from remote sensing imagery,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 17, pp. 19813–19826, 2024, doi.org/10.1109/JSTARS.2024.3488034

  25. [25]

    Moving and Stationary Target Acquisition and Recognition (MSTAR) Public Release,

    DARPA and AFRL, “Moving and Stationary Target Acquisition and Recognition (MSTAR) Public Release,” Sensor Data Management System, Sep. 1995. [Online]. Available: sdms.afrl.af.mil/index.php?collection=mstar

  26. [26]

    Towards a large language -vision question answering model for MSTAR automatic target recognition,

    D. F. Ramirez, T. L. Overman, K. Jaskie, M. Kleine, and A. Spanias, “Towards a large language -vision question answering model for MSTAR automatic target recognition,” in Automatic Target Recognition XXXV, K. Chen, R. I. Hammoud, and T. L. Overman, Eds., Proc. SPIE, vol. 13463, Art. no. 134630D, 2025, doi.org/10.1117/12.3053859

  27. [27]

    LLaVA -NeXT: Stronger LLMs supercharge multimodal capabilities in the wild,

    B. Li, K. Zhang, H. Zhang, D. Guo, R. Zhang, F. Li, Y. Zhang, Z. Liu, and C. Li, “LLaVA -NeXT: Stronger LLMs supercharge multimodal capabilities in the wild,” LLaVA Blog, May 2024. [Online]. Available: llava- vl.github.io/blog/2024-05-10-llava-next-stronger-llms/

  28. [28]

    SAR-RAG: ATR Visual Question Answering by Semantic Search, Retrieval, and MLLM Generation

    D. F. Ramirez, T. Overman, K. Jaskie, J. Marvin, and A. Spanias, “SAR-RAG: ATR visual question answering by semantic search, retrieval, and MLLM generation,” arXiv:2602.04712, 2026 , doi.org/10.48550/arXiv.2602.04712

  29. [29]

    SARLANG-1M: A benchmark for vision- language modeling in SAR image understanding,

    Y. Wei, A. Xiao, Y. Ren, Y. Zhu, H. Chen, J. Xia, and N. Yokoya, “SARLANG-1M: A benchmark for vision- language modeling in SAR image understanding,” IEEE Trans. Geosci. Remote Sens., 2026, doi.org/10.1109/TGRS.2026.3652099

  30. [30]

    SARChat-Bench-2M: A multi-task vision-language benchmark for SAR image interpretation,

    Z. Ma, X. Xiao, S. Dong, P. Wang, H. Wang, and Q. Pan, “SARChat-Bench-2M: A multi-task vision-language benchmark for SAR image interpretation,” arXiv:2502.08168, 2025, doi.org/10.48550/arXiv.2502.08168

  31. [31]

    IARPA SMART Heavy Construction Dataset,

    JHU/APL PubGeo, “IARPA SMART Heavy Construction Dataset,” GitHub repository. Accessed: May 8, 2026. [Online]. Available: github.com/pubgeo/IARPA-SMART

  32. [32]

    Sentinel-1,

    European Space Agency, “Sentinel-1,” Copernicus Sentinel Missions, ESA. Accessed: May 8, 2026. [Online]. Available: esa.int/Applications/Observing_the_Earth/Copernicus/Sentinel-1

  33. [33]

    Gemma 3 Technical Report

    Gemma Team, A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, S. Perrin, T. Matejovicova, A. Ramé, M. Rivière, et al., “Gemma 3 technical report,” arXiv:2503.19786, 2025, doi.org/10.48550/arXiv.2503.19786

  34. [34]

    The Llama 3 Herd of Models

    A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al -Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al., “The Llama 3 herd of models,” arXiv:2407.21783, 2024, doi.org/10.48550/arXiv.2407.21783

  35. [35]

    Qwen3 Technical Report

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, et al., “Qwen3 technical report,” arXiv:2505.09388, 2025, doi.org/10.48550/arXiv.2505.09388

  36. [36]

    arXiv preprint arXiv:2601.07372 , year=

    X. Cheng, W. Zeng, D. Dai, Q. Chen, B. Wang, Z. Xie, K. Huang, X. Yu, Z. Hao, Y. Li, H. Zhang, H. Zhang, D. Zhao, and W. Liang, “Conditional memory via scalable lookup: A new axis of sparsity for large language models,” arXiv:2601.07372, 2026, doi.org/10.48550/arXiv.2601.07372

  37. [37]

    DeepSeek-V4: Towards highly efficient million-token context intelligence,

    DeepSeek-AI, “DeepSeek-V4: Towards highly efficient million-token context intelligence,” Tech. Rep., Apr. 2026. [Online]. Available: huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf