arxiv: 2604.20190 · v1 · submitted 2026-04-22 · 💻 cs.CV · cs.LG

Recognition: unknown

WildFireVQA: A Large-Scale Radiometric Thermal VQA Benchmark for Aerial Wildfire Monitoring

Mobin Habibpour , Niloufar Alipour Talemi , John Spodnik , Camren J. Khoury , Fatemeh Afghah

Authors on Pith no claims yet

Pith reviewed 2026-05-10 01:04 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords wildfire monitoringvisual question answeringthermal imagingaerial imagerymultimodal large language modelsRGB-thermal fusionbenchmark datasetfire intelligence

0 comments

The pith

WildFireVQA introduces a benchmark of 6,097 aerial RGB-thermal samples with 207,298 questions to test multimodal models on wildfire monitoring tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a new large-scale visual question answering dataset that pairs RGB imagery with radiometric thermal data specifically for aerial wildfire intelligence. It generates questions across presence detection, classification, segmentation, localization, cross-modal reasoning, and flight planning, then verifies answers through a hybrid process of model generation, sensor-driven rules, and consistency checks. A sympathetic reader would care because timely wildfire response depends on AI systems that can interpret temperature measurements from airborne platforms, yet no prior benchmark supplied this grounded multimodal evaluation. Experiments on the dataset show that RGB alone remains the strongest input for current models while retrieved thermal context improves performance for stronger multimodal large language models.

Core claim

WildFireVQA supplies 6,097 RGB-thermal samples, each containing an RGB image, a color-mapped thermal visualization, and a radiometric TIFF, paired with 34 questions for a total of 207,298 multiple-choice items. The benchmark spans six operational categories and uses a hybrid annotation method that merges MLLM-generated answers with deterministic sensor labeling, manual verification, and intra- and inter-frame consistency checks. Evaluation of representative MLLMs under RGB-only, thermal-only, and retrieval-augmented settings demonstrates that RGB currently yields the highest accuracy across tasks, yet thermal retrieval produces measurable gains for stronger models and exposes limitations in

What carries the argument

The WildFireVQA benchmark itself, which supplies aligned RGB images, color-mapped thermal visualizations, radiometric TIFF files, and verified question-answer pairs across six wildfire intelligence categories.

If this is right

Developers can now measure and improve temperature-grounded reasoning in MLLMs using a public wildfire-specific testbed.
Retrieval of radiometric statistics becomes a concrete, testable technique for boosting multimodal performance on operational tasks.
The six task categories supply a structured way to diagnose where current models fail in detection, localization, and planning for fires.
Open release of the dataset and evaluation code allows direct comparison of future models against the reported RGB and thermal baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could be extended to video sequences to test temporal reasoning in evolving fire scenarios.
Limitations in thermal handling may motivate creation of specialized thermal feature encoders rather than reliance on general vision-language pretraining.
Operational drone systems might adopt the retrieval-augmented protocol as a lightweight way to incorporate temperature data without full multimodal retraining.

Load-bearing premise

The hybrid annotation process of MLLM generation, sensor-driven deterministic labels, and consistency checks produces ground-truth answers reliable enough for safety-critical wildfire tasks.

What would settle it

Independent expert review of a random subset of the dataset answers that finds error rates above 5 percent, or a follow-on study in which models scoring above 80 percent on the benchmark still produce unsafe recommendations in controlled live-fire drone flights.

Figures

Figures reproduced from arXiv: 2604.20190 by Camren J. Khoury, Fatemeh Afghah, John Spodnik, Mobin Habibpour, Niloufar Alipour Talemi.

**Figure 1.** Figure 1: Overview of the WildFireVQA for the operational wildfire intelligence. Unlike standard aerial VQA datasets, we pair standard [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Random examples of RGB-Thermal question-answer quadruplets in WildFireVQA. Each example shows an aligned RGB and [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: UAV altitude above ground level calculation. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Example WildFireVQA prompt and model responses. The prompt contains aligned RGB and thermal images, a radiometric [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

read the original abstract

Wildfire monitoring requires timely, actionable situational awareness from airborne platforms, yet existing aerial visual question answering (VQA) benchmarks do not evaluate wildfire-specific multimodal reasoning grounded in thermal measurements. We introduce WildFireVQA, a large-scale VQA benchmark for aerial wildfire monitoring that integrates RGB imagery with radiometric thermal data. WildFireVQA contains 6,097 RGB-thermal samples, where each sample includes an RGB image, a color-mapped thermal visualization, and a radiometric thermal TIFF, and is paired with 34 questions, yielding a total of 207,298 multiple-choice questions spanning presence and detection, classification, distribution and segmentation, localization and direction, cross-modal reasoning, and flight planning for operational wildfire intelligence. To improve annotation reliability, we combine multimodal large language model (MLLM)-based answer generation with sensor-driven deterministic labeling, manual verification, and intra-frame and inter-frame consistency checks. We further establish a comprehensive evaluation protocol for representative MLLMs under RGB, Thermal, and retrieval-augmented settings using radiometric thermal statistics. Experiments show that across task categories, RGB remains the strongest modality for current models, while retrieved thermal context yields gains for stronger MLLMs, highlighting both the value of temperature-grounded reasoning and the limitations of existing MLLMs in safety-critical wildfire scenarios. The dataset and benchmark code are open-source at https://github.com/mobiiin/WildFire_VQA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WildFireVQA fills a gap with a thermal VQA dataset for wildfires, but the missing annotation metrics are a notable weakness.

read the letter

This paper's main contribution is releasing WildFireVQA, a dataset of 6,097 aerial RGB-thermal samples with 207,298 questions on wildfire monitoring tasks. It does well by making the full data and benchmark code available. The setup tests models under RGB, thermal, and retrieval-augmented conditions using actual radiometric values. The reported experiments give a straightforward picture that RGB still works best for current models while thermal context helps stronger ones on some categories. The soft spot is annotation reliability. The pipeline uses MLLM answers plus sensor labels, manual verification, and consistency checks, but the paper provides no numbers on agreement rates, verification coverage, or error rates from expert samples. That makes it hard to assess how clean the labels are for operational wildfire use. The stress-test note points this out accurately. This is for researchers building or testing VQA systems in remote sensing and disaster response. A reader in that area can download the data and run their own experiments immediately. It is scoped to one domain but fills a specific gap. The work engages honestly with the literature on multimodal benchmarks and shows clear experimental design. It deserves a serious referee to get feedback on strengthening the validation evidence. I would send this to peer review.

Referee Report

2 major / 2 minor

Summary. The paper introduces WildFireVQA, a large-scale VQA benchmark for aerial wildfire monitoring comprising 6,097 RGB-thermal samples (each with RGB image, color-mapped thermal visualization, and radiometric TIFF) paired with 207,298 multiple-choice questions across six task categories: presence/detection, classification, distribution/segmentation, localization/direction, cross-modal reasoning, and flight planning. It describes a multi-stage annotation pipeline that combines MLLM-based answer generation, sensor-driven deterministic labeling, manual verification, and intra-/inter-frame consistency checks. The authors evaluate representative MLLMs under RGB-only, thermal-only, and retrieval-augmented settings using radiometric statistics, reporting that RGB remains the strongest modality overall while retrieved thermal context improves performance for stronger models.

Significance. If the ground-truth annotations prove reliable, this benchmark would fill an important gap by providing the first large-scale VQA resource that grounds wildfire reasoning in radiometric thermal measurements, supporting development of multimodal models for safety-critical aerial monitoring. The open release of the dataset and benchmark code is a clear strength that enables reproducibility and community follow-up. The empirical finding that current MLLMs still struggle with temperature-grounded reasoning even when thermal context is supplied is a useful signal for the field.

major comments (2)

[Section 3] Annotation pipeline (Section 3): The central claim that the multi-stage process (MLLM generation + sensor-driven labels + manual verification + consistency checks) produces sufficiently reliable ground truth for safety-critical wildfire intelligence tasks is load-bearing, yet the manuscript reports no quantitative metrics such as inter-annotator agreement, fraction of the 207,298 questions that received manual inspection, or measured error rate on a held-out expert sample. Without these numbers it is impossible to assess residual hallucination or inconsistency rates.
[Section 5] Experimental results (Section 5): The modality-comparison claims rest on reported performance differences across task categories, but the manuscript provides neither complete per-model/per-task accuracy tables nor statistical significance tests for the stated gains from retrieved thermal context. This weakens the ability to evaluate the strength of the conclusion that RGB remains strongest while thermal retrieval helps stronger MLLMs.

minor comments (2)

[Abstract / Section 3] The abstract states the total question count but does not break down the number of questions per task category; a small table or sentence in Section 3 would improve clarity.
[Figures 1-3] Figure captions for the sample visualizations could explicitly note the radiometric temperature range and color-mapping function used, to aid readers in interpreting the thermal channel.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments highlight key areas where additional transparency will strengthen the presentation of the benchmark's reliability and experimental findings. We address each major comment point-by-point below, outlining the specific revisions we will incorporate.

read point-by-point responses

Referee: [Section 3] Annotation pipeline (Section 3): The central claim that the multi-stage process (MLLM generation + sensor-driven labels + manual verification + consistency checks) produces sufficiently reliable ground truth for safety-critical wildfire intelligence tasks is load-bearing, yet the manuscript reports no quantitative metrics such as inter-annotator agreement, fraction of the 207,298 questions that received manual inspection, or measured error rate on a held-out expert sample. Without these numbers it is impossible to assess residual hallucination or inconsistency rates.

Authors: We agree that explicit quantitative metrics are essential to substantiate the reliability of the ground-truth annotations, especially given the safety-critical nature of wildfire monitoring. While the manuscript describes the multi-stage pipeline, it does not report the requested numerical details. In the revised version, we will expand Section 3 to include: the fraction of questions that received manual inspection, inter-annotator agreement computed on a sampled subset of the data, and an estimated error rate based on consistency checks together with validation on a held-out expert-annotated sample. These additions will enable readers to better evaluate residual error rates. revision: yes
Referee: [Section 5] Experimental results (Section 5): The modality-comparison claims rest on reported performance differences across task categories, but the manuscript provides neither complete per-model/per-task accuracy tables nor statistical significance tests for the stated gains from retrieved thermal context. This weakens the ability to evaluate the strength of the conclusion that RGB remains strongest while thermal retrieval helps stronger MLLMs.

Authors: We acknowledge that complete per-model/per-task tables and statistical significance tests are necessary for a rigorous evaluation of the modality comparisons. The current manuscript summarizes key trends but omits the full tables and formal tests. In the revised manuscript, we will include exhaustive accuracy tables for all models and task categories (in the main text or as an appendix) and report the results of appropriate statistical significance tests (e.g., McNemar's test for paired comparisons) on the observed performance differences, including gains from retrieved thermal context. This will strengthen the evidential basis for our conclusions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset and benchmark paper with no derivations

full rationale

This is a dataset introduction and benchmarking paper. It defines WildFireVQA by describing data collection (6,097 RGB-thermal samples), question generation (34 questions per sample yielding 207k MCQs), and an annotation pipeline (MLLM generation + deterministic labeling + manual verification + consistency checks). Experiments report direct empirical accuracies of existing MLLMs under RGB, thermal, and retrieval settings. No equations, fitted parameters, predictions derived from inputs, uniqueness theorems, or self-citation chains appear in the provided text. All claims reduce to measurements on the newly constructed data rather than any self-referential derivation, satisfying the self-contained criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations or new physical entities are introduced; the work consists of data collection, annotation, and empirical evaluation of existing models.

pith-pipeline@v0.9.0 · 5584 in / 1143 out tokens · 54305 ms · 2026-05-10T01:04:13.273865+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 5 canonical work pages · 3 internal anchors

[1]

Wildfire monitoring in remote areas us- ing autonomous unmanned aerial vehicles

Fatemeh Afghah, Abolfazl Razi, Jacob Chakareski, and Jonathan Ashdown. Wildfire monitoring in remote areas us- ing autonomous unmanned aerial vehicles. InIEEE INFO- COM 2019-IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), pages 835–840. IEEE,

2019
[2]

Open-ended remote sensing visual ques- tion answering with transformers.International Journal of Remote Sensing, 43(18):6809–6823, 2022

Mohamad M Al Rahhal, Yakoub Bazi, Sara O Alsaleh, Muna Al-Razgan, Mohamed Lamine Mekhalfi, Mansour Al Zuair, and Naif Alajlan. Open-ended remote sensing visual ques- tion answering with transformers.International Journal of Remote Sensing, 43(18):6809–6823, 2022. 2

2022
[3]

Disa: Directional saliency- aware prompt learning for generalizable vision-language models

Niloufar Alipour Talemi, Hossein Kashiani, Hossein R Nowdeh, and Fatemeh Afghah. Disa: Directional saliency- aware prompt learning for generalizable vision-language models. InProceedings of the 31st ACM SIGKDD Confer- ence on Knowledge Discovery and Data Mining V . 2, pages 37–46, 2025. 1

2025
[4]

Vqa: Visual question answering

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. InProceedings of the IEEE international conference on computer vision, pages 2425– 2433, 2015. 1

2015
[5]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 7, 11, 12

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

A review of multi-modal large language and vision models,

Kilian Carolan, Laura Fennelly, and Alan F Smeaton. A re- view of multi-modal large language and vision models.arXiv preprint arXiv:2404.01322, 2024. 1

work page arXiv 2024
[7]

Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024. 7

2024
[8]

Yolo-world: Real-time open-vocabulary object detection

Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xing- gang Wang, and Ying Shan. Yolo-world: Real-time open-vocabulary object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16901–16911, 2024. 7

2024
[9]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

A survey on in-context learning

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Baobao Chang, et al. A survey on in-context learning. InProceed- ings of the 2024 conference on empirical methods in natural language processing, pages 1107–1128, 2024. 2

2024
[11]

The shuttle radar topography mission.Reviews of geophysics, 45(2), 2007

Tom G Farr, Paul A Rosen, Edward Caro, Robert Crippen, Riley Duren, Scott Hensley, Michael Kobrick, Mimi Paller, Ernesto Rodriguez, Ladislav Roth, et al. The shuttle radar topography mission.Reviews of geophysics, 45(2), 2007. 5

2007
[12]

Flame 2: Fire detection and modeling: Aerial multi-spectral image dataset.IEEE DataPort, 2023

Bryce Hopkins, Leo O’Neill, Fatemeh Afghah, Abolfazl Razi, Eric Rowell, Adam Watts, Peter Fule, and Janice Coen. Flame 2: Fire detection and modeling: Aerial multi-spectral image dataset.IEEE DataPort, 2023. 2

2023
[13]

Flame 3 dataset: Unleashing the power of radiometric thermal uav imagery for wildfire man- agement.arXiv preprint arXiv:2412.02831, 2024

Bryce Hopkins, Leo ONeill, Michael Marinaccio, Eric Row- ell, Russell Parsons, Sarah Flanary, Irtija Nazim, Carl Seiel- stad, and Fatemeh Afghah. Flame 3 dataset: Unleashing the power of radiometric thermal uav imagery for wildfire man- agement.arXiv preprint arXiv:2412.02831, 2024. 2, 3, 5

work page arXiv 2024
[14]

Wit- uas: A wildland-fire infrared thermal dataset to detect crew assets from aerial views

Andrew Jong, Mukai Yu, Devansh Dhrafani, Siva Kailas, Brady Moon, Katia Sycara, and Sebastian Scherer. Wit- uas: A wildland-fire infrared thermal dataset to detect crew assets from aerial views. In2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 11464–11471. IEEE, 2023. 2

2023
[15]

Seed-bench: Bench- marking multimodal large language models

Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench: Bench- marking multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 13299–13308, 2024. 3

2024
[16]

Hrvqa: A visual question answering benchmark for high-resolution aerial images.ISPRS Journal of Photogrammetry and Re- mote Sensing, 214:65–81, 2024

Kun Li, George V osselman, and Michael Ying Yang. Hrvqa: A visual question answering benchmark for high-resolution aerial images.ISPRS Journal of Photogrammetry and Re- mote Sensing, 214:65–81, 2024. 1, 2

2024
[17]

Vrsbench: A versatile vision-language benchmark dataset for remote sens- ing image understanding.Advances in Neural Information Processing Systems, 37:3229–3242, 2024

Xiang Li, Jian Ding, and Mohamed Elhoseiny. Vrsbench: A versatile vision-language benchmark dataset for remote sens- ing image understanding.Advances in Neural Information Processing Systems, 37:3229–3242, 2024. 2

2024
[18]

Improved baselines with visual instruction tuning, 2023

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023. 7

2023
[19]

Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vi- sion, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vi- sion, pages 216–233. Springer, 2024. 3

2024
[20]

Rsvqa: Visual question answering for remote sensing data

Sylvain Lobry, Diego Marcos, Jesse Murray, and Devis Tuia. Rsvqa: Visual question answering for remote sensing data. IEEE Transactions on Geoscience and Remote Sensing, 58 (12):8555–8566, 2020. 2

2020
[21]

Ok-vqa: A visual question answering benchmark requiring external knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. InProceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019. 1

2019
[22]

Boreal forest fire: Uav- collected wildfire detection and smoke segmentation dataset

Julius Pesonen, Anna-Maria Raita-Hakola, Jukka Jout- salainen, Teemu Hakala, Waleed Akhtar, Niko Koivum ¨aki, Lauri Markelin, Juha Suomalainen, Raquel Alves de Oliveira, Ilkka P ¨ol¨onen, et al. Boreal forest fire: Uav- collected wildfire detection and smoke segmentation dataset. Scientific Data, 12(1):1419, 2025. 2

2025
[23]

Firetwin: Digital twin advancing multi-modal sens- ing, interactive analytics for tactical wildfire response

Mayamin Hamid Raha, Ali Reza Tavakkoli, Chris Webb, Mobin Habibpour, Janice Coen, Eric Rowell, and Fatemeh Afghah. Firetwin: Digital twin advancing multi-modal sens- ing, interactive analytics for tactical wildfire response. In 2025 IEEE 30th International Workshop on Computer Aided Modeling and Design of Communication Links and Networks (CAMAD), pages 1–...

2025
[24]

Floodnet: A high resolution aerial imagery dataset for post flood scene understanding.IEEE Access, 9:89644–89654, 2021

Maryam Rahnemoonfar, Tashnim Chowdhury, Argho Sarkar, Debvrat Varshney, Masoud Yari, and Robin Rober- son Murphy. Floodnet: A high resolution aerial imagery dataset for post flood scene understanding.IEEE Access, 9:89644–89654, 2021. 2

2021
[25]

Orb: An efficient alternative to sift or surf

Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. Orb: An efficient alternative to sift or surf. In2011 International conference on computer vision, pages 2564–
[26]

Rescuenet-vqa: A large-scale visual question answering benchmark for dam- age assessment

Argho Sarkar and Maryam Rahnemoonfar. Rescuenet-vqa: A large-scale visual question answering benchmark for dam- age assessment. InIGARSS 2023-2023 IEEE International Geoscience and Remote Sensing Symposium, pages 1150–

2023
[27]

Aerial im- agery pile burn detection using deep learning: The flame dataset.Computer Networks, 193:108001, 2021

Alireza Shamsoshoara, Fatemeh Afghah, Abolfazl Razi, Liming Zheng, Peter Z Ful ´e, and Erik Blasch. Aerial im- agery pile burn detection using deep learning: The flame dataset.Computer Networks, 193:108001, 2021. 2

2021
[28]

Style-pro: Style-guided prompt learning for gen- eralizable vision-language models

Niloufar Alipour Talemi, Hossein Kashiani, and Fatemeh Afghah. Style-pro: Style-guided prompt learning for gen- eralizable vision-language models. In2025 IEEE/CVF Win- ter Conference on Applications of Computer Vision (WACV), pages 6207–6216. IEEE, 2025. 1

2025
[29]

Fire-vlm: A vision-language-driven reinforcement learning framework for uav wildfire tracking in a physics-grounded fire digital twin

Chris Webb, Mobin Habibpour, Mayamin Hamid Raha, Ali Reza Tavakkoli, Janice Coen, and Fatemeh Afghah. Fire-vlm: A vision-language-driven reinforcement learning framework for uav wildfire tracking in a physics-grounded fire digital twin. InProceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision (WACV) Workshops, pages 1493–1502, 2026. 1

2026
[30]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, Qianyu Chen, Huarong Zhou, Zhensheng Zou, Haoye Zhang, Shengding Hu, Zhi Zheng, Jie Zhou, Jie Cai, Xu Han, Guoyang Zeng, Dahai Li, Zhiyuan Liu, and Maosong Sun. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024. 7

work page internal anchor Pith review arXiv 2024
[31]

A survey on multimodal large language models.National Science Review, 11(12): nwae403, 2024

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.National Science Review, 11(12): nwae403, 2024. 1

2024
[32]

Mm-llms: Recent advances in multimodal large language models.Findings of the As- sociation for Computational Linguistics: ACL 2024, pages 12401–12430, 2024

Duzhen Zhang, Yahan Yu, Jiahua Dong, Chenxing Li, Dan Su, Chenhui Chu, and Dong Yu. Mm-llms: Recent advances in multimodal large language models.Findings of the As- sociation for Computational Linguistics: ACL 2024, pages 12401–12430, 2024. 1

2024
[33]

Mutual attention inception network for remote sensing visual question answering.IEEE Transactions on Geoscience and Remote Sensing, 60:1–14, 2021

Xiangtao Zheng, Binqiang Wang, Xingqian Du, and Xiao- qiang Lu. Mutual attention inception network for remote sensing visual question answering.IEEE Transactions on Geoscience and Remote Sensing, 60:1–14, 2021. 1, 2

2021
[34]

Rsvlm-qa: A benchmark dataset for remote sensing vision language model-based question answering

Xing Zi, Jinghao Xiao, Yunxiao Shi, Xian Tao, Jun Li, Ali Braytee, and Mukesh Prasad. Rsvlm-qa: A benchmark dataset for remote sensing vision language model-based question answering. InProceedings of the 33rd ACM In- ternational Conference on Multimedia, pages 12905–12911,
[35]

2 WildFireVQA: A Large-Scale Radiometric Thermal VQA Benchmark for Aerial Wildfire Monitoring Supplementary Material Supplementary Overview This supplementary material complements the main paper by providing a detailed analysis of temperature-grounded retrieval, the complete WildFireVQA question inventory, and additional information on the multimodal inpu...
[36]

The first image is a standard RGB aerial image
[37]

Use both images together to understand wildfire activity in the scene

The second image is a color-mapped thermal image derived from radiometric thermal data. Use both images together to understand wildfire activity in the scene. You are also given a compact temperature sum- mary computed from the paired radiometric ther- mal TIFF: - Minimum temperature:{min} - Maximum temperature:{max} - Temperature standard deviation:{std}...