A review of multi-modal large language and vision models

Carolan, K · 2024 · arXiv 2404.01322

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

WildFireVQA: A Large-Scale Radiometric Thermal VQA Benchmark for Aerial Wildfire Monitoring

cs.CV · 2026-04-22 · unverdicted · novelty 7.0

WildFireVQA is a new large-scale visual question answering benchmark that pairs RGB imagery with radiometric thermal measurements for aerial wildfire monitoring across six task categories.

Rethinking Video-Language Model from the Language Input Perspective

cs.CV · 2026-05-27 · unverdicted · novelty 5.0

Introduces a plug-and-play framework that generates varied texts and uses attribute reasoning plus video-guided loss to improve state-of-the-art Video-Language Models.

Cross-Layer Energy Analysis of Multimodal Training on Grace Hopper Superchips

cs.DC · 2026-05-03 · unverdicted · novelty 4.0 · 2 refs

On Grace Hopper superchips, energy efficiency during multimodal training is governed by data movement and overlap rather than compute utilization, and runtime-optimal configurations are not always energy-optimal.

citing papers explorer

Showing 3 of 3 citing papers.

WildFireVQA: A Large-Scale Radiometric Thermal VQA Benchmark for Aerial Wildfire Monitoring cs.CV · 2026-04-22 · unverdicted · none · ref 6
WildFireVQA is a new large-scale visual question answering benchmark that pairs RGB imagery with radiometric thermal measurements for aerial wildfire monitoring across six task categories.
Rethinking Video-Language Model from the Language Input Perspective cs.CV · 2026-05-27 · unverdicted · none · ref 5
Introduces a plug-and-play framework that generates varied texts and uses attribute reasoning plus video-guided loss to improve state-of-the-art Video-Language Models.
Cross-Layer Energy Analysis of Multimodal Training on Grace Hopper Superchips cs.DC · 2026-05-03 · unverdicted · none · ref 4 · 2 links
On Grace Hopper superchips, energy efficiency during multimodal training is governed by data movement and overlap rather than compute utilization, and runtime-optimal configurations are not always energy-optimal.

A review of multi-modal large language and vision models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer