MIRCaps supplies 141,364 images, 981,947 image-level captions, 1,742,264 region-level captions, and 1,391,779 bounding boxes to enable fine-grained vision-language learning across general and surveillance domains.
Title resolution pending
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CV 1years
2026 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
MIRCaps: A Large-Scale Mixed-Domain Dataset with Image-Level and Region-Level Captions for Fine-Grained Vision-Language Learning
MIRCaps supplies 141,364 images, 981,947 image-level captions, 1,742,264 region-level captions, and 1,391,779 bounding boxes to enable fine-grained vision-language learning across general and surveillance domains.