DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).
hub Canonical reference
Kosmos-2: Grounding Multimodal Large Language Models to the World
Canonical reference. 83% of citing Pith papers cite this work as background.
abstract
We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new capabilities of perceiving object descriptions (e.g., bounding boxes) and grounding text to the visual world. Specifically, we represent refer expressions as links in Markdown, i.e., ``[text span](bounding boxes)'', where object descriptions are sequences of location tokens. Together with multimodal corpora, we construct large-scale data of grounded image-text pairs (called GrIT) to train the model. In addition to the existing capabilities of MLLMs (e.g., perceiving general modalities, following instructions, and performing in-context learning), Kosmos-2 integrates the grounding capability into downstream applications. We evaluate Kosmos-2 on a wide range of tasks, including (i) multimodal grounding, such as referring expression comprehension, and phrase grounding, (ii) multimodal referring, such as referring expression generation, (iii) perception-language tasks, and (iv) language understanding and generation. This work lays out the foundation for the development of Embodiment AI and sheds light on the big convergence of language, multimodal perception, action, and world modeling, which is a key step toward artificial general intelligence. Code and pretrained models are available at https://aka.ms/kosmos-2.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new capabilities of perceiving object descriptions (e.g., bounding boxes) and grounding text to the visual world. Specifically, we represent refer expressions as links in Markdown, i.e., ``[text span](bounding boxes)'', where object descriptions are sequences of location tokens. Together with multimodal corpora, we construct large-scale data of grounded image-text pairs (called GrIT) to train the model. In addition to the existing capabilities of MLLMs (e.g., perceiving general modalities, following instructions, and per
co-cited works
representative citing papers
CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.
Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
IQA-Spider unifies reasoning, grounding, and referring for multi-granularity image quality assessment via a four-task paradigm and two-stage LMM training with training-free text-to-point mapping.
HEED replaces uniform residual alignment with density-weighted alignment using patch self-dissimilarity to improve hybrid VLM distillation, gaining 8.7 points on OCRBench v2 and 5.13 on a 10-benchmark average.
INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.
RevealLayer decomposes natural images into multiple RGBA layers using diffusion models with region-aware attention, occlusion-guided adaptation, and a composite loss, outperforming prior methods on a new benchmark dataset.
GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.
Introduces closed-set C-Bench and open-set O-Bench for layout-guided diffusion models, a unified semantic-spatial scoring protocol, and ranks six models after generating and evaluating 319,086 images.
MMR-AD is a new benchmark dataset showing that current generalist MLLMs lag industrial needs for anomaly detection, with Anomaly-R1 delivering better results through reasoning and RL.
STORM is an end-to-end MLLM for referring multi-object tracking that uses task-composition learning to leverage sub-task data and introduces the STORM-Bench dataset, achieving SOTA results.
MARINER is a new benchmark dataset and evaluation framework for fine-grained perception and causal reasoning in open-water scenes using 16,629 images across 63 vessel categories, diverse environments, and maritime incidents.
Mema adds a stateful memory module to vision encoders that accumulates hierarchical visual features across layers and selectively injects portions back via feedback to preserve fine-grained cues, yielding consistent gains on multimodal benchmarks.
OR-VSKC provides 28,190 synthetic operating room images plus an expert subset to expose and reduce visual-semantic knowledge conflicts in multimodal models for surgical risk detection.
A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.
CompART adds a composition loss on decomposed captions to regularize attention sums and improves multi-object grounding plus VQA across four VLM types and six benchmarks.
Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
3D-VLA is a new embodied foundation model that uses a 3D LLM plus aligned diffusion models to generate future images and point clouds for improved reasoning and action planning in 3D environments.
SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.
GeoSearcher introduces anchor-centric reasoning supervised fine-tuning and process-faithful group relative policy optimization to improve MLLM-based remote sensing visual grounding.
InstanceControl uses VLMs to auto-generate instance masks from text and visual conditions, with adaptive refinement, to enable controllable multi-object image generation without manual labeling.
VisReflect generates continuous latent visual reflections to emphasize relevant visual features and guide attention in LVLMs, yielding 4.1% gains on image benchmarks and 1.8% on video benchmarks with 44% less inference time than zooming methods.
A plug-in Q-Synth Module plus Attention-to-Point Decoder converts text-conditioned attention in frozen MLLMs into point heatmaps, improving part-level grounding accuracy on multiple datasets.
citing papers explorer
-
DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful Reasoning
DeFacto trains multimodal models with counterfactual image variants and GRPO reinforcement learning to enforce that correct answers are supported by correct visual evidence.
-
InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners
InfiGUI-R1 uses Reasoning Injection via spatial distillation followed by Deliberation Enhancement via RL to evolve GUI agents from reactive actors to deliberative reasoners, reporting strong performance on grounding and trajectory tasks.
-
Vision Language Model Helps Private Information De-Identification in Vision Data
VisShield with OPTIC dataset enables VLMs to localize and mask private text in vision data via instruction tuning for privacy preservation.