CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.
hub
Kosmos-2: Grounding Multimodal Large Language Models to the World
39 Pith papers cite this work. Polarity classification is still indexing.
abstract
We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new capabilities of perceiving object descriptions (e.g., bounding boxes) and grounding text to the visual world. Specifically, we represent refer expressions as links in Markdown, i.e., ``[text span](bounding boxes)'', where object descriptions are sequences of location tokens. Together with multimodal corpora, we construct large-scale data of grounded image-text pairs (called GrIT) to train the model. In addition to the existing capabilities of MLLMs (e.g., perceiving general modalities, following instructions, and performing in-context learning), Kosmos-2 integrates the grounding capability into downstream applications. We evaluate Kosmos-2 on a wide range of tasks, including (i) multimodal grounding, such as referring expression comprehension, and phrase grounding, (ii) multimodal referring, such as referring expression generation, (iii) perception-language tasks, and (iv) language understanding and generation. This work lays out the foundation for the development of Embodiment AI and sheds light on the big convergence of language, multimodal perception, action, and world modeling, which is a key step toward artificial general intelligence. Code and pretrained models are available at https://aka.ms/kosmos-2.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new capabilities of perceiving object descriptions (e.g., bounding boxes) and grounding text to the visual world. Specifically, we represent refer expressions as links in Markdown, i.e., ``[text span](bounding boxes)'', where object descriptions are sequences of location tokens. Together with multimodal corpora, we construct large-scale data of grounded image-text pairs (called GrIT) to train the model. In addition to the existing capabilities of MLLMs (e.g., perceiving general modalities, following instructions, and per
co-cited works
representative citing papers
INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.
RevealLayer decomposes natural images into multiple RGBA layers using diffusion models with region-aware attention, occlusion-guided adaptation, and a composite loss, outperforming prior methods on a new benchmark dataset.
GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.
Introduces closed-set C-Bench and open-set O-Bench for layout-guided diffusion models, a unified semantic-spatial scoring protocol, and ranks six models after generating and evaluating 319,086 images.
MMR-AD is a new benchmark dataset showing that current generalist MLLMs lag industrial needs for anomaly detection, with Anomaly-R1 delivering better results through reasoning and RL.
STORM is an end-to-end MLLM for referring multi-object tracking that uses task-composition learning to leverage sub-task data and introduces the STORM-Bench dataset, achieving SOTA results.
MARINER is a new benchmark dataset and evaluation framework for fine-grained perception and causal reasoning in open-water scenes using 16,629 images across 63 vessel categories, diverse environments, and maritime incidents.
OmniSch is the first benchmark exposing gaps in LMMs for PCB schematic visual grounding, topology-to-graph parsing, geometric weighting, and tool-augmented reasoning.
Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
3D-VLA is a new embodied foundation model that uses a 3D LLM plus aligned diffusion models to generate future images and point clouds for improved reasoning and action planning in 3D environments.
SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.
Controlled counterfactual perturbations reveal no correlation between embedding cosine similarity and approximation behavior in two visual grounding models.
LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and VQA benchmarks.
AlbumFill retrieves identity-consistent references from personal albums via VLM-inferred semantic cues to support personalized image completion.
CoNewsReader integrates user comments with an LLM to improve critical news reading on social media, with a 24-participant study showing gains in comprehension and critical thinking over baseline interfaces.
Agentic AI faces structural challenges in remote sensing due to geospatial data properties and workflow constraints, requiring EO-native agents built around structured state, tool-aware reasoning, and validity-aware evaluation.
A latent denoising objective with saliency-aware corruption and contrastive distillation improves visual alignment and corruption robustness in large multimodal models.
G-MIXER achieves state-of-the-art zero-shot composed image retrieval by using geodesic mixup to build diverse implicit candidates and MLLM-derived explicit semantics for re-ranking.
CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.
Intermediate layers of a contrastively trained vision-language encoder yield stronger general embeddings than the output layer, enabling state-of-the-art performance across image/video classification, multimodal QA, and dense prediction after simple alignment.
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.
Qwen-Audio trains a unified model on diverse audio and tasks with hierarchical tags to enable strong zero-shot performance on audio understanding benchmarks and multi-turn audio chat.
citing papers explorer
-
CoNewsReader: Supporting Comprehensive Understanding and Raising Critical Thoughts on Social Media News Through Comments
CoNewsReader integrates user comments with an LLM to improve critical news reading on social media, with a 24-participant study showing gains in comprehension and critical thinking over baseline interfaces.