hub

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Krishna,R · 2016 · cs.CV · arXiv 1602.07332

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

open full Pith review browse 13 citing papers arXiv PDF

abstract

Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designed for perceptual tasks. To achieve success at cognitive tasks, models need to understand the interactions and relationships between objects in an image. When asked "What vehicle is the person riding?", computers will need to identify the objects in an image as well as the relationships riding(man, carriage) and pulling(horse, carriage) in order to answer correctly that "the person is riding a horse-drawn carriage". In this paper, we present the Visual Genome dataset to enable the modeling of such relationships. We collect dense annotations of objects, attributes, and relationships within each image to learn these models. Specifically, our dataset contains over 100K images where each image has an average of 21 objects, 18 attributes, and 18 pairwise relationships between objects. We canonicalize the objects, attributes, relationships, and noun phrases in region descriptions and questions answer pairs to WordNet synsets. Together, these annotations represent the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answers.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 2

citation-polarity summary

use dataset 2

representative citing papers

Deep Modular Co-Attention Networks for Visual Question Answering

cs.CV · 2019-06-25 · conditional · novelty 7.0

MCAN stacks modular co-attention layers to reach 70.63% accuracy on VQA-v2 test-dev, outperforming prior state-of-the-art models.

Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models

cs.CV · 2026-04-27 · unverdicted · novelty 7.0

XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning objectives are across modalities.

LaV-CoT: Language-Aware Visual CoT with Multi-Aspect Reward Optimization for Real-World Multilingual VQA

cs.CV · 2025-09-12 · unverdicted · novelty 6.0

LaV-CoT introduces a multi-stage visual CoT pipeline and GRPO training with language-consistency rewards, delivering up to 9.5% accuracy gains on multilingual VQA benchmarks over similar-sized open models.

Otter: A Multi-Modal Model with In-Context Instruction Tuning

cs.CV · 2023-05-05 · unverdicted · novelty 6.0

Otter is a multi-modal model instruction-tuned on the MIMIC-IT dataset of over 3 million in-context instruction-response pairs to improve convergence and generalization on tasks with multiple images and videos.

Florence: A New Foundation Model for Computer Vision

cs.CV · 2021-11-22 · unverdicted · novelty 6.0

Florence is a new vision foundation model that learns universal visual-language representations from web-scale data and reports state-of-the-art results on 44 benchmarks including 83.74% zero-shot ImageNet top-1 accuracy.

PIQA: Reasoning about Physical Commonsense in Natural Language

cs.CL · 2019-11-26 · accept · novelty 6.0

PIQA is a new benchmark showing that current AI models achieve 77% on physical commonsense questions versus humans at 95%.

AIM: Asymmetric Information Masking for Visual Question Answering Continual Learning

cs.CV · 2026-04-16 · unverdicted · novelty 6.0

AIM applies modality-specific masks to balance stability and plasticity in asymmetric VLMs, achieving SOTA average performance and reduced forgetting on continual VQA v2 and GQA while preserving generalization to novel compositions.

Grounding Generative Planners in Verifiable Logic: A Hybrid Architecture for Trustworthy Embodied AI

cs.AI · 2026-02-09 · unverdicted · novelty 5.0

VIRF combines a deterministic logic tutor with LLM planners to achieve zero hazardous action rates in home safety tasks through iterative plan repairs.

VeriGraph: Scene Graphs for Execution Verifiable Robot Planning

cs.RO · 2024-11-15 · unverdicted · novelty 5.0

VeriGraph integrates VLMs with scene-graph verification to raise robot task success rates by 30-58% over baselines in manipulation scenarios.

GIT: A Generative Image-to-text Transformer for Vision and Language

cs.CV · 2022-05-27 · unverdicted · novelty 5.0

GIT achieves new state-of-the-art results on 12 vision-language benchmarks, including surpassing human performance on TextCaps, via a simplified single-encoder single-decoder transformer scaled on large pre-training data.

UpstreamQA: A Modular Framework for Explicit Reasoning on Video Question Answering Tasks

cs.CV · 2026-04-25 · unverdicted · novelty 5.0

UpstreamQA disentangles video reasoning by using LRMs for explicit upstream object identification and scene context before downstream LMM VideoQA, improving performance and interpretability on OpenEQA and NExTQA in some cases.

Stitching Gaps: Fusing Situated Perceptual Knowledge with Vision Transformers for High-Level Image Classification

cs.CV · 2024-02-29 · unverdicted · novelty 4.0

Hybrid knowledge graph embeddings fused with vision transformer features outperform standard techniques on abstract concept classification by integrating situated perceptual knowledge from a new cultural image resource.

Evaluation of Winning Solutions of 2025 Low Power Computer Vision Challenge

cs.CV · 2026-04-21 · unverdicted · novelty 2.0

The 2025 LPCVC winners demonstrate practical techniques for low-power image classification under varied conditions, open-vocabulary segmentation from text prompts, and monocular depth estimation.

citing papers explorer

Showing 13 of 13 citing papers.

Deep Modular Co-Attention Networks for Visual Question Answering cs.CV · 2019-06-25 · conditional · none · ref 18 · internal anchor
MCAN stacks modular co-attention layers to reach 70.63% accuracy on VQA-v2 test-dev, outperforming prior state-of-the-art models.
Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models cs.CV · 2026-04-27 · unverdicted · none · ref 20
XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning objectives are across modalities.
LaV-CoT: Language-Aware Visual CoT with Multi-Aspect Reward Optimization for Real-World Multilingual VQA cs.CV · 2025-09-12 · unverdicted · none · ref 25 · internal anchor
LaV-CoT introduces a multi-stage visual CoT pipeline and GRPO training with language-consistency rewards, delivering up to 9.5% accuracy gains on multilingual VQA benchmarks over similar-sized open models.
Otter: A Multi-Modal Model with In-Context Instruction Tuning cs.CV · 2023-05-05 · unverdicted · none · ref 42 · internal anchor
Otter is a multi-modal model instruction-tuned on the MIMIC-IT dataset of over 3 million in-context instruction-response pairs to improve convergence and generalization on tasks with multiple images and videos.
Florence: A New Foundation Model for Computer Vision cs.CV · 2021-11-22 · unverdicted · none · ref 13 · internal anchor
Florence is a new vision foundation model that learns universal visual-language representations from web-scale data and reports state-of-the-art results on 44 benchmarks including 83.74% zero-shot ImageNet top-1 accuracy.
PIQA: Reasoning about Physical Commonsense in Natural Language cs.CL · 2019-11-26 · accept · none · ref 61 · internal anchor
PIQA is a new benchmark showing that current AI models achieve 77% on physical commonsense questions versus humans at 95%.
AIM: Asymmetric Information Masking for Visual Question Answering Continual Learning cs.CV · 2026-04-16 · unverdicted · none · ref 23
AIM applies modality-specific masks to balance stability and plasticity in asymmetric VLMs, achieving SOTA average performance and reduced forgetting on continual VQA v2 and GQA while preserving generalization to novel compositions.
Grounding Generative Planners in Verifiable Logic: A Hybrid Architecture for Trustworthy Embodied AI cs.AI · 2026-02-09 · unverdicted · none · ref 2 · internal anchor
VIRF combines a deterministic logic tutor with LLM planners to achieve zero hazardous action rates in home safety tasks through iterative plan repairs.
VeriGraph: Scene Graphs for Execution Verifiable Robot Planning cs.RO · 2024-11-15 · unverdicted · none · ref 16 · internal anchor
VeriGraph integrates VLMs with scene-graph verification to raise robot task success rates by 30-58% over baselines in manipulation scenarios.
GIT: A Generative Image-to-text Transformer for Vision and Language cs.CV · 2022-05-27 · unverdicted · none · ref 17 · internal anchor
GIT achieves new state-of-the-art results on 12 vision-language benchmarks, including surpassing human performance on TextCaps, via a simplified single-encoder single-decoder transformer scaled on large pre-training data.
UpstreamQA: A Modular Framework for Explicit Reasoning on Video Question Answering Tasks cs.CV · 2026-04-25 · unverdicted · none · ref 22
UpstreamQA disentangles video reasoning by using LRMs for explicit upstream object identification and scene context before downstream LMM VideoQA, improving performance and interpretability on OpenEQA and NExTQA in some cases.
Stitching Gaps: Fusing Situated Perceptual Knowledge with Vision Transformers for High-Level Image Classification cs.CV · 2024-02-29 · unverdicted · none · ref 39 · internal anchor
Hybrid knowledge graph embeddings fused with vision transformer features outperform standard techniques on abstract concept classification by integrating situated perceptual knowledge from a new cultural image resource.
Evaluation of Winning Solutions of 2025 Low Power Computer Vision Challenge cs.CV · 2026-04-21 · unverdicted · none · ref 19
The 2025 LPCVC winners demonstrate practical techniques for low-power image classification under varied conditions, open-vocabulary segmentation from text prompts, and monocular depth estimation.

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer