hub

Vision language models are blind

Pooyan Rahmanzadehgervi, Logan Bolton, Mohammad Reza Taesiri, Anh Totti Nguyen · 2024 · arXiv 2407.06581

14 Pith papers cite this work. Polarity classification is still indexing.

14 Pith papers citing it

read on arXiv browse 14 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1 dataset 1 method 1

citation-polarity summary

background 1 use dataset 1 use method 1

representative citing papers

SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes

cs.CV · 2026-05-29 · unverdicted · novelty 7.0

SpatialAct benchmark shows VLMs handle isolated spatial reasoning but fail to maintain coherent spatial beliefs and produce reliable actions in multi-turn 3D interactions, underperforming humans.

MOSS-Video-Preview: Toward Real-Time Video Understanding via Cross-Attention

cs.CV · 2026-06-01 · unverdicted · novelty 6.0

MOSS-Video-Preview introduces a cross-attention architecture and synthesized real-time QA data to enable continuous perception, answer revision, and faster inference in video-language models compared to decoder-only designs.

Decomposed On-Policy Distillation for Vision-Language Reasoning: Steering Gradients for Visual Grounding

cs.CV · 2026-05-30 · unverdicted · novelty 6.0

Decomposes VLM distillation loss into orthogonal language and visual components and introduces Visual Gradient Steering to prioritize visual grounding over standard monolithic optimization.

Binding Visual Features Point by Point

cs.CV · 2026-05-25 · unverdicted · novelty 6.0

Training VLMs to point via text induces serial processing that eliminates binding errors and enables compositional generalization on multi-object tasks.

Do Vision-Language Models Understand 3D Scenes or Just Catalogue Objects?

cs.CV · 2026-05-19 · unverdicted · novelty 6.0 · 2 refs

VLMs achieve 53-97% on rearrangement planning but only 6-45% on occlusion and under 7% on reflections, with failures localized to visual token compression after the vision encoder.

Mirror, Mirror on the Wall: Can VLM Agents Tell Who They Are at All?

cs.AI · 2026-05-09 · unverdicted · novelty 6.0

Stronger VLM agents use mirror reflections for self-identification in controlled 3D tests, while weaker ones inspect but fail to extract or correctly attribute self-relevant information.

S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models

cs.CV · 2026-04-20 · unverdicted · novelty 6.0

S2H-DPO generates hierarchical prompt-driven preference pairs to improve multi-image reasoning in VLMs while keeping single-image performance intact.

ReflectCAP: Detailed Image Captioning with Reflective Memory

cs.AI · 2026-04-14 · unverdicted · novelty 6.0

ReflectCAP distills model-specific hallucination and oversight patterns into Structured Reflection Notes that steer LVLMs toward more factual and complete image captions, reaching the Pareto frontier on factuality-coverage trade-offs.

MiMo-Embodied: X-Embodied Foundation Model Technical Report

cs.RO · 2025-11-20 · unverdicted · novelty 6.0

MiMo-Embodied is a single foundation model that achieves state-of-the-art results on 17 embodied AI benchmarks and 12 autonomous driving benchmarks through multi-stage learning, curated data, and CoT/RL fine-tuning that produces positive cross-domain transfer.

How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks

cs.CV · 2025-07-02 · unverdicted · novelty 6.0

Multimodal foundation models achieve respectable but sub-specialist performance on semantic vision tasks and weaker results on geometric tasks when evaluated through prompt chaining on established benchmarks.

Grounded Reinforcement Learning for Visual Reasoning

cs.CV · 2025-05-29 · unverdicted · novelty 6.0

ViGoRL introduces visually grounded RL that anchors reasoning steps to image coordinates and uses multi-turn zooming to outperform standard RL and supervised baselines on spatial and GUI reasoning benchmarks.

MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems

cs.CV · 2025-03-19 · unverdicted · novelty 6.0

MathFlow decouples perception and inference stages in MLLMs for visual math, with a dedicated perception model delivering gains on the FlowVerse benchmark when paired with existing reasoners.

Context Unrolling in Omni Models

cs.CV · 2026-04-23 · unverdicted · novelty 5.0

Omni is a multimodal model whose native training on diverse data types enables context unrolling, allowing explicit reasoning across modalities to better approximate shared knowledge and improve downstream performance.

Seed1.5-VL Technical Report

cs.CV · 2025-05-11 · unverdicted · novelty 4.0

Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.

citing papers explorer

Showing 14 of 14 citing papers.

SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes cs.CV · 2026-05-29 · unverdicted · none · ref 18
SpatialAct benchmark shows VLMs handle isolated spatial reasoning but fail to maintain coherent spatial beliefs and produce reliable actions in multi-turn 3D interactions, underperforming humans.
MOSS-Video-Preview: Toward Real-Time Video Understanding via Cross-Attention cs.CV · 2026-06-01 · unverdicted · none · ref 36
MOSS-Video-Preview introduces a cross-attention architecture and synthesized real-time QA data to enable continuous perception, answer revision, and faster inference in video-language models compared to decoder-only designs.
Decomposed On-Policy Distillation for Vision-Language Reasoning: Steering Gradients for Visual Grounding cs.CV · 2026-05-30 · unverdicted · none · ref 24
Decomposes VLM distillation loss into orthogonal language and visual components and introduces Visual Gradient Steering to prioritize visual grounding over standard monolithic optimization.
Binding Visual Features Point by Point cs.CV · 2026-05-25 · unverdicted · none · ref 11
Training VLMs to point via text induces serial processing that eliminates binding errors and enables compositional generalization on multi-object tasks.
Do Vision-Language Models Understand 3D Scenes or Just Catalogue Objects? cs.CV · 2026-05-19 · unverdicted · none · ref 30 · 2 links
VLMs achieve 53-97% on rearrangement planning but only 6-45% on occlusion and under 7% on reflections, with failures localized to visual token compression after the vision encoder.
Mirror, Mirror on the Wall: Can VLM Agents Tell Who They Are at All? cs.AI · 2026-05-09 · unverdicted · none · ref 25
Stronger VLM agents use mirror reflections for self-identification in controlled 3D tests, while weaker ones inspect but fail to extract or correctly attribute self-relevant information.
S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models cs.CV · 2026-04-20 · unverdicted · none · ref 118
S2H-DPO generates hierarchical prompt-driven preference pairs to improve multi-image reasoning in VLMs while keeping single-image performance intact.
ReflectCAP: Detailed Image Captioning with Reflective Memory cs.AI · 2026-04-14 · unverdicted · none · ref 29
ReflectCAP distills model-specific hallucination and oversight patterns into Structured Reflection Notes that steer LVLMs toward more factual and complete image captions, reaching the Pareto frontier on factuality-coverage trade-offs.
MiMo-Embodied: X-Embodied Foundation Model Technical Report cs.RO · 2025-11-20 · unverdicted · none · ref 43
MiMo-Embodied is a single foundation model that achieves state-of-the-art results on 17 embodied AI benchmarks and 12 autonomous driving benchmarks through multi-stage learning, curated data, and CoT/RL fine-tuning that produces positive cross-domain transfer.
How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks cs.CV · 2025-07-02 · unverdicted · none · ref 48
Multimodal foundation models achieve respectable but sub-specialist performance on semantic vision tasks and weaker results on geometric tasks when evaluated through prompt chaining on established benchmarks.
Grounded Reinforcement Learning for Visual Reasoning cs.CV · 2025-05-29 · unverdicted · none · ref 49
ViGoRL introduces visually grounded RL that anchors reasoning steps to image coordinates and uses multi-turn zooming to outperform standard RL and supervised baselines on spatial and GUI reasoning benchmarks.
MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems cs.CV · 2025-03-19 · unverdicted · none · ref 52
MathFlow decouples perception and inference stages in MLLMs for visual math, with a dedicated perception model delivering gains on the FlowVerse benchmark when paired with existing reasoners.
Context Unrolling in Omni Models cs.CV · 2026-04-23 · unverdicted · none · ref 34
Omni is a multimodal model whose native training on diverse data types enables context unrolling, allowing explicit reasoning across modalities to better approximate shared knowledge and improve downstream performance.
Seed1.5-VL Technical Report cs.CV · 2025-05-11 · unverdicted · none · ref 109
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.

Vision language models are blind

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer