These are images of an object. What is the name of the object?

Mengfei Du, Binhao Wu, Zejun Li, Xuanjing Huang, Zhongyu Wei · 2024 · arXiv 2406.05756

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

representative citing papers

DRIVESPATIAL: A Benchmark for Spatiotemporal Intelligence in VLMs for Autonomous Driving

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

DriveSpatial benchmark shows the best of 15 VLMs trails humans by 28.4 points on spatiotemporal driving tasks, with cognitive scene construction as the main failure mode.

ReVSI: Rebuilding Visual Spatial Intelligence Evaluation for Accurate Assessment of VLM 3D Reasoning

cs.CV · 2026-04-27 · unverdicted · novelty 7.0

ReVSI rebuilds 3D spatial reasoning benchmarks for VLMs by re-annotating objects and geometry across 381 scenes and creating verified QA pairs that match actual model inputs like 16-64 frames.

MiMo-Embodied: X-Embodied Foundation Model Technical Report

cs.RO · 2025-11-20 · unverdicted · novelty 6.0

MiMo-Embodied is a single foundation model that achieves state-of-the-art results on 17 embodied AI benchmarks and 12 autonomous driving benchmarks through multi-stage learning, curated data, and CoT/RL fine-tuning that produces positive cross-domain transfer.

Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation

cs.RO · 2025-08-19 · conditional · novelty 6.0

Embodied-R1 uses a pointing-centric representation and reinforced fine-tuning on a 200K dataset to achieve state-of-the-art results on embodied benchmarks plus 56.2% success in SIMPLEREnv and 87.5% on real XArm tasks without task-specific training.

Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models

cs.CV · 2025-05-22 · unverdicted · novelty 6.0

Multi-SpatialMLLM integrates depth perception, visual correspondence, and dynamic perception into MLLMs via a 27M-sample MultiSPA dataset and benchmark, yielding gains on multi-frame spatial tasks.

ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop

cs.CV · 2026-05-18

citing papers explorer

Showing 6 of 6 citing papers.

DRIVESPATIAL: A Benchmark for Spatiotemporal Intelligence in VLMs for Autonomous Driving cs.CV · 2026-05-22 · unverdicted · none · ref 51
DriveSpatial benchmark shows the best of 15 VLMs trails humans by 28.4 points on spatiotemporal driving tasks, with cognitive scene construction as the main failure mode.
ReVSI: Rebuilding Visual Spatial Intelligence Evaluation for Accurate Assessment of VLM 3D Reasoning cs.CV · 2026-04-27 · unverdicted · none · ref 1
ReVSI rebuilds 3D spatial reasoning benchmarks for VLMs by re-annotating objects and geometry across 381 scenes and creating verified QA pairs that match actual model inputs like 16-64 frames.
MiMo-Embodied: X-Embodied Foundation Model Technical Report cs.RO · 2025-11-20 · unverdicted · none · ref 15
MiMo-Embodied is a single foundation model that achieves state-of-the-art results on 17 embodied AI benchmarks and 12 autonomous driving benchmarks through multi-stage learning, curated data, and CoT/RL fine-tuning that produces positive cross-domain transfer.
Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation cs.RO · 2025-08-19 · conditional · none · ref 8
Embodied-R1 uses a pointing-centric representation and reinforced fine-tuning on a 200K dataset to achieve state-of-the-art results on embodied benchmarks plus 56.2% success in SIMPLEREnv and 87.5% on real XArm tasks without task-specific training.
Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models cs.CV · 2025-05-22 · unverdicted · none · ref 20
Multi-SpatialMLLM integrates depth perception, visual correspondence, and dynamic perception into MLLMs via a 27M-sample MultiSPA dataset and benchmark, yielding gains on multi-frame spatial tasks.
ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop cs.CV · 2026-05-18 · unreviewed · ref 5

These are images of an object. What is the name of the object?

fields

years

verdicts

representative citing papers

citing papers explorer