pith. sign in

hub

What’s" up" with vision-language models? investigating their strug- gle with spatial reasoning.arXiv preprint arXiv:2310.19785,

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

hub tools

years

2026 5 2025 5

representative citing papers

PhotoFlow: Agentic 3D Virtual Photography Missions

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

PhotoFlow is a closed-loop agent framework that searches for camera parameters in 3D scenes according to language intent and outperforms one-shot, reflection, and random baselines on the new VPhotoBench of 47 scenes and 141 missions.

Test-Time Matching: Unlocking Compositional Reasoning in Multimodal Models

cs.AI · 2025-10-09 · unverdicted · novelty 6.0

Introduces group matching score for better evaluation of compositional reasoning and Test-Time Matching (TTM) algorithm for unsupervised self-improvement in multimodal models, achieving SOTA gains including surpassing GPT-4.1 and estimated human performance.

Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation

cs.RO · 2025-08-19 · conditional · novelty 6.0

Embodied-R1 uses a pointing-centric representation and reinforced fine-tuning on a 200K dataset to achieve state-of-the-art results on embodied benchmarks plus 56.2% success in SIMPLEREnv and 87.5% on real XArm tasks without task-specific training.

GeoWorld-VLM: Geometry from World Models for Vision-Language Models

cs.CV · 2026-05-15 · unverdicted · novelty 5.0

GeoWorld-VLM distills geometric structure from camera-conditioned world models into VLMs by aligning visual features, improving spatial reasoning by about 4% on What'sUp and VSR benchmarks across two architectures while preserving language capabilities.

citing papers explorer

Showing 10 of 10 citing papers.