train on the test set

Ellis Brown, Jihan Yang, Shusheng Yang, Rob Fergus, Saining Xie · 2025 · arXiv 2511.04655

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

Ablate-to-Validate: Are Vision-Language Models Really Using Continuous Thought Tokens?

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

The Token Replacement Test shows VLMs keep most accuracy gains even after corrupting or replacing continuous thought token content, indicating the tokens are not used as information bottlenecks.

CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.

Thinking with Geometry: Active Geometry Integration for Spatial Reasoning

cs.CV · 2026-02-05 · unverdicted · novelty 7.0

GeoThinker enables active, task-conditioned geometry integration in MLLMs via spatial-grounded fusion and importance gating, reaching 72.6 on VSI-Bench.

Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

cs.CV · 2026-05-28 · unverdicted · novelty 6.0

GASP injects geometric priors into VLMs via a deep-supervised correspondence head trained on video point correspondences and depth consistency, raising internal matching accuracy and delivering gains on spatial benchmarks without any 3D VQA data.

Cambrian-P: Pose-Grounded Video Understanding

cs.CV · 2026-05-21 · unverdicted · novelty 6.0

Cambrian-P adds per-frame camera pose tokens and a regression head to video MLLMs, delivering 4.5-6.5% gains on spatial benchmarks, generalization to other video QA tasks, and SOTA streaming pose estimation on ScanNet.

Lifting Unlabeled Internet-level Data for 3D Scene Understanding

cs.CV · 2026-04-02 · unverdicted · novelty 6.0

Unlabeled web videos processed by designed data engines generate effective training data that yields strong zero-shot and finetuned performance on 3D detection, segmentation, VQA, and navigation.

From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs

cs.CV · 2026-05-04 · unverdicted · novelty 5.0

SFI-Bench shows current multimodal LLMs struggle to integrate spatial memory with functional reasoning and external knowledge in video tasks.

citing papers explorer

Showing 7 of 7 citing papers.

Ablate-to-Validate: Are Vision-Language Models Really Using Continuous Thought Tokens? cs.CV · 2026-05-20 · unverdicted · none · ref 5
The Token Replacement Test shows VLMs keep most accuracy gains even after corrupting or replacing continuous thought token content, indicating the tokens are not used as information bottlenecks.
CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models cs.CV · 2026-05-19 · unverdicted · none · ref 73
Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.
Thinking with Geometry: Active Geometry Integration for Spatial Reasoning cs.CV · 2026-02-05 · unverdicted · none · ref 3
GeoThinker enables active, task-conditioned geometry integration in MLLMs via spatial-grounded fusion and importance gating, reaching 72.6 on VSI-Bench.
Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning cs.CV · 2026-05-28 · unverdicted · none · ref 5
GASP injects geometric priors into VLMs via a deep-supervised correspondence head trained on video point correspondences and depth consistency, raising internal matching accuracy and delivering gains on spatial benchmarks without any 3D VQA data.
Cambrian-P: Pose-Grounded Video Understanding cs.CV · 2026-05-21 · unverdicted · none · ref 12
Cambrian-P adds per-frame camera pose tokens and a regression head to video MLLMs, delivering 4.5-6.5% gains on spatial benchmarks, generalization to other video QA tasks, and SOTA streaming pose estimation on ScanNet.
Lifting Unlabeled Internet-level Data for 3D Scene Understanding cs.CV · 2026-04-02 · unverdicted · none · ref 13
Unlabeled web videos processed by designed data engines generate effective training data that yields strong zero-shot and finetuned performance on 3D detection, segmentation, VQA, and navigation.
From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs cs.CV · 2026-05-04 · unverdicted · none · ref 7
SFI-Bench shows current multimodal LLMs struggle to integrate spatial memory with functional reasoning and external knowledge in video tasks.

train on the test set

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer