Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al · 2023

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

browse 8 citing papers

representative citing papers

Mind the Gap: Geometrically Accurate Generative Reconstruction from Disjoint Views

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

GLADOS reconstructs 3D geometry from disjoint views by generating intermediate perspectives, performing robust coarse alignment that tolerates generative inconsistencies, and iteratively expanding context for consistency.

Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.

TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models

cs.CV · 2026-05-11 · conditional · novelty 7.0 · 2 refs

TOC-Bench is a new diagnostic benchmark that reveals major weaknesses in temporal object consistency for Video-LLMs, including event counting, ordering, identity reasoning, and hallucination avoidance.

Birds of a Feather Flock Together: Background-Invariant Representations via Linear Structure in VLMs

cs.CV · 2026-05-11 · unverdicted · novelty 6.0

Exploiting linear structure in VLM embeddings, a synthetic-data pre-training method yields background-invariant representations that exceed 90% worst-group accuracy on Waterbirds even under 100% spurious correlation with no minority examples in training.

Video models are zero-shot learners and reasoners

cs.LG · 2025-09-24 · unverdicted · novelty 6.0

Generative video models exhibit emergent zero-shot capabilities across perception, manipulation, and basic reasoning tasks.

ImgEdit: A Unified Image Editing Dataset and Benchmark

cs.CV · 2025-05-26 · conditional · novelty 6.0

ImgEdit supplies 1.2 million curated edit pairs and a three-part benchmark that let a VLM-based model outperform prior open-source editors on adherence, quality, and detail preservation.

MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

cs.CV · 2025-07-03 · unverdicted · novelty 5.0

MoGe-2 recovers metric-scale 3D point maps with fine details from single images via data refinement and extension of affine-invariant predictions.

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

cs.CV · 2025-05-14 · conditional · novelty 5.0

BLIP3-o uses a diffusion transformer to generate CLIP image features and a sequential pretraining strategy to build open models that perform strongly on both image understanding and generation benchmarks.

citing papers explorer

Showing 8 of 8 citing papers.

Mind the Gap: Geometrically Accurate Generative Reconstruction from Disjoint Views cs.CV · 2026-05-08 · unverdicted · none · ref 32
GLADOS reconstructs 3D geometry from disjoint views by generating intermediate perspectives, performing robust coarse alignment that tolerates generative inconsistencies, and iteratively expanding context for consistency.
Covering Human Action Space for Computer Use: Data Synthesis and Benchmark cs.CV · 2026-05-12 · unverdicted · none · ref 40
Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models cs.CV · 2026-05-11 · conditional · none · ref 17 · 2 links
TOC-Bench is a new diagnostic benchmark that reveals major weaknesses in temporal object consistency for Video-LLMs, including event counting, ordering, identity reasoning, and hallucination avoidance.
Birds of a Feather Flock Together: Background-Invariant Representations via Linear Structure in VLMs cs.CV · 2026-05-11 · unverdicted · none · ref 19
Exploiting linear structure in VLM embeddings, a synthetic-data pre-training method yields background-invariant representations that exceed 90% worst-group accuracy on Waterbirds even under 100% spurious correlation with no minority examples in training.
Video models are zero-shot learners and reasoners cs.LG · 2025-09-24 · unverdicted · none · ref 11
Generative video models exhibit emergent zero-shot capabilities across perception, manipulation, and basic reasoning tasks.
ImgEdit: A Unified Image Editing Dataset and Benchmark cs.CV · 2025-05-26 · conditional · none · ref 31
ImgEdit supplies 1.2 million curated edit pairs and a three-part benchmark that let a VLM-based model outperform prior open-source editors on adherence, quality, and detail preservation.
MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details cs.CV · 2025-07-03 · unverdicted · none · ref 29
MoGe-2 recovers metric-scale 3D point maps with fine details from single images via data refinement and extension of affine-invariant predictions.
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset cs.CV · 2025-05-14 · conditional · none · ref 13
BLIP3-o uses a diffusion transformer to generate CLIP image features and a sequential pretraining strategy to build open models that perform strongly on both image understanding and generation benchmarks.

Segment anything

fields

years

verdicts

representative citing papers

citing papers explorer