Youtube-vos: A large-scale video object segmentation benchmark

Ning Xu, Linjie Yang, Yuchen Fan, Dingcheng Yue, Yuchen Liang, Jianchao Yang, Thomas Huang · 2018 · arXiv 1809.03327

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

representative citing papers

Robust Promptable Video Object Segmentation

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

The paper creates a real-world corruption benchmark for promptable video object segmentation and proposes MoGA, which uses object-specific memory to improve robustness and temporal consistency under adverse conditions.

VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing

cs.CV · 2026-05-05 · unverdicted · novelty 7.0 · 2 refs

VEBENCH is the first benchmark with 3.9K videos and 3,080 human-verified QA pairs that measures LMMs on video editing technique recognition and operation simulation, revealing a large gap to human performance.

YOSE: You Only Select Essential Tokens for Efficient DiT-based Video Object Removal

cs.CV · 2026-04-30 · unverdicted · novelty 7.0

YOSE accelerates DiT video object removal up to 2.5x by using BVI for adaptive token selection and DiffSim to simulate unmasked token effects, while preserving visual quality.

Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners

cs.CV · 2026-04-29 · unverdicted · novelty 6.0

LILA learns temporally consistent semantic and geometric pixel features from uncurated videos via linear in-context learning on off-the-shelf depth and motion cues, yielding empirical gains on video object segmentation, surface normal estimation, and semantic segmentation.

X2SAM: Any Segmentation in Images and Videos

cs.CV · 2026-04-27 · unverdicted · novelty 6.0

X2SAM unifies any-segmentation across images and videos in one MLLM by adding a Mask Memory module for temporal consistency and joint training on mixed datasets.

CMTM: Cross-Modal Token Modulation for Unsupervised Video Object Segmentation

cs.CV · 2026-04-16 · unverdicted · novelty 6.0

Cross-modal token modulation enables better fusion of appearance and motion cues in two-stream models, leading to state-of-the-art results in unsupervised video object segmentation.

PanoSAM2: Lightweight Distortion- and Memory-aware Adaptions of SAM2 for 360 Video Object Segmentation

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

PanoSAM2 adapts SAM2 with a Pano-Aware Decoder, Distortion-Guided Mask Loss, and Long-Short Memory Module to improve 360 video object segmentation, reporting +5.6 and +6.7 gains over base SAM2 on two benchmarks.

SAM 2: Segment Anything in Images and Videos

cs.CV · 2024-08-01 · conditional · novelty 6.0

SAM 2 delivers more accurate video segmentation with 3x fewer user interactions and 6x faster image segmentation than the original SAM by training a streaming-memory transformer on the largest video segmentation dataset collected to date.

citing papers explorer

Showing 8 of 8 citing papers.

Robust Promptable Video Object Segmentation cs.CV · 2026-05-12 · unverdicted · none · ref 47
The paper creates a real-world corruption benchmark for promptable video object segmentation and proposes MoGA, which uses object-specific memory to improve robustness and temporal consistency under adverse conditions.
VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing cs.CV · 2026-05-05 · unverdicted · none · ref 38 · 2 links
VEBENCH is the first benchmark with 3.9K videos and 3,080 human-verified QA pairs that measures LMMs on video editing technique recognition and operation simulation, revealing a large gap to human performance.
YOSE: You Only Select Essential Tokens for Efficient DiT-based Video Object Removal cs.CV · 2026-04-30 · unverdicted · none · ref 23
YOSE accelerates DiT video object removal up to 2.5x by using BVI for adaptive token selection and DiffSim to simulate unmasked token effects, while preserving visual quality.
Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners cs.CV · 2026-04-29 · unverdicted · none · ref 44
LILA learns temporally consistent semantic and geometric pixel features from uncurated videos via linear in-context learning on off-the-shelf depth and motion cues, yielding empirical gains on video object segmentation, surface normal estimation, and semantic segmentation.
X2SAM: Any Segmentation in Images and Videos cs.CV · 2026-04-27 · unverdicted · none · ref 42
X2SAM unifies any-segmentation across images and videos in one MLLM by adding a Mask Memory module for temporal consistency and joint training on mixed datasets.
CMTM: Cross-Modal Token Modulation for Unsupervised Video Object Segmentation cs.CV · 2026-04-16 · unverdicted · none · ref 23
Cross-modal token modulation enables better fusion of appearance and motion cues in two-stream models, leading to state-of-the-art results in unsupervised video object segmentation.
PanoSAM2: Lightweight Distortion- and Memory-aware Adaptions of SAM2 for 360 Video Object Segmentation cs.CV · 2026-04-09 · unverdicted · none · ref 36
PanoSAM2 adapts SAM2 with a Pano-Aware Decoder, Distortion-Guided Mask Loss, and Long-Short Memory Module to improve 360 video object segmentation, reporting +5.6 and +6.7 gains over base SAM2 on two benchmarks.
SAM 2: Segment Anything in Images and Videos cs.CV · 2024-08-01 · conditional · none · ref 29
SAM 2 delivers more accurate video segmentation with 3x fewer user interactions and 6x faster image segmentation than the original SAM by training a streaming-memory transformer on the largest video segmentation dataset collected to date.

Youtube-vos: A large-scale video object segmentation benchmark

fields

years

verdicts

representative citing papers

citing papers explorer