Recognition: unknown
YouTube-VOS: A Large-Scale Video Object Segmentation Benchmark
read the original abstract
Learning long-term spatial-temporal features are critical for many video analysis tasks. However, existing video segmentation methods predominantly rely on static image segmentation techniques, and methods capturing temporal dependency for segmentation have to depend on pretrained optical flow models, leading to suboptimal solutions for the problem. End-to-end sequential learning to explore spatialtemporal features for video segmentation is largely limited by the scale of available video segmentation datasets, i.e., even the largest video segmentation dataset only contains 90 short video clips. To solve this problem, we build a new large-scale video object segmentation dataset called YouTube Video Object Segmentation dataset (YouTube-VOS). Our dataset contains 4,453 YouTube video clips and 94 object categories. This is by far the largest video object segmentation dataset to our knowledge and has been released at http://youtube-vos.org. We further evaluate several existing state-of-the-art video object segmentation algorithms on this dataset which aims to establish baselines for the development of new algorithms in the future.
This paper has not been read by Pith yet.
Forward citations
Cited by 9 Pith papers
-
VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing
VEBENCH is the first benchmark evaluating LMMs on video editing technique recognition and operation simulation using 3.9K videos and 3,080 QA pairs, revealing a large performance gap to humans.
-
Robust Promptable Video Object Segmentation
The paper creates a real-world corruption benchmark for promptable video object segmentation and proposes MoGA, which uses object-specific memory to improve robustness and temporal consistency under adverse conditions.
-
VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing
VEBENCH is the first benchmark with 3.9K videos and 3,080 human-verified QA pairs that measures LMMs on video editing technique recognition and operation simulation, revealing a large gap to human performance.
-
YOSE: You Only Select Essential Tokens for Efficient DiT-based Video Object Removal
YOSE accelerates DiT video object removal up to 2.5x by using BVI for adaptive token selection and DiffSim to simulate unmasked token effects, while preserving visual quality.
-
Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners
LILA learns temporally consistent semantic and geometric pixel features from uncurated videos via linear in-context learning on off-the-shelf depth and motion cues, yielding empirical gains on video object segmentatio...
-
X2SAM: Any Segmentation in Images and Videos
X2SAM unifies any-segmentation across images and videos in one MLLM by adding a Mask Memory module for temporal consistency and joint training on mixed datasets.
-
CMTM: Cross-Modal Token Modulation for Unsupervised Video Object Segmentation
Cross-modal token modulation enables better fusion of appearance and motion cues in two-stream models, leading to state-of-the-art results in unsupervised video object segmentation.
-
PanoSAM2: Lightweight Distortion- and Memory-aware Adaptions of SAM2 for 360 Video Object Segmentation
PanoSAM2 adapts SAM2 with a Pano-Aware Decoder, Distortion-Guided Mask Loss, and Long-Short Memory Module to improve 360 video object segmentation, reporting +5.6 and +6.7 gains over base SAM2 on two benchmarks.
-
SAM 2: Segment Anything in Images and Videos
SAM 2 delivers more accurate video segmentation with 3x fewer user interactions and 6x faster image segmentation than the original SAM by training a streaming-memory transformer on the largest video segmentation datas...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.