Videomae v2: Scaling video masked autoencoders with dual masking

URL https://arxiv · 2023 · arXiv 2303.16727

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

SignMAE: Segmentation-Driven Self-Supervised Learning for Sign Language Recognition

cs.CV · 2026-05-03 · unverdicted · novelty 7.0

SignMAE uses segmentation-driven masking in a mask-and-reconstruct self-supervised task to learn fine-grained sign representations, achieving state-of-the-art accuracy on WLASL, NMFs-CSL, and Slovo with fewer frames and modalities.

The TIME Machine: On The Power of Motion for Efficient Perception

cs.CV · 2026-05-21 · unverdicted · novelty 6.0

TIME is a motion-based embedding from point tracks, trained only on synthetic data via masked autoencoding, that matches state-of-the-art video model performance with up to 10,000x less training data.

Do Video Foundation Models Understand Intuitive Physics? A Layerwise Probing Analysis

cs.CV · 2026-06-08 · unverdicted · novelty 5.0

Video foundation models encode intuitive physics knowledge that is strongest in V-JEPA at intermediate-to-late layers and depends on pretraining type and probe design.

citing papers explorer

Showing 1 of 1 citing paper after filters.

SignMAE: Segmentation-Driven Self-Supervised Learning for Sign Language Recognition cs.CV · 2026-05-03 · unverdicted · none · ref 25
SignMAE uses segmentation-driven masking in a mask-and-reconstruct self-supervised task to learn fine-grained sign representations, achieving state-of-the-art accuracy on WLASL, NMFs-CSL, and Slovo with fewer frames and modalities.

Videomae v2: Scaling video masked autoencoders with dual masking

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer