hub

Videomae: Masked au- toencoders are data-efficient learners for self-supervised video pre-training

URL https:// arxiv · 2022 · arXiv 2203.12602

14 Pith papers cite this work. Polarity classification is still indexing.

14 Pith papers citing it

read on arXiv browse 14 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

method 2 background 1

citation-polarity summary

use method 2 background 1

representative citing papers

SignMAE: Segmentation-Driven Self-Supervised Learning for Sign Language Recognition

cs.CV · 2026-05-03 · unverdicted · novelty 7.0

SignMAE uses segmentation-driven masking in a mask-and-reconstruct self-supervised task to learn fine-grained sign representations, achieving state-of-the-art accuracy on WLASL, NMFs-CSL, and Slovo with fewer frames and modalities.

Mask World Model: Predicting What Matters for Robust Robot Policy Learning

cs.RO · 2026-04-21 · unverdicted · novelty 7.0

Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization and texture robustness.

VTI-CoT: Visual-Textual Interleaved Chain of Thought for Video Reasoning

cs.CV · 2026-06-04 · unverdicted · novelty 6.0

VTI-CoT proposes a visual-textual interleaved chain-of-thought method for video reasoning, built via automated annotation and OCR compression, claiming SOTA performance and better training efficiency on same-scale models.

SpecSem-Net: Integrating Spectral and Semantic Features for Robust AI-generated Video Detection

cs.CV · 2026-05-17 · unverdicted · novelty 6.0

SpecSem-Net integrates Fourier-based spectral filtering with semantic-guided gated merging to detect AI-generated videos, reporting 87.25% accuracy on a new benchmark of five commercial generators and 95.59% on public datasets.

Prognostic Value of Lung Ultrasound Biomarkers for Readmission Risk in Congestive Heart Failure: A Pilot Data-Driven Analysis

eess.SP · 2026-05-16 · unverdicted · novelty 6.0

Pilot study uses pretrained video encoder features from lung ultrasound to predict 30-day CHF readmission, finding lower-lung views and temporal differences most informative with top MLP F1 of 0.80.

Beyond Independent Frames: Latent Attention Masked Autoencoders for Multi-View Echocardiography

cs.CV · 2026-04-16 · unverdicted · novelty 6.0

LAMAE adds latent-space attention to masked autoencoders so multi-view echocardiography videos can exchange information across frames and views, yielding representations that transfer from adult to pediatric hearts and enable ICD-10 code prediction on MIMIC-IV-ECHO.

Zero-shot World Models Are Developmentally Efficient Learners

cs.AI · 2026-04-11 · unverdicted · novelty 6.0

A zero-shot visual world model trained on one child's experience achieves broad competence on physical understanding benchmarks while matching developmental behavioral patterns.

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

cs.CV · 2023-10-03 · unverdicted · novelty 6.0

LanguageBind aligns video, infrared, depth, and audio to a frozen language encoder via contrastive learning on the new VIDAL-10M dataset, extending video-language pretraining to N modalities.

Do Video Foundation Models Understand Intuitive Physics? A Layerwise Probing Analysis

cs.CV · 2026-06-08 · unverdicted · novelty 5.0

Video foundation models encode intuitive physics knowledge that is strongest in V-JEPA at intermediate-to-late layers and depends on pretraining type and probe design.

BioVid: Autoregressive Video Generation with Biological Behavior Semantic Comprehension

cs.CV · 2026-06-07 · unverdicted · novelty 5.0

BioVid is a data-driven autoregressive model using 2D-encode/3D-decode tokenization and causal Transformer with EOS termination that reproduces real action duration distributions (W1 distance 1.24 frames) on NTU RGB+D drinking clips, outperforming fixed-length baselines.

EVA-Net: Subject-Independent EEG Motor Decoding with Video-Derived Motor Priors

cs.AI · 2026-06-01 · unverdicted · novelty 5.0

EVA-Net improves subject-independent EEG motor decoding by using video action priors via cross-modal contrastive alignment and knowledge distillation, reporting an 8.66% LOSO accuracy gain on EEGMMI.

FAST-ME: Foundation-aware Adaptive Stopping for Motion Estimation for Efficient IoT Video Analysis

cs.CV · 2026-05-22 · unverdicted · novelty 5.0

A hybrid motion estimation framework combines optimal stopping theory with foundation model semantic scores to reduce computation while maintaining accuracy and semantic coverage in video analysis.

Insights from Visual Cognition: Understanding Human Action Dynamics with Overall Glance and Refined Gaze Transformer

cs.CV · 2026-04-08 · unverdicted · novelty 5.0

The OG-ReG Transformer achieves state-of-the-art results on Kinetics-400, Something-Something v2, and Diving-48 by combining global glance and local gaze processing paths.

Empirical Evaluation of Multi-Modal Touch Detection in Over-the-Shoulder Video Surveillance

cs.CV · 2026-06-28 · accept · novelty 3.0

An empirical evaluation of a multi-modal touch detector using MediaPipe, HSV skin filtering, motion differencing, and Canny edges finds low F1 scores on staged video and excessive false positives on real videos, concluding the approach does not enable reliable keystroke reconstruction outside contro

citing papers explorer

Showing 14 of 14 citing papers.

SignMAE: Segmentation-Driven Self-Supervised Learning for Sign Language Recognition cs.CV · 2026-05-03 · unverdicted · none · ref 23
SignMAE uses segmentation-driven masking in a mask-and-reconstruct self-supervised task to learn fine-grained sign representations, achieving state-of-the-art accuracy on WLASL, NMFs-CSL, and Slovo with fewer frames and modalities.
Mask World Model: Predicting What Matters for Robust Robot Policy Learning cs.RO · 2026-04-21 · unverdicted · none · ref 34
Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization and texture robustness.
VTI-CoT: Visual-Textual Interleaved Chain of Thought for Video Reasoning cs.CV · 2026-06-04 · unverdicted · none · ref 31
VTI-CoT proposes a visual-textual interleaved chain-of-thought method for video reasoning, built via automated annotation and OCR compression, claiming SOTA performance and better training efficiency on same-scale models.
SpecSem-Net: Integrating Spectral and Semantic Features for Robust AI-generated Video Detection cs.CV · 2026-05-17 · unverdicted · none · ref 22
SpecSem-Net integrates Fourier-based spectral filtering with semantic-guided gated merging to detect AI-generated videos, reporting 87.25% accuracy on a new benchmark of five commercial generators and 95.59% on public datasets.
Prognostic Value of Lung Ultrasound Biomarkers for Readmission Risk in Congestive Heart Failure: A Pilot Data-Driven Analysis eess.SP · 2026-05-16 · unverdicted · none · ref 28
Pilot study uses pretrained video encoder features from lung ultrasound to predict 30-day CHF readmission, finding lower-lung views and temporal differences most informative with top MLP F1 of 0.80.
Beyond Independent Frames: Latent Attention Masked Autoencoders for Multi-View Echocardiography cs.CV · 2026-04-16 · unverdicted · none · ref 25
LAMAE adds latent-space attention to masked autoencoders so multi-view echocardiography videos can exchange information across frames and views, yielding representations that transfer from adult to pediatric hearts and enable ICD-10 code prediction on MIMIC-IV-ECHO.
Zero-shot World Models Are Developmentally Efficient Learners cs.AI · 2026-04-11 · unverdicted · none · ref 22
A zero-shot visual world model trained on one child's experience achieves broad competence on physical understanding benchmarks while matching developmental behavioral patterns.
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment cs.CV · 2023-10-03 · unverdicted · none · ref 124
LanguageBind aligns video, infrared, depth, and audio to a frozen language encoder via contrastive learning on the new VIDAL-10M dataset, extending video-language pretraining to N modalities.
Do Video Foundation Models Understand Intuitive Physics? A Layerwise Probing Analysis cs.CV · 2026-06-08 · unverdicted · none · ref 19
Video foundation models encode intuitive physics knowledge that is strongest in V-JEPA at intermediate-to-late layers and depends on pretraining type and probe design.
BioVid: Autoregressive Video Generation with Biological Behavior Semantic Comprehension cs.CV · 2026-06-07 · unverdicted · none · ref 32
BioVid is a data-driven autoregressive model using 2D-encode/3D-decode tokenization and causal Transformer with EOS termination that reproduces real action duration distributions (W1 distance 1.24 frames) on NTU RGB+D drinking clips, outperforming fixed-length baselines.
EVA-Net: Subject-Independent EEG Motor Decoding with Video-Derived Motor Priors cs.AI · 2026-06-01 · unverdicted · none · ref 11
EVA-Net improves subject-independent EEG motor decoding by using video action priors via cross-modal contrastive alignment and knowledge distillation, reporting an 8.66% LOSO accuracy gain on EEGMMI.
FAST-ME: Foundation-aware Adaptive Stopping for Motion Estimation for Efficient IoT Video Analysis cs.CV · 2026-05-22 · unverdicted · none · ref 21
A hybrid motion estimation framework combines optimal stopping theory with foundation model semantic scores to reduce computation while maintaining accuracy and semantic coverage in video analysis.
Insights from Visual Cognition: Understanding Human Action Dynamics with Overall Glance and Refined Gaze Transformer cs.CV · 2026-04-08 · unverdicted · none · ref 75
The OG-ReG Transformer achieves state-of-the-art results on Kinetics-400, Something-Something v2, and Diving-48 by combining global glance and local gaze processing paths.
Empirical Evaluation of Multi-Modal Touch Detection in Over-the-Shoulder Video Surveillance cs.CV · 2026-06-28 · accept · none · ref 3
An empirical evaluation of a multi-modal touch detector using MediaPipe, HSV skin filtering, motion differencing, and Canny edges finds low F1 scores on staged video and excessive false positives on real videos, concluding the approach does not enable reliable keystroke reconstruction outside contro

Videomae: Masked au- toencoders are data-efficient learners for self-supervised video pre-training

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer