SignMAE uses segmentation-driven masking in a mask-and-reconstruct self-supervised task to learn fine-grained sign representations, achieving state-of-the-art accuracy on WLASL, NMFs-CSL, and Slovo with fewer frames and modalities.
hub
Videomae: Masked au- toencoders are data-efficient learners for self-supervised video pre-training
14 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization and texture robustness.
VTI-CoT proposes a visual-textual interleaved chain-of-thought method for video reasoning, built via automated annotation and OCR compression, claiming SOTA performance and better training efficiency on same-scale models.
SpecSem-Net integrates Fourier-based spectral filtering with semantic-guided gated merging to detect AI-generated videos, reporting 87.25% accuracy on a new benchmark of five commercial generators and 95.59% on public datasets.
Pilot study uses pretrained video encoder features from lung ultrasound to predict 30-day CHF readmission, finding lower-lung views and temporal differences most informative with top MLP F1 of 0.80.
LAMAE adds latent-space attention to masked autoencoders so multi-view echocardiography videos can exchange information across frames and views, yielding representations that transfer from adult to pediatric hearts and enable ICD-10 code prediction on MIMIC-IV-ECHO.
A zero-shot visual world model trained on one child's experience achieves broad competence on physical understanding benchmarks while matching developmental behavioral patterns.
LanguageBind aligns video, infrared, depth, and audio to a frozen language encoder via contrastive learning on the new VIDAL-10M dataset, extending video-language pretraining to N modalities.
Video foundation models encode intuitive physics knowledge that is strongest in V-JEPA at intermediate-to-late layers and depends on pretraining type and probe design.
BioVid is a data-driven autoregressive model using 2D-encode/3D-decode tokenization and causal Transformer with EOS termination that reproduces real action duration distributions (W1 distance 1.24 frames) on NTU RGB+D drinking clips, outperforming fixed-length baselines.
EVA-Net improves subject-independent EEG motor decoding by using video action priors via cross-modal contrastive alignment and knowledge distillation, reporting an 8.66% LOSO accuracy gain on EEGMMI.
A hybrid motion estimation framework combines optimal stopping theory with foundation model semantic scores to reduce computation while maintaining accuracy and semantic coverage in video analysis.
The OG-ReG Transformer achieves state-of-the-art results on Kinetics-400, Something-Something v2, and Diving-48 by combining global glance and local gaze processing paths.
An empirical evaluation of a multi-modal touch detector using MediaPipe, HSV skin filtering, motion differencing, and Canny edges finds low F1 scores on staged video and excessive false positives on real videos, concluding the approach does not enable reliable keystroke reconstruction outside contro
citing papers explorer
-
SignMAE: Segmentation-Driven Self-Supervised Learning for Sign Language Recognition
SignMAE uses segmentation-driven masking in a mask-and-reconstruct self-supervised task to learn fine-grained sign representations, achieving state-of-the-art accuracy on WLASL, NMFs-CSL, and Slovo with fewer frames and modalities.
-
Mask World Model: Predicting What Matters for Robust Robot Policy Learning
Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization and texture robustness.
-
VTI-CoT: Visual-Textual Interleaved Chain of Thought for Video Reasoning
VTI-CoT proposes a visual-textual interleaved chain-of-thought method for video reasoning, built via automated annotation and OCR compression, claiming SOTA performance and better training efficiency on same-scale models.
-
SpecSem-Net: Integrating Spectral and Semantic Features for Robust AI-generated Video Detection
SpecSem-Net integrates Fourier-based spectral filtering with semantic-guided gated merging to detect AI-generated videos, reporting 87.25% accuracy on a new benchmark of five commercial generators and 95.59% on public datasets.
-
Prognostic Value of Lung Ultrasound Biomarkers for Readmission Risk in Congestive Heart Failure: A Pilot Data-Driven Analysis
Pilot study uses pretrained video encoder features from lung ultrasound to predict 30-day CHF readmission, finding lower-lung views and temporal differences most informative with top MLP F1 of 0.80.
-
Beyond Independent Frames: Latent Attention Masked Autoencoders for Multi-View Echocardiography
LAMAE adds latent-space attention to masked autoencoders so multi-view echocardiography videos can exchange information across frames and views, yielding representations that transfer from adult to pediatric hearts and enable ICD-10 code prediction on MIMIC-IV-ECHO.
-
Zero-shot World Models Are Developmentally Efficient Learners
A zero-shot visual world model trained on one child's experience achieves broad competence on physical understanding benchmarks while matching developmental behavioral patterns.
-
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
LanguageBind aligns video, infrared, depth, and audio to a frozen language encoder via contrastive learning on the new VIDAL-10M dataset, extending video-language pretraining to N modalities.
-
Do Video Foundation Models Understand Intuitive Physics? A Layerwise Probing Analysis
Video foundation models encode intuitive physics knowledge that is strongest in V-JEPA at intermediate-to-late layers and depends on pretraining type and probe design.
-
BioVid: Autoregressive Video Generation with Biological Behavior Semantic Comprehension
BioVid is a data-driven autoregressive model using 2D-encode/3D-decode tokenization and causal Transformer with EOS termination that reproduces real action duration distributions (W1 distance 1.24 frames) on NTU RGB+D drinking clips, outperforming fixed-length baselines.
-
EVA-Net: Subject-Independent EEG Motor Decoding with Video-Derived Motor Priors
EVA-Net improves subject-independent EEG motor decoding by using video action priors via cross-modal contrastive alignment and knowledge distillation, reporting an 8.66% LOSO accuracy gain on EEGMMI.
-
FAST-ME: Foundation-aware Adaptive Stopping for Motion Estimation for Efficient IoT Video Analysis
A hybrid motion estimation framework combines optimal stopping theory with foundation model semantic scores to reduce computation while maintaining accuracy and semantic coverage in video analysis.
-
Insights from Visual Cognition: Understanding Human Action Dynamics with Overall Glance and Refined Gaze Transformer
The OG-ReG Transformer achieves state-of-the-art results on Kinetics-400, Something-Something v2, and Diving-48 by combining global glance and local gaze processing paths.
-
Empirical Evaluation of Multi-Modal Touch Detection in Over-the-Shoulder Video Surveillance
An empirical evaluation of a multi-modal touch detector using MediaPipe, HSV skin filtering, motion differencing, and Canny edges finds low F1 scores on staged video and excessive false positives on real videos, concluding the approach does not enable reliable keystroke reconstruction outside contro