V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
Advances in neural information processing systems , volume=
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 3representative citing papers
TAPE applies temporal-aware token pruning with smoothing, reselection, and timestep scheduling to speed up video diffusion models while preserving visual fidelity and coherence.
SplAttN uses Gaussian soft splatting and attention to avoid sparse projection collapse in point cloud completion, achieving SOTA results and demonstrating genuine visual cue reliance on KITTI.
citing papers explorer
-
Revisiting Feature Prediction for Learning Visual Representations from Video
V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
-
Temporal Aware Pruning for Efficient Diffusion-based Video Generation
TAPE applies temporal-aware token pruning with smoothing, reselection, and timestep scheduling to speed up video diffusion models while preserving visual fidelity and coherence.
-
SplAttN: Bridging 2D and 3D with Gaussian Soft Splatting and Attention for Point Cloud Completion
SplAttN uses Gaussian soft splatting and attention to avoid sparse projection collapse in point cloud completion, achieving SOTA results and demonstrating genuine visual cue reliance on KITTI.