EyeCue detects driver cognitive distraction by modeling gaze-visual context interactions in egocentric videos and achieves 74.38% accuracy on the new CogDrive dataset, outperforming 11 baselines.
Title resolution pending
12 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
LIMSSR reformulates incomplete multimodal learning as LLM-driven sequence-to-score reasoning with prompt-guided imputation and mask-aware aggregation, outperforming baselines on action quality assessment without complete training data.
A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
A diffusion model for video generation extends image architectures with joint image-video training and improved conditional sampling, delivering first large-scale text-to-video results and state-of-the-art performance on video prediction and unconditional generation benchmarks.
EAST uses randomized time-step sampling and token masking to train a single encoder-only model that generalizes across all observation ratios in early action prediction and reports new state-of-the-art accuracy on NTU60, SSv2, and UCF101.
HELIX is the first end-to-end neural codec jointly optimizing video compression and DNA encoding via tokens, achieving 1.91 bits per nucleotide with Kronecker mixing and FSM mapping.
ABMamba uses Mamba-based linear-complexity processing plus a novel Aligned Hierarchical Bidirectional Scan to deliver competitive video captioning on VATEX and MSR-VTT at roughly 3x higher throughput than typical Transformer MLLMs.
V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 hours of unlabeled robot video.
EVA-CLIP delivers improved CLIP training recipes that yield 82.0% zero-shot ImageNet-1K accuracy for a 5B-parameter model after only 9 billion samples.
InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
CogVideo is a large-scale transformer pretrained for text-to-video generation that outperforms public models in evaluations.
Motion separation modules plus negative prompts improve CLIP-based zero-shot video action recognition on standard benchmarks.
citing papers explorer
-
EyeCue: Driver Cognitive Distraction Detection via Gaze-Empowered Egocentric Video Understanding
EyeCue detects driver cognitive distraction by modeling gaze-visual context interactions in egocentric videos and achieves 74.38% accuracy on the new CogDrive dataset, outperforming 11 baselines.
-
LIMSSR: LLM-Driven Sequence-to-Score Reasoning under Training-Time Incomplete Multimodal Observations
LIMSSR reformulates incomplete multimodal learning as LLM-driven sequence-to-score reasoning with prompt-guided imputation and mask-aware aggregation, outperforming baselines on action quality assessment without complete training data.
-
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
-
Video Diffusion Models
A diffusion model for video generation extends image architectures with joint image-video training and improved conditional sampling, delivering first large-scale text-to-video results and state-of-the-art performance on video prediction and unconditional generation benchmarks.
-
EAST: Early Action Prediction Sampling Strategy with Token Masking
EAST uses randomized time-step sampling and token masking to train a single encoder-only model that generalizes across all observation ratios in early action prediction and reports new state-of-the-art accuracy on NTU60, SSv2, and UCF101.
-
From Pixels to Nucleotides: End-to-End Token-Based Video Compression for DNA Storage
HELIX is the first end-to-end neural codec jointly optimizing video compression and DNA encoding via tokens, achieving 1.91 bits per nucleotide with Kronecker mixing and FSM mapping.
-
ABMAMBA: Multimodal Large Language Model with Aligned Hierarchical Bidirectional Scan for Efficient Video Captioning
ABMamba uses Mamba-based linear-complexity processing plus a novel Aligned Hierarchical Bidirectional Scan to deliver competitive video captioning on VATEX and MSR-VTT at roughly 3x higher throughput than typical Transformer MLLMs.
-
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 hours of unlabeled robot video.
-
EVA-CLIP: Improved Training Techniques for CLIP at Scale
EVA-CLIP delivers improved CLIP training recipes that yield 82.0% zero-shot ImageNet-1K accuracy for a 5B-parameter model after only 9 billion samples.
-
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
-
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
CogVideo is a large-scale transformer pretrained for text-to-video generation that outperforms public models in evaluations.
-
Motion-Guided Semantic Alignment with Negative Prompts for Zero-Shot Video Action Recognition
Motion separation modules plus negative prompts improve CLIP-based zero-shot video action recognition on standard benchmarks.