CoDAAR aligns modality-specific codebooks at the index level using Discrete Temporal Alignment and Cascading Semantic Alignment to achieve cross-modal generalization while preserving unique structures, reporting state-of-the-art results on event classification, localization, video segmentation, and跨
Clip4clip: An empirical study of CLIP for end to end video clip retrieval.CoRR, abs/2104.08860
6 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 6representative citing papers
MDPD mutually distills knowledge between a frozen backbone and a learnable side network during fine-tuning, then discards the side network at inference to accelerate speed by at least 25% while preserving accuracy.
MP-ISMoE uses Gaussian noise perturbed iterative quantization and interactive side mixture-of-experts to deliver higher accuracy than prior memory-efficient transfer learning methods while keeping similar parameter and memory usage.
VideoStir introduces a spatio-temporal graph-based structure and intent-aware retrieval for long-video RAG, achieving competitive performance with SOTA methods via a new IR-600K dataset.
LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.
Inverse attention embeddings combined with standard visual features improve recall in video semantic search for crowded scenes without additional training.
citing papers explorer
-
Cross-Modal-Domain Generalization Through Semantically Aligned Discrete Representations
CoDAAR aligns modality-specific codebooks at the index level using Discrete Temporal Alignment and Cascading Semantic Alignment to achieve cross-modal generalization while preserving unique structures, reporting state-of-the-art results on event classification, localization, video segmentation, and跨
-
Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation
MDPD mutually distills knowledge between a frozen backbone and a learnable side network during fine-tuning, then discards the side network at inference to accelerate speed by at least 25% while preserving accuracy.
-
MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning
MP-ISMoE uses Gaussian noise perturbed iterative quantization and interactive side mixture-of-experts to deliver higher accuracy than prior memory-efficient transfer learning methods while keeping similar parameter and memory usage.
-
VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG
VideoStir introduces a spatio-temporal graph-based structure and intent-aware retrieval for long-video RAG, achieving competitive performance with SOTA methods via a new IR-600K dataset.
-
LLaVA-Video: Video Instruction Tuning With Synthetic Data
LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.
-
Look Beyond Saliency: Low-Attention Guided Dual Encoding for Video Semantic Search
Inverse attention embeddings combined with standard visual features improve recall in video semantic search for crowded scenes without additional training.