hub

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, Tianrui Li · 2021 · arXiv 2104.08860

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it

read on arXiv browse 11 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1 baseline 1 method 1

citation-polarity summary

background 1 baseline 1 use method 1

representative citing papers

OmniRetriever: Any-to-Any Audio-Video-Text Retrieval via Fusion-as-Teacher Distillation

cs.CV · 2026-05-26 · unverdicted · novelty 7.0

OmniRetriever-7B uses fusion-as-teacher distillation plus Tuple-InfoNCE to improve any-to-any audio-video-text retrieval over prior open and closed models.

Cross-Modal-Domain Generalization Through Semantically Aligned Discrete Representations

cs.CV · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

CoDAAR aligns modality-specific codebooks at the index level using Discrete Temporal Alignment and Cascading Semantic Alignment to achieve cross-modal generalization while preserving unique structures, reporting state-of-the-art results on event classification, localization, video segmentation, and跨

Adapting MLLMs for Nuanced Video Retrieval

cs.CV · 2025-12-15 · unverdicted · novelty 7.0

Text-only contrastive fine-tuning of an MLLM with hard negatives produces embeddings that handle temporal, negation, and multimodal nuances in video retrieval and achieves SOTA performance.

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

cs.CV · 2022-04-01 · unverdicted · novelty 7.0

Socratic Models compose zero-shot multimodal reasoning by prompting pretrained language and vision models to exchange information and enable new capabilities without finetuning.

Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation

cs.CV · 2026-04-10 · unverdicted · novelty 6.0

MDPD mutually distills knowledge between a frozen backbone and a learnable side network during fine-tuning, then discards the side network at inference to accelerate speed by at least 25% while preserving accuracy.

MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning

cs.LG · 2026-04-10 · unverdicted · novelty 6.0

MP-ISMoE uses Gaussian noise perturbed iterative quantization and interactive side mixture-of-experts to deliver higher accuracy than prior memory-efficient transfer learning methods while keeping similar parameter and memory usage.

VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG

cs.CV · 2026-04-07 · unverdicted · novelty 6.0

VideoStir introduces a spatio-temporal graph-based structure and intent-aware retrieval for long-video RAG, achieving competitive performance with SOTA methods via a new IR-600K dataset.

LLaVA-Video: Video Instruction Tuning With Synthetic Data

cs.CV · 2024-10-03 · unverdicted · novelty 6.0

LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.

Demystifying CLIP Data

cs.CV · 2023-09-28 · accept · novelty 6.0

MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.

Look Beyond Saliency: Low-Attention Guided Dual Encoding for Video Semantic Search

cs.CV · 2026-05-07 · unverdicted · novelty 5.0

Inverse attention embeddings combined with standard visual features improve recall in video semantic search for crowded scenes without additional training.

SRL-CLIP: Efficient CLIP Video Adaptation via Structured Semantic Role Labels

cs.CV · 2024-01-15

citing papers explorer

Showing 11 of 11 citing papers.

OmniRetriever: Any-to-Any Audio-Video-Text Retrieval via Fusion-as-Teacher Distillation cs.CV · 2026-05-26 · unverdicted · none · ref 28
OmniRetriever-7B uses fusion-as-teacher distillation plus Tuple-InfoNCE to improve any-to-any audio-video-text retrieval over prior open and closed models.
Cross-Modal-Domain Generalization Through Semantically Aligned Discrete Representations cs.CV · 2026-05-12 · unverdicted · none · ref 24 · 2 links
CoDAAR aligns modality-specific codebooks at the index level using Discrete Temporal Alignment and Cascading Semantic Alignment to achieve cross-modal generalization while preserving unique structures, reporting state-of-the-art results on event classification, localization, video segmentation, and跨
Adapting MLLMs for Nuanced Video Retrieval cs.CV · 2025-12-15 · unverdicted · none · ref 52
Text-only contrastive fine-tuning of an MLLM with hard negatives produces embeddings that handle temporal, negation, and multimodal nuances in video retrieval and achieves SOTA performance.
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language cs.CV · 2022-04-01 · unverdicted · none · ref 93
Socratic Models compose zero-shot multimodal reasoning by prompting pretrained language and vision models to exchange information and enable new capabilities without finetuning.
Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation cs.CV · 2026-04-10 · unverdicted · none · ref 64
MDPD mutually distills knowledge between a frozen backbone and a learnable side network during fine-tuning, then discards the side network at inference to accelerate speed by at least 25% while preserving accuracy.
MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning cs.LG · 2026-04-10 · unverdicted · none · ref 76
MP-ISMoE uses Gaussian noise perturbed iterative quantization and interactive side mixture-of-experts to deliver higher accuracy than prior memory-efficient transfer learning methods while keeping similar parameter and memory usage.
VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG cs.CV · 2026-04-07 · unverdicted · none · ref 21
VideoStir introduces a spatio-temporal graph-based structure and intent-aware retrieval for long-video RAG, achieving competitive performance with SOTA methods via a new IR-600K dataset.
LLaVA-Video: Video Instruction Tuning With Synthetic Data cs.CV · 2024-10-03 · unverdicted · none · ref 210
LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.
Demystifying CLIP Data cs.CV · 2023-09-28 · accept · none · ref 96
MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.
Look Beyond Saliency: Low-Attention Guided Dual Encoding for Video Semantic Search cs.CV · 2026-05-07 · unverdicted · none · ref 10
Inverse attention embeddings combined with standard visual features improve recall in video semantic search for crowded scenes without additional training.
SRL-CLIP: Efficient CLIP Video Adaptation via Structured Semantic Role Labels cs.CV · 2024-01-15 · unreviewed · ref 33

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer