TeachCLIP: Multi-Grained Teaching for Efficient Text-to-Video Retrieval

Fengzong Lian; Hu Hu; Kaibin Tian; Ruixiang Zhao; Runquan Xie; Xirong Li; Zhanhui Kang

arxiv: 2308.01217 · v1 · pith:JIODPJYNnew · submitted 2023-08-02 · 💻 cs.CV

TeachCLIP: Multi-Grained Teaching for Efficient Text-to-Video Retrieval

Kaibin Tian , Ruixiang Zhao , Hu Hu , Runquan Xie , Fengzong Lian , Zhanhui Kang , Xirong Li This is my paper

classification 💻 cs.CV

keywords efficientretrievalstudentt2vrteachingclip4clipfine-grainedlearning

0 comments

read the original abstract

For text-to-video retrieval (T2VR), which aims to retrieve unlabeled videos by ad-hoc textual queries, CLIP-based methods are dominating. Compared to CLIP4Clip which is efficient and compact, the state-of-the-art models tend to compute video-text similarity by fine-grained cross-modal feature interaction and matching, putting their scalability for large-scale T2VR into doubt. For efficient T2VR, we propose TeachCLIP with multi-grained teaching to let a CLIP4Clip based student network learn from more advanced yet computationally heavy models such as X-CLIP, TS2-Net and X-Pool . To improve the student's learning capability, we add an Attentional frame-Feature Aggregation (AFA) block, which by design adds no extra storage/computation overhead at the retrieval stage. While attentive weights produced by AFA are commonly used for combining frame-level features, we propose a novel use of the weights to let them imitate frame-text relevance estimated by the teacher network. As such, AFA provides a fine-grained learning (teaching) channel for the student (teacher). Extensive experiments on multiple public datasets justify the viability of the proposed method.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Reasoning Text-to-Video Retrieval for Operating Room Clips via Action-Driven Digital Twins
cs.CV 2026-06 conditional novelty 7.0

OR3 converts OR clips to action-driven digital twins, uses LLM imagination for hypothetical ActDTs, and achieves 57.6 R@1 and 77.3 R@5 on 276 implicit queries from 386 robotic knee procedure clips, outperforming baselines.