SRL-CLIP: Efficient CLIP Video Adaptation via Structured Semantic Role Labels

Darshan Singh; Makarand Tapaswi; Zeeshan Khan

arxiv: 2401.07669 · v3 · pith:M2PM7S4Snew · submitted 2024-01-15 · 💻 cs.CV

SRL-CLIP: Efficient CLIP Video Adaptation via Structured Semantic Role Labels

Darshan Singh , Zeeshan Khan , Makarand Tapaswi This is my paper

Pith reviewed 2026-05-24 04:44 UTC · model grok-4.3

classification 💻 cs.CV

keywords CLIP adaptationvideo understandingsemantic role labelscontrastive finetuningzero-shot retrievalstructured captionsefficient adaptationvideo benchmarks

0 comments

The pith

Structured semantic role label captions let CLIP adapt to video tasks with only 23k pairs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that dense captions generated from semantic role labels supply a richer training signal than the sparse narrations typical in large video datasets. Simple contrastive finetuning on the resulting 23k video-caption pairs produces representations that transfer across video tasks needing different levels of detail. This approach reaches performance on zero-shot retrieval and other benchmarks that matches or exceeds models trained on orders of magnitude more data. The result follows directly from treating the structured labels as a holistic description of each video rather than relying on incomplete text.

Core claim

Rule-based captions derived from semantic role labels that encode actions, people or objects, attributes, adverbs, and locations in structured form allow contrastive finetuning on 23k video pairs to yield an adapted CLIP model whose zero-shot text-to-video retrieval performance is comparable or superior to state-of-the-art models that use 4-8 times more parameters and are post-pretrained on up to 6000 times more data, while also surpassing the original CLIP on multiple video benchmarks.

What carries the argument

Rule-based captions generated from semantic role labels that represent each video holistically through actions, objects, attributes, manner, and location.

If this is right

The adapted model matches or exceeds larger models on zero-shot text-to-video retrieval despite using far less data and fewer parameters.
Performance improves over the base CLIP model on a range of video understanding benchmarks.
Representations learned this way transfer to tasks that require different degrees of perceptual detail.
Post-pretraining for video adaptation can be performed with two to three orders of magnitude fewer samples than current large-scale narration datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same structured-label approach could be tested on domains where detailed annotations already exist, such as instructional or surveillance video.
Replacing rule-based caption generation with learned captioning from the same labels might further improve the signal without increasing data volume.
If the efficiency holds, video adaptation pipelines could shift toward smaller, higher-quality annotated sets instead of web-scale scraping.

Load-bearing premise

Captions produced by rules from semantic role label annotations give a learning signal that is rich enough to replace the sparse narrations found in much larger video datasets.

What would settle it

Training an otherwise identical model on the same 23k videos but paired with random or minimally descriptive captions and finding that retrieval and benchmark performance remain comparable would falsify the claim that the structured labels are responsible for the efficiency.

Figures

Figures reproduced from arXiv: 2401.07669 by Darshan Singh, Makarand Tapaswi, Zeeshan Khan.

**Figure 1.** Figure 1: We illustrate the qualitative performance of FiGCLIP, a fine-grained adaptation of the popular CLIP model across multiple [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: We visualize an overview of our CLIP adaptation strategy. On the [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Video Situation Recognition on 5 videos. FiGCLIP performs much better than CLIP in picking the right attribute of an entity. [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Zero-shot text-to-video retrieval on the MSRVTT dataset. We show three frames of the top-1 retrieved video for each query. We [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Zero-shot text-to-video retrieval on the LSMDC dataset. We show three frames of the top-1 retrieved video for each query. We [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Zero-shot action recognition on Kinetics-400 dataset. We show the top 5 predicted actions by CLIP and FiGCLIP [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Attribution, Relation, and Order (ARO) benchmark for vision-language compositionality. For each image, we show the better [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative results of the challenging SugarCrepe benchmark. The top box shows the caption predicted by CLIP, while the [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

read the original abstract

Adapting CLIP for videos has gained popularity due to its semantic and rich representation. While CLIP is a good starting point, it typically undergoes post-pretraining (contrastive finetuning) on large video narration or caption datasets (e.g. HowTo100M, WebVid2.5M). However, such narrations or captions often lack comprehensive information needed to represent a video holistically. As the learning signal from text is sparse, the visual learning is inefficient and adaptation requires millions of samples to post-pretrain. In this work, we ask: is it possible to efficiently adapt CLIP for general and holistic video understanding? We use videos labeled with structured and dense Semantic Role Labels (SRLs) that capture actions, people or objects, their attributes, adverbs (manner), and location in a structured format representing the entire video in a holistic way. We generate rule-based captions from SRLs and demonstrate that simple contrastive finetuning on a mere 23k video-caption pairs is adequate to learn powerful, transferable representations applicable across a diverse range of video understanding tasks that require varying levels of perceptual granularity. Our adapted CLIP model, SRL-CLIP, exhibits comparable or superior performance on zero-shot text-to-video retrieval compared to state-of-the-art models that possess 4-8x more parameters and are post-pretrained on up to 6000x more data. SRL-CLIP surpasses CLIP on multiple video benchmarks, underscoring the efficient learning and improved representations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SRL captions let them adapt CLIP on 23k pairs but the rule-based conversion may drop temporal and interaction details.

read the letter

The main takeaway is that turning structured semantic role labels into captions lets them run contrastive finetuning on just 23k video pairs and still claim retrieval performance that matches or beats models trained on millions of narrations or much larger parameter counts. The work replaces the usual large-scale sparse text with rule-generated captions that pack in actions, agents, attributes, adverbs, and locations. That shift is the concrete step forward, and the efficiency numbers, if they hold, directly tackle the data volume problem in video-language adaptation. The paper does a clean job of framing why standard narrations are weak signals and then showing a structured alternative can work at small scale. The downstream task results across different granularity levels are the part worth checking in the tables. The soft spot is exactly the one the stress test flags. Rule-based caption generation from SRL tuples can flatten time order and miss overlapping events or complex relations that videos routinely contain. Without ablations that test caption fidelity, compare against non-structured captions, or measure what the structure itself adds, it is difficult to know whether the gains come from the SRL format or from the particular 23k set. The abstract also withholds the actual scores, baselines, and statistical details, so the strength of the evidence stays provisional until the full results are examined. This is for groups working on data-efficient video-language models who want to reduce reliance on massive narration corpora. A reader already thinking about structured supervision would get practical value from the pipeline if the experiments check out. I would send it to referees to verify the caption quality and the controls.

Referee Report

1 major / 1 minor

Summary. The paper proposes SRL-CLIP, which adapts CLIP to video by generating rule-based captions from structured Semantic Role Labels (SRLs) on a 23k video-caption dataset and performing simple contrastive finetuning. It claims this yields powerful, transferable video representations that achieve comparable or superior zero-shot text-to-video retrieval to SOTA models with 4-8x more parameters trained on up to 6000x more data, while also surpassing the original CLIP on multiple video benchmarks.

Significance. If the results hold, the work shows that dense structured annotations can support efficient CLIP adaptation for video with orders-of-magnitude less data than current narration-based pipelines, offering a practical route to strong video representations when large-scale video-text corpora are unavailable.

major comments (1)

[§3] §3 (method): The central claim that SRL-derived captions supply a richer holistic signal than sparse narrations rests on the rule-based generation process, yet the manuscript provides no ablation that isolates caption fidelity (e.g., temporal ordering or multi-event relations) from the SRL structure itself; without such a control the attribution of gains on the 23k set to the proposed signal remains untested.

minor comments (1)

[Abstract] Abstract and §4: performance claims are stated without reference to the specific tables or statistical tests that support them; adding explicit pointers would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and for highlighting an important methodological point. We address the comment below and commit to revisions that will strengthen the attribution of results.

read point-by-point responses

Referee: [§3] §3 (method): The central claim that SRL-derived captions supply a richer holistic signal than sparse narrations rests on the rule-based generation process, yet the manuscript provides no ablation that isolates caption fidelity (e.g., temporal ordering or multi-event relations) from the SRL structure itself; without such a control the attribution of gains on the 23k set to the proposed signal remains untested.

Authors: We agree that the current manuscript does not contain an explicit ablation separating the benefits of the rule-based generation procedure (which encodes temporal ordering and multi-event relations) from the underlying SRL annotations. The 23k dataset provides SRL annotations, so we can generate control captions by applying simplified concatenation rules that omit ordering and relational constraints. We will add this ablation (new table and discussion in §4) to the revised manuscript to more directly attribute performance gains to caption fidelity enabled by SRL structure. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with external annotations and standard loss

full rationale

The paper presents an empirical adaptation of CLIP using rule-based captions derived from external SRL annotations, followed by standard contrastive finetuning on 23k pairs. No equations, derivations, or parameter-fitting steps are described that reduce a claimed prediction or result back to the inputs by construction. The central claim rests on performance comparisons against external benchmarks and larger datasets, with no self-citation load-bearing the uniqueness or validity of the approach. The work is self-contained against external video understanding tasks and does not invoke any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that SRL annotations exist and can be converted into captions that are more informative than typical video narrations; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption SRL labels capture actions, people or objects, their attributes, adverbs, and location in a structured format representing the entire video holistically
Invoked in the abstract to justify why SRL-derived captions are superior to standard narrations.

pith-pipeline@v0.9.0 · 5810 in / 1224 out tokens · 23160 ms · 2026-05-24T04:44:46.609542+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

All in One: A Unified Synthetic Data Pipeline for Multimodal Video Understanding
cs.CV 2026-04 unverdicted novelty 6.0

A unified synthetic data generation pipeline produces unlimited annotated multimodal video data across multiple tasks, enabling models trained mostly on synthetic data to generalize effectively to real-world video und...