OR3 converts OR clips to action-driven digital twins, uses LLM imagination for hypothetical ActDTs, and achieves 57.6 R@1 and 77.3 R@5 on 276 implicit queries from 386 robotic knee procedure clips, outperforming baselines.
hub
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
14 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
OmniRetriever-7B uses fusion-as-teacher distillation plus Tuple-InfoNCE to improve any-to-any audio-video-text retrieval over prior open and closed models.
CoDAAR aligns modality-specific codebooks at the index level using Discrete Temporal Alignment and Cascading Semantic Alignment to achieve cross-modal generalization while preserving unique structures, reporting state-of-the-art results on event classification, localization, video segmentation, and跨
Text-only contrastive fine-tuning of an MLLM with hard negatives produces embeddings that handle temporal, negation, and multimodal nuances in video retrieval and achieves SOTA performance.
Socratic Models compose zero-shot multimodal reasoning by prompting pretrained language and vision models to exchange information and enable new capabilities without finetuning.
VideoSearch-R1 achieves SOTA on VCMR across three datasets via iterative retrieval, latent-space soft query refinement, and GRPO training.
MDPD mutually distills knowledge between a frozen backbone and a learnable side network during fine-tuning, then discards the side network at inference to accelerate speed by at least 25% while preserving accuracy.
MP-ISMoE uses Gaussian noise perturbed iterative quantization and interactive side mixture-of-experts to deliver higher accuracy than prior memory-efficient transfer learning methods while keeping similar parameter and memory usage.
VideoStir introduces a spatio-temporal graph-based structure and intent-aware retrieval for long-video RAG, achieving competitive performance with SOTA methods via a new IR-600K dataset.
LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.
MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.
LARE uses parallel encoding of full images and low-attention regions to improve text-image retrieval, shown on a new Dense-Set subset of COCO and Flickr30K with re-captioned overlooked areas.
Inverse attention embeddings combined with standard visual features improve recall in video semantic search for crowded scenes without additional training.
citing papers explorer
-
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
Socratic Models compose zero-shot multimodal reasoning by prompting pretrained language and vision models to exchange information and enable new capabilities without finetuning.