{"total":11,"items":[{"citing_arxiv_id":"2605.26641","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OmniRetriever: Any-to-Any Audio-Video-Text Retrieval via Fusion-as-Teacher Distillation","primary_cat":"cs.CV","submitted_at":"2026-05-26T07:26:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OmniRetriever-7B uses fusion-as-teacher distillation plus Tuple-InfoNCE to improve any-to-any audio-video-text retrieval over prior open and closed models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12145","ref_index":24,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Cross-Modal-Domain Generalization Through Semantically Aligned Discrete Representations","primary_cat":"cs.CV","submitted_at":"2026-05-12T14:03:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CoDAAR aligns modality-specific codebooks at the index level using Discrete Temporal Alignment and Cascading Semantic Alignment to achieve cross-modal generalization while preserving unique structures, reporting state-of-the-art results on event classification, localization, video segmentation, and跨","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06229","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Look Beyond Saliency: Low-Attention Guided Dual Encoding for Video Semantic Search","primary_cat":"cs.CV","submitted_at":"2026-05-07T13:21:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Inverse attention embeddings combined with standard visual features improve recall in video semantic search for crowded scenes without additional training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09088","ref_index":64,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation","primary_cat":"cs.CV","submitted_at":"2026-04-10T08:16:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MDPD mutually distills knowledge between a frozen backbone and a learnable side network during fine-tuning, then discards the side network at inference to accelerate speed by at least 25% while preserving accuracy.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"methods including Prompt [55], Adapter [40], AdaptFormer [10], NOAH [112], Convpass [46], Res-Tuning [44]; 3) Re- parameterized Tuning methods including LoRA [41], SSF [56], AdaLoRA [111]. On VTAB-1K benchmark, we addi- tionally compare with VPT [43]. Results on VL Tasks.We employ two structures from VSE∞[9] with BERT-base+BUTD regions and ResNeXt- 101+BiGRU for ITR, and one structure fromCLIP4Clip [64] with ViT-base+Text Transformer for VTR. As shown in Table 1, our method promotes accuracy of compared PETL approaches in most cases. Despite adopting slightly more learnable parameters, our method substantially re- duces training memory overhead, by least 59.5%, 90.4% and 64.3% with various backbones respectively, through op- timizing lightweight side networks instead of backbones."},{"citing_arxiv_id":"2605.04058","ref_index":76,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning","primary_cat":"cs.LG","submitted_at":"2026-04-10T08:00:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MP-ISMoE uses Gaussian noise perturbed iterative quantization and interactive side mixture-of-experts to deliver higher accuracy than prior memory-efficient transfer learning methods while keeping similar parameter and memory usage.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.05418","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG","primary_cat":"cs.CV","submitted_at":"2026-04-07T04:26:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VideoStir introduces a spatio-temporal graph-based structure and intent-aware retrieval for long-video RAG, achieving competitive performance with SOTA methods via a new IR-600K dataset.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"LongVLM [20], extend visual context to enable hour-scale video reasoning via uniform sampling. However, this paradigm faces a major bottleneck: limited context windows necessitate sparse sam- pling across the timeline, which can yield semanti- cally redundant frames while simultaneously risk- ing the loss of fleeting, query-critical cues. In par- allel, contrastive video-language models [21] (e.g., Video-CLIP [22], X-CLIP [23], and PE [ 24]) ex- tend CLIP-style contrastive embeddings to video, enabling query-conditioned retrieval of salient clips and frames. Nevertheless, contrastive objectives primarily optimize for semantic similarity and may miss cues that are implicitly relevant to the query intent but lack an explicit semantic match."},{"citing_arxiv_id":"2512.13511","ref_index":52,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Adapting MLLMs for Nuanced Video Retrieval","primary_cat":"cs.CV","submitted_at":"2025-12-15T16:38:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Text-only contrastive fine-tuning of an MLLM with hard negatives produces embeddings that handle temporal, negation, and multimodal nuances in video retrieval and achieves SOTA performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2410.02713","ref_index":210,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LLaVA-Video: Video Instruction Tuning With Synthetic Data","primary_cat":"cs.CV","submitted_at":"2024-10-03T17:36:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2401.07669","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SRL-CLIP: Efficient CLIP Video Adaptation via Structured Semantic Role Labels","primary_cat":"cs.CV","submitted_at":"2024-01-15T13:27:34+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2309.16671","ref_index":96,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Demystifying CLIP Data","primary_cat":"cs.CV","submitted_at":"2023-09-28T17:59:56+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2204.00598","ref_index":93,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language","primary_cat":"cs.CV","submitted_at":"2022-04-01T17:43:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Socratic Models compose zero-shot multimodal reasoning by prompting pretrained language and vision models to exchange information and enable new capabilities without finetuning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Li, and C. G. Snoek. Predicting visual features from text for image and video caption retrieval. IEEE Transactions on Multimedia, 20(12):3377-3388, 2018. [92] M. Patrick, P.-Y . Huang, Y . Asano, F. Metze, A. Hauptmann, J. Henriques, and A. Vedaldi. Support-set bottlenecks for video-text representation learning. arXiv preprint arXiv:2010.02824, 2020. [93] H. Luo, L. Ji, M. Zhong, Y . Chen, W. Lei, N. Duan, and T. Li. Clip4clip: An empirical study of clip for end to end video clip retrieval. arXiv preprint arXiv:2104.08860, 2021. 13 [94] Q. Wang, Y . Zhang, Y . Zheng, P. Pan, and X.-S. Hua. Disentangled representation learning for text-video retrieval. arXiv preprint arXiv:2203.07111, 2022. [95] H."}],"limit":50,"offset":0}