{"total":11,"items":[{"citing_arxiv_id":"2606.04061","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Intra-Modal Neighbors Never Lie: Rectifying Inter-Modal Noisy Correspondence via Graph-Based Intra-Modal Reasoning","primary_cat":"cs.CV","submitted_at":"2026-06-02T12:26:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"IN2R rectifies inter-modal noisy correspondence by synthesizing continuous soft prototypes from intra-modal neighbor consensus using a Graph Refiner on dynamic cross-modal memory.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20838","ref_index":36,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"USV: Towards Understanding the User-generated Short-form Videos","primary_cat":"cs.CV","submitted_at":"2026-05-20T07:27:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces the USV dataset of 224K short user-generated videos and benchmarks topic recognition plus video-text retrieval with MMF-Net and VTCL baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23950","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models","primary_cat":"cs.CV","submitted_at":"2026-04-27T01:56:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LearnPruner prunes vision tokens to 5.5% of the original count while retaining about 95% of VLM performance and delivering 3.2 times faster inference by fixing attention sink in encoders and using unbiased middle-layer attention in LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15628","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SIMMER: Cross-Modal Food Image--Recipe Retrieval via MLLM-Based Embedding","primary_cat":"cs.CV","submitted_at":"2026-04-17T02:09:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SIMMER uses a single multimodal LLM (VLM2Vec) with custom prompts and partial-recipe augmentation to embed food images and recipes, achieving new state-of-the-art retrieval accuracy on Recipe1M.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.06452","ref_index":27,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MULTIBENCH++: A Unified and Comprehensive Multimodal Fusion Benchmarking Across Specialized Domains","primary_cat":"cs.LG","submitted_at":"2025-11-09T16:37:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MULTIBENCH++ is a new large-scale benchmark integrating over 30 datasets across 15 modalities and 20 tasks, accompanied by an open-source automated evaluation pipeline that establishes new performance baselines for multimodal fusion.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.20291","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VisRet: Visualization Improves Knowledge-Intensive Text-to-Image Retrieval","primary_cat":"cs.CV","submitted_at":"2025-05-26T17:59:33+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VisRet improves text-to-image retrieval by generating images from text queries and then retrieving within the image modality, reporting average nDCG@30 gains of 0.125 with CLIP and 0.121 with E5-V across four benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2309.16671","ref_index":139,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Demystifying CLIP Data","primary_cat":"cs.CV","submitted_at":"2023-09-28T17:59:56+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1910.07467","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Root Mean Square Layer Normalization","primary_cat":"cs.LG","submitted_at":"2019-10-16T16:44:22+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"RMSNorm delivers re-scaling invariance and comparable accuracy to LayerNorm while cutting computation by skipping mean subtraction, yielding 7-64% runtime reductions across tested models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1908.03557","ref_index":83,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VisualBERT: A Simple and Performant Baseline for Vision and Language","primary_cat":"cs.CV","submitted_at":"2019-08-09T17:57:13+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VisualBERT is a Transformer model that implicitly aligns text and image regions through self-attention and achieves competitive or superior results on VQA, VCR, NLVR2, and Flickr30K after pre-training on captions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1906.10996","ref_index":24,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Learning Soft-Attention Models for Tempo-invariant Audio-Sheet Music Retrieval","primary_cat":"cs.IR","submitted_at":"2019-06-26T11:52:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Soft-attention on audio inputs increases tempo robustness in cross-modal audio-to-sheet-music retrieval on synthesized piano data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1504.00325","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Microsoft COCO Captions: Data Collection and Evaluation Server","primary_cat":"cs.CV","submitted_at":"2015-04-01T18:13:43+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Microsoft COCO Captions provides 1.5 million human captions across 330,000 images and a public server to evaluate captioning models with BLEU, METEOR, ROUGE, and CIDEr.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"long standing and challenging problem in artiﬁcial in- telligence [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19]. Research in this area spans numerous domains, such as computer vision, natural language processing, and machine learn- ing. Recently there has been a surprising resurgence of interest in this area [20], [21], [22], [23], [24], [25], [26], [27], [28], [29], [30], due to the renewed interest in neural network learning techniques [31], [32] and increasingly large datasets [33], [34], [35], [7], [36], [37], [38]. In this paper, we describe our process of collecting captions for the Microsoft COCO Caption dataset, and the evaluation server we have set up to evaluate perfor-"}],"limit":50,"offset":0}