Unifying visual- semantic embeddings with multimodal neural language models

· 2014 · arXiv 1411.2539

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models

cs.CV · 2026-04-27 · unverdicted · novelty 7.0

LearnPruner prunes vision tokens to 5.5% of the original count while retaining about 95% of VLM performance and delivering 3.2 times faster inference by fixing attention sink in encoders and using unbiased middle-layer attention in LLMs.

SIMMER: Cross-Modal Food Image--Recipe Retrieval via MLLM-Based Embedding

cs.CV · 2026-04-17 · unverdicted · novelty 6.0

SIMMER uses a single multimodal LLM (VLM2Vec) with custom prompts and partial-recipe augmentation to embed food images and recipes, achieving new state-of-the-art retrieval accuracy on Recipe1M.

Microsoft COCO Captions: Data Collection and Evaluation Server

cs.CV · 2015-04-01 · accept · novelty 6.0

Microsoft COCO Captions provides 1.5 million human captions across 330,000 images and a public server to evaluate captioning models with BLEU, METEOR, ROUGE, and CIDEr.

citing papers explorer

Showing 3 of 3 citing papers.

LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models cs.CV · 2026-04-27 · unverdicted · none · ref 10
LearnPruner prunes vision tokens to 5.5% of the original count while retaining about 95% of VLM performance and delivering 3.2 times faster inference by fixing attention sink in encoders and using unbiased middle-layer attention in LLMs.
SIMMER: Cross-Modal Food Image--Recipe Retrieval via MLLM-Based Embedding cs.CV · 2026-04-17 · unverdicted · none · ref 19
SIMMER uses a single multimodal LLM (VLM2Vec) with custom prompts and partial-recipe augmentation to embed food images and recipes, achieving new state-of-the-art retrieval accuracy on Recipe1M.
Microsoft COCO Captions: Data Collection and Evaluation Server cs.CV · 2015-04-01 · accept · none · ref 24
Microsoft COCO Captions provides 1.5 million human captions across 330,000 images and a public server to evaluate captioning models with BLEU, METEOR, ROUGE, and CIDEr.

Unifying visual- semantic embeddings with multimodal neural language models

fields

years

verdicts

representative citing papers

citing papers explorer