mEOL creates aligned embeddings for text, images, and SVGs using instruction-guided MLLM one-word summaries and semantic SVG rewriting, outperforming baselines on a new text-to-SVG retrieval benchmark.
Llave: Large language and vision embedding models with hardness-weighted contrastive learning
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
PLUME uses latent-state autoregressive rollouts and a progressive training curriculum to deliver efficient reasoning for universal multimodal embeddings without generating explicit rationales.
A 300M-parameter open embedding model sets new SOTA on MTEB for its size class and matches models twice as large while staying effective when compressed.
MetaEmbed trains fixed learnable Meta Tokens to produce granularity-organized multi-vector embeddings that support test-time scaling in multimodal retrieval.
A framework with similarity-based visual token compression, dynamic attention rebalancing, and explicit inductive-deductive chain-of-thought improves multimodal ICL performance across eight benchmarks for open-source VLMs.
SSA-ME uses saliency-aware modeling to reduce visual neglect and semantic drift, achieving SOTA results on the MMEB benchmark for multimodal retrieval.
m3BERT uses a three-stage Matryoshka pretraining approach on a bidirectional encoder to support variable embedding sizes while outperforming prior models on large-scale retrieval tasks.
citing papers explorer
-
mEOL: Training-Free Instruction-Guided Multimodal Embedder for Vector Graphics and Image Retrieval
mEOL creates aligned embeddings for text, images, and SVGs using instruction-guided MLLM one-word summaries and semantic SVG rewriting, outperforming baselines on a new text-to-SVG retrieval benchmark.
-
PLUME: Latent Reasoning Based Universal Multimodal Embedding
PLUME uses latent-state autoregressive rollouts and a progressive training curriculum to deliver efficient reasoning for universal multimodal embeddings without generating explicit rationales.
-
Enhancing Multimodal In-Context Learning via Inductive-Deductive Reasoning
A framework with similarity-based visual token compression, dynamic attention rebalancing, and explicit inductive-deductive chain-of-thought improves multimodal ICL performance across eight benchmarks for open-source VLMs.
-
Combating Visual Neglect and Semantic Drift in Large Multimodal Models for Enhanced Cross-Modal Retrieval
SSA-ME uses saliency-aware modeling to reduce visual neglect and semantic drift, achieving SOTA results on the MMEB benchmark for multimodal retrieval.
- Beyond Chain-of-Thought: Rewrite as a Universal Interface for Generative Multimodal Embeddings
- FreeRet: MLLMs as Training-Free Retrievers