EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
hub Mixed citations
Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking
Mixed citation behavior. Most common role is background (43%).
abstract
In this report, we introduce the Qwen3-VL-Embedding and Qwen3-VL-Reranker model series, the latest extensions of the Qwen family built on the Qwen3-VL foundation model. Together, they provide an end-to-end pipeline for high-precision multimodal search by mapping diverse modalities, including text, images, document images, and video, into a unified representation space. The Qwen3-VL-Embedding model employs a multi-stage training paradigm, progressing from large-scale contrastive pre-training to reranking model distillation, to generate semantically rich high-dimensional vectors. It supports Matryoshka Representation Learning, enabling flexible embedding dimensions, and handles inputs up to 32k tokens. Complementing this, Qwen3-VL-Reranker performs fine-grained relevance estimation for query-document pairs using a cross-encoder architecture with cross-attention mechanisms. Both model series inherit the multilingual capabilities of Qwen3-VL, supporting more than 30 languages, and are released in $\textbf{2B}$ and $\textbf{8B}$ parameter sizes to accommodate diverse deployment requirements. Empirical evaluations demonstrate that the Qwen3-VL-Embedding series achieves state-of-the-art results across diverse multimodal embedding evaluation benchmarks. Specifically, Qwen3-VL-Embedding-8B attains an overall score of $\textbf{77.8}$ on MMEB-V2, ranking first among all models (as of January 8, 2025). This report presents the architecture, training methodology, and practical capabilities of the series, demonstrating their effectiveness on various multimodal retrieval tasks, including image-text retrieval, visual question answering, and video-text matching.
hub tools
citation-role summary
citation-polarity summary
years
2026 48representative citing papers
FashionMV introduces product-level multi-view CIR, a 127K-product dataset built via automated LMM pipeline, and a 0.8B ProCIR model that beats larger baselines on three fashion benchmarks.
VSE perturbs images only to probe visual ambiguity in VLMs, clusters outputs into semantic prototypes, and computes mass-weighted dispersion, outperforming prior entropy methods on five VQA benchmarks across five models.
VidMsg is a new benchmark dataset and QA/retrieval tasks for implicit message inference in short videos, where current models perform poorly.
PixelRAG shows that operating RAG entirely over web screenshots outperforms text-based retrieval on NQ, SimpleQA, MMSearch, LiveVQA, and MoNaCo, with up to 18.1% accuracy gains and 3x token savings via image compression.
OmniRetriever-7B uses fusion-as-teacher distillation plus Tuple-InfoNCE to improve any-to-any audio-video-text retrieval over prior open and closed models.
ReTool-Video uses a 134-tool meta-augmented library and recursive grounding to translate abstract video intents into fine-grained multimodal operations, outperforming baselines on MVBench, MLVU, and Video-MME.
FLARE is a new benchmark with 399 long videos, 87k multimodal clips, and 275k user-style queries for testing audiovisual retrieval under caption and query regimes.
Omni-Persona benchmark with 18 tasks shows open-source models have audio-visual grounding gaps, RLVR narrows them but leads to conservative outputs, and scale or recall alone fail as diagnostics.
MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.
CLAY reframes pretrained VLM embedding spaces as text-conditional similarity spaces for adaptive, multi-conditioned image retrieval without additional training.
M2Note stores failed VLM trajectories as subject-guidance notes in an external notebook and retrieves them via multimodal RAG to avoid past errors during inference.
VideoSearch-R1 achieves SOTA on VCMR across three datasets via iterative retrieval, latent-space soft query refinement, and GRPO training.
MemShot renders local dialogue spans as structured visual memory units to improve long-term dialogue modeling in LLMs, achieving competitive benchmark performance with 70x faster memory construction.
DiagramRAG is a retrieval-augmented framework that represents diagrams as knowledge graphs, synthesizes sketch variants, trains an embedding model for structure-aware retrieval, and uses retrieved references to guide sketch-based scientific diagram generation.
IPIBench evaluates MLLMs on interactive proactive intelligence in streaming videos, identifies unstable triggering and poor coordination, and proposes the training-free IPI-Agent framework to improve performance across settings.
AnE combines Truth Anchor Expansion and Scaffold-Stripping to deliver 10.3% gains on eight multimodal reasoning benchmarks for MLLMs.
SMART unlocks latent multi-vector capabilities in single-vector embedding models by applying late interaction to frozen hidden states shaped by contrastive training, yielding consistent gains on MMEB-V2 and visual document retrieval.
VISAFF is a tuning-free speaker-centered visual affective feature learning framework for emotion recognition in conversation that guides frozen VLMs to active speakers and uses reliability-guided complementation from textual and acoustic modalities to achieve competitive performance.
TIGER-FG proposes text-guided implicit fine-grained grounding with dual distillation to address modality and granularity asymmetries in image-to-multimodal e-commerce retrieval, reporting Recall@1 gains of 6.1 and 34.4 points on two new benchmarks.
GELATO extends frozen Jina Embeddings v5 text models with locked non-text encoders, training only connectors to produce competitive multimodal embeddings while preserving exact text performance.
MINER fuses internal transformer layer representations via probing and adaptive sparse fusion to improve dense single-vector retrieval quality on visual documents by up to 4.5% nDCG@5 while preserving efficiency.
Uncertainty estimation for LLM hallucinations can be done effectively with partial generations or input-only predictors, reducing the need for full autoregressive sampling.
A single-pass black-box method models LLM outputs as dynamical systems via Koopman operators to detect hallucinations with claimed state-of-the-art accuracy and lower cost.
citing papers explorer
-
M2Note: Continual Evolution of Vision Language Models via Mistake Notebook Learning
M2Note stores failed VLM trajectories as subject-guidance notes in an external notebook and retrieves them via multimodal RAG to avoid past errors during inference.