VLRS-Bench is the first benchmark dedicated to complex vision-language reasoning in remote sensing, with 2000 QA pairs across 14 tasks in cognition, decision, and prediction dimensions.
hub Mixed citations
Are We on the Right Way for Evaluating Large Vision-Language Models?
Mixed citation behavior. Most common role is background (50%).
abstract
Large vision-language models (LVLMs) have recently achieved rapid progress, sparking numerous studies to evaluate their multi-modal capabilities. However, we dig into current evaluation works and identify two primary issues: 1) Visual content is unnecessary for many samples. The answers can be directly inferred from the questions and options, or the world knowledge embedded in LLMs. This phenomenon is prevalent across current benchmarks. For instance, GeminiPro achieves 42.9% on the MMMU benchmark without any visual input, and outperforms the random choice baseline across six benchmarks over 24% on average. 2) Unintentional data leakage exists in LLM and LVLM training. LLM and LVLM could still answer some visual-necessary questions without visual content, indicating the memorizing of these samples within large-scale training data. For example, Sphinx-X-MoE gets 43.6% on MMMU without accessing images, surpassing its LLM backbone with 17.9%. Both problems lead to misjudgments of actual multi-modal gains and potentially misguide the study of LVLM. To this end, we present MMStar, an elite vision-indispensable multi-modal benchmark comprising 1,500 samples meticulously selected by humans. MMStar benchmarks 6 core capabilities and 18 detailed axes, aiming to evaluate LVLMs' multi-modal capacities with carefully balanced and purified samples. These samples are first roughly selected from current benchmarks with an automated pipeline, human review is then involved to ensure each curated sample exhibits visual dependency, minimal data leakage, and requires advanced multi-modal capabilities. Moreover, two metrics are developed to measure data leakage and actual performance gain in multi-modal training. We evaluate 16 leading LVLMs on MMStar to assess their multi-modal capabilities, and on 7 benchmarks with the proposed metrics to investigate their data leakage and actual multi-modal gain.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.
MODE decomposes expert selection frequency by modality, filters redundant vision tokens, adds per-modality sensitivity, and uses ILP to assign bit-widths, limiting average loss to 2.9% at W3A16 on MoE-MLLMs.
OxyEcomBench is a unified multimodal benchmark covering 6 capability areas and 29 tasks with authentic e-commerce data to measure how well foundation models handle real platform, merchant, and customer challenges.
Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.
COHERENCE is a new benchmark for measuring MLLMs' ability to recover fine-grained image-text correspondences in interleaved multimodal contexts.
Perceval is a perception-centric PRM that detects token-level perceptual errors in VLMs, supporting token-advantage RL training and iterative test-time scaling for improved reasoning.
DMLR performs dynamic visual-textual interleaving in latent space using confidence-guided latent policy gradient optimization and a dynamic visual injection strategy, yielding improved multimodal reasoning on benchmarks.
VGR introduces a visual-grounded reasoning MLLM that detects and replays image regions during inference, achieving gains on visual benchmarks with 30% fewer image tokens than the LLaVA-NeXT-7B baseline.
FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperforming larger models with only 630 vision tokens at 3B scale.
R1-VL uses StepGRPO with rule-based StepRAR and StepRVR rewards to let MLLMs learn step-by-step reasoning beyond imitation of positive paths.
OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.
PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.
WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.
Cambrian-1 is a vision-centric multimodal LLM family that evaluates over 20 vision encoders, introduces CV-Bench and the Spatial Vision Aggregator, and releases open models, code, and data achieving strong performance on visual grounding tasks.
LVBench is a new benchmark for extreme long video understanding that evaluates multimodal large language models on hour-scale videos using tasks designed to probe extended memory and comprehension.
Audit of four VideoQA benchmarks reveals text-only shortcuts in VLMs; new diagnostics Blind Gap, Visual Gain, and Shortcut Score quantify and filter visual dependence.
ViQ is a new two-stage text-aligned quantization method for visual features supporting arbitrary resolutions that claims competitive multimodal performance with efficiency gains of 20-70%.
Qwen-RobotWorld is a language-conditioned video world model using Double-Stream MMDiT, an 8.6M-frame embodied corpus, and progressive curriculum training that ranks first on EWMBench and DreamGen Bench.
LSLE is a two-stage framework that constructs literacy-stratified LLM personas and benchmarks their evaluation outputs against human response distributions from VLAT and BeauVIS validation studies.
MOSS-Video-Preview introduces a cross-attention architecture and synthesized real-time QA data to enable continuous perception, answer revision, and faster inference in video-language models compared to decoder-only designs.
AOD isolates hallucination signals in LVLM representations with an adversarial minimax objective and uses dual-forward contrastive decoding to reduce hallucinations while preserving utility.
SPpruner reduces visual tokens in VLMs via focus identification followed by context-aware scanning, retaining 22.2% tokens for 2.53x speedup on Qwen2.5-VL with negligible accuracy loss.
citing papers explorer
-
Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space
DMLR performs dynamic visual-textual interleaving in latent space using confidence-guided latent policy gradient optimization and a dynamic visual injection strategy, yielding improved multimodal reasoning on benchmarks.
-
VGR: Visual Grounded Reasoning
VGR introduces a visual-grounded reasoning MLLM that detects and replays image regions during inference, achieving gains on visual benchmarks with 30% fewer image tokens than the LLaVA-NeXT-7B baseline.
-
FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding
FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperforming larger models with only 630 vision tokens at 3B scale.
-
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization
R1-VL uses StepGRPO with rule-based StepRAR and StepRVR rewards to let MLLMs learn step-by-step reasoning beyond imitation of positive paths.
-
Agentic Learner with Grow-and-Refine Multimodal Semantic Memory
ViLoMem is a dual-stream grow-and-refine memory system that separates visual and logical error patterns in MLLMs to improve pass@1 accuracy and reduce repeated mistakes across six multimodal benchmarks.
-
Qwen3-Omni Technical Report
Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-modal Qwen counterparts.
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and agentic tasks.
-
Fourier Compressor: Frequency-Domain Visual Token Compression for Vision-Language Models
Fourier Compressor uses FFT to remove frequency-domain redundancy from visual tokens in VLMs, retaining over 96% accuracy with up to 83.8% FLOP reduction.
-
Training-Free Multimodal Large Language Model Orchestration
LLM Orchestration integrates modality experts via an LLM controller, cross-modal memory, and interaction layer to enable multimodal input-output without gradient-based training.
-
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.
-
Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models
Multi-SpatialMLLM integrates depth perception, visual correspondence, and dynamic perception into MLLMs via a 27M-sample MultiSPA dataset and benchmark, yielding gains on multi-frame spatial tasks.
-
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.
-
Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models
Circle-RoPE achieves cross-modal positional disentanglement in VLMs by mapping 2D image tokens to a cone-like annulus orthogonal to the text axis, with PTD=0 eliminating RoPE geometric bias while preserving intra-image structure via alternating geometry encoding.
-
Visual Compositional Tuning
COMPACT synthesizes compositional visual instruction data to reduce VIT training data by 90% while achieving 100.2% of full performance across eight multimodal benchmarks.
-
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
-
SmolVLM: Redefining small and efficient multimodal models
SmolVLM-256M outperforms a 300-times larger model using under 1 GB GPU memory, while the 2.2B version matches state-of-the-art VLMs at half the memory cost.
-
OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles
Iterative SFT-RL cycles enable a 7B LVLM to develop sophisticated visual chain-of-thought reasoning and improve performance on math and general reasoning benchmarks.
-
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
Sa2VA unifies SAM-2 segmentation with MLLM reasoning into a single model for referring segmentation and conversation on images and videos, supported by a new 72k-expression Ref-SAV dataset.
-
NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation
NoisyGRPO is an RL framework that perturbs visual inputs with Gaussian noise for exploration and computes trajectory advantages via Bayesian posterior fusion of noise prior and reward likelihood to improve multimodal CoT generalization.
-
Qwen2.5-Omni Technical Report
Qwen2.5-Omni presents a multimodal model with block-wise encoders, TMRoPE position embeddings, and a Thinker-Talker architecture that enables simultaneous text and streaming speech generation while matching text performance on reasoning benchmarks.
-
Efficient Reasoning with Hidden Thinking
Heima compresses verbose CoT into hidden thinking tokens via information-theoretic analysis and an adaptive interpreter, claiming maintained or improved zero-shot accuracy on reasoning benchmarks.
-
Show-o2: Improved Native Unified Multimodal Models
Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
-
Seed1.5-VL Technical Report
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
-
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
VITA-1.5 integrates vision and speech into a single LLM through multi-stage training, delivering competitive benchmark results on image, video, and speech tasks with near real-time response speed.