EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
hub Canonical reference
Visulogic: A benchmark for evaluating visual reasoning in multi-modal large language models
Canonical reference. 88% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
DeepLatent introduces a parallel latent visual reasoning framework with learnable 2D tokens and continuous RL, trained via distillation then RL, plus a new 180K dataset, claiming SOTA benchmark results.
StemBind benchmark diagnoses MLLM failures in abstract visual reasoning by separating perception, rule induction, and answer selection on shared stems, finding a persistent rule-to-instance binding gap even when perception and rule are correct.
EmbodiedMidtrain mid-trains VLMs on curated VLA-aligned data subsets to improve downstream performance on robot manipulation benchmarks.
DyCo-RL improves four RLVR algorithms on seven visual and math reasoning benchmarks by assigning tokens visual or text roles via Fisher-Rao geodesic distance on attention and reweighting advantages by role-alignment score.
MOSS-Video-Preview introduces a cross-attention architecture and synthesized real-time QA data to enable continuous perception, answer revision, and faster inference in video-language models compared to decoder-only designs.
DMPO approximates forward KL minimization in on-policy RL by aligning the policy to a group-level reward-proportional target distribution, yielding 9-12% relative gains over GRPO on NP-Bench and smaller gains on math reasoning.
Latent visual reasoning improves multimodal models via training effects even without using latent tokens at inference, enabled by an attention-based RL reward that promotes interaction with text tokens.
Reformulating 53 visual reasoning tasks in polar coordinates causes frontier MLLMs to drop from 70-83% to 31-39% accuracy while preserving logical equivalence, revealing a Cartesian shortcut in current benchmarks.
Modality representations share dominant semantic geometry but have an anisotropic residual gap; AnisoAlign corrects source representations boundedly using target geometry for unpaired alignment.
Tuna-2 shows that direct pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive generation and stronger understanding at scale.
MapTab is a new multimodal benchmark with 328 images and nearly 200k queries that shows current MLLMs have substantial difficulty with multi-criteria route planning when visual and tabular information must be combined.
ChartVerse uses Rollout Posterior Entropy and truth-anchored inverse QA synthesis to produce 640K high-quality chart reasoning samples, training an 8B model that surpasses its 30B teacher.
SPHINX generates synthetic visual puzzles for benchmarking LVLMs, where GPT-5 scores 51.1% and RLVR training improves both in-domain and external visual reasoning performance.
Latent visual reasoning fails in current models because standard datasets make oracle latents uninformative and inference-time latents collapse away from useful representations.
TTSP resolves the Grounding Paradox by treating perception as a scalable test-time process that generates, filters, and iteratively refines multiple visual exploration traces, outperforming baselines on high-resolution and multimodal reasoning tasks.
Kwai Keye-VL-2.0-30B-A3B is a 30B MoE model with 3B active parameters using DSA adaptation and MOPD distillation that reports SOTA results on video understanding and agent benchmarks.
Mild rotations and noise significantly increase relation hallucinations in VLMs across models and datasets, with prompt and preprocessing fixes providing only partial relief.
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
Seed2.0 model series reports gains in reasoning, visual understanding, search, and reliability on intricate long-horizon tasks via an internal evaluation system.
citing papers explorer
-
Leveraging Latent Visual Reasoning in Silence
Latent visual reasoning improves multimodal models via training effects even without using latent tokens at inference, enabled by an attention-based RL reward that promotes interaction with text tokens.
-
MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?
MapTab is a new multimodal benchmark with 328 images and nearly 200k queries that shows current MLLMs have substantial difficulty with multi-criteria route planning when visual and tabular information must be combined.
-
ChartVerse: Scaling Chart Reasoning via Reliable Programmatic Synthesis from Scratch
ChartVerse uses Rollout Posterior Entropy and truth-anchored inverse QA synthesis to produce 640K high-quality chart reasoning samples, training an 8B model that surpasses its 30B teacher.