EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
Evolmm: Self-evolving large multimodal models with continuous rewards
6 Pith papers cite this work. Polarity classification is still indexing.
abstract
Recent advances in large multimodal models (LMMs) have enabled impressive reasoning and perception abilities, yet most existing training pipelines still depend on human-curated data or externally verified reward models, limiting their autonomy and scalability. In this work, we strive to improve LMM reasoning capabilities in a purely unsupervised fashion (without any annotated data or reward distillation). To this end, we propose a self-evolving framework, named EvoLMM, that instantiates two cooperative agents from a single backbone model: a Proposer, which generates diverse, image-grounded questions, and a Solver, which solves them through internal consistency, where learning proceeds through a continuous self-rewarding process. This dynamic feedback encourages both the generation of informative queries and the refinement of structured reasoning without relying on ground-truth or human judgments. When using the popular Qwen2.5-VL as the base model, our EvoLMM yields consistent gains upto $\sim$3\% on multimodal math-reasoning benchmarks, including ChartQA, MathVista, and MathVision, using only raw training images. We hope our simple yet effective approach will serve as a solid baseline easing future research in self-improving LMMs in a fully-unsupervised fashion. Our code and models are available at https://github.com/mbzuai-oryx/EvoLMM.
citation-role summary
citation-polarity summary
fields
cs.CV 6years
2026 6representative citing papers
A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.
AnE combines Truth Anchor Expansion and Scaffold-Stripping to deliver 10.3% gains on eight multimodal reasoning benchmarks for MLLMs.
EvoVid proposes a temporal-centric self-evolution framework for Video-LLMs that uses temporal-aware Questioner and temporal-grounded Solver rewards to improve performance directly from unannotated videos.
citing papers explorer
-
EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations
EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
-
EvoGround: Self-Evolving Video Agents for Video Temporal Grounding
A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.
-
AnE: Pushing the Reasoning Frontier of Multimodal LLMs via Anchor Evolution
AnE combines Truth Anchor Expansion and Scaffold-Stripping to deliver 10.3% gains on eight multimodal reasoning benchmarks for MLLMs.
-
EvoVid: Temporal-Centric Self-Evolution for Video Large Language Models
EvoVid proposes a temporal-centric self-evolution framework for Video-LLMs that uses temporal-aware Questioner and temporal-grounded Solver rewards to improve performance directly from unannotated videos.
- RISE: Reliable Improvement in Self-Evolving Vision-Language Models
- Agentic AI for Remote Sensing: Technical Challenges and Research Directions