EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards

Abdelrahman Shaker; Fahad Khan; Hisham Cholakkal; Omkar Thawakar; Rao Muhammad Anwer; Ritesh Thawkar; Salman Khan; Shravan Venkatraman

arxiv: 2511.16672 · v4 · pith:RW52MPL5new · submitted 2025-11-20 · 💻 cs.CV

EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards

Omkar Thawakar , Shravan Venkatraman , Ritesh Thawkar , Abdelrahman Shaker , Hisham Cholakkal , Rao Muhammad Anwer , Salman Khan , Fahad Khan This is my paper

classification 💻 cs.CV

keywords evolmmmodelsmultimodalreasoningcontinuousdatafashionlarge

0 comments

read the original abstract

Recent advances in large multimodal models (LMMs) have enabled impressive reasoning and perception abilities, yet most existing training pipelines still depend on human-curated data or externally verified reward models, limiting their autonomy and scalability. In this work, we strive to improve LMM reasoning capabilities in a purely unsupervised fashion (without any annotated data or reward distillation). To this end, we propose a self-evolving framework, named EvoLMM, that instantiates two cooperative agents from a single backbone model: a Proposer, which generates diverse, image-grounded questions, and a Solver, which solves them through internal consistency, where learning proceeds through a continuous self-rewarding process. This dynamic feedback encourages both the generation of informative queries and the refinement of structured reasoning without relying on ground-truth or human judgments. When using the popular Qwen2.5-VL as the base model, our EvoLMM yields consistent gains upto $\sim$3\% on multimodal math-reasoning benchmarks, including ChartQA, MathVista, and MathVision, using only raw training images. We hope our simple yet effective approach will serve as a solid baseline easing future research in self-improving LMMs in a fully-unsupervised fashion. Our code and models are available at https://github.com/mbzuai-oryx/EvoLMM.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations
cs.CV 2026-04 unverdicted novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
EvoGround: Self-Evolving Video Agents for Video Temporal Grounding
cs.CV 2026-05 unverdicted novelty 7.0

A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.
Ask, Solve, Generate: Self-Evolving Unified Multimodal Understanding and Generation via Self-Consistency Rewards
cs.CV 2026-06 unverdicted novelty 6.0

A self-evolving framework with proposer-solver-generator roles, Solver Token Entropy, and multi-scale internal evaluation improves unified LMMs on understanding and generation tasks using only self-derived consistency...
Paying More Attention to Visual Tokens in Self-Evolving Large Multimodal Models
cs.CV 2026-06 unverdicted novelty 6.0

VISE is an unsupervised self-evolving method for LMMs that uses invariance rewards to improve visual conditioning, reporting gains on captioning and reduced hallucination across multiple models.
EvoVid: Temporal-Centric Self-Evolution for Video Large Language Models
cs.CV 2026-05 unverdicted novelty 6.0

EvoVid proposes a temporal-centric self-evolution framework for Video-LLMs that uses temporal-aware Questioner and temporal-grounded Solver rewards to improve performance directly from unannotated videos.
RISE: Reliable Improvement in Self-Evolving Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

RISE is a self-evolving framework for VLMs that adds fine-grained alternation, quality supervision, and dynamic balancing to produce reliable gains on seven benchmarks from unlabeled data.
Agentic AI for Remote Sensing: Technical Challenges and Research Directions
cs.CV 2026-04 unverdicted novelty 6.0

Agentic AI faces structural challenges in remote sensing due to geospatial data properties and workflow constraints, requiring EO-native agents built around structured state, tool-aware reasoning, and validity-aware e...
Agentic AI for Remote Sensing: Technical Challenges and Research Directions
cs.CV 2026-04 unverdicted novelty 5.0

Agentic AI for remote sensing requires new designs centered on structured geospatial state, tool-aware reasoning, verifier-guided execution, and physical validity rather than generic extensions.