VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation

Cordelia Schmid; Hilde Kuehne; Walid Bousselham

arxiv: 2510.23497 · v3 · pith:VBZ37CDRnew · submitted 2025-10-27 · 💻 cs.CV

VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation

Walid Bousselham , Hilde Kuehne , Cordelia Schmid This is my paper

classification 💻 cs.CV

keywords reasoningvolddistillationmodelson-policyteacheralignmentstudent

0 comments

read the original abstract

Training vision-language models (VLMs) for complex reasoning remains a challenging task, i.a. due to the scarcity of high-quality image-text reasoning data. Conversely, text-based reasoning resources are abundant and scalable, but it is still an open question how to leveraging them for VLM reasoning. To address this problem, we propose VOLD, a framework to transfer reasoning capabilities from text-only teacher models to VLM student models. To this end, VOLD combines reinforcement learning via Group Relative Policy Optimization (GRPO) with on-policy distillation, which allows the student reasoning traces to be guided by the teacher model, resulting in a significant gain over using GRPO alone. We further show that a cold-start alignment is essential for an effective transfer during the online training phase in this scenario and that without sufficient distributional alignment between teacher and student, on-policy distillation fails to provide meaningful guidance. We evaluate VOLD across diverse benchmarks including MMMU-Pro, MathVision, MathVista, and LogicVista, showing that VOLD outperforms the baseline model significantly and improves over the state of the art by a margin. Our ablation shows the importance of a cold-start alignment via SFT for on-policy distillation with a text-only teacher.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients
cs.CL 2026-06 unverdicted novelty 7.0

ZPPO improves distillation to small vision-language models by using binary and negative candidate prompts plus a replay buffer for hard questions, outperforming standard distillation and GRPO on a 31-benchmark suite w...
Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation
cs.CL 2026-05 unverdicted novelty 6.0

Local teachability collapse in trajectory suffixes makes uniform dense supervision suboptimal in strong-to-weak OPD; truncating at BIC-style change points on teacher margin improves performance.
SOD: Step-wise On-policy Distillation for Small Language Model Agents
cs.CL 2026-05 unverdicted novelty 6.0

SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.
Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe
cs.LG 2026-05 unverdicted novelty 6.0

Uni-OPD unifies on-policy distillation across LLMs and MLLMs with dual-perspective strategies that promote student exploration and enforce order-consistent teacher supervision based on outcome rewards.
Learning Visual Spatial Planning from Symbolic State via Modality-Gap-Aware Self-Distillation
cs.AI 2026-06 unverdicted novelty 5.0

MGSD is a modality-gap-aware self-distillation method that improves visual spatial planning in 4B and 8B VLMs by 19.3% and 18.4% macro average on benchmarks by distilling from symbolic states during training only.
Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation
cs.LG 2026-06 unverdicted novelty 5.0

FiRe-OPD introduces a two-stage filter-then-soft-reweight procedure for trajectory- and token-level supervision in on-policy distillation, claiming gains over prior token-level methods.
Stage-1 Controls the Entropy Regime, Not the Outcome
cs.LG 2026-06 unverdicted novelty 4.0

Stage-1 warm-starts control the entropy regime entering RL but yield small and localized effects on final in-domain and out-of-domain performance.