pith. sign in

arxiv: 2510.23497 · v3 · pith:VBZ37CDRnew · submitted 2025-10-27 · 💻 cs.CV

VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation

classification 💻 cs.CV
keywords reasoningvolddistillationmodelson-policyteacheralignmentstudent
0
0 comments X
read the original abstract

Training vision-language models (VLMs) for complex reasoning remains a challenging task, i.a. due to the scarcity of high-quality image-text reasoning data. Conversely, text-based reasoning resources are abundant and scalable, but it is still an open question how to leveraging them for VLM reasoning. To address this problem, we propose VOLD, a framework to transfer reasoning capabilities from text-only teacher models to VLM student models. To this end, VOLD combines reinforcement learning via Group Relative Policy Optimization (GRPO) with on-policy distillation, which allows the student reasoning traces to be guided by the teacher model, resulting in a significant gain over using GRPO alone. We further show that a cold-start alignment is essential for an effective transfer during the online training phase in this scenario and that without sufficient distributional alignment between teacher and student, on-policy distillation fails to provide meaningful guidance. We evaluate VOLD across diverse benchmarks including MMMU-Pro, MathVision, MathVista, and LogicVista, showing that VOLD outperforms the baseline model significantly and improves over the state of the art by a margin. Our ablation shows the importance of a cold-start alignment via SFT for on-policy distillation with a text-only teacher.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients

    cs.CL 2026-06 unverdicted novelty 7.0

    ZPPO improves distillation to small vision-language models by using binary and negative candidate prompts plus a replay buffer for hard questions, outperforming standard distillation and GRPO on a 31-benchmark suite w...

  2. Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation

    cs.CL 2026-05 unverdicted novelty 6.0

    Local teachability collapse in trajectory suffixes makes uniform dense supervision suboptimal in strong-to-weak OPD; truncating at BIC-style change points on teacher margin improves performance.

  3. SOD: Step-wise On-policy Distillation for Small Language Model Agents

    cs.CL 2026-05 unverdicted novelty 6.0

    SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.

  4. Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe

    cs.LG 2026-05 unverdicted novelty 6.0

    Uni-OPD unifies on-policy distillation across LLMs and MLLMs with dual-perspective strategies that promote student exploration and enforce order-consistent teacher supervision based on outcome rewards.

  5. Learning Visual Spatial Planning from Symbolic State via Modality-Gap-Aware Self-Distillation

    cs.AI 2026-06 unverdicted novelty 5.0

    MGSD is a modality-gap-aware self-distillation method that improves visual spatial planning in 4B and 8B VLMs by 19.3% and 18.4% macro average on benchmarks by distilling from symbolic states during training only.

  6. Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation

    cs.LG 2026-06 unverdicted novelty 5.0

    FiRe-OPD introduces a two-stage filter-then-soft-reweight procedure for trajectory- and token-level supervision in on-policy distillation, claiming gains over prior token-level methods.

  7. Stage-1 Controls the Entropy Regime, Not the Outcome

    cs.LG 2026-06 unverdicted novelty 4.0

    Stage-1 warm-starts control the entropy regime entering RL but yield small and localized effects on final in-domain and out-of-domain performance.