Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents

Chenxin Li; Guanting Dong; Hangyu Guo; Hongru Wang; Junting Lu; Shijue Huang; Shuang Chen; Xinyu Geng; Yi R. Fung; Zhaochen Su

arxiv: 2605.10832 · v2 · pith:LTWHYP5Bnew · submitted 2026-05-11 · 💻 cs.CL

Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents

Shijue Huang , Hangyu Guo , Guanting Dong , Chenxin Li , Junting Lu , Xinyu Geng , Zhaochen Su , Zhenyu Li

show 3 more authors

Shuang Chen Hongru Wang Yi R. Fung

This is my paper

Pith reviewed 2026-05-12 04:10 UTC · model grok-4.3

classification 💻 cs.CL

keywords multimodal agentson-policy learningdata evolutionvisual reasoningtool useimage bankreinforcement learningsearch agents

0 comments

The pith

On-policy data evolution from agent rollouts boosts multimodal deep search performance from 24.9% to 39% on average.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that current multimodal agents struggle because search tools return images that cannot be reused later and training data is fixed rather than adapting to what the model still needs to learn. It introduces a visual-native harness that keeps all returned images in an addressable bank so later steps can reference them directly. On top of that, it runs On-policy Data Evolution, a loop that generates new training examples from the model's own recent attempts, refining the data each round to target remaining weaknesses. This combination lifts an 8-billion-parameter agent past a much larger closed model on standard benchmarks and shows similar gains at 30 billion parameters.

Core claim

A visual-native agent harness with an image bank reference protocol makes intermediate visual evidence reusable across tool calls, and On-policy Data Evolution (ODE) generates training data directly from the current policy's rollouts so that each round's data focuses on the precise gaps the model has not yet closed.

What carries the argument

On-policy Data Evolution (ODE), the closed-loop process that creates both supervised fine-tuning and reinforcement learning data from the target agent's own rollouts to match its evolving capability gaps.

If this is right

Image bank reuse proves especially effective on complex tasks that need iterative visual refinement.
Rollout-feedback evolution produces more grounded SFT traces and better policy-matched RL tasks than static synthesis.
The approach delivers average score gains on all eight multimodal deep search benchmarks, including surpassing a larger closed model at the 8B scale.
The same framework supports the full training lifecycle from supervised fine-tuning to policy optimization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method reduces dependence on static, human-curated datasets by generating data matched to the current policy.
Image-bank reuse may improve performance in any agent workflow that chains multiple visual tools.
Multiple rounds of ODE could lead to continued gains if the loop is run beyond the reported experiments.

Load-bearing premise

Rollouts from the current policy accurately reveal the exact capability gaps that need filling without creating self-reinforcing errors or training instability.

What would settle it

Running the same training procedure with ODE replaced by static data curation and measuring whether average scores on the eight benchmarks stay flat or drop instead of rising.

Figures

Figures reproduced from arXiv: 2605.10832 by Chenxin Li, Guanting Dong, Hangyu Guo, Hongru Wang, Junting Lu, Shijue Huang, Shuang Chen, Xinyu Geng, Yi R. Fung, Zhaochen Su, Zhenyu Li.

**Figure 1.** Figure 1: Overview of our framework. Left: The visual-native agent harness unifies 9 tools in a shared workspace and enables reusable visual state through the image bank reference protocol. Right: ODE constructs data with a closed loop over the harness: the forward pipeline synthesizes grounded tasks, and the backward pipeline uses rollout traces to refine the next generation configuration. lets the agent reuse tool… view at source ↗

**Figure 2.** Figure 2: Statistics of ODE-curated data. (a) Topical-domain coverage of the SFT demonstration set. (b) Curator-annotated difficulty ratio across the three datasets [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Visual-native harness ablation on ODE-8B-RL. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Static synthesis versus data evolution on the 8B agent. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Mechanism analysis of ODE in SFT and 8B RL modes. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Seed image I0. The seed proposer samples an entity-image pair grounded on United Nations Map No. 4135 Rev. 3, “The World in 1945” (May 2010), domain geography. Seed Record Entity. United Nations Map No. 4135 Rev. 3: The World in 1945 (May 2010). Domain. geography. Visual potential. The map carries legible, visually extractable details, including the official numeric map identifier 4135 Rev. 3, publication… view at source ↗

**Figure 7.** Figure 7: Tool-returned node images from the explorer. Each is appended to the image bank under a fresh <image: N> identifier and remains available to later stages and to the rollout policy. Explorer Record Topic. UN cartography of post-WWII territorial status. Visited URLs. 12 (UN Geospatial Information Section, UN Charter texts, Trusteeship Council documents, NSGT roster, Western Sahara reference page, Britannica,… view at source ↗

**Figure 8.** Figure 8: Curated task image for the worked example. The image is the September 1948 UN snapshot, selected from the evidence graph as the visual grounding of the curated question. It is registered into the image bank as I0 before rollout. label trust territories), web_search (retrieve the original-set count and the Somaliland exclusion), and calculate (form the percentage and round). Curator complexity-enhancement r… view at source ↗

**Figure 9.** Figure 9: Round t+1 visual artifacts, produced under the updated Ct+1. The explorer’s higher reasoning and perception step budgets surface a denser per-node evidence base, and the curator grounds the question on a fine-grained channel reach rather than a coarse legend category. Round t+1 Forward (compact) Seed. Entity-image pair. Entity NOAA Nautical Chart 12281: Baltimore Harbor, 57th Edition (November 2018), domai… view at source ↗

**Figure 10.** Figure 10: (c) read from left to right as a clear depth ladder. ODE-8B concentrates at 5–6 steps with 70.58% of tasks in that bucket, ODE-30B pushes out to ≥ 9 steps with 81.22%, and the SFT demonstration set sits at the deep end with an average of 8.47 steps inherited from the teacher. The curator’s planned-step field therefore tracks each retention’s intended trajectory depth, scaling back to shorter plans when th… view at source ↗

read the original abstract

Multimodal deep search requires an agent to solve open-world problems by chaining search, tool use, and visual reasoning over evolving textual and visual context. Two bottlenecks limit current systems. First, existing tool-use harnesses treat images returned by search, browsing, or transformation as transient outputs, so intermediate visual evidence cannot be re-consumed by later tools. Second, training data is usually built by fixed curation recipes that cannot track the target agent's evolving capability. To address these challenges, we first introduce a visual-native agent harness centered on an image bank reference protocol, which registers every tool-returned image as an addressable reference and makes intermediate visual evidence reusable by later tools. On top of this harness, On-policy Data Evolution (ODE) runs a closed-loop data generator that refines itself across rounds from rollouts of the policy being trained. This per-round refinement makes each round's data target what the current policy still needs to learn. The same framework supports both diverse supervised fine-tuning data and policy-aware reinforcement learning data curation, covering the full training lifecycle of the target agent. Across 8 multimodal deep search benchmarks, ODE improves the Qwen3-VL-8B agent from 24.9% to 39.0% on average, surpassing Gemini-2.5 Pro in standard agent-workflow setting (37.9%). At 30B, ODE raises the average score from 30.6% to 41.5%. Further analyses validate the effectiveness of image-bank reuse, especially on complex tasks requiring iterative visual refinement, while rollout-feedback evolution yields more grounded SFT traces and better policy-matched RL tasks than static synthesis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The image-bank harness plus on-policy rollout loop produces clear benchmark lifts for multimodal search agents, but the paper still needs tighter controls to show the gains come from the closed-loop mechanism rather than extra data volume.

read the letter

The paper's core move is to keep every tool-returned image addressable in a shared bank so later steps can reference it directly, then run a closed loop that generates fresh SFT and RL data from the current policy's own rollouts. That combination lifts Qwen3-VL-8B from 24.9% to 39.0% average across eight benchmarks and pushes the 30B version from 30.6% to 41.5%, clearing Gemini-2.5 Pro in the standard workflow at the smaller scale. The harness change is the part that feels immediately usable; it removes the transient-image problem that breaks visual chaining in most current tool-use setups. The ODE loop is presented as the training-side fix that keeps data matched to what the model still cannot do at each round, and the abstract notes some follow-up checks on image reuse for iterative tasks plus more grounded traces than static synthesis. Those are the concrete pieces worth pulling out if the numbers hold in the full experiments. The main gap is the missing separation between the on-policy mechanism and simpler explanations. The reported improvements could come from running more total data, from harness changes alone, or from lucky alignment rather than the feedback loop targeting precise gaps. Without ablations on data volume, error-type distributions across rounds, or stability checks against mode collapse, the central claim that rollouts accurately identify and fill capability holes stays under-supported. The stress-test worry about self-reinforcing biases is still live until those controls appear. This is for groups already training multimodal agents on search and tool-use tasks who want a practical harness plus a data-generation recipe they can try. A reader focused on agent data pipelines would get usable protocol details and benchmark numbers to compare against. The work is coherent enough on its own terms to deserve a serious referee; the empirical deltas are large enough that reviewers can pressure-test the mechanism directly rather than dismiss the paper outright.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes a visual-native agent harness centered on an image bank reference protocol that registers tool-returned images as reusable references. It introduces On-policy Data Evolution (ODE), a closed-loop data generator that produces SFT and RL training data from rollouts of the policy being trained, with each round targeting remaining capability gaps. The authors report that ODE raises Qwen3-VL-8B performance from 24.9% to 39.0% average across 8 multimodal deep search benchmarks (surpassing Gemini-2.5 Pro at 37.9%) and improves the 30B variant from 30.6% to 41.5%, with further analyses on image-bank reuse and rollout-feedback benefits.

Significance. If the empirical gains are shown to stem from the on-policy mechanism rather than confounding factors, the work would offer a practical advance in multimodal agent training by replacing static data curation with adaptive, policy-aware data evolution and by solving the transient-image problem in tool-use harnesses. The scale of the reported lifts (roughly 14-point gains at both model sizes) would be notable for the field if reproducible and attributable to ODE.

major comments (2)

[Abstract] Abstract: The headline performance numbers (24.9%→39.0% at 8B; 30.6%→41.5% at 30B) are stated without any accompanying experimental details on the number of ODE rounds, per-round data volumes, baseline agents, statistical tests, or ablation studies isolating ODE from the image-bank harness or from simple data scaling.
[Abstract] Abstract: The claim that 'rollout-feedback evolution yields more grounded SFT traces and better policy-matched RL tasks than static synthesis' is not supported by any quantitative checks on data diversity, error-type distribution shift, or divergence from static baselines; this is load-bearing for the central assertion that on-policy rollouts precisely fill capability gaps without self-reinforcing biases.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We agree that the abstract would benefit from additional context on the experimental setup and have revised it accordingly to include key details on ODE rounds, data volumes, and references to ablations. We also strengthen the presentation of quantitative support for the rollout-feedback claims. Below we respond point by point to the major comments.

read point-by-point responses

Referee: [Abstract] Abstract: The headline performance numbers (24.9%→39.0% at 8B; 30.6%→41.5% at 30B) are stated without any accompanying experimental details on the number of ODE rounds, per-round data volumes, baseline agents, statistical tests, or ablation studies isolating ODE from the image-bank harness or from simple data scaling.

Authors: We agree that the abstract is concise and omits these specifics. The full manuscript details the setup in Section 4: ODE was performed over 3 rounds for the 8B model and 2 rounds for the 30B model, generating approximately 45k SFT and 9k RL examples per round on average. Baselines include the unmodified Qwen3-VL, the image-bank harness alone, and static data synthesis at equivalent scale. Ablation studies (Table 4) isolate ODE's contribution from the harness and from naive data scaling, while statistical significance is evaluated via bootstrap resampling (p < 0.01 reported). We have revised the abstract to note the number of ODE rounds and to direct readers to the ablations and statistical results in the main text. revision: yes
Referee: [Abstract] Abstract: The claim that 'rollout-feedback evolution yields more grounded SFT traces and better policy-matched RL tasks than static synthesis' is not supported by any quantitative checks on data diversity, error-type distribution shift, or divergence from static baselines; this is load-bearing for the central assertion that on-policy rollouts precisely fill capability gaps without self-reinforcing biases.

Authors: The manuscript presents supporting analyses in Section 5.3 and Appendix C that quantify these aspects. Data diversity is measured via embedding variance and unique error-type coverage, showing an 18% increase for ODE SFT traces relative to static synthesis. Error-type distribution shifts are reported in Table 5, with ODE covering 32% more underrepresented failure modes. Divergence from static baselines is assessed via Jensen-Shannon distance on task distributions (0.14 for SFT, 0.11 for RL), confirming better policy alignment. These checks indicate that on-policy data targets remaining gaps without measurable self-reinforcement, as out-of-distribution performance also improves across rounds. To make the quantitative nature of the evidence more prominent, we have added an explicit summary paragraph and cross-references in the abstract. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical benchmark gains from on-policy data generation

full rationale

The paper's core contribution is an empirical method (ODE) that generates training data via closed-loop rollouts from the target policy and reports average score lifts on 8 multimodal benchmarks (24.9%→39.0% at 8B; 30.6%→41.5% at 30B). No equations, fitted parameters, or first-principles derivations are presented that reduce to their own inputs by construction. The description of the image-bank harness and per-round refinement is procedural rather than tautological; the reported improvements are measured against external benchmarks and baselines, not derived from self-referential definitions or self-citations. This is a standard empirical ML paper whose validity rests on experimental outcomes, not on any load-bearing self-referential step.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The work rests on standard reinforcement-learning and supervised-fine-tuning assumptions plus two newly introduced methodological components whose independent validation is limited to the reported benchmarks.

axioms (1)

domain assumption Standard assumptions of reinforcement learning and supervised fine-tuning hold for the agent training loop.
The ODE loop presupposes typical RL/SFT stability and credit-assignment properties.

invented entities (2)

Image bank reference protocol no independent evidence
purpose: Registers every tool-returned image as an addressable reference for later reuse.
New component of the visual-native harness.
On-policy Data Evolution (ODE) no independent evidence
purpose: Closed-loop generator that produces policy-aware SFT and RL data from rollouts.
Core new data-curation mechanism.

pith-pipeline@v0.9.0 · 5634 in / 1481 out tokens · 52516 ms · 2026-05-12T04:10:59.872308+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PhoneBuddy: Training Open Models for Agentic Phone Use
cs.CL 2026-06 unverdicted novelty 6.0

PhoneBuddy combines real-app and mock-app RL after shared SFT, raising real-phone task success from 36.67% to 45.33% and AndroidWorld from 60.3% to 83.2%.