Dynamo: Dynamic Skill-Tool Evolution for Vision-Language Agents
Pith reviewed 2026-06-30 06:29 UTC · model grok-4.3
The pith
A frozen vision-language model improves visual reasoning by evolving its own reusable skills and visual tools from self-inspection on a small labeled set.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Dynamo lets a frozen VLM evolve two complementary capabilities on a small labeled training subset: reusable reasoning skills for cognitive bottlenecks and executable visual tools for perceptual ones. Each generated tool is paired with a skill that specifies when to invoke it, and both types accumulate in a persistent library. Across four visual reasoning benchmarks and five VLM backbones, this produces accuracy gains on all 20 model-benchmark pairs with an average improvement of 5.6 points. When tools are supplied ahead of time, the framework learns per-step invocation policies that improve over every backbone tested. Dynamo closes 65 to 99 percent of the performance gap to task-specific rei
What carries the argument
The Dynamo loop of self-inspection on successes and failures to generate, pair, and accumulate reasoning skills with executable visual tools in a persistent library.
If this is right
- Accuracy rises on every model-benchmark pair tested without any weight updates.
- The framework learns reliable per-step tool invocation policies when a tool set is provided in advance.
- It recovers most of the gains from task-specific reinforcement learning at a fraction of the compute cost.
- The skill-tool library combines additively with reinforcement learning when both are used.
Where Pith is reading between the lines
- The same self-inspection process could be applied to non-visual agent tasks if the base model can critique its own outputs.
- Over multiple tasks the library might accumulate into a growing shared capability set that reduces the need for per-task restarts.
- If inspection quality improves with stronger models, the framework's gains would increase rather than plateau.
- The approach offers a route to iterative agent improvement that stays within a single frozen model rather than requiring repeated fine-tuning.
Load-bearing premise
The frozen VLM can accurately and usefully review its own correct and incorrect attempts on the small training subset to produce skills and tools that generalize to new inputs.
What would settle it
Running the evolved skill-tool library on a new visual reasoning benchmark and observing no accuracy improvement or a drop relative to the original frozen model.
read the original abstract
Improving vision-language models (VLMs) on visual reasoning typically requires retraining or hand-designed prompts and tools. We present Dynamo, a training-free framework that adapts a frozen VLM without any weight updates. On a small labeled training subset, the agent inspects its own correct and incorrect attempts and evolves two complementary capabilities: reusable reasoning skills for cognitive bottlenecks, and executable visual tools for perceptual ones. Each generated tool is paired with a skill that specifies when to invoke it, and both capability types accumulate in a persistent library. Across four visual reasoning benchmarks and five VLM backbones, Dynamo improves direct inference on all 20 model--benchmark settings (avg. +5.6 acc). When the tool set is given in advance, the framework learns when to call each tool, and per-step tool choice improves on every tested backbone. Against task-specific RL (VTool-R1, DeepEyes), Dynamo closes 65--99% of the RL gap at a fraction of the compute, and combines additively with RL when available.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents Dynamo, a training-free framework in which a frozen VLM inspects its own correct and incorrect attempts on a small labeled training subset to evolve a persistent library of reusable reasoning skills (for cognitive bottlenecks) and executable visual tools (for perceptual ones), with each tool paired to a skill that specifies invocation conditions. The central empirical claim is that this yields accuracy gains on every one of the 20 model–benchmark combinations tested (four visual-reasoning benchmarks, five VLM backbones), for an average improvement of +5.6 points over direct inference; the method also learns tool-selection policies when tools are supplied and closes 65–99 % of the gap to task-specific RL while remaining additive with RL.
Significance. If the reported gains are shown to arise from the proposed self-evolution mechanism rather than subset artifacts or prompt leakage, the work would be significant: it supplies a concrete, low-compute route to adapting frozen VLMs on visual reasoning tasks and demonstrates that skill/tool libraries can transfer beyond the inspection subset. The breadth of the evaluation (20 settings, multiple backbones) is a clear strength.
major comments (3)
- [Abstract and §4] Abstract and §4 (Experiments): the headline claim of consistent gains on all 20 settings and the +5.6 average are presented without error bars, statistical significance tests, data-exclusion criteria, or details on how tool/skill generation quality was validated on the held-out test distribution; these omissions are load-bearing for the central empirical assertion.
- [§3] §3 (Method) and the weakest-assumption paragraph: the framework’s correctness hinges on the frozen VLM accurately diagnosing its own errors on the small labeled subset and distilling them into reusable, generalizable skills/tools; no isolated measurement of self-critique accuracy or ablation that severs the self-inspection step is reported, leaving open the possibility that observed gains are explained by subset selection or leakage rather than the claimed mechanism.
- [§4.3] §4.3 (Comparison with RL): the statement that Dynamo “closes 65–99 % of the RL gap” is presented without the precise per-backbone numbers, variance across runs, or the exact definition of the gap (absolute accuracy difference or normalized), making the quantitative claim difficult to verify or reproduce.
minor comments (2)
- [§3] Notation for the skill–tool pairing and the persistent library accumulation is introduced without a compact formal definition or pseudocode; a short algorithm box would improve clarity.
- [§4] Figure captions and axis labels in the main results plots do not indicate whether error bars represent standard deviation across seeds or across benchmarks; this should be stated explicitly.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight important aspects of empirical rigor and mechanistic validation. We address each major point below and commit to revisions that strengthen the presentation without altering the core claims.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): the headline claim of consistent gains on all 20 settings and the +5.6 average are presented without error bars, statistical significance tests, data-exclusion criteria, or details on how tool/skill generation quality was validated on the held-out test distribution; these omissions are load-bearing for the central empirical assertion.
Authors: We agree these reporting elements are necessary for verifiability. In the revision we will add per-setting standard deviations (computed over 3–5 independent runs where compute permits), paired t-tests or Wilcoxon tests for significance against the direct-inference baseline, explicit data-exclusion criteria, and a new subsection quantifying tool/skill generation quality (e.g., human or automated correctness rates) on held-out examples from the test distribution. These additions will appear in §4 and be summarized in the abstract. revision: yes
-
Referee: [§3] §3 (Method) and the weakest-assumption paragraph: the framework’s correctness hinges on the frozen VLM accurately diagnosing its own errors on the small labeled subset and distilling them into reusable, generalizable skills/tools; no isolated measurement of self-critique accuracy or ablation that severs the self-inspection step is reported, leaving open the possibility that observed gains are explained by subset selection or leakage rather than the claimed mechanism.
Authors: The concern is well-founded; an isolated measurement of self-critique fidelity would directly test the central assumption. While the breadth of gains across five backbones and four benchmarks provides indirect support, we will add (i) a quantitative self-critique accuracy analysis on the inspection subset (comparing VLM diagnoses to ground-truth error categories) and (ii) an ablation that replaces self-inspection with either random or oracle critiques, reporting the resulting accuracy delta. These will be placed in §4.2 or the appendix. revision: yes
-
Referee: [§4.3] §4.3 (Comparison with RL): the statement that Dynamo “closes 65–99 % of the RL gap” is presented without the precise per-backbone numbers, variance across runs, or the exact definition of the gap (absolute accuracy difference or normalized), making the quantitative claim difficult to verify or reproduce.
Authors: We will expand §4.3 with a table listing, for each backbone–benchmark pair, the direct-inference accuracy, RL accuracy, Dynamo accuracy, the absolute gap closed, and the normalized percentage (defined as (Dynamo – direct) / (RL – direct)). Any available run-to-run variance will also be reported. The definition of the gap will be stated explicitly in the text and caption. revision: yes
Circularity Check
No circularity: empirical framework with no derivations or load-bearing self-citations
full rationale
The paper describes an empirical, training-free method (Dynamo) in which a frozen VLM inspects its own attempts on a small labeled subset to generate reusable skills and tools that are then evaluated on held-out benchmarks. No equations, first-principles derivations, fitted parameters renamed as predictions, or uniqueness theorems appear in the provided text. Claims rest on reported accuracy gains across 20 model-benchmark pairs rather than any reduction of outputs to inputs by construction. Self-citations, if present, are not invoked to justify load-bearing premises. The work is therefore self-contained as an empirical study.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
DeepEyesV2: Toward Agentic Multimodal Model
doi: 10.48550/ARXIV.2511.05271. URL https://doi.org/10.48550/arXiv.2511.05271. Guanyu Jiang, Zhaochen Su, Xiaoye Qu, and Yi R. Fung. Xskill: Continual learning from experi- ence and skills in multimodal agents.CoRR, abs/2603.12056, 2026. Zongxia Li, Xiyang Wu, Hongyang Du, Fuxiao Liu, Huy Nghiem, and Guangyao Shi. A survey of state of the art large vision...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.05271 2026
-
[2]
Identify the visual entity named by the question. 16 Dynamo: Dynamic Skill-Tool Evolution for Vision-Language Agents Case study Generated capability Prior motif Claim supported Case A: Structured- image editing A skill that instructs the solver to isolate the relevant chart/table structure, plus a tool that highlights, boxes, or masks visual evidence. ReF...
2025
-
[3]
Before computing the answer, create an edited view that marks only the relevant region and suppresses likely distractors
-
[4]
Re-read the edited image and extract the required value
-
[5]
highlight
Perform the requested arithmetic or comparison. Common Pitfalls: - Do not answer from the unedited image if multiple marks are visually similar. - Do not treat the visual edit as adding new information; it only exposes the evidence already present in the image. 1defmark_relevant_region(image_path, region, mode="highlight"): 2"""Return an edited image that...
2025
-
[6]
small red sign on the right side of the building
Phrase the question into a short ‘target_description‘ (e.g., "small red sign on the right side of the building")
-
[7]
17 Dynamo: Dynamic Skill-Tool Evolution for Vision-Language Agents
Provide a tile scoring callback ‘score_tile(tile_path, desc)‘ that returns a higher value when the tile is more likely to contain the target; in our agent this is a fast VLM yes/no probe. 17 Dynamo: Dynamic Skill-Tool Evolution for Vision-Language Agents
-
[8]
Call ‘coarse_to_fine_zoom(image_path, target_description, score_tile, grid=2, depth=2)‘ (defined in the paired tool below). The function recursively splits the image into a 2x2 grid, scores all four tiles, zooms into the best one, and recurses; after ‘depth=2‘ it returns the path of the final zoomed crop
-
[9]
"" 3fromPILimportImage 4crop_path = image_path 5forlevelin range(depth): 6img = Image.open(crop_path).convert(
Answer the question from the returned zoomed crop. If the crop is still ambiguous, call ‘coarse_to_fine_zoom‘ again on the returned crop with ‘depth=1‘ to drill in one more level. Common Pitfalls: - Do not answer from the globally downsampled image when the requested evidence is small. - Do not crop solely by image center; rely on the per-tile score and l...
-
[10]
Identify both the *measurement target* (the bars / lines / points you need to read off) AND the *context* you need to interpret it (x-axis labels, y-axis ticks, legend swatches)
-
[11]
Compute a tight bounding box around the measurement target
-
[12]
Expand the bounding box by ~30 px in every direction toward an axis or legend; never crop with margin = 0
-
[13]
"" 7fromPILimportImage 8img = Image.open(image_path).convert(
Render the crop and compare it against the two reference images stored alongside this skill: - If your crop looks like crop_bad.png (axis labels or legend on one side cut off, target visible but no longer associated with its label), the crop is too aggressive. Re-think the bounding box: expand it by another ~30 px in the direction of the missing context a...
2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.