arxiv: 2604.02812 · v1 · submitted 2026-04-03 · 💻 cs.RO

Recognition: no theorem link

Learning Structured Robot Policies from Vision-Language Models via Synthetic Neuro-Symbolic Supervision

Alessandro Adami , Tommaso Tubaldo , Marco Todescato , Ruggero Carli , Pietro Falco

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:08 UTC · model grok-4.3

classification 💻 cs.RO

keywords robot policiesvision-language modelsbehavior treesneuro-symbolic supervisionsynthetic datadomain randomizationpolicy transfermanipulator control

0 comments

The pith

Vision-language models synthesize structured Behavior Tree policies for robots from self-generated synthetic data that transfer to physical manipulators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper demonstrates that vision-language models can produce executable, interpretable robot policies in the form of Behavior Trees directly from visual observations and natural language instructions. An automated pipeline lets the same foundation model create its own training dataset of domain-randomized scenes paired with instruction-policy examples, eliminating the need for manual labeling. Real-world tests on two different robotic manipulators show that policies trained exclusively on this synthetic data execute successfully on hardware. The approach offers a route to multimodal robot control that retains the modularity and analyzability of classical structured representations rather than relying on opaque end-to-end networks.

Core claim

A vision-language model can be specialized to output structured Behavior Tree policies grounded in multimodal inputs, and an automated synthetic data pipeline using domain randomization produces sufficient supervision for these policies to transfer from simulation to two physical robotic manipulators without further real-world training.

What carries the argument

A neuro-symbolic pipeline in which a VLM generates executable Behavior Tree policies from visual observations, language instructions, and system specifications, trained via self-supervised synthetic multimodal data with domain randomization.

If this is right

Structured Behavior Tree policies remain interpretable and modular, supporting safety analysis in robotic applications.
Supervision data can be scaled automatically without human annotation of real robot trajectories.
Multimodal decision making for manipulation becomes feasible using only synthetic data generated by the same model.
The method provides a concrete alternative to end-to-end visuomotor policies for tasks requiring reactive and analyzable control.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same synthetic supervision loop could be applied to other structured representations such as finite-state machines or skill graphs.
Reducing dependence on costly real-robot data collection might accelerate deployment in new environments or tasks.
Adding limited human corrections on the generated trees could further close remaining gaps between synthetic and real distributions.

Load-bearing premise

The synthetic scenes and instruction-policy pairs generated by the foundation model are distributed closely enough to real visual observations and tasks that no large domain gap blocks policy transfer.

What would settle it

Deploying the learned policies on the physical robots and measuring a sharp drop in success rate or frequent execution failures under varied real lighting, object positions, and backgrounds would show the transfer does not hold.

Figures

Figures reproduced from arXiv: 2604.02812 by Alessandro Adami, Marco Todescato, Pietro Falco, Ruggero Carli, Tommaso Tubaldo.

**Figure 1.** Figure 1: Overview of the proposed framework. Given synthetic observations, a large foundation model is first used to automatically generate a synthetic supervision dataset composed of task instructions and corresponding Behavior Trees from visual observations and structured system specifications. This dataset is then used to fine-tune the Pixtral-12B vision-language model for constrained symbolic generation. At inf… view at source ↗

**Figure 3.** Figure 3: Representation of the target Behavior Tree [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 2.** Figure 2: Examples of synthetic tabletop scenes used in dataset generation. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 4.** Figure 4: Example of real-world images, representing scenarios coherent with [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Real-world experimental platforms used to validate hardware-agnostic [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Sequence of the task computed by the UR5 platform in which the [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

read the original abstract

Vision-language models (VLMs) have recently demonstrated strong capabilities in mapping multimodal observations to robot behaviors. However, most current approaches rely on end-to-end visuomotor policies that remain opaque and difficult to analyze, limiting their use in safety-critical robotic applications. In contrast, classical robotic systems often rely on structured policy representations that provide interpretability, modularity, and reactive execution. This work investigates how foundation models can be specialized to generate structured robot policies grounded in multimodal perception, bridging high-dimensional learning and symbolic control. We propose a neuro-symbolic approach in which a VLM synthesizes executable Behavior Tree policies from visual observations, natural language instructions, and structured system specifications. To enable scalable supervision without manual annotation, we introduce an automated pipeline that generates a synthetic multimodal dataset of domain-randomized scenes paired with instruction-policy examples produced by a foundation model. Real-world experiments on two robotic manipulators show that structured policies learned entirely from synthetic supervision transfer successfully to physical systems. The results indicate that foundation models can be adapted to produce interpretable and structured robot policies, providing an alternative to opaque end-to-end approaches for multimodal robot decision making.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a pipeline for using a VLM to synthesize Behavior Tree policies from synthetic randomized scenes, with a claim of real-robot transfer that lacks numbers.

read the letter

The main contribution here is a concrete pipeline that has a vision-language model generate executable Behavior Tree policies from visual observations and instructions, all trained on a synthetic dataset the same model helps create through domain randomization. This avoids manual annotation and keeps the output structured rather than end-to-end black box. The synthetic supervision step is practical and directly addresses the cost of collecting paired real data for this kind of mapping. The motivation around interpretability for safety-critical use is stated clearly and matches the choice of Behavior Trees. The randomization is a reasonable attempt to handle visual variation without real-world collection. The central claim is that policies learned this way transfer to two physical manipulators. That claim is presented without success rates, baseline comparisons, task details, or failure analysis, so it is difficult to judge how well the synthetic distribution actually matches real observations. The domain gap remains an unquantified assumption, exactly as the stress-test note flags. No ablations on randomization strength or coverage statistics appear in the provided text. This is for robotics researchers already working on symbolic controllers or VLM grounding who want an idea for scalable data generation. A reader could extract the pipeline structure even if the transfer results need more backing. The work shows clear thinking on the neuro-symbolic angle and honest engagement with the interpretability problem. It deserves peer review so the authors can supply the missing metrics and ablations; the idea is worth checking with proper evidence.

Referee Report

2 major / 2 minor

Summary. The paper proposes using a VLM to synthesize a multimodal dataset of domain-randomized scenes paired with instruction-Behavior Tree policy examples, then training structured policies from this synthetic supervision that transfer to real robotic manipulators, offering an interpretable neuro-symbolic alternative to end-to-end visuomotor policies.

Significance. If the real-world transfer holds with quantitative support, the approach would be significant for enabling scalable, annotation-free generation of modular and analyzable robot policies from foundation models, bridging high-dimensional perception with classical symbolic control in safety-critical settings.

major comments (2)

[Abstract] Abstract: the central claim that 'structured policies learned entirely from synthetic supervision transfer successfully to physical systems' on two manipulators is asserted without any quantitative metrics (success rates, baselines, statistical details, or failure-case analysis), leaving the empirical evidence for transfer unevaluated.
[Results] Experimental evaluation (implied results section): no quantitative measures of the synthetic-to-real domain gap (feature-space distances, coverage statistics, or ablation on randomization strength) are reported, so the load-bearing assumption that the VLM-generated distribution sufficiently matches real visual observations remains unverified.

minor comments (2)

[Method] The automated pipeline description would benefit from explicit details on prompting strategy and post-processing steps used to ensure the generated Behavior Trees are executable.
[Method] Notation for the neuro-symbolic supervision pipeline could be clarified with a diagram or pseudocode to distinguish VLM synthesis from downstream policy learning.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger quantitative evidence. We address each major comment below and will revise the manuscript to incorporate the requested metrics and analyses.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'structured policies learned entirely from synthetic supervision transfer successfully to physical systems' on two manipulators is asserted without any quantitative metrics (success rates, baselines, statistical details, or failure-case analysis), leaving the empirical evidence for transfer unevaluated.

Authors: We agree that the abstract would be strengthened by including quantitative support for the transfer claim. In the revised version we will update the abstract to report the specific success rates achieved on the two real manipulators, include brief comparisons to relevant baselines, and note statistical details. A concise failure-case analysis will also be added to the results section to provide a more complete empirical picture. revision: yes
Referee: [Results] Experimental evaluation (implied results section): no quantitative measures of the synthetic-to-real domain gap (feature-space distances, coverage statistics, or ablation on randomization strength) are reported, so the load-bearing assumption that the VLM-generated distribution sufficiently matches real visual observations remains unverified.

Authors: We acknowledge that explicit quantification of the domain gap would better substantiate the synthetic-to-real transfer. While the current evaluation centers on end-to-end policy success, we will add the suggested analyses in the revision: feature-space distance metrics between synthetic and real observations, coverage statistics of the generated distribution, and an ablation on randomization strength to directly verify the distribution match. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline with no derivations or self-referential fits

full rationale

The manuscript presents an empirical pipeline: a VLM generates synthetic instruction-policy pairs for Behavior Trees on domain-randomized scenes, policies are trained on that data, and transfer is tested on two physical manipulators. No equations, parameter fits, uniqueness theorems, or derivation chains appear in the abstract or described text. The success claim is framed as an experimental observation rather than a quantity derived from or equivalent to its own inputs by construction. Self-citations, if present, are not load-bearing for any mathematical result. The domain-gap concern raised in the skeptic note is a validity issue, not a circularity reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters or invented entities; the approach implicitly rests on the domain assumption that Behavior Trees are sufficiently expressive for the target manipulation tasks.

axioms (1)

domain assumption Behavior Trees can represent the robot behaviors required by the evaluated tasks without loss of capability relative to end-to-end policies.
The entire pipeline presupposes that the chosen symbolic representation is adequate for the manipulation scenarios considered.

pith-pipeline@v0.9.0 · 5509 in / 1224 out tokens · 51476 ms · 2026-05-13T20:08:24.831657+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 5 internal anchors

[1]

A survey of behavior trees in robotics and ai,

M. Iovino, E. Scukins, J. Styrud, P. ¨Ogren, and C. Smith, “A survey of behavior trees in robotics and ai,”Robotics and Autonomous Systems, vol. 154, p. 104096, 2022. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0921889022000513

work page 2022
[2]

Behavior trees in robotics and ai: An introduction,

M. Colledanchise and P. Ogren, “Behavior trees in robotics and ai: An introduction,” 07 2018

work page 2018
[3]

Vlm-driven behavior tree for context-aware task planning,

N. Wake, A. Kanehira, J. Takamatsu, K. Sasabuchi, and K. Ikeuchi, “Vlm-driven behavior tree for context-aware task planning,” 2025. [Online]. Available: https://arxiv.org/abs/2501.03968

work page arXiv 2025
[4]

Real2sim based on active perception with automatically vlm-generated behavior trees,

A. Adami, S. Zudaire, R. Carli, and P. Falco, “Real2sim based on active perception with automatically vlm-generated behavior trees,”

work page
[5]

Available: https://arxiv.org/abs/2601.08454

[Online]. Available: https://arxiv.org/abs/2601.08454

work page arXiv
[6]

Gemini: A Family of Highly Capable Multimodal Models

G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millicanet al., “Gemini: a family of highly capable multimodal models,”arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Openvla: An open-source vision-language-action model,

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn, “Openvla: An open-source vision-language-action model,”

work page
[8]

OpenVLA: An Open-Source Vision-Language-Action Model

[Online]. Available: https://arxiv.org/abs/2406.09246

work page internal anchor Pith review Pith/arXiv arXiv
[9]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, P. Florence, C. Fu, M. G. Arenas, K. Gopalakrishnan, K. Han, K. Hausman, A. Herzog, J. Hsu, B. Ichter, A. Irpan, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, L. Lee, T.-W. E. Lee, S. Levine, Y . Lu, H. Michalewski, I. Mordatch, K. Pe...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

A3vlm: Actionable articulation-aware vision language model,

S. Huang, H. Chang, Y . Liu, Y . Zhu, H. Dong, P. Gao, A. Boularias, and H. Li, “A3vlm: Actionable articulation-aware vision language model,” 2024. [Online]. Available: https://arxiv.org/abs/2406.07549

work page arXiv 2024
[11]

From synthetic scenes to real performance: Enhancing spatial reasoning in vlms,

M. Rizzoli, S. Alghisi, S. M. Mousavi, and G. Riccardi, “From synthetic scenes to real performance: Enhancing spatial reasoning in vlms,” 2026. [Online]. Available: https://arxiv.org/abs/2511.11440

work page arXiv 2026
[12]

Llm-brain: Ai-driven fast generation of robot behaviour tree based on large language model,

A. Lykov and D. Tsetserukou, “Llm-brain: Ai-driven fast generation of robot behaviour tree based on large language model,” in2024 2nd International Conference on Foundation and Large Language Models (FLLM), 2024, pp. 392–397

work page 2024
[13]

Stanford alpaca: An instruction-following llama model,

R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford alpaca: An instruction-following llama model,” https://github.com/tatsu-lab/stanford alpaca, 2023

work page 2023
[14]

Btgenbot-2: Efficient behavior tree generation with small language models,

R. A. Izzo, G. Bardaro, and M. Matteucci, “Btgenbot-2: Efficient behavior tree generation with small language models,” 2026. [Online]. Available: https://arxiv.org/abs/2602.01870

work page arXiv 2026
[15]

Llm-as-bt-planner: Leveraging llms for behavior tree generation in robot task planning,

J. Ao, F. Wu, Y . Wu, A. Swiki, and S. Haddadin, “Llm-as-bt-planner: Leveraging llms for behavior tree generation in robot task planning,” in2025 IEEE International Conference on Robotics and Automation (ICRA), 2025, pp. 1233–1239

work page 2025
[16]

Multimodal behavior tree generation: A small vision-language model for robot task planning,

C. Battistini, R. A. Izzo, G. Bardaro, and M. Matteucci, “Multimodal behavior tree generation: A small vision-language model for robot task planning,”arXiv preprint arXiv:2603.06084, 2026

work page arXiv 2026
[17]

Ontology-guided diffusion for zero-shot visual sim2real transfer,

M. Youssef, M. Elfares, A.-M. Meer, M. Bortoletto, and A. Bulling, “Ontology-guided diffusion for zero-shot visual sim2real transfer,”

work page
[18]

Available: https://arxiv.org/abs/2603.18719

[Online]. Available: https://arxiv.org/abs/2603.18719

work page arXiv
[19]

Sim2real VLA: Zero-shot generalization of synthesized skills to realistic manipulation,

R. Zhao, S. Xu, R. Jin, Y . Deng, Y . Tai, K. Jia, and G. Liu, “Sim2real VLA: Zero-shot generalization of synthesized skills to realistic manipulation,” inThe Fourteenth International Conference on Learning Representations, 2026. [Online]. Available: https: //openreview.net/forum?id=H4SyKHjd4c

work page 2026
[20]

A study on training and developing large language models for behavior tree generation,

F. Li, X. Wang, B. Li, Y . Wu, Y . Wang, and X. Yi, “A study on training and developing large language models for behavior tree generation,”

work page
[21]

Available: https://arxiv.org/abs/2401.08089

[Online]. Available: https://arxiv.org/abs/2401.08089

work page arXiv
[22]

FoundationPose: Unified 6d pose estimation and tracking of novel objects,

B. Wen, W. Yang, J. Kautz, and S. Birchfield, “FoundationPose: Unified 6d pose estimation and tracking of novel objects,” inCVPR, 2024

work page 2024
[23]

Deep object pose estimation for semantic robotic grasping of household objects,

J. Tremblay, T. To, B. Sundaralingam, Y . Xiang, D. Fox, and S. Birch- field, “Deep object pose estimation for semantic robotic grasping of household objects,” 09 2018

work page 2018
[24]

Chain- of-symbol prompting elicits planning in large langauge models,

H. Hu, H. Lu, H. Zhang, Y .-Z. Song, W. Lam, and Y . Zhang, “Chain- of-symbol prompting elicits planning in large langauge models,” 2024. [Online]. Available: https://arxiv.org/abs/2305.10276

work page arXiv 2024
[25]

Pixtral 12B

P. Agrawal, S. Antoniak, E. B. Hanna, B. Bout, D. Chaplot, J. Chudnovsky, D. Costa, B. D. Monicault, S. Garg, T. Gervet, S. Ghosh, A. H ´eliou, P. Jacob, A. Q. Jiang, K. Khandelwal, T. Lacroix, G. Lample, D. L. Casas, T. Lavril, T. L. Scao, A. Lo, W. Marshall, L. Martin, A. Mensch, P. Muddireddy, V . Nemychnikova, M. Pellat, P. V . Platen, N. Raghuraman, ...

work page internal anchor Pith review arXiv 2024
[26]

Mujoco: A physics engine for model-based control,

E. Todorov, T. Erez, and Y . Tassa, “Mujoco: A physics engine for model-based control,” in2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2012, pp. 5026–5033

work page 2012
[27]

Lora: Low-rank adaptation of large language models,

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,”

work page
[28]

LoRA: Low-Rank Adaptation of Large Language Models

[Online]. Available: https://arxiv.org/abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv