pith. sign in

arxiv: 2606.10180 · v1 · pith:F7EOSE4Rnew · submitted 2026-06-08 · 💻 cs.RO · cs.AI· cs.HC

Flow Control: Steering Vision-Language-Action Models with Simple Real-Time Inputs

Pith reviewed 2026-06-27 16:00 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.HC
keywords flow controlvision-language-action modelsreal-time steeringrobot controlVLA modelshuman-in-the-looppolicy fine-tuning
0
0 comments X

The pith

Flow control steers vision-language-action robot models in real time with simple inputs like a keyboard without any retraining or model changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents flow control as a method to steer VLA models using generic real-time inputs such as keyboard commands. These inputs are transformed into action samples drawn from the model's learned expert distribution, preserving action quality while reflecting user intent. The technique requires no retraining and works out of the box. It produces accurate, responsive steering that remains robust even with suboptimal inputs. Users achieve higher success rates and faster task completion, and trajectories collected under flow control can be used to fine-tune the base VLA for better autonomous performance.

Core claim

Flow control converts simple user inputs into action samples from the VLA expert action distribution. This produces robot actions that are high quality because they conform to the distribution learned during training and high fidelity because they align with the user's intent. The method works without retraining the VLA and yields accurate responsive steering that is robust to imperfect user inputs, resulting in significantly higher task success rates and faster completion times. Collecting trajectories under flow control and fine-tuning the VLA on them improves the autonomous policy.

What carries the argument

Flow control, the mapping of generic user inputs to samples from the VLA's expert action distribution.

If this is right

  • Flow control accurately and responsively steers robot actions with user inputs.
  • It remains robust even when user inputs are suboptimal.
  • It produces significantly higher success rates and faster task completion.
  • Fine-tuning a VLA on trajectories generated under flow control improves the autonomous policy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same input-to-distribution mapping could be applied to other low-dimensional interfaces such as joysticks or touchscreens.
  • Flow control offers a lightweight way to collect high-quality human-corrected data for policy improvement without changing model architecture.
  • Real-time steering may reduce the need for full autonomy in safety-critical or novel environments by letting users supply corrections on the fly.
  • If the mapping step generalizes across tasks, flow control could serve as a standard interface layer between humans and any pretrained VLA.

Load-bearing premise

Simple user inputs can be transformed into samples from the VLA expert distribution without degrading action quality and while matching user intent.

What would settle it

A controlled experiment in which steering with flow control produces no measurable increase in task success rate or reduction in completion time compared with the unsteered VLA.

Figures

Figures reproduced from arXiv: 2606.10180 by Andy Wang, Jason Chan, Jonathan C. Kao.

Figure 1
Figure 1. Figure 1: Overview. We freeze a VLA with a flow matching action expert. Our method (red) steers the VLA by injecting user inputs into an action expert. A critical observation we leverage is that sev￾eral state-of-the-art VLAs use generative action experts, such as a flow matching head in π0.5 [2]. These generative policies transform noise – conditioned on camera inputs, language, state, and attention tokens derived … view at source ↗
Figure 2
Figure 2. Figure 2: a, Diffusion injects noise at each time step, and therefore information in the initial condition (IC) is lost over successive iterations. b, Flow matching is a diffeomorphism and is therefore invertible, preserving information about the IC in x1. c, The flow trajectory integrates an ODE directly to the left mode; diffusion integrates a noisy SDE that can arrive at the right mode. In contrast, a diffusion m… view at source ↗
Figure 3
Figure 3. Figure 3: a, Task, “put the block in the hole.” 2 cm blocks are equally spaced 10 cm apart from the hole. b, Joint 1 controls left-right movement. c, We perturb joint 1’s initial condition from between τ = 0 to τ = 8 time steps over the horizon of 16 time steps in the π0.5 action chunk. d, As τ increases, the proportion of left trials increases. e, Perturbing the IC does not adversely affect pick or place performanc… view at source ↗
Figure 4
Figure 4. Figure 4: a, In a multimodal setting, flow control steers the policy towards the left block. b, c, In a unimodal setting, the flow IC is transformed into an on policy ac￾tion. d, In experiments, the IC perturbation only affects the trajectory early (red) and not when the task is un￾ambiguous (black and blue). Flow control steering can therefore be thought of as injecting user intent when action distri￾butions are mu… view at source ↗
Figure 5
Figure 5. Figure 5: a, Marker-in-Bowl task. b, Cup-Stacking task. c, Example failure of π0.5-DROID, not lifting a cup high enough and knocking over another. d, The user gives a flow control input “up” to avoid this failure. e, Trial time distribution for Cup-Stacking for autonomous, flow control, and flow control fine tuned VLA. to 99.4% ( [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: 5-goal task. a, Task where blocks are sepa￾rated 0cm, 2cm, or 4cm apart. A correct block is ran￾domly specified on each trial. b, Success rate of picking up the correct block. What precision can flow control achieve in a pick-and-place task? To answer this question, we performed the Five-Block pick-and-place task at varying difficulties (0 cm, 2 cm, and 4 cm separation, Figure 6a). On each trial, a random … view at source ↗
Figure 7
Figure 7. Figure 7: Perturbing the flow ODE integration. a, We perturb the flow process after 1 s, 2 s, or 3 s into the trajectory by modifying the flow. b, As more of the flow steps are perturbed, the policy more easily “changes” its mind to reach towards the left block. The experiments in Appendix C.1 led us to ask: can perturbation of the flow ODE integration induce “corrective” or “change of mind” behav￾ior in the VLA? To… view at source ↗
read the original abstract

We introduce flow control of vision-language-action (VLA) models, a simple and effective way to steer VLA actions in real-time through generic inputs, such as a keyboard. This method can be used out-of-the-box and does not require retraining or fine-tuning VLAs. It enables relatively crude user inputs to steer a VLA to align with user intent. The VLA transforms these inputs into action samples drawn from the VLA expert action distribution learned during training, so that the generated actions are high quality (conformity to the action expert distribution) and high fidelity (reflecting the user's intent). We demonstrate that flow control has many desirable properties: (1) flow control accurately and responsively steers robot actions with user inputs, (2) it is robust to suboptimal user inputs, (3) it enables users to steer VLAs to achieve significantly higher success rates and faster task completion, and (4) fine-tuning a VLA on flow control trajectories improves the autonomous policy. Together, these results provide a simple and intuitive way for users to help steer VLA actions, increasing task performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces flow control, an inference-time procedure for steering pretrained vision-language-action (VLA) models using generic real-time inputs such as keyboard commands. The approach injects the user signal directly into the flow-matching or diffusion sampling process to produce actions drawn from the VLA's expert distribution while remaining faithful to user intent. Experiments across multiple robot tasks report improved success rates, faster completion times, robustness to noisy or suboptimal inputs, and additional gains when the VLA is subsequently fine-tuned on trajectories generated under flow control.

Significance. If the empirical claims hold, the work supplies a practical, model-agnostic mechanism for incorporating human guidance into deployed VLAs without architectural modification or retraining. The reported performance improvements and the downstream fine-tuning benefit constitute concrete, falsifiable contributions. The emphasis on preserving conformity to the expert distribution while accommodating crude inputs addresses a recurring deployment challenge in robotics.

minor comments (3)
  1. [§3] The abstract and §3 would benefit from an explicit statement of the precise mathematical form in which the user input is injected into the flow or diffusion step (e.g., the modified velocity field or noise schedule).
  2. [Figure 4] Figure 4 caption should clarify whether the plotted trajectories are single-rollout examples or aggregated statistics; the current wording leaves the reader uncertain about variance across seeds.
  3. [§5.3] The fine-tuning experiment in §5.3 reports success-rate gains but does not state the number of flow-control trajectories used or the number of gradient steps; these details are needed to assess data efficiency.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of our work and the recommendation for minor revision. No specific major comments were provided in the report, so we have no points to address point-by-point at this time. We remain available to incorporate any minor suggestions during revision.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript introduces an inference-time steering procedure for VLAs using generic real-time inputs (e.g., keyboard) mapped into the model's expert action distribution. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described method. Claims rest on experimental results across robot tasks (success rates, task completion time, robustness to noisy inputs, and downstream fine-tuning gains). The approach is presented as model-agnostic at inference time with ablations addressing robustness, rendering the central claims self-contained against external benchmarks rather than reducing to definitional or fitted inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no details on parameters, axioms, or new entities; method described at high level.

pith-pipeline@v0.9.1-grok · 5729 in / 1134 out tokens · 24177 ms · 2026-06-27T16:00:33.671719+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 1 canonical work pages

  1. [1]

    Black, N

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Haus- man, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky.π 0: A vision-language-action flow model for general robot control.arXiv [cs.LG], Oct. 2024

  2. [2]

    Intelligence, K

    P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. V...

  3. [3]

    Brohan, N

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, P. Florence, C. Fu, M. G. Arenas, K. Gopalakrishnan, K. Han, K. Hausman, A. Herzog, J. Hsu, B. Ichter, A. Irpan, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, L. Lee, T.-W. E. Lee, S. Levine, Y . Lu, H. Michalewski, I. Mordatch, K. Pe...

  4. [4]

    Driess, F

    D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y . Chebotar, P. Sermanet, D. Duckworth, S. Levine, V . Vanhoucke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. Florence. PaLM-E: An embodied multimodal language model.arXiv [cs.LG], Mar. 2023

  5. [5]

    A. Szot, B. Mazoure, O. Attia, A. Timofeev, H. Agrawal, D. Hjelm, Z. Gan, Z. Kira, and A. Toshev. From multimodal LLMs to generalist embodied agents: Methods and lessons. arXiv [cs.LG], Dec. 2024

  6. [6]

    Abeyruwan, J

    Gemini Robotics Team, S. Abeyruwan, J. Ainslie, J.-B. Alayrac, M. G. Arenas, T. Arm- strong, A. Balakrishna, R. Baruch, M. Bauza, M. Blokzijl, S. Bohez, K. Bousmalis, A. Bro- han, T. Buschmann, A. Byravan, S. Cabi, K. Caluwaerts, F. Casarini, O. Chang, J. E. Chen, X. Chen, H.-T. L. Chiang, K. Choromanski, D. D’Ambrosio, S. Dasari, T. Davchev, C. Devin, N....

  7. [7]

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. OpenVLA: An open-source vision-language-action model.arXiv [cs.RO], June 2024

  8. [8]

    Intelligence, B

    P. Intelligence, B. Ai, A. Amin, R. Aniceto, A. Balakrishna, G. Balke, K. Black, G. Bokin- sky, S. Cao, T. Charbonnier, V . Choudhary, F. Collins, K. Conley, G. Connors, J. Darpinian, 10 K. Dhabalia, M. Dhaka, J. DiCarlo, D. Driess, M. Equi, A. Esmail, Y . Fang, C. Finn, C. Glos- sop, T. Godden, I. Goryachev, L. Groom, H. Habeeb, H. Hancock, K. Hausman, G...

  9. [9]

    Z. Chen, A. Tian, L. Wang, B. Joffe, Y . C. Lin, Y . Chen, S. Karamcheti, and D. Xu. ReSteer: Quantifying and refining the steerability of multitask robot policies.arXiv [cs.RO], Mar. 2026

  10. [10]

    J. Gao, S. Belkhale, S. Dasari, A. Balakrishna, D. Shah, and D. Sadigh. A taxonomy for evaluating generalist robot manipulation policies.arXiv [cs.RO], Mar. 2025

  11. [11]

    J. Lee, J. Duan, H. Fang, Y . Deng, S. Liu, B. Li, B. Fang, J. Zhang, Y . R. Wang, S. Lee, W. Han, W. Pumacay, A. Wu, R. Hendrix, K. Farley, E. VanderBilt, A. Farhadi, D. Fox, and R. Krishna. MolmoAct: Action reasoning models that can reason in space.arXiv [cs.RO], Aug. 2025

  12. [12]

    S. Ross, G. J. Gordon, and J. Andrew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning.arXiv [cs.LG], Nov. 2010

  13. [13]

    L. X. Shi, B. Ichter, M. Equi, L. Ke, K. Pertsch, Q. Vuong, J. Tanner, A. Walling, H. Wang, N. Fusai, A. Li-Bell, D. Driess, L. Groom, S. Levine, and C. Finn. Hi robot: Open-ended instruction following with hierarchical vision-language-action models.arXiv [cs.RO], Feb. 2025

  14. [14]

    Belkhale, T

    S. Belkhale, T. Ding, T. Xiao, P. Sermanet, Q. Vuong, J. Tompson, Y . Chebotar, D. Dwibedi, and D. Sadigh. RT-H: Action hierarchies using language. InRobotics: Science and Systems XX. Robotics: Science and Systems Foundation, July 2024

  15. [15]

    Q. Zhao, Y . Lu, M. J. Kim, Z. Fu, Z. Zhang, Y . Wu, Z. Li, Q. Ma, S. Han, C. Finn, A. Handa, M.-Y . Liu, D. Xiang, G. Wetzstein, and T.-Y . Lin. CoT-VLA: Visual chain-of-thought reason- ing for vision-language-action models.arXiv [cs.CV], Mar. 2025

  16. [16]

    C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, page 02783649241273668, 2023

  17. [17]

    B. Chen, D. M. Monso, Y . Du, M. Simchowitz, R. Tedrake, and V . Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.arXiv [cs.LG], July 2024

  18. [18]

    L. X. Shi, Z. Hu, T. Z. Zhao, A. Sharma, K. Pertsch, J. Luo, S. Levine, and C. Finn. Yell at your robot: Improving on-the-fly from language corrections.arXiv [cs.RO], Mar. 2024

  19. [19]

    Y . Cui, S. Karamcheti, R. Palleti, N. Shivakumar, P. Liang, and D. Sadigh. No, to the right: Online language corrections for robotic manipulation via shared autonomy. InProceedings of the 2023 ACM/IEEE International Conference on Human-Robot Interaction, pages 93–101, 2023

  20. [20]

    Lynch, A

    C. Lynch, A. Wahid, J. Tompson, T. Ding, J. Betker, R. Baruch, T. Armstrong, and P. Florence. Interactive language: Talking to robots in real time.IEEE Robot. Autom. Lett., pages 1–8, 2024

  21. [21]

    Chisari, T

    E. Chisari, T. Welschehold, J. Boedecker, W. Burgard, and A. Valada. Correct me if i am wrong: Interactive learning for robotic manipulation.IEEE Robotics and Automation Letters, 7:3695–3702, Apr. 2022. doi:10.1109/LRA.2022.3145516. 11

  22. [22]

    Sundaresan, Q

    P. Sundaresan, Q. Vuong, J. Gu, P. Xu, T. Xiao, S. Kirmani, T. Yu, M. Stark, A. Jain, K. Haus- man, D. Sadigh, J. Bohg, and S. Schaal. RT-sketch: Goal-conditioned imitation learning from hand-drawn sketches.arXiv [cs.RO], Mar. 2024

  23. [23]

    J. Gu, S. Kirmani, P. Wohlhart, Y . Lu, M. G. Arenas, K. Rao, W. Yu, C. Fu, K. Gopalakrishnan, Z. Xu, P. Sundaresan, P. Xu, H. Su, K. Hausman, C. Finn, Q. Vuong, and T. Xiao. RT- trajectory: Robotic task generalization via hindsight trajectory sketches.arXiv [cs.RO], Nov. 2023

  24. [24]

    Black, M

    K. Black, M. Nakamoto, P. Atreya, H. Walke, C. Finn, A. Kumar, and S. Levine. Zero-shot robotic manipulation with pretrained image-editing diffusion models.arXiv [cs.RO], Oct. 2023

  25. [25]

    Y . Wang, L. Wang, Y . Du, B. Sundaralingam, X. Yang, Y .-W. Chao, C. Perez-D’Arpino, D. Fox, and J. Shah. Inference-time policy steering through human interactions, 2025. URL https://arxiv.org/abs/2411.16627

  26. [26]

    Du and S

    M. Du and S. Song. Dynaguide: Steering diffusion polices with active dynamic guidance,

  27. [27]

    URLhttps://arxiv.org/abs/2506.13922

  28. [28]

    S. Park, H. Bharadhwaj, and S. Tulsiani. Demodiffusion: One-shot human imitation using pre-trained diffusion policy, 2025. URLhttps://arxiv.org/abs/2506.20668

  29. [29]

    Wagenmaker, M

    A. Wagenmaker, M. Nakamoto, Y . Zhang, S. Park, W. Yagoub, A. Nagabandi, A. Gupta, and S. Levine. Steering your diffusion policy with latent space reinforcement learning.arXiv [cs.RO], June 2025

  30. [30]

    Yoneda, L

    T. Yoneda, L. Sun, B. Stadie, M. Walter, et al. To the noise and back: Diffusion for shared autonomy.arXiv preprint arXiv:2302.12244, 2023

  31. [31]

    A. Wang, X. Yan, B. McMahan, M. Zhou, Y . Yuan, J. Y . Lee, A. Shreif, M. Li, Z. Peng, B. Zhou, Y . Cui, and J. C. Kao. DiSCo: Diffusion sequence copilots for shared autonomy. InProceedings of the 21st ACM/IEEE International Conference on Human-Robot Interaction, pages 982–990, New York, NY , USA, Mar. 2026. ACM

  32. [32]

    Beyer, A

    L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Al- abdulmohsin, M. Tschannen, E. Bugliarello, T. Unterthiner, D. Keysers, S. Koppula, F. Liu, A. Grycner, A. Gritsenko, N. Houlsby, M. Kumar, K. Rong, J. Eisenschlos, R. Kabra, M. Bauer, M. Bosnjak, X. Chen, M. Minderer, P. V oigtlaender, I. Bica, I. Balazevic, J. Puigcerv...

  33. [33]

    Mesnard, C

    Gemma Team, T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivi`ere, M. S. Kale, J. Love, P. Tafti, L. Hussenot, P. G. Sessa, A. Chowdhery, A. Roberts, A. Barua, A. Botev, A. Castro-Ros, A. Slone, A. H ´eliou, A. Tacchetti, A. Bulanova, A. Pater- son, B. Tsai, B. Shahriari, C. L. Lan, C. A. Choquette-Choo, C. Crepy, D. Cer, D. ...

  34. [34]

    Khazatsky, K

    A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, P. D. Fagan, J. Hejna, M. Itkina, M. Lepert, Y . J. Ma, P. T. Miller, J. Wu, S. Belkhale, S. Dass, H. Ha, A. Jain, A. Lee, Y . Lee, M. Memmel, S. Park, I. Radosavovic, K. Wang, A. Zhan, K. Black, C. Chi, K. B. Hatch, S. Lin, J. ...

  35. [35]

    Cover and J

    Thomas M. Cover and J. A. Thomas.Elements of Information Theory. John Wiley & Sons, Nashville, TN, 2 edition, Nov. 2012

  36. [36]

    Lipman, R

    Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling.arXiv [cs.LG], Oct. 2022

  37. [37]

    basins-of-attraction

    P. Holderrieth and E. Erives. An introduction to flow matching and diffusion models.arXiv [cs.LG], July 2025. 13 A Geometry of initial condition steering This appendix expands on intuition for why modifying the initial condition can steer the flow (Sec- tion 3.3). We emphasize that while this section provides more intuition for the steering properties, th...