arxiv: 2605.02037 · v1 · submitted 2026-05-03 · 💻 cs.RO · cs.AI

Recognition: 2 theorem links

· Lean Theorem

VILAS: A VLA-Integrated Low-cost Architecture with Soft Grasping for Robotic Manipulation

Zijian An , Hadi Khezam , Bill Cai , Ran Yang , Shijie Geng , Yiming Feng , Yue (Luna) Zheng , Lifeng Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-08 19:29 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords low-cost roboticsvision-language-actionsoft graspingkirigami gripperrobotic manipulationVLA policy learningmodular hardware

0 comments

The pith

Low-cost modular robots with a kirigami soft gripper can train and run vision-language-action policies for delicate grasping.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VILAS, a complete low-cost platform that combines an affordable collaborative arm, an electric gripper upgraded with a soft kirigami extension, dual cameras, and a unified communication layer. This setup supports teleoperation for data collection and then fine-tunes existing vision-language-action models on the collected demonstrations. The authors test the resulting policies on a grape-grasping task that requires gentle contact without force sensors. If the approach holds, it shows that advanced manipulation learning does not need expensive specialized hardware, making such systems reachable for more users and labs.

Core claim

VILAS integrates a Fairino FR5 arm, Jodell RG52-50 gripper with kirigami soft extension, and dual-camera module through a ZMQ architecture to handle teleoperation, data collection, and policy deployment in one framework. Fine-tuning of pi_0, pi_0.5, and GR00T N1.6 models on the same teleoperation dataset enables successful deployment on grape grasping, confirming that capable manipulation policies can be trained and run on low-cost modular hardware without explicit force sensing.

What carries the argument

the kirigami-based soft compliant gripper extension that induces predictable deformation under compressive loading to provide gentle and repeatable contact with delicate objects without force sensing

If this is right

Capable VLA policies can be successfully fine-tuned and deployed using only low-cost modular components and standard teleoperation demonstrations.
The soft gripper enables safe handling of fragile items like grapes without dedicated force or tactile sensors.
A single ZMQ-based framework can coordinate perception, control, data logging, and policy execution on accessible hardware.
Multiple pretrained VLA models show comparable real-world performance when adapted to the same platform and dataset.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar kirigami extensions could be adapted for other delicate tasks such as handling produce or laboratory samples.
The modular ZMQ architecture might simplify integration of new arms or sensors across different robot setups.
If the soft gripper pattern proves reliable, it offers a low-cost route to compliant grasping that avoids specialized end-effectors.

Load-bearing premise

The kirigami gripper produces consistent gentle deformation on contact and the teleoperation data supplies enough variety for the VLA models to fine-tune effectively.

What would settle it

Run repeated grape-grasping trials with the fine-tuned models and check whether the gripper damages fruit or fails to pick it at rates no better than an untrained baseline.

Figures

Figures reproduced from arXiv: 2605.02037 by Bill Cai, Hadi Khezam, Lifeng Zhou, Ran Yang, Shijie Geng, Yiming Feng, Yue (Luna) Zheng, Zijian An.

**Figure 1.** Figure 1: Overview of the VILAS system. (Left) The physical robotic platform. view at source ↗

**Figure 2.** Figure 2: Kirigami structure design and experimental demonstration. (a) Photograph of the fabricated kirigami structure buckled within a gripper during testing, (b) Top view in Fusion 360, (c) Isometric view in Fusion 360. A low cost kirigami based pattern was developed to be used as a soft extension grabber for a safe and effective method of handling delicate objects as shown in view at source ↗

**Figure 3.** Figure 3: Communication architecture of the VILAS system during data collection. view at source ↗

**Figure 4.** Figure 4: Policy deployment overview and representative execution sequence using view at source ↗

**Figure 5.** Figure 5: Execution sequence of the cherry grasping task using the view at source ↗

read the original abstract

We present VILAS, a fully low-cost, modular robotic manipulation platform designed to support end-to-end vision-language-action (VLA) policy learning and deployment on accessible hardware. The system integrates a Fairino FR5 collaborative arm, a Jodell RG52-50 electric gripper, and a dual-camera perception module, unified through a ZMQ-based communication architecture that seamlessly coordinates teleoperation, data collection, and policy deployment within a single framework. To enable safe manipulation of fragile objects without relying on explicit force sensing, we design a kirigami-based soft compliant gripper extension that induces predictable deformation under compressive loading, providing gentle and repeatable contact with delicate targets. We deploy and evaluate three state-of-the-art VLA models on the VILAS platform: pi_0, pi_0.5, and GR00T N1.6. All models are fine-tuned from publicly released pretrained checkpoints using an identical demonstration dataset collected via our teleoperation pipeline. Experiments on a grape grasping task validate the effectiveness of the proposed system, confirming that capable manipulation policies can be successfully trained and deployed on low-cost modular hardware. Our results further provide practical insights into the deployment characteristics of current VLA models in real-world settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VILAS is a straightforward engineering paper that assembles existing VLA models, a cheap arm, and a kirigami soft gripper into one working low-cost platform, with the main new piece being the specific integration and gripper design for delicate grasping.

read the letter

VILAS shows how to run and fine-tune three public VLA models on a modular low-cost setup built around a Fairino FR5 arm, Jodell gripper, dual cameras, and ZMQ messaging. They add a kirigami-based soft extension to the gripper so it deforms predictably under load and handles fragile items like grapes without force sensors. The system supports teleoperation, data collection, and policy deployment in one framework, and they fine-tune pi_0, pi_0.5, and GR00T N1.6 on the same demonstration set for a grape-grasping task.

Referee Report

2 major / 2 minor

Summary. The manuscript presents VILAS, a low-cost modular robotic manipulation platform that integrates a Fairino FR5 collaborative arm, Jodell RG52-50 electric gripper, dual-camera perception, and a ZMQ-based communication architecture to support teleoperation, data collection, and end-to-end VLA policy deployment. It introduces a kirigami-based soft compliant gripper extension for safe handling of fragile objects without explicit force sensing. Three state-of-the-art VLA models (pi_0, pi_0.5, and GR00T N1.6) are fine-tuned from public checkpoints on an identical teleoperation dataset and evaluated on a grape grasping task, with the abstract claiming that the experiments validate effective training and deployment on low-cost hardware.

Significance. If the empirical validation holds with quantitative support, the work offers a practical contribution to accessible robotics by lowering hardware barriers for VLA research and addressing safe manipulation of delicate items via the soft gripper design. The unified ZMQ framework and identical-dataset fine-tuning across models provide a useful engineering demonstration and deployment insights, though the absence of metrics limits its value as a reproducible benchmark.

major comments (2)

[Abstract] Abstract: the claim that 'Experiments on a grape grasping task validate the effectiveness of the proposed system' is unsupported by any reported success rates, failure analysis, baselines, or trial counts, which is load-bearing for the central empirical assertion that capable policies can be trained and deployed on the hardware.
[Experiments] Experiments section: no details are provided on demonstration dataset size, data collection protocol, fine-tuning hyperparameters, or quantitative performance metrics for the three models, preventing assessment of whether the teleoperation data suffices for effective fine-tuning or whether the kirigami gripper performs predictably under load.

minor comments (2)

The ZMQ architecture description would benefit from a system diagram or pseudocode to clarify coordination between teleoperation, data logging, and policy inference.
[Abstract] Consider adding a brief comparison table of the three VLA models' deployment characteristics (e.g., inference latency or success patterns) to strengthen the practical insights claimed in the abstract.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The points raised highlight important areas for strengthening the empirical claims and reproducibility. We will revise the manuscript to incorporate quantitative details and metrics as outlined below.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'Experiments on a grape grasping task validate the effectiveness of the proposed system' is unsupported by any reported success rates, failure analysis, baselines, or trial counts, which is load-bearing for the central empirical assertion that capable policies can be trained and deployed on the hardware.

Authors: We agree that the abstract's validation claim would benefit from explicit quantitative support to stand on its own. In the revised manuscript, we will update the abstract to include key metrics such as success rates (e.g., X/ Y trials for each model), trial counts, and a concise failure analysis summary. This will directly substantiate the assertion that capable policies can be trained and deployed on the low-cost hardware without relying solely on the experiments section. revision: yes
Referee: [Experiments] Experiments section: no details are provided on demonstration dataset size, data collection protocol, fine-tuning hyperparameters, or quantitative performance metrics for the three models, preventing assessment of whether the teleoperation data suffices for effective fine-tuning or whether the kirigami gripper performs predictably under load.

Authors: We acknowledge that the current Experiments section lacks these specifics, which are necessary for full reproducibility and evaluation. We will expand this section in the revision to report: the demonstration dataset size (number of trajectories collected), the teleoperation data collection protocol, fine-tuning hyperparameters for pi_0, pi_0.5, and GR00T N1.6, and quantitative performance metrics including success rates, failure modes, and any observations on the kirigami gripper's behavior under load during the grape grasping task. These additions will enable readers to assess the sufficiency of the data and the gripper's predictability. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper is an engineering demonstration of a low-cost robotic manipulation platform (VILAS) that integrates hardware components, a kirigami soft gripper, ZMQ architecture, and fine-tuning of three publicly released VLA models (pi_0, pi_0.5, GR00T N1.6) on teleoperation data for a grape-grasping task. No mathematical derivations, equations, fitted parameters, or predictions appear in the provided text. The central claim rests on empirical system performance and experimental validation rather than any reduction of outputs to inputs by construction, self-citation chains, or ansatz smuggling. The validation uses external pretrained checkpoints and real-world task results, rendering the presentation self-contained with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The paper is an engineering systems contribution focused on hardware integration and empirical evaluation. No free parameters, mathematical axioms, or invented physical entities are introduced in the central claim.

invented entities (1)

kirigami-based soft compliant gripper extension no independent evidence
purpose: To enable safe manipulation of fragile objects without relying on explicit force sensing by inducing predictable deformation under compressive loading
A custom design choice for the platform; abstract provides no independent evidence or external validation beyond the grape grasping experiment.

pith-pipeline@v0.9.0 · 5547 in / 1273 out tokens · 72352 ms · 2026-05-08T19:29:26.733353+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 14 canonical work pages · 4 internal anchors

[1]

arXiv preprint arXiv:2509.14143 (2025)

An, Z., Yang, R., Feng, Y., Zhou, L.: Claw: A vision-language-action framework for weight-aware robotic grasping. arXiv preprint arXiv:2509.14143 (2025)

work page arXiv 2025
[2]

Agronomy15(11), 2650 (2025)

Ao, J., Ji, W., Yu, X., Ruan, C., Xu, B.: End-effectors for fruit and vegetable harvesting robots: A review of key technologies, challenges, and future prospects. Agronomy15(11), 2650 (2025)

2025
[3]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al.:π0: A vision-language-action flow model for general robot control. corr, abs/2410.24164, 2024. doi: 10.48550. arXiv preprint ARXIV.2410.24164

work page internal anchor Pith review arXiv 2024
[4]

In: Proceedings of The 7th Conference on Robot Learning

Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Choromanski, K., Ding, T., Driess, D., Finn, C., Florence, P., et al.: Rt-2: Vision-language-action models transfer web knowledge to robotic control. In: Proceedings of The 7th Conference on Robot Learning. pp. 2165–2183 (2023)

2023
[5]

The International Journal of Robotics Research (2023)

Chi,C.,Xu,Z.,Feng,S.,Cousineau,E.,Du,Y.,Burchfiel,B.,Tedrake,R.,Song,S.: Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research (2023)

2023
[6]

In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) (2024)

Collaboration, O.X.E., O’Neill, A., Rehman, A., Gupta, A., Maddukuri, A., Gupta, A., et al.: Open X-Embodiment: Robotic learning datasets and RT-X models. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) (2024)

2024
[7]

Telemoma: A modular and versatile teleoperation system for mobile manipulation.arXiv preprint arXiv:2403.07869, 2024

Dass, S., Ai, W., Jiang, Y., Singh, S., Hu, J., Zhang, R., Stone, P., Abbatematteo, B., Martín-Martín, R.: Telemoma: A modular and versatile teleoperation system for mobile manipulation. arXiv preprint arXiv:2403.07869 (2024)

work page arXiv 2024
[8]

In: 2024 IEEE International Conference on Robotics and Automation (ICRA)

Fang, H., Fang, H.S., Wang, Y., Ren, J., Chen, J., Zhang, R., Wang, W., Lu, C.: Airexo: Low-cost exoskeletons for learning whole-arm manipulation in the wild. In: 2024 IEEE International Conference on Robotics and Automation (ICRA). pp. 15031–15038. IEEE (2024)

2024
[9]

In: Conference on Robot Learning (CoRL) (2024)

Fu, Z., Zhao, T.Z., Finn, C.: Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation. In: Conference on Robot Learning (CoRL) (2024)

2024
[10]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Intelligence, P., Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., et al.:π 0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054 (2025)

work page Pith review arXiv 2025
[11]

Scientific Reports6(04 2016)

Isobe, M., Okumura, K.: Initial rigid response and softening transition of highly stretchable kirigami sheet materials. Scientific Reports6(04 2016). https://doi.org/10.1038/srep24758

work page doi:10.1038/srep24758 2016
[12]

IEEE Access (2025)

Kawaharazuka, K., Oh, J., Yamada, J., Posner, I., Zhu, Y.: Vision-language-action models for robotics: A review towards real-world applications. IEEE Access (2025)

2025
[13]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Khazatsky, A., Pertsch, K., Nair, S., Balakrishna, A., Dasari, S., Karamcheti, S., Nasiriany, S., Srirama, M.K., Chen, L.Y., Ellis, K., et al.: Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945 (2024)

work page internal anchor Pith review arXiv 2024
[14]

OpenVLA: An Open-Source Vision-Language-Action Model

Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al.: Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 (2024)

work page internal anchor Pith review arXiv 2024
[15]

Journal of Agri- cultural Engineering52(1) (2021) 16 Z

Kultongkham, A., Kumnon, S., Thintawornkul, T., Chanthasopeephan, T., et al.: The design of a force feedback soft gripper for tomato harvesting. Journal of Agri- cultural Engineering52(1) (2021) 16 Z. An et al

2021
[16]

International Journal of Advanced Robotic Systems 20(6), 17298806231213442 (2023)

Li,Z.,Yuan,X.,Yang,Z.:Design,simulation,andexperimentfortheendeffectorof a spherical fruit picking robot. International Journal of Advanced Robotic Systems 20(6), 17298806231213442 (2023)

2023
[17]

Advanced Intelligent Systems5(12), 2300233 (2023)

Liu, Y., Hou, J., Li, C., Wang, X.: Intelligent soft robotic grippers for agricultural and food product handling: A brief review with a focus on design and control. Advanced Intelligent Systems5(12), 2300233 (2023). https://doi.org/10.1002/aisy.202300233

work page doi:10.1002/aisy.202300233 2023
[18]

In: Conference on Robot Learning

Mandlekar, A., Zhu, Y., Garg, A., Booher, J., Spero, M., Tung, A., Gao, J., Em- mons, J., Gupta, A., Orbay, E., et al.: Roboturk: A crowdsourcing platform for robotic skill learning through imitation. In: Conference on Robot Learning. pp. 879–893. PMLR (2018)

2018
[19]

Frontiers in Robotics and AI10, 1330496 (2024)

Navas, E., Shamshiri, R.R., Dworak, V., Weltzien, C., Fernández, R.: Soft gripper for small fruits harvesting and pick and place operations. Frontiers in Robotics and AI10, 1330496 (2024)

2024
[20]

NVIDIA: GR00T N1.6: An improved foundation model for generalist humanoid robots.https://research.nvidia.com/labs/gear/gr00t-n1_6/(2025)

2025
[21]

NVIDIA, Bjorck, J., et al.: GR00T N1: An open foundation model for generalist humanoid robots (2025)

2025
[22]

In: Actuators

Ochoa, E., Mo, C.: Design and field evaluation of an end effector for robotic straw- berry harvesting. In: Actuators. vol. 14, p. 42. MDPI (2025)

2025
[23]

In: Proceedings of Robotics: Science and Systems (2024)

Octo Model Team, Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Kreiman, T., Xu, C., Luo, J., Tan, Y.L., Sanketi, P., Vuong, Q., Xiao, T., Sadigh, D., Finn, C., Levine, S.: Octo: An open-source generalist robot policy. In: Proceedings of Robotics: Science and Systems (2024)

2024
[24]

Soft robotic grippers,

Shintake, J., Cacucciolo, V., Floreano, D., Shea, H.: Soft robotic grippers. Advanced Materials30(29), 1707035 (2018). https://doi.org/10.1002/adma.201707035

work page doi:10.1002/adma.201707035 2018
[25]

CAAI Transactions on Intelligence Technology doi:10.1049/cit2.70022

Wang, S., Nikolić, M.N., Lam, T.L., Gao, Q., Ding, R., Zhang, T.: Robot ma- nipulation based on embodied visual perception: A survey. CAAI Transactions on Intelligence Technology10, 945–958 (2025). https://doi.org/10.1049/cit2.70022

work page doi:10.1049/cit2.70022 2025
[26]

IEEE Robotics and Automation Letters5(2), 2762–2769 (2020)

Wen, R., Yuan, K., Wang, Q., Heng, S., Li, Z.: Force-guided high-precision grasping control of fragile and deformable objects using semg-based force prediction. IEEE Robotics and Automation Letters5(2), 2762–2769 (2020)

2020
[27]

International Journal of Solids and Structures317, 113410 (2025)

Wu, J., Yang, B., Li, Z., He, S., Bu, Y., Su, B., Wang, Y.: Surface curvature regula- tion of 3d kirigami soft gripper. International Journal of Solids and Structures317, 113410 (2025). https://doi.org/https://doi.org/10.1016/j.ijsolstr.2025.113410

work page doi:10.1016/j.ijsolstr.2025.113410 2025
[28]

In: 2024 IEEE/RSJ Inter- national Conference on Intelligent Robots and Systems (IROS)

Wu, P., Shentu, Y., Yi, Z., Lin, X., Abbeel, P.: Gello: A general, low-cost, and in- tuitive teleoperation framework for robot manipulators. In: 2024 IEEE/RSJ Inter- national Conference on Intelligent Robots and Systems (IROS). pp. 12156–12163. IEEE (2024)

2024
[29]

arXiv preprint arXiv:2509.14138 (2025)

Yang, R., An, Z., Zhou, L., Feng, Y.: Seqvla: Sequential task execution for long- horizon manipulation with completion-aware vision-language-action model. arXiv preprint arXiv:2509.14138 (2025)

work page arXiv 2025
[30]

Science Robotics 6(54), eabd6426 (2021)

Yang, Y., Vella, K., Holmes, D.P.: Grasping with kirigami shells. Science Robotics 6(54), eabd6426 (2021). https://doi.org/10.1126/scirobotics.abd6426

work page doi:10.1126/scirobotics.abd6426 2021
[31]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Zhao, T.Z., Kumar, V., Levine, S., Finn, C.: Learning fine-grained bimanual ma- nipulation with low-cost hardware. arXiv preprint arXiv:2304.13705 (2023)

work page internal anchor Pith review arXiv 2023