arxiv: 2604.19643 · v1 · submitted 2026-04-21 · 💻 cs.RO

Recognition: unknown

A Gesture-Based Visual Learning Model for Acoustophoretic Interactions using a Swarm of AcoustoBots

Alex Lin , Lei Gao , Narsimlu Kemsaram , Sriram Subramanian

Authors on Pith no claims yet

Pith reviewed 2026-05-10 01:36 UTC · model grok-4.3

classification 💻 cs.RO

keywords gesture recognitionvisual learning modelAcoustoBotshuman-swarm interactionacoustophoretic robotsmid-air hapticsOpenCLIPmultimodal control

0 comments

The pith

A vision-language model classifies hand gestures to switch AcoustoBot modalities with 87.8% accuracy in swarm experiments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a contactless control system for AcoustoBots, which are robots that produce mid-air haptics, directional audio, and acoustic levitation. It uses an ESP32-CAM camera to capture gestures, a PhaseSpace tracker for robot positions, and an OpenCLIP visual learning model with linear probing to recognize three hand gestures and route them to the correct robot function. Validation accuracy rises from roughly 67% on small datasets to nearly 98% on larger ones. Integrated tests with two robots across 90 trials yield 87.8% correct modality switches and 3.95 seconds average latency. This matters because it replaces scripted commands with a natural gesture interface, making multimodal acoustic robots more usable for human-swarm tasks.

Core claim

The paper establishes that an OpenCLIP-based visual learning model with linear probing can reliably classify three hand gestures captured by an ESP32-CAM and map them to haptics, audio, or levitation commands on a swarm of AcoustoBots, reaching 87.8% end-to-end switching accuracy and 3.95-second latency in controlled experiments with two robots while improving from 67% to 98% validation accuracy as the training dataset grows.

What carries the argument

The OpenCLIP-based visual learning model with linear probing, which takes ESP32-CAM gesture images as input, classifies them into one of three gestures, and directs the centralized processor to activate the matching modality on the AcoustoBots.

If this is right

Larger gesture datasets directly improve classification accuracy up to nearly 98%.
The mapping of gestures to modalities enables real-time contactless switching between haptics, audio, and levitation on the robots.
Centralized processing supports the reported 3.95-second end-to-end latency in small-swarm trials.
The approach provides a working foundation for replacing scripted commands with intuitive visual interfaces in acoustophoretic swarms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the gesture vocabulary beyond three classes could support more complex swarm behaviors without retraining the entire model from scratch.
Decentralizing the image processing across multiple ESP32-CAM units might lower latency when scaling to larger swarms.
The same linear-probing technique could be tested on other pre-trained vision models to compare robustness for robotic gesture tasks.
Real-world deployment would require explicit trials measuring accuracy under simultaneous multi-user input or outdoor lighting changes.

Load-bearing premise

The linear-probed OpenCLIP classifier trained on a small controlled dataset will keep high accuracy when gestures vary in speed, lighting, background, or when multiple users interact at once.

What would settle it

Running the gesture classifier in an experiment with changing background clutter or users performing the same gestures at different speeds and observing accuracy fall below 70% would show the model does not maintain reliable performance outside the training conditions.

Figures

Figures reproduced from arXiv: 2604.19643 by Alex Lin, Lei Gao, Narsimlu Kemsaram, Sriram Subramanian.

**Figure 1.** Figure 1: Conceptual illustration of the proposed gesture-based AcoustoBot swarm system, where recognized hand gestures are [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: System architecture and data flow of the proposed [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: Comparison of training loss over epochs for differ [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 7.** Figure 7: Validation accuracy curve for the first training run [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 8.** Figure 8: Validation accuracy curve for the final training [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗

**Figure 9.** Figure 9: Experimental setup for the AcoustoBot gesture-interaction evaluation: A) schematic of the test arena showing the [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗

**Figure 11.** Figure 11: Distribution of end-to-end command execution [PITH_FULL_IMAGE:figures/full_fig_p007_11.png] view at source ↗

**Figure 10.** Figure 10: Average response latency across the main stages [PITH_FULL_IMAGE:figures/full_fig_p007_10.png] view at source ↗

read the original abstract

AcoustoBots are mobile acoustophoretic robots capable of delivering mid-air haptics, directional audio, and acoustic levitation, but existing implementations rely on scripted commands and lack an intuitive interface for real-time human control. This work presents a gesture-based visual learning framework for contactless human-swarm interaction with a multimodal AcoustoBot platform. The system combines ESP32-CAM gesture capture, PhaseSpace motion tracking, centralized processing, and an OpenCLIP-based visual learning model (VLM) with linear probing to classify three hand gestures and map them to haptics, audio, and levitation modalities. Validation accuracy improved from about 67% with a small dataset to nearly 98% with the largest dataset. In integrated experiments with two AcoustoBots, the system achieved an overall gesture-to-modality switching accuracy of 87.8% across 90 trials, with an average end-to-end latency of 3.95 seconds. These results demonstrate the feasibility of using a vision-language-model-based gesture interface for multimodal human-swarm interaction. While the current system is limited by centralized processing, a static gesture set, and controlled-environment evaluation, it establishes a foundation for more expressive, scalable, and accessible swarm robotic interfaces.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a working prototype for gesture control of AcoustoBots but leaves the vision classifier's robustness unverified.

read the letter

The key takeaway is that this paper builds a working gesture interface for two AcoustoBots using a linear probe on OpenCLIP, but the accuracy claims depend on controlled conditions that are not stress-tested for variation. They capture gestures with an ESP32-CAM, classify them into three classes for haptics, audio, or levitation, and run the whole thing with motion tracking. The integrated experiments give 87.8% switching accuracy over 90 trials and about 4 seconds latency. Validation accuracy on their data set goes up to 98% with more examples. The new part is applying this to the AcoustoBot platform, which combines mid-air haptics and levitation in a swarm setup. The system description shows how they wired the vision model into the robot control loop, which is a practical step even if the underlying classifier is standard. It does well on the hardware integration side. They report concrete end-to-end numbers from real trials rather than just simulation. The soft spots are in the evaluation of the vision component. There are no details on how many images were in the training set, no cross-validation results, and no tests that change lighting, camera angle, or gesture execution. The 87.8% comes from the same controlled environment as the training, so it does not confirm the model will work when those factors shift. That matches the stress-test concern about generalization. This is useful for people working on intuitive interfaces for specialized robotic platforms like acoustophoretic swarms. Someone trying to add human control to similar hardware would get ideas from the pipeline. It deserves a serious referee. The prototype is complete enough that feedback on the methods and evaluation would strengthen it. I would recommend sending it for peer review.

Referee Report

3 major / 2 minor

Summary. The paper presents a gesture-based visual learning framework for contactless human-swarm interaction with AcoustoBots, which are mobile acoustophoretic robots. It uses an ESP32-CAM for gesture capture, PhaseSpace tracking, and a linear-probed OpenCLIP visual learning model to classify three hand gestures and map them to haptics, audio, and levitation modalities. The work reports validation accuracy improving to nearly 98% with larger datasets and an integrated experiment result of 87.8% gesture-to-modality switching accuracy across 90 trials with two AcoustoBots, along with 3.95s average end-to-end latency.

Significance. If the reported accuracies and low latency generalize, the work would establish a practical foundation for intuitive, vision-language-model-based interfaces in multimodal robotic swarms, advancing contactless human-swarm interaction for applications involving mid-air haptics and acoustic levitation. The use of linear probing on OpenCLIP is a lightweight and accessible technique that could be extended, though the current evaluation under controlled conditions limits broader claims.

major comments (3)

[Abstract] Abstract: The headline claims of 87.8% overall accuracy across 90 trials and 3.95s latency are presented without any accompanying dataset sizes, number of samples per gesture class, training hyperparameters for the linear probe, baseline comparisons (e.g., to other classifiers or end-to-end models), error bars, or statistical tests, preventing verification of whether these numbers support the feasibility conclusion.
[Experiments] Experiments section (integrated trials): The 87.8% gesture-to-modality accuracy is obtained under the same controlled-environment conditions used for training the OpenCLIP linear probe (reaching ~98% validation accuracy); no ablation, cross-validation, or test sets are described that vary gesture speed, lighting, background clutter, camera angle, or simultaneous multi-user input, so the result does not yet demonstrate robustness required for the claimed real-world human-swarm interface.
[Methods] Methods (visual learning model): The linear-probed OpenCLIP classifier is the load-bearing component for mapping gestures to modalities, yet the manuscript provides no details on the size or diversity of the collected gesture image dataset, the exact probing procedure, or any regularization to mitigate overfitting to the small, controlled training distribution.

minor comments (2)

[Abstract] The abstract and conclusion mention limitations (centralized processing, static gesture set, controlled evaluation) but do not quantify their impact or outline concrete next steps for addressing generalization.
[Figures] Figure captions and system diagrams could more clearly label data flow between ESP32-CAM, PhaseSpace, centralized processor, and the VLM to improve readability for readers unfamiliar with the hardware.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and constructive suggestions. We address each of the major comments below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The headline claims of 87.8% overall accuracy across 90 trials and 3.95s latency are presented without any accompanying dataset sizes, number of samples per gesture class, training hyperparameters for the linear probe, baseline comparisons (e.g., to other classifiers or end-to-end models), error bars, or statistical tests, preventing verification of whether these numbers support the feasibility conclusion.

Authors: We agree that the abstract would benefit from additional context to support the reported figures. In the revised manuscript, we will incorporate the total size of the gesture dataset, the number of samples per gesture class, and the training hyperparameters used for the linear probe (such as the optimizer, learning rate, and number of epochs). For baseline comparisons, error bars, and statistical tests, these were not part of the original analysis; we will add a statement acknowledging this and include basic error bars derived from the trial data where possible. Full comparative baselines would require new experiments beyond the scope of this initial study. revision: partial
Referee: [Experiments] Experiments section (integrated trials): The 87.8% gesture-to-modality accuracy is obtained under the same controlled-environment conditions used for training the OpenCLIP linear probe (reaching ~98% validation accuracy); no ablation, cross-validation, or test sets are described that vary gesture speed, lighting, background clutter, camera angle, or simultaneous multi-user input, so the result does not yet demonstrate robustness required for the claimed real-world human-swarm interface.

Authors: The evaluation was indeed performed in a controlled laboratory setting to validate the core functionality of the gesture-to-modality mapping. We will revise the Experiments section to more clearly state the controlled conditions and add a new subsection on limitations that explicitly discusses the absence of ablations and tests under varying conditions such as lighting changes or multi-user scenarios. While we recognize that robustness testing would enhance the work, the current results establish initial feasibility for the proposed interface. We do not plan to conduct additional experiments for this revision but will temper the claims accordingly. revision: partial
Referee: [Methods] Methods (visual learning model): The linear-probed OpenCLIP classifier is the load-bearing component for mapping gestures to modalities, yet the manuscript provides no details on the size or diversity of the collected gesture image dataset, the exact probing procedure, or any regularization to mitigate overfitting to the small, controlled training distribution.

Authors: We acknowledge the need for greater transparency in the Methods section. The revised manuscript will specify the exact size and composition of the gesture image dataset, including the number of images per class and any measures of diversity (e.g., variations in hand poses within the controlled setup). We will detail the linear probing procedure, including how the OpenCLIP vision encoder was frozen and the classification head trained, along with any regularization applied (e.g., L2 regularization or early stopping) to address potential overfitting. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical accuracies are measured outcomes, not derived by construction

full rationale

The paper presents a system description followed by empirical results: a linear probe on OpenCLIP is trained on the authors' collected gesture images, validation accuracy is reported on held-out portions of that data, and end-to-end accuracy is measured in 90 integrated trials with two physical AcoustoBots. No equations, derivations, or self-citations are invoked that reduce the reported 87.8% or 98% figures to fitted parameters or prior author work by definition. The performance numbers are direct experimental measurements under the stated conditions; they do not loop back to the inputs via any of the enumerated circular patterns. Generalization concerns exist but are separate from circularity.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical success of a fine-tuned vision model in a controlled setting; no new physical entities or mathematical axioms are introduced beyond standard assumptions of computer vision and supervised learning.

free parameters (1)

linear probing weights
The final classification layer is trained on the authors' gesture images; its parameters are fitted to data and not derived from first principles.

axioms (1)

domain assumption OpenCLIP image embeddings contain sufficient information to distinguish the three chosen hand gestures under controlled lighting and pose conditions.
This assumption is required for the linear probe to achieve the reported validation accuracies and for the subsequent mapping to robot modalities.

pith-pipeline@v0.9.0 · 5539 in / 1525 out tokens · 85158 ms · 2026-05-10T01:36:20.434279+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 8 canonical work pages · 4 internal anchors

[1]

Springer, 2018

Heiko Hamann.Swarm robotics: A formal approach, volume 221. Springer, 2018

2018
[2]

Human interaction with robot swarms: A survey.IEEE Transactions on Human-Machine Systems, 46(1):9–26, 2015

Andreas Kolling, Phillip Walker, Nilanjan Chakraborty, Katia Sycara, and Michael Lewis. Human interaction with robot swarms: A survey.IEEE Transactions on Human-Machine Systems, 46(1):9–26, 2015

2015
[3]

User- defined swarm robot control

Lawrence H Kim, Daniel S Drew, Veronika Domova, and Sean Follmer. User- defined swarm robot control. InProceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pages 1–13, 2020

2020
[4]

Swarmvlm: Vlm-guided impedance control for autonomous navigation of heterogeneous robots in dynamic warehousing.arXiv preprint arXiv:2508.07814, 2025

Malaika Zafar, Roohan Ahmed Khan, Faryal Batool, Yasheerah Yaqoot, Ziang Guo, Mikhail Litvinov, Aleksey Fedoseev, and Dzmitry Tsetserukou. Swarmvlm: Vlm-guided impedance control for autonomous navigation of heterogeneous robots in dynamic warehousing.arXiv preprint arXiv:2508.07814, 2025

work page arXiv 2025
[5]

Impedancegpt: Vlm-driven impedance control of swarm of mini-drones for intelligent navigation in dynamic environment.arXiv preprint arXiv:2503.02723, 2025

Faryal Batool, Malaika Zafar, Yasheerah Yaqoot, Roohan Ahmed Khan, Muham- mad Haris Khan, Aleksey Fedoseev, and Dzmitry Tsetserukou. Impedancegpt: Vlm-driven impedance control of swarm of mini-drones for intelligent navigation in dynamic environment.arXiv preprint arXiv:2503.02723, 2025

work page arXiv 2025
[6]

Swarm body: Embodied swarm robots

Sosuke Ichihashi, So Kuroki, Mai Nishimura, Kazumi Kasaura, Takefumi Hiraki, Kazutoshi Tanaka, and Shigeo Yoshida. Swarm body: Embodied swarm robots. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, pages 1–19, 2024

2024
[7]

Swarmpaint: Human-swarm interaction for trajectory generation and formation control by dnn-based gesture interface

Valerii Serpiva, Ekaterina Karmanova, Aleksey Fedoseev, Stepan Perminov, and Dzmitry Tsetserukou. Swarmpaint: Human-swarm interaction for trajectory generation and formation control by dnn-based gesture interface. In2021 Inter- national Conference on Unmanned Aircraft Systems (ICUAS), pages 1055–1062. IEEE, 2021

2021
[8]

Gesture-controlled aerial robot formation for human-swarm interaction in safety monitoring appli- cations.IEEE Robotics and Automation Letters, 2025

Vít Krátk `y, Giuseppe Silano, Matouš Vrba, Christos Papaioannidis, Ioannis Mademlis, Robert Pěnička, Ioannis Pitas, and Martin Saska. Gesture-controlled aerial robot formation for human-swarm interaction in safety monitoring appli- cations.IEEE Robotics and Automation Letters, 2025

2025
[9]

Sonarios: A design futuring-driven exploration of acoustophoresis

Ceylan Beşevli, Lei Gao, Narsimlu Kemsaram, Giada Brianza, Orestis Georgiou, Sriram Subramanian, and Marianna Obrist. Sonarios: A design futuring-driven exploration of acoustophoresis. InProceedings of the 2025 ACM Designing Inter- active Systems Conference, pages 740–753, 2025

2025
[10]

Gs-pat: high-speed multi-point sound-fields for phased arrays of transducers.ACM Transactions on Graphics (TOG), 39(4):138–1, 2020

Diego Martinez Plasencia, Ryuji Hirayama, Roberto Montano-Murillo, and Sriram Subramanian. Gs-pat: high-speed multi-point sound-fields for phased arrays of transducers.ACM Transactions on Graphics (TOG), 39(4):138–1, 2020

2020
[11]

Acoustobots: A swarm of robots for acoustophoretic multimodal interactions.Frontiers in Robotics and AI, Volume 12 - 2025, 2025

Narsimlu Kemsaram, James Hardwick, Jincheng Wang, Bonot Gautam, Cey- lan Besevli, Giorgos Christopoulos, Sourabh Dogra, Lei Gao, Akin Delibasi, Diego Martinez Plasencia, Orestis Georgiou, Marianna Obrist, Ryuji Hirayama, and Sriram Subramanian. Acoustobots: A swarm of robots for acoustophoretic multimodal interactions.Frontiers in Robotics and AI, Volume ...

2025
[12]

A cooperative contactless object transport with acoustic robots

Narsimlu Kemsaram, Akin Delibasi, James Hardwick, Bonot Gautam, Diego Mar- tinez Plasencia, and Sriram Subramanian. A cooperative contactless object transport with acoustic robots. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 18043–18050. IEEE, 2025

2025
[13]

Understanding deep learning requires rethinking generalization

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization.arXiv preprint arXiv:1611.03530, 2016

work page internal anchor Pith review arXiv 2016
[14]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR, 2021

2021
[15]

Reproducible scaling laws for contrastive language-image learning

Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scaling laws for contrastive language-image learning. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2818–2829, 2023

2023
[16]

Hand gesture recognition based on computer vision: a review of techniques.journal of Imaging, 6(8):73, 2020

Munir Oudah, Ali Al-Naji, and Javaan Chahl. Hand gesture recognition based on computer vision: a review of techniques.journal of Imaging, 6(8):73, 2020

2020
[17]

A proposed set of communicative gestures for human robot interaction and an rgb image-based gesture recognizer implemented in ros.arXiv preprint arXiv:2109.09908, 2021

Jia Chuan A Tan, Wesley P Chan, Nicole L Robinson, Elizabeth A Croft, and Dana Kulic. A proposed set of communicative gestures for human robot interaction and an rgb image-based gesture recognizer implemented in ros.arXiv preprint arXiv:2109.09908, 2021

work page arXiv 2021
[18]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi- aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[19]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on machine learning, pages 1597–1607. PmLR, 2020

2020
[20]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

2016
[21]

Learning to prompt for vision-language models.International Journal of Computer Vision, 130(9):2337–2348, 2022

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models.International Journal of Computer Vision, 130(9):2337–2348, 2022

2022
[22]

Overcoming catastrophic forgetting in neural net- works.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural net- works.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

2017
[23]

Learning to learn single domain generalization

Fengchun Qiao, Long Zhao, and Xi Peng. Learning to learn single domain generalization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12556–12565, 2020

2020
[24]

Fine-tuning can distort pretrained features and underperform out-of-distribution.arXiv preprint arXiv:2202.10054,

Ananya Kumar, Aditi Raghunathan, Robbie Jones, Tengyu Ma, and Percy Liang. Fine-tuning can distort pretrained features and underperform out-of-distribution. arXiv preprint arXiv:2202.10054, 2022

work page arXiv 2022
[25]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[26]

MIT press Cambridge, 2016

Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio.Deep learning, volume 1. MIT press Cambridge, 2016

2016
[27]

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generaliza- tion gap and sharp minima.arXiv preprint arXiv:1609.04836, 2016

work page internal anchor Pith review arXiv 2016
[28]

John Bridle. Training stochastic model recognition algorithms as networks can lead to maximum mutual information estimation of parameters.Advances in neural information processing systems, 2, 1989

1989
[29]

Onnx: Open neural network exchange, 2019

Junjie Bai, Fang Lu, Ke Zhang, et al. Onnx: Open neural network exchange, 2019

2019