Proactive Conversational Assistant for a Procedural Manual Task based on Audio and IMU

Erik Visser; Phanidhar Chinchili; Rehana Mahfuz; Yinyi Guo

arxiv: 2602.15707 · v2 · pith:BZZ465UFnew · submitted 2026-02-17 · 💻 cs.MM · cs.CL· cs.LG

Proactive Conversational Assistant for a Procedural Manual Task based on Audio and IMU

Rehana Mahfuz , Yinyi Guo , Erik Visser , Phanidhar Chinchili This is my paper

classification 💻 cs.MM cs.CLcs.LG

keywords assistantproceduraltaskuserconversationalmanualquestionsability

0 comments

read the original abstract

Real-time conversational assistants for procedural manual tasks often depend on video input, which can be computationally expensive and compromise user privacy. For the first time, we propose a real-time conversational assistant that provides comprehensive guidance for procedural manual tasks using only lightweight privacy-preserving modalities such as audio and IMU inputs from a user's wearable device to understand the context. Using a furniture assembly task and a cooking task, we show how this assistant proactively communicates step-by-step instructions to a user performing a procedural task, and answers user questions. We illustrate the data generation method and the system design to achieve such an assistant. On observing that an off-the-shelf language model is a talkative assistant but is not always able to answer questions correctly, we demonstrate how finetuning the model improves its ability to limit unnecessary dialogues with a 50% increase in the precision, while also improving its ability to answer questions correctly, measured by a 150% increase in the recall of answers. We further describe how such an assistant is implemented on an edge device with no dependence on the cloud.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VisionClaw: Always-On AI Agents through Smart Glasses
cs.HC 2026-04 unverdicted novelty 5.0

VisionClaw couples continuous egocentric vision on smart glasses with speech-driven AI agents to enable hands-free real-world tasks, with lab and field studies showing faster completion and a shift toward opportunisti...