From Speech-to-Spatial: Grounding Utterances on A Live Shared View with Augmented Reality

Yoonsang Kim , Divyansh Pradhan , Devshree Jadeja , Arie Kaufman

Authors on Pith no claims yet

classification 💻 cs.HC cs.CLcs.ETcs.IR

keywords guidancespeech-to-spatialcuesdisambiguationlivereferentrelationalremote

read the original abstract

We introduce Speech-to-Spatial, a referent disambiguation framework that converts verbal remote-assistance instructions into spatially grounded AR guidance. Unlike prior systems that rely on additional cues (e.g., gesture, gaze) or manual expert annotations, Speech-to-Spatial infers the intended target solely from spoken references (speech input). Motivated by our formative study of speech referencing patterns, we characterize recurring ways people specify targets (Direct Attribute, Relational, Remembrance, and Chained) and ground them to our object-centric relational graph. Given an utterance, referent cues are parsed and rendered as persistent in-situ AR visual guidance, reducing iterative micro-guidance ("a bit more to the right", "now, stop.") during remote guidance. We demonstrate the use cases of our system with remote guided assistance and intent disambiguation scenarios. Our evaluation shows that Speechto-Spatial improves task efficiency, reduces cognitive load, and enhances usability compared to a conventional voice-only baseline, transforming disembodied verbal instruction into visually explainable, actionable guidance on a live shared view.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VisionClaw: Always-On AI Agents through Smart Glasses
cs.HC 2026-04 unverdicted novelty 5.0

VisionClaw couples continuous egocentric vision on smart glasses with speech-driven AI agents to enable hands-free real-world tasks, with lab and field studies showing faster completion and a shift toward opportunisti...