pith. machine review for the scientific record. sign in

arxiv: 2602.03059 · v2 · submitted 2026-02-03 · 💻 cs.HC · cs.CL· cs.ET· cs.IR

Recognition: unknown

From Speech-to-Spatial: Grounding Utterances on A Live Shared View with Augmented Reality

Authors on Pith no claims yet
classification 💻 cs.HC cs.CLcs.ETcs.IR
keywords guidancespeech-to-spatialcuesdisambiguationlivereferentrelationalremote
0
0 comments X
read the original abstract

We introduce Speech-to-Spatial, a referent disambiguation framework that converts verbal remote-assistance instructions into spatially grounded AR guidance. Unlike prior systems that rely on additional cues (e.g., gesture, gaze) or manual expert annotations, Speech-to-Spatial infers the intended target solely from spoken references (speech input). Motivated by our formative study of speech referencing patterns, we characterize recurring ways people specify targets (Direct Attribute, Relational, Remembrance, and Chained) and ground them to our object-centric relational graph. Given an utterance, referent cues are parsed and rendered as persistent in-situ AR visual guidance, reducing iterative micro-guidance ("a bit more to the right", "now, stop.") during remote guidance. We demonstrate the use cases of our system with remote guided assistance and intent disambiguation scenarios. Our evaluation shows that Speechto-Spatial improves task efficiency, reduces cognitive load, and enhances usability compared to a conventional voice-only baseline, transforming disembodied verbal instruction into visually explainable, actionable guidance on a live shared view.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. VisionClaw: Always-On AI Agents through Smart Glasses

    cs.HC 2026-04 unverdicted novelty 5.0

    VisionClaw couples continuous egocentric vision on smart glasses with speech-driven AI agents to enable hands-free real-world tasks, with lab and field studies showing faster completion and a shift toward opportunisti...