Learning Individual Styles of Conversational Gesture

Amir Bar; Andrew Owens; Caroline Chan; Gefen Kohavi; Jitendra Malik; Shiry Ginosar

arxiv: 1906.04160 · v1 · pith:6DGTY4XFnew · submitted 2019-06-10 · 💻 cs.CV · cs.LG· eess.AS

Learning Individual Styles of Conversational Gesture

Shiry Ginosar , Amir Bar , Gefen Kohavi , Caroline Chan , Andrew Owens , Jitendra Malik This is my paper

classification 💻 cs.CV cs.LGeess.AS

keywords speechgesturesgesturehandvideoaccompaniedalongaudio

0 comments

read the original abstract

Human speech is often accompanied by hand and arm gestures. Given audio speech input, we generate plausible gestures to go along with the sound. Specifically, we perform cross-modal translation from "in-the-wild'' monologue speech of a single speaker to their hand and arm motion. We train on unlabeled videos for which we only have noisy pseudo ground truth from an automatic pose detection system. Our proposed model significantly outperforms baseline methods in a quantitative comparison. To support research toward obtaining a computational understanding of the relationship between gesture and speech, we release a large video dataset of person-specific gestures. The project website with video, code and data can be found at http://people.eecs.berkeley.edu/~shiry/speech2gesture .

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DAIN: Dynamic Agent-Based Interaction Network for Efficient and Collaborative Multimodal Reasoning
cs.CL 2026-06 unverdicted novelty 6.0

DAIN reframes multimodal fusion as dynamic agent collaboration with sparse activation, claiming SOTA results including 2.6% accuracy gain on ADNI across five benchmarks.