B1
–
Speech-gesture alignment
Stefan Kopp, Hannes Rieser
Humans intuitively combine language with spontaneous gesture to form multimodal utterances. In
such utterances, words and gestures are highly coordinated and closely intertwined - in other
words, aligned to each other by the human speaker. These alignments concern the meaning that the
verbal and non-verbal behaviours convey, the form they take up in doing so, the manner in which
they are performed, their relative temporal arrangement, and their coordinated organization in a
phrasal structure of utterance. Their effects are essential for how meaning is communicated by
both modalities concertedly. The resulting confluence of language and gesture has led many
researchers (e.g. McNeill, 2005) to believe that speech and gesture are produced by one and the
same generative process. Still, how language and gesture do exactly interact in producing a
coherent multimodal utterance is an open question. The goal of subproject B1 is to investigate
the intra-personal mechanisms that underlie the composition of a multimodal utterance in
dialogue. Concretely, we investigate the following research questions:
- What kind(s) of meaning do people convey in concurrent speech and gesture in order to
pursue their communicative intentions? This is the level of meaning construction and
concerns the selection, composition, representation, and distribution of meaning as it comes
to be expressed in speech and gesture.
- What forms do speech and gesture take up to convey this meaning in context? Concerning
deictic gestures we study the pointers' "pointing cones", i.e. the domains
singled out by pointing gestures. With regard to iconic gestures, we study the gestural
forms that speakers use to depict aspects of a referent and the verbal construction with
which they combine.
- How are speech and gesture organized into multimodal deliveries? Here we focus on the role
of self-monitoring, regarded as a special case of self-alignment, for the portioning of
communicative intentions and content into idea units.
B1 investigates these topics by empirical study of human multimodal behavior and the conception
and simulation of computational models in virtual humans. Empirical studies elicit sets of
dialogue games. Video and VR tracking data are annotated in order to extract statistically
significant patterns. Based on the behavioral units found in the data, the generation processes
that turn content representations and communicative intentions into verbal and gestural behavior
are modeled both theoretically and computationally, informing the implementation of a prototype
simulation system with
our virtual human Max.
Computational Model
Based on an empirical study on spatial descriptions of landmarks in direction-giving, our model
allows virtual humans to automatically generate coordinated language and iconic gestures. The
model is characterized by a close interplay between these two modes of expressiveness: We
utilize two different kinds of content representation, visuo-spatial imagery and
propositional-linguistic knowledge. Further, specific planners carry out the formulation of
concrete verbal and gestural behavior. Both, content planning and formulation processes, run in
parallel and interact on a multimodal working memory. In gesture formulation we apply a novel
probabilistic approach which incorporates not only systematic factors constraining the mapping
of visuo-spatial referent properties onto gesture morphology, but also accounts for the role of
idiosyncratic patterns in multimodal behavior.
Click
here for a video demonstrating the simultated gesturing behavior of a particular speaker
from our empirical data.