Menu Content/Inhalt
SFB 673 - B1
B1 - Speech-gesture alignment
Humans intuitively combine language with spontaneous gesture to form multimodal utterances. In such utterances, words and gestures are highly coordinated and closely intertwined - in other words, aligned to each other by the human speaker. These alignments concern the meaning that the verbal and non-verbal behaviours convey, the form they take up in doing so, the manner in which they are performed, their relative temporal arrangement, and their coordinated organization in a phrasal structure of utterance. Their effects are essential for how meaning is communicated by both modalities concertedly. The resulting confluence of language and gesture has led many researchers (e.g. McNeill, 2005) to believe that speech and gesture are produced by one and the same generative process. Still, how language and gesture do exactly interact in producing a coherent multimodal utterance is an open question. The goal of subproject B1 is to investigate the intra-personal mechanisms that underlie the composition of a multimodal utterance in dialogue. Concretely, we investigate the following research questions:

  1. What kind(s) of meaning do people convey in concurrent speech and gesture in order to pursue their communicative intentions? This is the level of meaning construction and concerns the selection, composition, representation, and distribution of meaning as it comes to be expressed in speech and gesture.
  2. What forms do speech and gesture take up to convey this meaning in context? Concerning deictic gestures we study the pointers' "pointing cones", i.e. the domains singled out by pointing gestures. With regard to iconic gestures, we study the gestural forms that speakers use to depict aspects of a referent and the verbal construction with which they combine.
  3. How are speech and gesture organized into multimodal deliveries? Here we focus on the role of self-monitoring, regarded as a special case of self-alignment, for the portioning of communicative intentions and content into idea units.

B1 investigates these topics by empirical study of human multimodal behavior and the conception and simulation of computational models in virtual humans. Empirical studies elicit sets of dialogue games. Video and VR tracking data are annotated in order to extract statistically significant patterns. Based on the behavioral units found in the data, the generation processes that turn content representations and communicative intentions into verbal and gestural behavior are modeled both theoretically and computationally, informing the implementation of a prototype simulation system with
our virtual human Max.

Computational Model

Based on an empirical study on spatial descriptions of landmarks in direction-giving, our model allows virtual humans to automatically generate coordinated language and iconic gestures. The model is characterized by a close interplay between these two modes of expressiveness: We utilize two different kinds of content representation, visuo-spatial imagery and propositional-linguistic knowledge. Further, specific planners carry out the formulation of concrete verbal and gestural behavior. Both, content planning and formulation processes, run in parallel and interact on a multimodal working memory. In gesture formulation we apply a novel probabilistic approach which incorporates not only systematic factors constraining the mapping of visuo-spatial referent properties onto gesture morphology, but also accounts for the role of idiosyncratic patterns in multimodal behavior.
Click here for a video demonstrating the simultated gesturing behavior of a particular speaker from our empirical data.

B1 production architecture.