Kirsten Bergmann & Stefan Kopp
On the production and perception of gestures: Insights from computational simulation and empirical evaluation
We aim to understand the processes that underlie verbal and gestural behavior by building computational simulation models that embody empirically derived or otherwise conceived hypotheses. Such models help us to test these hypotheses by, first, testing the simulated behavior against real human behavior and, second, putting it to evaluations through naive human observers. In our project we have applied this approach to co-verbal iconic gestures.
Based on psycholinguistic findings obtained from literature, we have devised an integrated speech-gesture production model [3] whose architecture allows the two modalities to interface via levels of modality-specific content representations (propositional and imagistic, resp.) and behavioral forms. In addition to this "horizontal" interaction, our model enables bi-directional "vertical" interactions between adjacent stages, e.g., of planning content and form. This model has been implemented and enables in our virtual human "Max" realtime speech and gesture generation from high-level communicative goals like "introduce church-3".
One major part of this model is the stage of "Gesture Formulation". It is based on the empirical observation that the production of speech-accompanying iconic gestures in humans is, on the one hand, characterized by commonalities that account for an agreed sign system, and on the other hand, by idiosyncrasies that make for a coherent individual style. In our computational model, we simulate the production of iconic gestures accounting for both idiosyncrasies and commonalities using a Bayesian decision network, GNetIc, to automatically derive novel gestures from contextual demands [1].
These networks are obtained automatically from the empirical data by means of machine learning algorithms. Analyzing the models enables us to gain novel insights into the production process of iconic gestures: differences in the resulting network structures reveal that individual differences are not only present in the overt gestures, but also in the production process they originate from. Whereas gesture production in some individuals is, e.g., predominantly influenced by visuo-spatial referent features, other speakers mostly comply with the discourse context. A cross-validation of the generated gestures against the empirically observed gestures shows that the system can - to some extent - successfully predict the iconic gesture a speaker is going to produce. In another evaluation study, we have
analyzed how human users perceive a virtual agent endowed with such gestural expressiveness. Results show that automatically GNetIc-generated gestures help to increase the perceived quality of object descriptions given by a virtual human. Moreover, gesturing behavior generated with individual speaker networks is rated more positively in terms of likeability, competence and human-likeness [2].
Finally, we will point out ways to exploit and extent this approach to model inter-personal coordination and alignment in-between two agents (or a human user and a virtual agent), and we will briefly discuss possible future extensions of the model that appear interesting to us, including the incorporation and combined production of representational (iconic) and interactive gestures.
References:
[1] K. Bergmann and S. Kopp (2009). GNetIc–Using Bayesian decision networks for iconic gesture generation. In Proceedings of IVA, pages 76–89. Berlin/Heidelberg: Springer.
[2] K. Bergmann and S. Kopp (submitted). Individualized Gesturing Outperforms Average Gesturing – Evaluating Gesture Production in Virtual Humans.
[3] S. Kopp, K. Bergmann, I., Wachsmuth (2008). Multimodal communication from multimodal thinking – Towards an integrated model of speech and gesture production. Semantic Computing 2(1):115-136.
---------
|