Menu Content/Inhalt
SFB 673 arrow Events
CRC Colloquium
show month-view show week-view   
A Random Text Model for the Generation of Statistical Language Invariants
Montag, 11.06.2007 16:00 - 18:00
Christian Biemann
Universität Leipzig


http://www.informatik.uni-leipzig.de/personal/CBiemann.html

"A Random Text Model for the Generation of Statistical Language Invariants"

Since the first half of the 20th century, it is known that many quantitative aspects of natural language can be approximated by power-laws. In the 1950s, two random text models were proposed by B. Mandelbrot and H. Simon, which could reproduce the Zipf's law invariant with very simple mechanisms. But the models failed to explain other quantitative characteristics of language.
First, I will revisit power-laws in natural language by presenting data on rank-frequency and degree distributions of word co-occurrence graphs.
Then, I propose a new random text model that approximates not only the Zipf-Mandelbrot law on rank-frequency, but also word length distribution, sentence length distribution and word order restrictions of natural language. This is reached by a two level process of generating words out of letters and sentences out of words; at this, the model is emergent, and not initialised by a-priori distributions.
An outlook on possible enhancements and applications in communication simulation concludes the talk.
Contact
contact-person: Olga Pustylnikov
homepage: ariadne.coli.uni-bielefeld.de