Modeling Statistical Properties of Written Text

Open Access

29 April 2009

journal article
research article
Published by Public Library of Science (PLoS) in PLOS ONE

Vol. 4 (4), e5372
https://doi.org/10.1371/journal.pone.0005372

Abstract

Written text is one of the fundamental manifestations of human language, and the study of its universal regularities can give clues about how our brains process information and how we, as a society, organize and share it. Among these regularities, only Zipf's law has been explored in depth. Other basic properties, such as the existence of bursts of rare words in specific documents, have only been studied independently of each other and mainly by descriptive models. As a consequence, there is a lack of understanding of linguistic processes as complex emergent phenomena. Beyond Zipf's law for word frequencies, here we focus on burstiness, Heaps' law describing the sublinear growth of vocabulary size with the length of a document, and the topicality of document collections, which encode correlations within and across documents absent in random null models. We introduce and validate a generative model that explains the simultaneous emergence of all these patterns from simple rules. As a result, we find a connection between the bursty nature of rare words and the topical organization of texts and identify dynamic word ranking and memory across documents as key mechanisms explaining the non trivial organization of written text. Our research can have broad implications and practical applications in computer science, cognitive science and linguistics.

Keywords

This publication has 40 references indexed in Scilit:

Semiotic dynamics and collaborative tagging
Proceedings of the National Academy of Sciences, 2007
Contextual Diversity, Not Word Frequency, Determines Word-Naming and Lexical Decision Times
Psychological Science, 2006
Scale-Free Network Growth by Ranking
Physical Review Letters, 2006
Hierarchical structures induce long-range dynamical correlations in written texts
Proceedings of the National Academy of Sciences, 2006
Serial Mechanisms in Lexical Access: The Rank Hypothesis.
Psychological Review, 2004
Mining the Web: Discovering Knowledge from Hypertext Data
Online Information Review, 2003
The Faculty of Language: What Is It, Who Has It, and How Did It Evolve?
Science, 2002
Computational and evolutionary aspects of language
Nature, 2002
Statistical mechanics of complex networks
Reviews of Modern Physics, 2002
Natural Language Processing
Science, 1991

Cited by 87 articles