Random Texts Do Not Exhibit the Real Zipf's Law-Like Rank Distribution

Top Cited Papers

Open Access

9 March 2010

journal article
research article
Published by Public Library of Science (PLoS) in PLOS ONE

Vol. 5 (3), e9411
https://doi.org/10.1371/journal.pone.0009411

Abstract

Zipf's law states that the relationship between the frequency of a word in a text and its rank (the most frequent word has rank , the 2nd most frequent word has rank ,…) is approximately linear when plotted on a double logarithmic scale. It has been argued that the law is not a relevant or useful property of language because simple random texts - constructed by concatenating random characters including blanks behaving as word delimiters - exhibit a Zipf's law-like word rank distribution. In this article, we examine the flaws of such putative good fits of random texts. We demonstrate - by means of three different statistical tests - that ranks derived from random texts and ranks derived from real texts are statistically inconsistent with the parameters employed to argue for such a good fit, even when the parameters are inferred from the target real text. Our findings are valid for both the simplest random texts composed of equally likely characters as well as more elaborate and realistic versions where character probabilities are borrowed from a real text. The good fit of random texts to real Zipf's law-like rank distributions has not yet been established. Therefore, we suggest that Zipf's law might in fact be a fundamental law in natural languages.

Keywords

This publication has 11 references indexed in Scilit:

Power-Law Distributions in Empirical Data
SIAM Review, 2009
The frequency spectrum of finite samples from the intermittent silence process
Journal of the American Society for Information Science and Technology, 2009
Zipf's Law and Avoidance of Excessive Synonymy
Cognitive Science, 2008
Power Laws for Monkeys Typing Randomly: The Case of Unequal Probabilities
IEEE Transactions on Information Theory, 2004
A Brief History of Generative Models for Power Law and Lognormal Distributions
Internet Mathematics, 2004
Least effort and the origins of scaling in human language
Proceedings of the National Academy of Sciences, 2003
Spoken word production: A theory of lexical access
Proceedings of the National Academy of Sciences, 2001
Numerical Analysis of Word Frequencies in Artificial and Natural Language Texts
Fractals, 1997
Quantitative linguistics and complex system studies*
Journal of Quantitative Linguistics, 1996
Introduction to Algorithms
Journal of the Operational Research Society, 1991

Cited by 92 articles