Poisson mixtures
- 1 June 1995
- journal article
- research article
- Published by Cambridge University Press (CUP) in Natural Language Engineering
- Vol. 1 (2), 163-190
- https://doi.org/10.1017/s1351324900000139
Abstract
Shannon (1948) showed that a wide range of practical problems can be reduced to the problem of estimating probability distributions of words and ngrams in text. It has become standard practice in text compression, speech recognition, information retrieval and many other applications of Shannon's theory to introduce a “bag-of-words” assumption. But obviously, word rates vary from genre to genre, author to author, topic to topic, document to document, section to section, and paragraph to paragraph. The proposed Poisson mixture captures much of this heterogeneous structure by allowing the Poisson parameter θ to vary over documents subject to a density function φ. φ is intended to capture dependencies on hidden variables such genre, author, topic, etc. (The Negative Binomial is a well-known special case where φ is a Г distribution.) Poisson mixtures fit the data better than standard Poissons, producing more accurate estimates of the variance over documents (σ2), entropy (H), inverse document frequency (IDF), and adaptation (Pr(x ≥ 2/x ≥ 1)).This publication has 6 references indexed in Scilit:
- Elements of Information TheoryPublished by Wiley ,2001
- Some Simple Effective Approximations to the 2-Poisson Model for Probabilistic Weighted RetrievalPublished by Springer Nature ,1994
- A probabilistic approach to automatic keyword indexing. Part I. On the Distribution of Specialty Words in a Technical LiteratureJournal of the American Society for Information Science, 1975
- Probabilistic models for automatic indexingJournal of the American Society for Information Science, 1974
- A STATISTICAL INTERPRETATION OF TERM SPECIFICITY AND ITS APPLICATION IN RETRIEVALJournal of Documentation, 1972
- A Mathematical Theory of CommunicationBell System Technical Journal, 1948