Storing text retrieval systems on CD-ROM: compression and encryption considerations

Abstract
The emergence of the CD-ROM as a storage medium for full-text databases raises the question of the maximum size database that can be contained by this medium. As an example, the problem of storing the Trésor de la Langue Française on a CD-ROM is examined in this paper. The text alone of this database is 700 megabytes long, more than a CD-ROM can hold. In addition, the dictionary and concordance needed to access these data must be stored. A further constraint is that some of the material is copyrighted, and it is desirable that such material be difficult to decode except through software provided by the system. Pertinent approaches to compression of the various files are reviewed, and the compression of the text is related to the problem of data encryption: Specifically, it is shown that, under simple models of text generation, Huffman encoding produces a bit-string indistinguishable from a representation of coin flips.

This publication has 16 references indexed in Scilit: