Storing text retrieval systems on CD-ROM: compression and encryption considerations
- 1 July 1989
- journal article
- Published by Association for Computing Machinery (ACM) in ACM Transactions on Information Systems
- Vol. 7 (3), 230-245
- https://doi.org/10.1145/65943.65946
Abstract
The emergence of the CD-ROM as a storage medium for full-text databases raises the question of the maximum size database that can be contained by this medium. As an example, the problem of storing the Trésor de la Langue Française on a CD-ROM is examined in this paper. The text alone of this database is 700 megabytes long, more than a CD-ROM can hold. In addition, the dictionary and concordance needed to access these data must be stored. A further constraint is that some of the material is copyrighted, and it is desirable that such material be difficult to decode except through software provided by the system. Pertinent approaches to compression of the various files are reviewed, and the compression of the text is related to the problem of data encryption: Specifically, it is shown that, under simple models of text generation, Huffman encoding produces a bit-string indistinguishable from a representation of coin flips.Keywords
This publication has 16 references indexed in Scilit:
- Design considerations for CD-ROM retrieval softwareJournal of the American Society for Information Science, 1988
- The CD-ROM mediumJournal of the American Society for Information Science, 1988
- Novel Compression of Sparse Bit-Strings — Preliminary ReportPublished by Springer Nature ,1985
- Signature filesACM Transactions on Information Systems, 1984
- Self-synchronizing Huffman codes (Corresp.)IEEE Transactions on Information Theory, 1984
- Combinatorial Compression and Partitioning of Large DictionariesThe Computer Journal, 1983
- Is text compression by prefixes and suffixes practical?Acta Informatica, 1983
- Processing truncated terms in document retrieval systemsInformation Processing & Management, 1982
- Computerized full-text retrieval systems and research in the humanities: The responsa projectComputers and the Humanities, 1980
- A Method for the Construction of Minimum-Redundancy CodesProceedings of the IRE, 1952