PNAD-CSS: a workbench for constructing a protein name abbreviation dictionary

Abstract
Motivation: Since their initial development, integration and construction of databases for molecular-level data have progressed. Though biological molecules are related to each other and form a complex system, the information is stored in the vast archives of the literature or in diverse databases. There is no unified naming convention for biological object, and biological terms may be ambiguous or polysemic. This makes the integration and interaction of databases difficult. In order to eliminate these problems, machine-readable natural language resources appear to be quite promising. We have developed a workbench for protein name abbreviation dictionary (PNAD) building. Results: We have developed PNAD Construction Support System (PNAD-CSS), which offers various convenient facilities to decrease the construction costs of a protein name abbreviation dictionary of which entries are collected from abstracts in biomedical papers. The system allows the users to concentrate on higher level interpretation by removing some troublesome tasks, e.g. management of abstracts, extracting protein names and their abbreviations, and so on. To extract a pair of protein names and abbreviations, we have developed a hybrid system composed of the PROPER System and the PNAD System. The PNAD System can extract the pairs from parenthetical-paraphrases involved in protein names, the PROPER System identified these pairs, with 98.95% precision, 95.56% recall and 97.58% complete precision. Availability: PROPER System is freely available from http://www.hgc.ims.u-tokyo.ac.jp/service/tooldoc/KeX/intro.html. The other software are also available on request. Contact the authors. Contact: mikio@ims.u-tokyo.ac.jp