ParPEST: a pipeline for EST data analysis based on parallel computing

Open Access

1 December 2005

journal article
Published by Springer Nature in BMC Bioinformatics

Vol. 6 (S4), S9
https://doi.org/10.1186/1471-2105-6-s4-s9

Abstract

Expressed Sequence Tags (ESTs) are short and error-prone DNA sequences generated from the 5' and 3' ends of randomly selected cDNA clones. They provide an important resource for comparative and functional genomic studies and, moreover, represent a reliable information for the annotation of genomic sequences. Because of the advances in biotechnologies, ESTs are daily determined in the form of large datasets. Therefore, suitable and efficient bioinformatic approaches are necessary to organize data related information content for further investigations. We implemented ParPEST (Par allel P rocessing of EST s), a pipeline based on parallel computing for EST analysis. The results are organized in a suitable data warehouse to provide a starting point to mine expressed sequence datasets. The collected information is useful for investigations on data quality and on data information content, enriched also by a preliminary functional annotation. The pipeline presented here has been developed to perform an exhaustive and reliable analysis on EST data and to provide a curated set of information based on a relational database. Moreover, it is designed to reduce execution time of the specific steps required for a complete analysis using distributed processes and parallelized software. It is conceived to run on low requiring hardware components, to fulfill increasing demand, typical of the data used, and scalability at affordable costs.

Keywords

This publication has 24 references indexed in Scilit:

The TIGR Gene Indices: clustering and assembling EST and known genes and integration with eukaryotic genomes
Nucleic Acids Research, 2004
EST Pipeline System: Detailed and Automated EST Data Processing and Mining
Genomics, Proteomics and Bioinformatics, 2003
DNA sequence quality trimming and vector removal
Bioinformatics, 2001
STACK: Sequence Tag Alignment and Consensus Knowledgebase
Nucleic Acids Research, 2001
Repbase Update: a database and an electronic journal of repetitive elements
Trends in Genetics, 2000
Gene Ontology: tool for the unification of biology
Nature Genetics, 2000
A Greedy Algorithm for Aligning DNA Sequences
Journal of Computational Biology, 2000
The ENZYME database in 2000
Nucleic Acids Research, 2000
d2_cluster: A Validated Method for Clustering EST and Full-Length cDNA Sequences
Genome Research, 1999
A Comprehensive Approach to Clustering of Expressed Human Gene Sequence: The Sequence Tag Alignment and Consensus Knowledge Base
Genome Research, 1999

Cited by 35 articles