ReRep: Computational detection of repetitive sequences in genome survey sequences (GSS)

Otto, T. D. , Gomes, L. H.F., Alves-Ferreira, M., de Miranda, A. B. and Degrave, W. M. (2008) ReRep: Computational detection of repetitive sequences in genome survey sequences (GSS). BMC Bioinformatics, 9, 366. (doi: 10.1186/1471-2105-9-366) (PMID:18782453) (PMCID:PMC2559850)

[img]
Preview
Text
148096.pdf - Published Version
Available under License Creative Commons Attribution.

774kB

Abstract

Background: Genome survey sequences (GSS) offer a preliminary global view of a genome since, unlike ESTs, they cover coding as well as non-coding DNA and include repetitive regions of the genome. A more precise estimation of the nature, quantity and variability of repetitive sequences very early in a genome sequencing project is of considerable importance, as such data strongly influence the estimation of genome coverage, library quality and progress in scaffold construction. Also, the elimination of repetitive sequences from the initial assembly process is important to avoid errors and unnecessary complexity. Repetitive sequences are also of interest in a variety of other studies, for instance as molecular markers. Results: We designed and implemented a straightforward pipeline called ReRep, which combines bioinformatics tools for identifying repetitive structures in a GSS dataset. In a case study, we first applied the pipeline to a set of 970 GSSs, sequenced in our laboratory from the human pathogen Leishmania braziliensis, the causative agent of leishmaniosis, an important public health problem in Brazil. We also verified the applicability of ReRep to new sequencing technologies using a set of 454-reads of an Escheria coli. The behaviour of several parameters in the algorithm is evaluated and suggestions are made for tuning of the analysis. Conclusion: The ReRep approach for identification of repetitive elements in GSS datasets proved to be straightforward and efficient. Several potential repetitive sequences were found in a L. braziliensis GSS dataset generated in our laboratory, and further validated by the analysis of a more complete genomic dataset from the EMBL and Sanger Centre databases. ReRep also identified most of the E. coli K12 repeats prior to assembly in an example dataset obtained by automated sequencing using 454 technology. The parameters controlling the algorithm behaved consistently and may be tuned to the properties of the dataset, in particular to the length of sequencing reads and the genome coverage.

Item Type:Articles
Status:Published
Refereed:Yes
Glasgow Author(s) Enlighten ID:Otto, Professor Thomas
Authors: Otto, T. D., Gomes, L. H.F., Alves-Ferreira, M., de Miranda, A. B., and Degrave, W. M.
College/School:College of Medical Veterinary and Life Sciences > School of Infection & Immunity
Journal Name:BMC Bioinformatics
Publisher:BioMed Central
ISSN:1471-2105
ISSN (Online):1471-2105
Copyright Holders:Copyright ©2008 Otto et al.
First Published:First published in BMC Bioinformatics 9: 366
Publisher Policy:Reproduced under a Creative Commons License

University Staff: Request a correction | Enlighten Editors: Update this record