From contigs towards chromosomes: automatic improvement of long read assemblies (ILRA)

Ruiz, J. L., Reimering, S., Escobar-Prieto, J. D., Brancucci, N. N.M. , Echeverry, D. F., Abdi, A. I., Marti, M. , Gómez-Díaz, E. and Otto, T. D. (2023) From contigs towards chromosomes: automatic improvement of long read assemblies (ILRA). Briefings in Bioinformatics, 24(4), bbad248. (doi: 10.1093/bib/bbad248) (PMID:37406192) (PMCID:PMC10359078)

[img] Text
303013.pdf - Published Version
Available under License Creative Commons Attribution Non-commercial.

1MB

Abstract

Recent advances in long read technologies not only enable large consortia to aim to sequence all eukaryotes on Earth, but they also allow individual laboratories to sequence their species of interest with relatively low investment. Long read technologies embody the promise of overcoming scaffolding problems associated with repeats and low complexity sequences, but the number of contigs often far exceeds the number of chromosomes and they may contain many insertion and deletion errors around homopolymer tracts. To overcome these issues, we have implemented the ILRA pipeline to correct long read-based assemblies. Contigs are first reordered, renamed, merged, circularized, or filtered if erroneous or contaminated. Illumina short reads are used subsequently to correct homopolymer errors. We successfully tested our approach by improving the genome sequences of Homo sapiens, Trypanosoma brucei, and Leptosphaeria spp., and by generating four novel Plasmodium falciparum assemblies from field samples. We found that correcting homopolymer tracts reduced the number of genes incorrectly annotated as pseudogenes, but an iterative approach seems to be required to correct more sequencing errors. In summary, we describe and benchmark the performance of our new tool, which improved the quality of novel long read assemblies up to 1 Gbp. The pipeline is available at GitHub: https://github.com/ThomasDOtto/ILRA.

Item Type:Articles
Additional Information:This work was supported by the Wellcome Trust [098051, 104111/Z/14/ZR]. E.G.-D. is funded by the Spanish Ministry of Science and Innovation grant no. PID2019-111109RB-I00 and by La Caixa Foundation—Health Research Program (grant no. HR20-00635). J.L.R is funded by a Severo Ochoa Fellowship (BES-2016-076276). D.F.E and J.D.E-P are funded by Colciencias, call 656–2014 ‘EsTiempo de Volver’ award FP44842-503-2014 and ‘Programa Jovenes Investigadores’ special cooperation 552–2015, respectively. M. Marti and N. M. B. Brancucci are funded by WT Investigator Award 110166.
Status:Published
Refereed:Yes
Glasgow Author(s) Enlighten ID:Otto, Professor Thomas and Marti, Professor Matthias and Brancucci, Dr Nicolas
Authors: Ruiz, J. L., Reimering, S., Escobar-Prieto, J. D., Brancucci, N. N.M., Echeverry, D. F., Abdi, A. I., Marti, M., Gómez-Díaz, E., and Otto, T. D.
College/School:College of Medical Veterinary and Life Sciences > School of Infection & Immunity
Journal Name:Briefings in Bioinformatics
Publisher:Oxford University Press
ISSN:1467-5463
ISSN (Online):1477-4054
Published Online:05 July 2023
First Published:First published in Briefings in Bioinformatics 24(4):bbad248
Publisher Policy:Reproduced under a Creative Commons license

University Staff: Request a correction | Enlighten Editors: Update this record