Quantifying and cataloguing unknown sequences within human microbiomes

Modha, S. , Robertson, D. L. , Hughes, J. and Orton, R. J. (2022) Quantifying and cataloguing unknown sequences within human microbiomes. mSystems, 7(2), e01468-21. (doi: 10.1128/msystems.01468-21) (PMID:35258340) (PMCID:PMC9052204)

[img] Text
265129.pdf - Published Version
Available under License Creative Commons Attribution.

[img] Text
265129Suppl1.pdf - Supplemental Material



Advances in genome sequencing technologies and lower costs have enabled the exploration of a multitude of known and novel environments and microbiomes. This has led to an exponential growth in the raw sequence data that are deposited in online repositories. Metagenomic and metatranscriptomic data sets are typically analysed with regard to a specific biological question. However, it is widely acknowledged that these data sets are comprised of a proportion of sequences that bear no similarity to any currently known biological sequence, and this so-called “dark matter” is often excluded from downstream analyses. In this study, a systematic framework was developed to assemble, identify, and measure the proportion of unknown sequences present in distinct human microbiomes. This framework was applied to 40 distinct studies, comprising 963 samples, and covering 10 different human microbiomes including fecal, oral, lung, skin, and circulatory system microbiomes. We found that while the human microbiome is one of the most extensively studied, on average 2% of assembled sequences have not yet been taxonomically defined. However, this proportion varied extensively among different microbiomes and was as high as 25% for skin and oral microbiomes that have more interactions with the environment. A rate of taxonomic characterization of 1.64% of unknown sequences being characterized per month was calculated from these taxonomically unknown sequences discovered in this study. A cross-study comparison led to the identification of similar unknown sequences in different samples and/or microbiomes. Both our computational framework and the novel unknown sequences produced are publicly available for future cross-referencing. Our approach led to the discovery of several novel viral genomes that bear no similarity to sequences in the public databases. Some of these are widespread as they have been found in different microbiomes and studies. Hence, our study illustrates how the systematic characterization of unknown sequences can help the discovery of novel microbes, and we call on the research community to systematically collate and share the unknown sequences from metagenomic studies to increase the rate at which the unknown sequence space can be classified.

Item Type:Articles
Glasgow Author(s) Enlighten ID:Modha, Ms Sejal and Hughes, Dr Joseph and Robertson, Professor David and Orton, Dr Richard
Creator Roles:
Modha, S.Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Validation, Visualization, Writing – original draft, Writing – review and editing
Robertson, D. L.Funding acquisition, Supervision
Hughes, J.Conceptualization, Funding acquisition, Methodology, Supervision, Writing – review and editing
Orton, R. J.Conceptualization, Funding acquisition, Methodology, Supervision, Writing – review and editing
Authors: Modha, S., Robertson, D. L., Hughes, J., and Orton, R. J.
College/School:College of Medical Veterinary and Life Sciences > School of Infection & Immunity
College of Medical Veterinary and Life Sciences > School of Infection & Immunity > Centre for Virus Research
Journal Name:mSystems
Publisher:American Society for Microbiology
ISSN (Online):2379-5077
Published Online:08 March 2022
Copyright Holders:Copyright © 2022 Modha et al.
First Published:First published in mSystems 7(2): e01468-21
Publisher Policy:Reproduced under a Creative Commons License

University Staff: Request a correction | Enlighten Editors: Update this record

Project CodeAward NoProject NamePrincipal InvestigatorFunder's NameFunder RefLead Dept
313948MRC Industrial Strategy Studentships 2018: Precision MedicineGeorge BaillieMedical Research Council (MRC)MR/S502479/1MVLS - Graduate School
172630014Cross-Cutting Programme – Viral Genomics and Bioinformatics (Programme 9)David RobertsonMedical Research Council (MRC)MC_UU_12014/12III - Centre for Virus Research