Young, F., Rogers, S. and Robertson, D. L. (2020) Predicting host taxonomic information from viral genomes: a comparison of feature representations. PLoS Computational Biology, 16(5), e1007894. (doi: 10.1371/journal.pcbi.1007894) (PMID:32453718) (PMCID:PMC7307784)
|
Text
217218.pdf - Published Version Available under License Creative Commons Attribution. 3MB |
Abstract
The rise in metagenomics has led to an exponential growth in virus discovery. However, the majority of these new virus sequences have no assigned host. Current machine learning approaches to predicting virus host interactions have a tendency to focus on nucleotide features, ignoring other representations of genomic information. Here we investigate the predictive potential of features generated from four different ‘levels’ of viral genome representation: nucleotide, amino acid, amino acid properties and protein domains. This more fully exploits the biological information present in the virus genomes. Over a hundred and eighty binary datasets for infecting versus non-infecting viruses at all taxonomic ranks of both eukaryote and prokaryote hosts were compiled. The viral genomes were converted into the four different levels of genome representation and twenty feature sets were generated by extracting k-mer compositions and predicted protein domains. We trained and tested Support Vector Machine, SVM, classifiers to compare the predictive capacity of each of these feature sets for each dataset. Our results show that all levels of genome representation are consistently predictive of host taxonomy and that prediction k-mer composition improves with increasing k-mer length for all k-mer based features. Using a phylogenetically aware holdout method, we demonstrate that the predictive feature sets contain signals reflecting both the evolutionary relationship between the viruses infecting related hosts, and host-mimicry. Our results demonstrate that incorporating a range of complementary features, generated purely from virus genome sequences, leads to improved accuracy for a range of virus host prediction tasks enabling computational assignment of host taxonomic information.
Item Type: | Articles |
---|---|
Additional Information: | Funding: FY is supported by a studentship from the Medical Research Council (MRC). DLR is funded by the MRC (MC_UU_1201412). |
Status: | Published |
Refereed: | Yes |
Glasgow Author(s) Enlighten ID: | Young, Francesca and Robertson, Professor David and Rogers, Dr Simon |
Creator Roles: | Young, F.Formal analysis, Investigation, Software, Visualization, Writing – original draft, Writing – review and editing Rogers, S.Conceptualization, Supervision, Writing – review and editing Robertson, D. L.Conceptualization, Supervision, Writing – review and editing |
Authors: | Young, F., Rogers, S., and Robertson, D. L. |
College/School: | College of Medical Veterinary and Life Sciences > School of Infection & Immunity College of Science and Engineering > School of Computing Science College of Medical Veterinary and Life Sciences > School of Infection & Immunity > Centre for Virus Research |
Journal Name: | PLoS Computational Biology |
Publisher: | Public Library of Science |
ISSN: | 1553-734X |
ISSN (Online): | 1553-7358 |
Copyright Holders: | Copyright: © 2020 Young et al. |
First Published: | First published in PLoS Computational Biology 16(5):e1007894 |
Publisher Policy: | Reproduced under a Creative Commons license |
University Staff: Request a correction | Enlighten Editors: Update this record