Predicting host taxonomic information from viral genomes: a comparison of feature representations

Young, F., Rogers, S. and Robertson, D. L. (2020) Predicting host taxonomic information from viral genomes: a comparison of feature representations. PLoS Computational Biology, 16(5), e1007894. (doi: 10.1371/journal.pcbi.1007894) (PMID:32453718) (PMCID:PMC7307784)

217218.pdf - Published Version
Available under License Creative Commons Attribution.



The rise in metagenomics has led to an exponential growth in virus discovery. However, the majority of these new virus sequences have no assigned host. Current machine learning approaches to predicting virus host interactions have a tendency to focus on nucleotide features, ignoring other representations of genomic information. Here we investigate the predictive potential of features generated from four different ‘levels’ of viral genome representation: nucleotide, amino acid, amino acid properties and protein domains. This more fully exploits the biological information present in the virus genomes. Over a hundred and eighty binary datasets for infecting versus non-infecting viruses at all taxonomic ranks of both eukaryote and prokaryote hosts were compiled. The viral genomes were converted into the four different levels of genome representation and twenty feature sets were generated by extracting k-mer compositions and predicted protein domains. We trained and tested Support Vector Machine, SVM, classifiers to compare the predictive capacity of each of these feature sets for each dataset. Our results show that all levels of genome representation are consistently predictive of host taxonomy and that prediction k-mer composition improves with increasing k-mer length for all k-mer based features. Using a phylogenetically aware holdout method, we demonstrate that the predictive feature sets contain signals reflecting both the evolutionary relationship between the viruses infecting related hosts, and host-mimicry. Our results demonstrate that incorporating a range of complementary features, generated purely from virus genome sequences, leads to improved accuracy for a range of virus host prediction tasks enabling computational assignment of host taxonomic information.

Item Type:Articles
Additional Information:Funding: FY is supported by a studentship from the Medical Research Council (MRC). DLR is funded by the MRC (MC_UU_1201412).
Glasgow Author(s) Enlighten ID:Young, Francesca and Robertson, Professor David and Rogers, Dr Simon
Creator Roles:
Young, F.Formal analysis, Investigation, Software, Visualization, Writing – original draft, Writing – review and editing
Rogers, S.Conceptualization, Supervision, Writing – review and editing
Robertson, D. L.Conceptualization, Supervision, Writing – review and editing
Authors: Young, F., Rogers, S., and Robertson, D. L.
College/School:College of Medical Veterinary and Life Sciences > Institute of Infection Immunity and Inflammation
College of Science and Engineering > School of Computing Science
Journal Name:PLoS Computational Biology
Publisher:Public Library of Science
ISSN (Online):1553-7358
Copyright Holders:Copyright: © 2020 Young et al.
First Published:First published in PLoS Computational Biology 16(5):e1007894
Publisher Policy:Reproduced under a Creative Commons license

University Staff: Request a correction | Enlighten Editors: Update this record

Project CodeAward NoProject NamePrincipal InvestigatorFunder's NameFunder RefLead Dept
Viral Genomics and BioinformaticsAndrew DavisonMedical Research Council (MRC)MC_UU_12014/12III-MRC-GU Centre for Virus Research