Seqenv: linking sequences to environments through text mining

Sinclair, L. et al. (2016) Seqenv: linking sequences to environments through text mining. PeerJ, 4, e2690. (doi:10.7717/peerj.2690)

[img]
Preview
Text
133062.pdf - Published Version
Available under License Creative Commons Attribution.

2MB

Abstract

Understanding the distribution of taxa and associated traits across different environments is one of the central questions in microbial ecology. High-throughput sequencing (HTS) studies are presently generating huge volumes of data to address this biogeographical topic. However, these studies are often focused on specific environment types or processes leading to the production of individual, unconnected datasets. The large amounts of legacy sequence data with associated metadata that exist can be harnessed to better place the genetic information found in these surveys into a wider environmental context. Here we introduce a software program, seqenv, to carry out precisely such a task. It automatically performs similarity searches of short sequences against the “nt” nucleotide database provided by NCBI and, out of every hit, extracts–if it is available–the textual metadata field. After collecting all the isolation sources from all the search results, we run a text mining algorithm to identify and parse words that are associated with the Environmental Ontology (EnvO) controlled vocabulary. This, in turn, enables us to determine both in which environments individual sequences or taxa have previously been observed and, by weighted summation of those results, to summarize complete samples. We present two demonstrative applications of seqenv to a survey of ammonia oxidizing archaea as well as to a plankton paleome dataset from the Black Sea. These demonstrate the ability of the tool to reveal novel patterns in HTS and its utility in the fields of environmental source tracking, paleontology, and studies of microbial biogeography. To install seqenv, go to: https://github.com/xapple/seqenv.

Item Type:Articles
Status:Published
Refereed:Yes
Glasgow Author(s) Enlighten ID:Quince, Dr Christopher and Ijaz, Dr Umer Zeeshan
Authors: Sinclair, L., Ijaz, U. Z., Jensen, L. J., Coolen, M. J.L., Gubry-Rangin, C., Chroňáková, A., Oulas, A., Pavloudi, C., Schnetzer, J., Weimann, A., Ijaz, A., Eiler, A., Quince, C., and Pafilis, E.
College/School:College of Science and Engineering > School of Engineering
College of Science and Engineering > School of Engineering > Infrastructure and Environment
Journal Name:PeerJ
Publisher:PeerJ
ISSN:2167-8359
ISSN (Online):2167-8359
Copyright Holders:Copyright © 2016 Sinclair et al.
First Published:First published in PeerJ 4:e2690
Publisher Policy:Reproduced under a Creative Commons License

University Staff: Request a correction | Enlighten Editors: Update this record

Project CodeAward NoProject NamePrincipal InvestigatorFunder's NameFunder RefLead Dept
652771Understanding microbial community through in situ environmental 'omic data synthesisUmer IjazNatural Environment Research Council (NERC)NE/L011956/1ENG - ENGINEERING INFRASTRUCTURE & ENVIR
652772Understanding microbial community through in situ environmental 'omic data synthesisUmer IjazNatural Environment Research Council (NERC)NE/L011956/1ENG - ENGINEERING INFRASTRUCTURE & ENVIR