Protein encoder: an autoencoder-based ensemble feature selection scheme to predict protein secondary structure

Uzma, , Manzoor, U. and Halim, Z. (2023) Protein encoder: an autoencoder-based ensemble feature selection scheme to predict protein secondary structure. Expert Systems with Applications, 213(Part B), 119081. (doi: 10.1016/j.eswa.2022.119081)

[img] Text
306717.pdf - Accepted Version
Available under License Creative Commons Attribution Non-commercial No Derivatives.

5MB

Abstract

Proteins play a vital role in the human body as they perform important metabolic tasks. Experimental identification of protein structure is expensive and time consuming. The prediction of protein secondary structure is significant to identify the protein tertiary structure and its folds. The feature subset selection from high dimensional protein primary sequence is a key to improve the accuracy of Protein Secondary Structure Prediction (PSSP). Therefore, it is essential to select the relevant features from high dimensional data to predict the protein secondary structure. This work presents a novel method for the PSSP problem based on a two-phase feature selection technique. The first stage utilizes an unsupervised autoencoder for feature extractions. Whereas, the second stage is an ensemble of three feature selection methods, namely, generic univariate select, recursive feature elimination, and Pearson's correlation. This phase combines multiple feature subsets using mutual information to select the optimum feature subset. For classification, different resultant subset features are used. These include random forest, decision tree, and multilayer perceptron. Two sets of experiments are performed on five datasets for the assessment of proposed work. The proposed solution is compared with three state-of-the-art methods based on Q3 accuracy, Q8 accuracy, and segment overlap score. Obtained results show that the proposed framework performs better in the majority of the cases than the past contributions. The proposed framework achieves Q8 accuracies of 82%, 80%, 79%, 73%, and 74% and Q3 accuracies of 90%, 90%, 92%, 79%, and 74% on CB6133, CB6133-filtered, CB513, CASP10, and CASP11 datasets, respectively.

Item Type:Articles
Additional Information:This work was supported by the GIK Institute graduate program research fund under GA-1 scheme.
Status:Published
Refereed:Yes
Glasgow Author(s) Enlighten ID:Uzma, Dr Uzma
Creator Roles:
Uzma, Conceptualization, Methodology, Software, Supervision
Authors: Uzma, , Manzoor, U., and Halim, Z.
College/School:College of Science and Engineering > School of Engineering > Infrastructure and Environment
Journal Name:Expert Systems with Applications
Publisher:Elsevier
ISSN:0957-4174
ISSN (Online):0957-4174
Published Online:20 October 2022
Copyright Holders:Copyright © 2022 Elsevier Ltd.
First Published:First published in Expert Systems with Applications 213(Part B): 119081
Publisher Policy:Reproduced in accordance with the publisher copyright policy

University Staff: Request a correction | Enlighten Editors: Update this record