Top scoring pairs for feature selection in machine learning and applications to cancer outcome prediction

Shi, P., Ray, S. , Zhu, Q. and Kon, M.A. (2011) Top scoring pairs for feature selection in machine learning and applications to cancer outcome prediction. BMC Bioinformatics, 12(1), p. 375. (doi: 10.1186/1471-2105-12-375)

Full text not currently available from Enlighten.

Abstract

<b>Background</b> The widely used k top scoring pair (k-TSP) algorithm is a simple yet powerful parameter-free classifier. It owes its success in many cancer microarray datasets to an effective feature selection algorithm that is based on relative expression ordering of gene pairs. However, its general robustness does not extend to some difficult datasets, such as those involving cancer outcome prediction, which may be due to the relatively simple voting scheme used by the classifier. We believe that the performance can be enhanced by separating its effective feature selection component and combining it with a powerful classifier such as the support vector machine (SVM). More generally the top scoring pairs generated by the k-TSP ranking algorithm can be used as a dimensionally reduced subspace for other machine learning classifiers.<p></p> <b>Results</b> We developed an approach integrating the k-TSP ranking algorithm (TSP) with other machine learning methods, allowing combination of the computationally efficient, multivariate feature ranking of k-TSP with multivariate classifiers such as SVM. We evaluated this hybrid scheme (k-TSP+SVM) in a range of simulated datasets with known data structures. As compared with other feature selection methods, such as a univariate method similar to Fisher's discriminant criterion (Fisher), or a recursive feature elimination embedded in SVM (RFE), TSP is increasingly more effective than the other two methods as the informative genes become progressively more correlated, which is demonstrated both in terms of the classification performance and the ability to recover true informative genes. We also applied this hybrid scheme to four cancer prognosis datasets, in which k-TSP+SVM outperforms k-TSP classifier in all datasets, and achieves either comparable or superior performance to that using SVM alone. In concurrence with what is observed in simulation, TSP appears to be a better feature selector than Fisher and RFE in some of the cancer datasets.<p></p> <b>Conclusions</b> The k-TSP ranking algorithm can be used as a computationally efficient, multivariate filter method for feature selection in machine learning. SVM in combination with k-TSP ranking algorithm outperforms k-TSP and SVM alone in simulated datasets and in some cancer prognosis datasets. Simulation studies suggest that as a feature selector, it is better tuned to certain data characteristics, i.e. correlations among informative genes, which is potentially interesting as an alternative feature ranking method in pathway analysis.

Item Type:Articles
Status:Published
Refereed:Yes
Glasgow Author(s) Enlighten ID:Ray, Professor Surajit
Authors: Shi, P., Ray, S., Zhu, Q., and Kon, M.A.
College/School:College of Science and Engineering > School of Mathematics and Statistics > Statistics
Journal Name:BMC Bioinformatics
ISSN:1471-2105
ISSN (Online):1471-2105

University Staff: Request a correction | Enlighten Editors: Update this record