Abstract
Researchers and practitioners in bioinformatics often encounter the problem of high dimensionality (a very large feature space). This large number of features severely hinders gene discovery and makes subsequent analysis computationally expensive. Thus, practitioners apply feature selection to reduce the number of features and simplify computation. One family of feature selection called filter-based feature selection (filters) employs statistical methods without the use of a classifier to find the most important features. Within the domain of filters, filter-based subset selection techniques evaluate a group of features simultaneously rather than each individual feature (i.e. filter-based feature ranking). Thus, filterbased subset selection techniques have the advantage of selecting unique features (not highly correlated with other features in the selected set) compared to ranker-based feature selection techniques (which can only target relevant features instead of redundant). Unfortunately, noise (incorrect and missing values in the data) is a prevalent problem in bioinformatics. Noise can confuse data mining techniques and render datasets more difficult to learn from. Relatively little work focused on filterbased subset selection techniques, let alone investigated their effectiveness in the context of difficulty of learning due to noise, especially in bioinformatics. In the present study, we compare two filter-based subset selection techniques using real-world high dimensional bioinformatics datasets with varying levels of learning difficulty due to noise. We employ Correlation-Based Feature Selection (CFS) and Consistency followed by classification, using four commonly used classifiers. We found that CFS consistently outperforms Consistency across all difficulty levels and learners. Thus, we recommend using CFS as the subset evaluation technique which not only identifies important genes but also removes redundant genes.