Filter-Based Subset Selection for Easy, Moderate, and Hard Bioinformatics Data

Ahmad Abu Shanab; Taghi Khoshgoftaar

doi:10.1109/IRI.2018.00062

Back

Conference proceeding

Filter-Based Subset Selection for Easy, Moderate, and Hard Bioinformatics Data

Ahmad Abu Shanab and Taghi Khoshgoftaar

2018 IEEE International Conference on Information Reuse and Integration (IRI), pp.372-377

07/2018

DOI: https://doi.org/10.1109/IRI.2018.00062

Abstract

Bioinformatics

Biological system modeling

CFS

Correlation

correlation-based feature selection

Data mining

difficulty of learning

Feature extraction

filter-based subset selection

Gene expression

Noise injection

Noise measurement

Researchers and practitioners in bioinformatics often encounter the problem of high dimensionality (a very large feature space). This large number of features severely hinders gene discovery and makes subsequent analysis computationally expensive. Thus, practitioners apply feature selection to reduce the number of features and simplify computation. One family of feature selection called filter-based feature selection (filters) employs statistical methods without the use of a classifier to find the most important features. Within the domain of filters, filter-based subset selection techniques evaluate a group of features simultaneously rather than each individual feature (i.e. filter-based feature ranking). Thus, filterbased subset selection techniques have the advantage of selecting unique features (not highly correlated with other features in the selected set) compared to ranker-based feature selection techniques (which can only target relevant features instead of redundant). Unfortunately, noise (incorrect and missing values in the data) is a prevalent problem in bioinformatics. Noise can confuse data mining techniques and render datasets more difficult to learn from. Relatively little work focused on filterbased subset selection techniques, let alone investigated their effectiveness in the context of difficulty of learning due to noise, especially in bioinformatics. In the present study, we compare two filter-based subset selection techniques using real-world high dimensional bioinformatics datasets with varying levels of learning difficulty due to noise. We employ Correlation-Based Feature Selection (CFS) and Consistency followed by classification, using four commonly used classifiers. We found that CFS consistently outperforms Consistency across all difficulty levels and learners. Thus, we recommend using CFS as the subset evaluation technique which not only identifies important genes but also removes redundant genes.

View Online

Metrics

13 Record Views

Details

Title: Filter-Based Subset Selection for Easy, Moderate, and Hard Bioinformatics Data
Creators: Ahmad Abu Shanab
Taghi Khoshgoftaar
Publication Details: 2018 IEEE International Conference on Information Reuse and Integration (IRI), pp.372-377
Publisher: IEEE
Identifiers: 991004106071706311
Academic Unit: Computer Science
Language: English
Resource Type: Conference proceeding

Filter-Based Subset Selection for Easy, Moderate, and Hard Bioinformatics Data

Abstract

View Online

Metrics

Details

University of La Verne Social media