Is Gene Selection Enough for Imbalanced Bioinformatics Data?

Ahmad Abu Shanab; Taghi M Khoshgoftaar

doi:10.1109/IRI.2018.00059

Back

Conference proceeding

Is Gene Selection Enough for Imbalanced Bioinformatics Data?

Ahmad Abu Shanab and Taghi M Khoshgoftaar

2018 IEEE International Conference on Information Reuse and Integration (IRI), pp.346-355

07/2018

DOI: https://doi.org/10.1109/IRI.2018.00059

Abstract

Bioinformatics

Biological system modeling

Class imbalance

Data integrity

Data mining

Data models

data qualty

data sampling

Feature extraction

feature ranking

filter-based subset Evaluation

Gene expression

wrapper-based subset selection

Many classification problems in bioinformatics use datasets that are characterized by class imbalance. This unequal class distribution can adversely affect the classification performance on the minority class (having a very high rate of false negatives), which is usually the class of interest. While this challenge is prevalent among bioinformatics datasets, a majority of practitioners and researchers focused their efforts on coping with a different problem, namely high dimensionality (too many independent variables). As a result, class imbalance has been almost completely neglected. In this work, we investigate the importance of alleviating class imbalance (by applying data sampling) for classification problems on bioinformatics datasets. To investigate this importance, we compare the classification performance after applying data sampling and feature selection to the classification performance when using feature selection alone. We employ six widely used classification algorithms as well as three major forms of feature selection. Our results show that the classification models built with feature selection alone perform worse than those built when data sampling is incorporated with feature selection. Statistical analysis shows that the increase in performance when performing data sampling along with feature selection is significant. Therefore, it is essential to place special focus on the problem of class imbalance in bioinformatics and this experiment shows why it is important to apply techniques (e.g. data sampling) to alleviate class imbalance.

View Online

Metrics

11 Record Views

Details

Title: Is Gene Selection Enough for Imbalanced Bioinformatics Data?
Creators: Ahmad Abu Shanab - Florida Atlantic University
Taghi M Khoshgoftaar - Florida Atlantic University
Publication Details: 2018 IEEE International Conference on Information Reuse and Integration (IRI), pp.346-355
Publisher: IEEE
Identifiers: 991004106073106311
Academic Unit: Computer Science
Language: English
Resource Type: Conference proceeding

Is Gene Selection Enough for Imbalanced Bioinformatics Data?

Abstract

View Online

Metrics

Details

University of La Verne Social media