Abstract
Many classification problems in bioinformatics use datasets that are characterized by class imbalance. This unequal class distribution can adversely affect the classification performance on the minority class (having a very high rate of false negatives), which is usually the class of interest. While this challenge is prevalent among bioinformatics datasets, a majority of practitioners and researchers focused their efforts on coping with a different problem, namely high dimensionality (too many independent variables). As a result, class imbalance has been almost completely neglected. In this work, we investigate the importance of alleviating class imbalance (by applying data sampling) for classification problems on bioinformatics datasets. To investigate this importance, we compare the classification performance after applying data sampling and feature selection to the classification performance when using feature selection alone. We employ six widely used classification algorithms as well as three major forms of feature selection. Our results show that the classification models built with feature selection alone perform worse than those built when data sampling is incorporated with feature selection. Statistical analysis shows that the increase in performance when performing data sampling along with feature selection is significant. Therefore, it is essential to place special focus on the problem of class imbalance in bioinformatics and this experiment shows why it is important to apply techniques (e.g. data sampling) to alleviate class imbalance.