You are here

Comparison of Data Sampling Approaches for Imbalanced Bioinformatics Data

Download pdf | Full Screen View

Date Issued:
2014
Summary:
Class imbalance is a frequent problem found in bioinformatics datasets. Unfortunately, the minority class is usually also the class of interest. One of the methods to improve this situation is data sampling. There are a number of different data sampling methods, each with their own strengths and weaknesses, which makes choosing one a difficult prospect. In our work we compare three data sampling techniques Random Undersampling, Random Oversampling, and SMOTE on six bioinformatics datasets with varying levels of class imbalance. Additionally, we apply two different classifiers to the problem 5-NN and SVM, and use feature selection to reduce our datasets to 25 features prior to applying sampling. Our results show that there is very little difference between the data sampling techniques, although Random Undersampling is the most frequent top performing data sampling technique for both of our classifiers. We also performed statistical analysis which confirms that there is no statistical difference between the techniques. Therefore, our recommendation is to use Random Undersampling when choosing a data sampling technique, because it is less computationally expensive to implement than SMOTE and it also reduces the size of the dataset, which will improve subsequent computational costs without sacrificing classification performance.
Title: Comparison of Data Sampling Approaches for Imbalanced Bioinformatics Data.
95 views
14 downloads
Name(s): Dittman, David
Wald, Randall
Napolitano, Amri E.
Graduate College
Khoshgoftaar, Taghi M.
Type of Resource: text
Genre: Abstract
Date Created: 2014
Date Issued: 2014
Publisher: Florida Atlantic University
Place of Publication: Boca Raton, Fla.
Physical Form: application/pdf
Extent: 1 p.
Language(s): English
Summary: Class imbalance is a frequent problem found in bioinformatics datasets. Unfortunately, the minority class is usually also the class of interest. One of the methods to improve this situation is data sampling. There are a number of different data sampling methods, each with their own strengths and weaknesses, which makes choosing one a difficult prospect. In our work we compare three data sampling techniques Random Undersampling, Random Oversampling, and SMOTE on six bioinformatics datasets with varying levels of class imbalance. Additionally, we apply two different classifiers to the problem 5-NN and SVM, and use feature selection to reduce our datasets to 25 features prior to applying sampling. Our results show that there is very little difference between the data sampling techniques, although Random Undersampling is the most frequent top performing data sampling technique for both of our classifiers. We also performed statistical analysis which confirms that there is no statistical difference between the techniques. Therefore, our recommendation is to use Random Undersampling when choosing a data sampling technique, because it is less computationally expensive to implement than SMOTE and it also reduces the size of the dataset, which will improve subsequent computational costs without sacrificing classification performance.
Identifier: FA00005811 (IID)
Collection: FAU Student Research Digital Collection
Note(s): The Fifth Annual Graduate Research Day was organized by Florida Atlantic University’s Graduate Student Association. Graduate students from FAU Colleges present abstracts of original research and posters in a competition for monetary prizes, awards, and recognition
Held by: Florida Atlantic University Libraries
Sublocation: Digital Library
Persistent Link to This Record: http://purl.flvc.org/fau/fd/FA00005811
Use and Reproduction: Copyright © is held by the author with permission granted to Florida Atlantic University to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
Host Institution: FAU
Is Part of Series: Florida Atlantic University Digital Library Collections.