You are here

Alleviating class imbalance using data sampling: Examining the effects on classification algorithms

Download pdf | Full Screen View

Date Issued:
2006
Summary:
Imbalanced class distributions typically cause poor classifier performance on the minority class, which also tends to be the class with the highest cost of mis-classification. Data sampling is a common solution to this problem, and numerous sampling techniques have been proposed to address it. Prior research examining the performance of these techniques has been narrow and limited. This work uses thorough empirical experimentation to compare the performance of seven existing data sampling techniques using five different classifiers and four different datasets. The work addresses which sampling techniques produce the best performance in the presence of class unbalance, which classifiers are most robust to the problem, as well as which sampling techniques perform better or worse with each classifier. Extensive statistical analysis of these results is provided, in addition to an examination of the qualitative effects of the sampling techniques on the types of predictions made by the C4.5 classifier.
Title: Alleviating class imbalance using data sampling: Examining the effects on classification algorithms.
543 views
467 downloads
Name(s): Napolitano, Amri E.
Florida Atlantic University, Degree grantor
Khoshgoftaar, Taghi M., Thesis advisor
College of Engineering and Computer Science
Department of Computer and Electrical Engineering and Computer Science
Type of Resource: text
Genre: Electronic Thesis Or Dissertation
Issuance: monographic
Date Issued: 2006
Publisher: Florida Atlantic University
Place of Publication: Boca Raton, Fla.
Physical Form: application/pdf
Extent: 100 p.
Language(s): English
Summary: Imbalanced class distributions typically cause poor classifier performance on the minority class, which also tends to be the class with the highest cost of mis-classification. Data sampling is a common solution to this problem, and numerous sampling techniques have been proposed to address it. Prior research examining the performance of these techniques has been narrow and limited. This work uses thorough empirical experimentation to compare the performance of seven existing data sampling techniques using five different classifiers and four different datasets. The work addresses which sampling techniques produce the best performance in the presence of class unbalance, which classifiers are most robust to the problem, as well as which sampling techniques perform better or worse with each classifier. Extensive statistical analysis of these results is provided, in addition to an examination of the qualitative effects of the sampling techniques on the types of predictions made by the C4.5 classifier.
Identifier: 9780542931291 (isbn), 13413 (digitool), FADT13413 (IID), fau:10263 (fedora)
Collection: FAU Electronic Theses and Dissertations Collection
Note(s): College of Engineering and Computer Science
Thesis (M.S.)--Florida Atlantic University, 2006.
Subject(s): Combinatorial group theory
Data mining
Decision trees
Machine learning
Held by: Florida Atlantic University Libraries
Persistent Link to This Record: http://purl.flvc.org/fcla/dt/13413
Sublocation: Digital Library
Use and Reproduction: Copyright © is held by the author, with permission granted to Florida Atlantic University to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
Use and Reproduction: http://rightsstatements.org/vocab/InC/1.0/
Host Institution: FAU
Is Part of Series: Florida Atlantic University Digital Library Collections.