You are here

Generalized Feature Embedding Learning for Clustering and Classication

Download pdf | Full Screen View

Date Issued:
2018
Abstract/Description:
Data comes in many di erent shapes and sizes. In real life applications it is common that data we are studying has features that are of varied data types. This may include, numerical, categorical, and text. In order to be able to model this data with machine learning algorithms, it is required that the data is typically in numeric form. Therefore, for data that is not originally numerical, it must be transformed to be able to be used as input into these algorithms. Along with this transformation it is common that data we study has many features relative to the number of samples in the data. It is often desirable to reduce the number of features that are being trained in a model to eliminate noise and reduce time in training. This problem of high dimensionality can be approached through feature selection, feature extraction, or feature embedding. Feature selection seeks to identify the most essential variables in a dataset that will lead to a parsimonious model and high performing results, while feature extraction and embedding are techniques that utilize a mathematical transformation of the data into a represented space. As a byproduct of using a new representation, we are able to reduce the dimension greatly without sacri cing performance. Oftentimes, by using embedded features we observe a gain in performance. Though extraction and embedding methods may be powerful for isolated machine learning problems, they do not always generalize well. Therefore, we are motivated to illustrate a methodology that can be applied to any data type with little pre-processing. The methods we develop can be applied in unsupervised, supervised, incremental, and deep learning contexts. Using 28 benchmark datasets as examples which include di erent data types, we construct a framework that can be applied for general machine learning tasks. The techniques we develop contribute to the eld of dimension reduction and feature embedding. Using this framework, we make additional contributions to eigendecomposition by creating an objective matrix that includes three main vital components. The rst being a class partitioned row and feature product representation of one-hot encoded data. Secondarily, the derivation of a weighted adjacency matrix based on class label relationships. Finally, by the inner product of these aforementioned values, we are able to condition the one-hot encoded data generated from the original data prior to eigenvector decomposition. The use of class partitioning and adjacency enable subsequent projections of the data to be trained more e ectively when compared side-to-side to baseline algorithm performance. Along with this improved performance, we can adjust the dimension of the subsequent data arbitrarily. In addition, we also show how these dense vectors may be used in applications to order the features of generic data for deep learning. In this dissertation, we examine a general approach to dimension reduction and feature embedding that utilizes a class partitioned row and feature representation, a weighted approach to instance similarity, and an adjacency representation. This general approach has application to unsupervised, supervised, online, and deep learning. In our experiments of 28 benchmark datasets, we show signi cant performance gains in clustering, classi cation, and training time.
Title: Generalized Feature Embedding Learning for Clustering and Classication.
205 views
41 downloads
Name(s): Golinko, Eric David, author
Zhu, Xingquan, Thesis advisor
Florida Atlantic University, Degree grantor
College of Engineering and Computer Science
Department of Computer and Electrical Engineering and Computer Science
Type of Resource: text
Genre: Electronic Thesis Or Dissertation
Date Created: 2018
Date Issued: 2018
Publisher: Florida Atlantic University
Place of Publication: Boca Raton, Fla.
Physical Form: application/pdf
Extent: 128 p.
Language(s): English
Abstract/Description: Data comes in many di erent shapes and sizes. In real life applications it is common that data we are studying has features that are of varied data types. This may include, numerical, categorical, and text. In order to be able to model this data with machine learning algorithms, it is required that the data is typically in numeric form. Therefore, for data that is not originally numerical, it must be transformed to be able to be used as input into these algorithms. Along with this transformation it is common that data we study has many features relative to the number of samples in the data. It is often desirable to reduce the number of features that are being trained in a model to eliminate noise and reduce time in training. This problem of high dimensionality can be approached through feature selection, feature extraction, or feature embedding. Feature selection seeks to identify the most essential variables in a dataset that will lead to a parsimonious model and high performing results, while feature extraction and embedding are techniques that utilize a mathematical transformation of the data into a represented space. As a byproduct of using a new representation, we are able to reduce the dimension greatly without sacri cing performance. Oftentimes, by using embedded features we observe a gain in performance. Though extraction and embedding methods may be powerful for isolated machine learning problems, they do not always generalize well. Therefore, we are motivated to illustrate a methodology that can be applied to any data type with little pre-processing. The methods we develop can be applied in unsupervised, supervised, incremental, and deep learning contexts. Using 28 benchmark datasets as examples which include di erent data types, we construct a framework that can be applied for general machine learning tasks. The techniques we develop contribute to the eld of dimension reduction and feature embedding. Using this framework, we make additional contributions to eigendecomposition by creating an objective matrix that includes three main vital components. The rst being a class partitioned row and feature product representation of one-hot encoded data. Secondarily, the derivation of a weighted adjacency matrix based on class label relationships. Finally, by the inner product of these aforementioned values, we are able to condition the one-hot encoded data generated from the original data prior to eigenvector decomposition. The use of class partitioning and adjacency enable subsequent projections of the data to be trained more e ectively when compared side-to-side to baseline algorithm performance. Along with this improved performance, we can adjust the dimension of the subsequent data arbitrarily. In addition, we also show how these dense vectors may be used in applications to order the features of generic data for deep learning. In this dissertation, we examine a general approach to dimension reduction and feature embedding that utilizes a class partitioned row and feature representation, a weighted approach to instance similarity, and an adjacency representation. This general approach has application to unsupervised, supervised, online, and deep learning. In our experiments of 28 benchmark datasets, we show signi cant performance gains in clustering, classi cation, and training time.
Identifier: FA00013063 (IID)
Degree granted: Dissertation (Ph.D.)--Florida Atlantic University, 2018.
Collection: FAU Electronic Theses and Dissertations Collection
Note(s): Includes bibliography.
Subject(s): Eigenvectors--Data processing.
Algorithms.
Cluster analysis.
Held by: Florida Atlantic University Libraries
Sublocation: Digital Library
Persistent Link to This Record: http://purl.flvc.org/fau/fd/FA00013063
Use and Reproduction: Copyright © is held by the author, with permission granted to Florida Atlantic University to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
Use and Reproduction: http://rightsstatements.org/vocab/InC/1.0/
Host Institution: FAU
Is Part of Series: Florida Atlantic University Digital Library Collections.