Catalog Home Page

Enhancing classification performance by handling noise and imbalanced data with fuzzy classification techniques

Pruengkarn, Ratchakoon (2018) Enhancing classification performance by handling noise and imbalanced data with fuzzy classification techniques. PhD thesis, Murdoch University.

[img]
Preview
PDF - Whole Thesis
Download (2MB) | Preview

Abstract

This thesis studied the methodologies to improve the quality of training data in order to enhance classification performance. Noise and imbalance problems are two significant factors affecting data quality. Class noise is considered as the most harmful type of noise to a classifier’s performance, since incorrectly labelled examples may severely bias the learning method and result in inaccurate models. Removing mislabelled instances is more efficient than repairing and relabelling them. However, excessive removal of instances can be the cause of serious and irremediable loss of information. Under any circumstance, maintaining the noisy instances is worse than over eliminating. For these reasons, the conservation of instances without excessive filtering must be considered. Therefore, in the first part of this study, a noise removal technique using the Complementary technique with the Fuzzy Support Vector Machine (CMTFSVM) is proposed by considering misclassification analysis, in order to eliminate high potential uncertainty instances, which could lead to the mislabeling of samples in the training data. The results indicated that the CMTFSVM can reduce class noise and enhance the classification accuracy across different learning algorithms: Neural Network (NN), Support Vector Machine (SVM) and Fuzzy Support Vector Machine (FSVM).

On the other hand, there exists the imbalance issue, which is the cause of poor performance for existing learning algorithms. In such situations, there are some classes which have their number of instances greater than the other classes. Traditional learning algorithms tend to be overwhelmed by the majority classes and ignore the minority classes. The minority classes are as important despite their rareness, as they can contain useful information, as well as being difficult to recognise because of their infrequency and casualness. The second part of this study is designed to overcome this bias, by utilising a combination of the CMTFSVM undersampling technique, along with the Synthetic Minority Over-sampling Technique (SMOTE) rebalancing technique called CMTSMT. CMTSMT has been proposed for handling the binary imbalance problem in order to filter the uncertainty instances out of the training datasets, as well as to promote the importance of the minority classes. The results revealed that CMTSMT can improve classification performance with various imbalance ratios approximately 96% and 40% in terms of Geometric Mean (G-mean) and Area under the Receiver Operating Characteristic (AUC).

Another type of imbalance problem is dealing with multiclass imbalance classification. Multiclass learning has been seen as a difficult task for classification algorithms as multiclass classification may have a significantly lower performance than binary cases. Most existing techniques applied directly to a binary class imbalance problem could not be applied directly to multiclass problems. In addition, between-class and within-class are the two main factors causing issues for learning algorithms. Decomposition techniques such as One-vs-One and One-vs-All are common techniques to deal with multiclass imbalance data. However, their drawbacks are the losing of balancing performance on all classes, as they require a high memory space and more classifiers. Thus, a hybrid Fuzzy C-Means clustering (FCM) and SMOTE called FCMSMT is proposed. The results presented that the FCMSMT technique could reduce between-class and within-class problems by balancing all the classes to have a similar number of class instances and randomly selecting instances (at least one) from each cluster. Moreover, the number of instances, after applying the FCMSMT technique, are of a similar number to the original dataset instances, in order to prevent an over undersampling and oversampling of class instances. The percentage of performance improvement between the original data and the FCMSMT technique with highly imbalanced data approximately 10% and 5% with G-mean and AUC respectively. Thus, the FCMSMT technique could be an alternative way to deal with the multiclass imbalance classification problem.

Publication Type: Thesis (PhD)
Murdoch Affiliation: School of Engineering and Information Technology
Supervisor: Wong, Kevin and Fung, Lance
URI: http://researchrepository.murdoch.edu.au/id/eprint/42505
Item Control Page Item Control Page

Downloads

Downloads per month over past year