Enhancing classification performance over noise and imbalanced data problems
Jeatrakul, Piyasak (2012) Enhancing classification performance over noise and imbalanced data problems. PhD thesis, Murdoch University.
|PDF - Front Pages |
Download (47kB) | Preview
|PDF - Whole Thesis |
Download (1326kB) | Preview
This research presents the development of techniques to handle two issues in data classification: noise and imbalanced data problems. Noise is a significant problem that can degrade the quality of training data in any learning algorithm. Learning algorithms trained by noisy instances generally increase misclassification when they perform classification. As a result, the classification performance tends to decrease. Meanwhile, the imbalanced data problem is another problem affecting the performance of learning algorithms. If some classes have a much larger number of instances than the others, the learning algorithms tend to be dominated by the features of the majority classes, and the features of the minority classes are difficult to recognise. As a result, the classification performance of the minority classes could be significantly lower than that of the majority classes. It is therefore important to implement techniques to better handle the negative effects of noise and imbalanced data problems.
Although there are several approaches attempting to handle noise and imbalanced data problems, shortcomings of the available approaches still exist. For the noise handling techniques, even though the noise tolerant approach does not require any data preprocessing, it can tolerate only a certain amount of noise. The classifier developed from noisy data tends to be less predictive if the training data contains a great number of noise instances. Furthermore, for the noise elimination approach, although it can be easily applied to various problem domains, it could degrade the quality of training data if it cannot distinguish between noise and rare cases (exceptions). Besides, for the imbalanced data problem, the available techniques used still present some limitations. For example, the algorithm-level approach can perform effectively only on specific problem domains or specific learning algorithms. The data-level approach can either eliminate necessary information from the training set or produce the over-fitting problem over the minority class. Moreover, when the imbalanced data problem becomes more complex, such as for the case of multi-class classification, it is difficult to apply the re-sampling techniques (the data-level approach), which perform effectively for imbalanced data problems in binary classification, to the multi-class classification. Due to the limitations above, these lead to the motivation of this research to propose and investigate techniques to handle noise and imbalanced data problems more effectively.
This thesis has developed three new techniques to overcome the identified problems. Firstly, a cleaning technique called the Complementary Neural Network (CMTNN) data cleaning technique has been developed in order to remove noise (misclassification data) from the training set. The results show that the new noise detection and removal technique can eliminate noise with confidence. Furthermore, the CMTNN cleaning technique can increase the classification accuracy across different learning algorithms, which are Artificial Neural Network (ANN), Support Vector Machine (SVM), k- Nearest Neighbor (k-NN), and Decision Tree (DT). It can provide higher classification performance than other cleaning methods such as Tomek links, the majority voting filtering, and the consensus voting filtering.
Secondly, the CMTNN re-sampling technique, which is a new under-sampling technique, has been developed to handle the imbalanced data problem in binary classification. The results show that the combined techniques of the CMTNN resampling technique and Synthetic Minority Over-sampling Technique (SMOTE) can perform effectively by improving the classification performance of the minority class instances in terms of Geometric Mean (G-Mean) and the area under the Receiver Operating Characteristic (ROC) curve. It generally provides higher performance than other re-sampling techniques such as Tomek links, Wilson’s Edited Nearest Neighbor Rule (ENN), SMOTE, the combined technique of SMOTE and ENN, and the combined technique of SMOTE and Tomek links.
For the third proposed technique, an algorithm named One-Against-All with Data Balancing (OAA-DB) has been developed in order to deal with the imbalanced data problem in multi-class classification. It can be asserted that this algorithm not only improves the performance for the minority class but it also maintains the overall accuracy, which is normally reduced by other techniques. The OAA-DB algorithm can increase the performance in terms of the classification accuracy and F-measure when compared to other multi-class classification approaches including One-Against-All (OAA), One-Against-One (OAO), All and One (A&O), and One Against Higher Order (OAHO) approaches. Furthermore, this algorithm has shown that the re-sampling technique is not only used effectively for the class imbalance problem in binary classification but it has been also applied successfully to the imbalanced data problem in multi-class classification.
|Publication Type:||Thesis (PhD)|
|Murdoch Affiliation:||School of Information Technology|
|Supervisor:||Wong, Kevin and Fung, Lance|
|Item Control Page|