Murdoch University Research Repository

Welcome to the Murdoch University Research Repository

The Murdoch University Research Repository is an open access digital collection of research
created by Murdoch University staff, researchers and postgraduate students.

Learn more

Classification of imbalanced data streams with an adaptive window re-balancing with retaining knowledge framework

Mohammed, Rafiq A. (2020) Classification of imbalanced data streams with an adaptive window re-balancing with retaining knowledge framework. PhD thesis, Murdoch University.

[img]
Preview
PDF - Whole Thesis
Download (2MB) | Preview

Abstract

Imbalanced data is ubiquitous in many real-world domains such as bioinformatics, call logs, cancer detection, finance, heart rates, and weather prediction. If one of the classes in a dataset reveals a considerable dominance over the other class is considered a class imbalance problem. The Imbalance Ratio (IR) in a dataset is computed as the proportion of the number of majority class instances to the number of minority class instances. Moreover, an IR of over 50:1 (majority: minority) for any binary class problem would be considered highly imbalanced. Learning and improving classification predictions in the presence of highly imbalanced data is a challenging task, especially in data streams. In practice, the class imbalance with high IR is dynamic, making the classification task progressively more difficult. When facing classes with high IR disproportions, one must develop an intelligent Classification System (CS) to learn from an imbalanced data stream. Also, with the increase of large scale transactions, the imbalanced data is immense and has created a challenging issue on how well Machine Learning (ML) techniques can scale up to learn from imbalanced data efficiently. The handling of imbalanced data using ML techniques focuses on detecting minority or interested class from the incoming data stream and responding faster with high prediction accuracy and reduced misclassification costs. The results shown in chapter 3 indicated that the Random Forest had been chosen as a suitable ML technique to work with the incoming imbalanced data streams.

Moreover, the performance of ML classification algorithms with highly imbalanced streaming data depends upon an efficient re-balancing strategy. Therefore, this thesis investigated different possible re-balancing strategies in resolving a highly class imbalance problem in data streams. This thesis also investigated the impact of the different evaluation criteria used to assist in determining the balancing of the data streams suitable for the ML classification technique, where the ML classification technique should be scalable with highly imbalanced datasets.

For highly imbalanced data streams, an automated re-balancing strategy is preferable as an active learning approach capable of handling imbalanced data and can provide a way to select an appropriate re-balancing technique with an imbalanced data stream. This thesis proposed using an automated re-balancing strategy using the Racing Algorithm (RA) to handle imbalanced data streams, where the RA has not been applied in data streams to the best of knowledge. To deal with the challenges of dynamic high imbalance class issues in data streams, a framework that combines an automated re-balancing strategy using the RA and incremental data re-balancing technique, named as Piece-Wise Incremental Data re-Balancing (PWIDB) framework, is proposed. The results shown in chapter 4 has revealed that the proposed PWIDB is capable of achieving stable predictions. Moreover, the IR in a stream is lower after applying the incremental re-balancing framework. The results also imply that when applying incremental re-balancing using the PWIDB framework, IR is lowered in the stream quicker and can address the underlying behaviour of class imbalance problem in streaming highly imbalanced data when performing classification tasks. In summary, for the PWIDB framework, this thesis focus on incremental data re-balancing technique to use previously acquired re-balanced data as knowledge (to capture, preserve and incrementally relay the re-balanced data to adapt to the real-time IR) and to derive the best approximation classification model from improving prediction accuracy at time j. The results shown in chapter 4 indicated that the proposed PWIDB framework is more adaptive and stable with highly imbalanced data streams than the benchmark technique.

Besides incremental re-balancing strategy in the PWIDB framework is adaptive with variant IR distributions and capable of providing stable classification results; it was realized that in data streams besides the IR, the concept frequently drifts for the class distribution. In addressing concept drifts and class imbalance problems, using adaptive window sizes, which dynamically adjusts streaming data windows of varying IR to maintain within the required IR cut-off threshold, is a practical solution. On top of the incremental re-balancing strategy using the PWIDB framework, this thesis adopted adaptive window sizes based on an IR threshold value to adjust the windows sizes and capture minority class concept drift to overcome the IR and concept drift challenges. The thesis proposes an Adaptive Window Re-balancing with Retain Knowledge (AWBReK) framework for binary class imbalanced data streams. The AWBReK framework has adopted adaptive window size and enhanced knowledge retention capability by re-balancing each data stream chunk. The AWBReK framework that adjusts window sizes dynamically based on the IR cut-off threshold would lower the datasets' imbalance. The AWBReK framework also adjusts window sizes dynamically to accept suitable instances from the data stream when comparing prediction results of current with previous and subsequent instances in the stream. Furthermore, in the AWBReK framework, the retained re-balancing data from the current stream can be re-used in the subsequent stream to enhance imbalanced learning and solve binary classification imbalanced data stream problems. The results shown in chapter 5 indicated that the proposed AWBReK framework could be used to provide consistent and stable classification accuracies with highly imbalanced binary class data streams to demonstrate that the concept drift can also be captured appropriately.

Lastly, investigating the mentioned problems of classifying highly imbalanced data streams has been extended to multi-class imbalanced data stream problems. Finally, the thesis proposed the One-Vs-All Adaptive Window Re-balancing with Retain Knowledge (OVA-AWBReK) to implement a dynamic framework with adaptive window size and enhance knowledge retention capability re-balancing each data stream chunks. The OVA-AWBReK framework that adjusts window sizes dynamically based on the IR cut-off threshold would lower the datasets' imbalance. The OVA-AWBReK framework also adjusts window sizes dynamically to accept suitable instances from the data stream when comparing prediction results of current with previous and subsequent instances in the stream. Moreover, in the OVA-AWBReK framework, the retained re-balancing data from the current stream can be re-used in the subsequent stream to enhance imbalanced learning and solve multi-class imbalanced data stream problems. Besides, the OVA-AWBReK framework uses the OVA binarization technique to decompose the multi-class data stream problems into binary classification problems. The results shown in chapter 6 indicated that the adaptive measures in the proposed framework could be used to provide consistent and stable classification accuracies for multi-class imbalanced data streams. The resulting OVA-AWBReK framework can deal with class imbalance and concept drift problems in multi-class imbalanced data streams.

Item Type: Thesis (PhD)
Murdoch Affiliation(s): Information Technology, Mathematics and Statistics
Supervisor(s): Wong, Kevin, Shiratuddin, Fairuz and Wang, Alex
URI: http://researchrepository.murdoch.edu.au/id/eprint/61260
Item Control Page Item Control Page

Downloads

Downloads per month over past year