Performance of Random Oversampling, Random Undersampling, and SMOTE-NC Methods in Handling Imbalanced Class in Classification Models

: One common challenge in classification modeling is the existence of imbalanced classes within the data. If the analysis continues with imbalanced classes, it is probable that the result will demonstrate inadequate performance when forecasting new data. Various approaches exist to rectify this class imbalance issue, such as random oversampling, random undersampling, and the Synthetic Minority Over-sampling Technique for Nominal and Continuous (SMOTE-NC). Each of these methods encompasses distinct techniques aimed at achieving balanced class distribution within the dataset. Comparison of classification performance on imbalanced classes handled by these three methods has never been carried out in previous research. Therefore, this study undertakes an evaluation of classification models (specifically Gradient Boosting, Random Forest, and Extremely Randomized Trees) in the context of imbalanced class data. The results of this research show that the random undersampling method used to balance the class distribution has the best performance on two classification models (Random Forest and Gradient Boosted Tree).


Introduction
Machine learning methods are divided into two, namely supervised learning and unsupervised learning.Supervised learning involves building a statistical model to predict or estimate output results based on one or more inputs.While unsupervised learning aims to understand the relationships and structure of data [1].Classification modeling is included in supervised learning, where the algorithm uses one or more inputs to build a model and is then used to predict an output.The classification methods used in this research are Gradient Boosting (GB), Random Forest (RF), and Extremely Randomized Trees (Extra Trees).Gradient boosting is a supervised learning technique based on decision trees.The GB algorithm creates a classification tree sequentially by minimizing the loss function [2].Previous research has applied GB to analyze health data.The research results show that GB has better performance and is easier to interpret compared to neural networks and linear models [3].
In contrast to the GB algorithm which creates classification trees sequentially, the formation of classification trees in RF and Extra Trees is done individually or the formation of the next tree is not related to the tree that was formed previously.Both of these methods use majority voting to determine the prediction results.Determining the best splitting criteria in these three methods uses random selection of explanatory variables so that the classification trees formed are not correlated with each other.Apart from using random selection of variables, Extra Trees also uses random selection of cut points to determine the best splitting, this can make computing time on Extra Trees faster [4].
One of the problems often encountered in classification modeling is imbalanced data.Imbalanced data is data that has an unbalanced distribution of response variable classes, the number of one class is less or more than the number of other data classes [5].Imbalanced class that is not resolved can affect the performance of the model used [6].The data balancing methods used in this research are Random Oversampling, Random Undersampling, and Synthetic Minority Over-sampling Technique for Nominal and Continuous (SMOTE-NC).
Random Oversampling performs random replication on minority samples to balance the class distribution [7].Meanwhile Random Undersampling used to balance the distribution of each class by randomly removing majority class samples [6].SMOTE-NC is an oversampling technique that uses Knearest neighbor characteristics in explanatory variables to produce synthetic data in the minority class [8].Previous research used the oversampling method for classification.The use of the SMOTE-NC method to balance the class distribution in previous research was carried out on data from heart failure patients.The results of this study show that the heart failure patient data classification model improved the F1 score from 69.39% to 81.90% after class balancing with SMOTE-NC [9].
Based on previous research, classification modeling with balanced classes can improve model performance.Therefore, this research compares methods Random Oversampling, Random Undersampling, and Synthetic Minority Over-sampling Technique for Nominal and Continuous (SMOTE-NC) on data with imbalanced classes.The data used is Telco customer churn data.Classification model performance (Gradient Boosting (GB), Random Forest (RF), and Extremely Randomized Trees (Extra Trees)) in modeling Telco customer churn data which has balanced class distribution compared to performance using accuracy, sensitivity, and specificity values.

Research Method 2.1 Data
The data used in this research is Telco customer churn data downloaded from Kaggle.There are 7043 observations in the Telco customer churn data.This data consists of 19 variables, 18 explanatory variables and one response variable.The response variable in this research is Churn (customers who stop using the service and still use Telco services).A description of the explanatory and response variables in this study is shown in Table 1.

Random Forest
The ensemble method is a learning method that combines prediction results from several individual models to obtain better performance (accuracy) results [10].Random Forest is an ensemble method developed by Leo Breiman in 2001 [11].The individual model used in RF is a classification/regression tree.This method is a development of Bagging which aims to build trees that are more distinct and not correlated with each other [12].The process of randomly selecting explanatory variables in RF reduces the correlation between the trees formed, thereby increasing prediction ability and being more efficient.Some of the advantages of RF are that it can overcome overfitting problems, is not sensitive to outliers, and can produce good accuracy [13].The following are the classification stages using Random Forest [14].1. Perform bootstrapping on training data.2. Build a classification tree using bootstrapped data.
3. Choose the best splitting at node t using randomly selected independent variables √ or ⁄ with p is all of the independent variables in the data.The splitting selection process is repeated until the stopping criterion has been reached.4. Determine the prediction results of a classification tree. 5. Steps 1-4 are repeated until b classification trees are obtained.6. Determine the prediction results from RF by combining the prediction results from each classification tree using majority vote.

Gradient Boosting Machine
It is included in the supervised learning method which is based on decision trees and can be used for classification modeling [15].This method was first introduced by Jerome H. Friedman in 2001.The learning procedure in GBMs works sequentially to provide more accurate predictions of response variables [2].The learning procedures for GBMs are as follows [16].Input: 1. Data consisting of independent variables (X) and a response variable (Y) with a number of N observations.2. The number of iterations is M. Steps a to c are repeated until the stopping criteria are reached so that prediction results from one classification tree are obtained.3. Steps 1-2 are repeated until a classification tree is formed.4. Determine the prediction results from Extra Trees by combining the prediction results from each classification tree using majority vote.

Measures of Model Performance
The measure of model performance in classification is used to see the accuracy of a model in predicting a class in the data.In classification modeling, the measure of model performance is calculated using a confusion matrix.Confusion matrix is a matrix that shows predicted and actual classification.Table 2 shows the confusion matrix for classification of two classes [17].Confusion matrix can be used to calculate measures of model performance such as accuracy, sensitivity, and specificity.The definition and formula of these measures of model performance are as follows [18].
a. Accuracy Accuracy is the proportion of the number of observations that are predicted correctly.Accuracy can be calculated using equation 5. ( b. Sensitivity Sensitivity is a measure of the performance of the classification algorithm in classifying data in the positive class.Sensitivity can be calculated using equation 6.
c. Specificity Specificity is a measure of the performance of the classification algorithm in classifying data in the negative class.Specificity can be calculated using equation 7. (7)

Data Analysis Stages M-2024-498
Data analysis in this research was carried out using R software.The stages of data analysis carried out were as follows.has been balanced.The training process is carried out using the GB, RF and Extra Trees methods. 5.The best model from each method that was obtained from stage 4 was then evaluated for its performance using accuracy, sensitivity and specificity.

Data Exploration
Telco Customer Churn data has 18 explanatory variables and one response variable consisting of two categories, namely Yes (customer stops using the service) and No (customer does not stop using the service).Figure 1 provides information regarding the percentage of discrete (categorical) and continuous variables contained in the data.Moreover, Figure 1 also shows that there are no missing observations in this data, so missing data was not handled in this study.

Figure 1 :
Figure 1: Information related to Variables in Telco Customer Churn Data

Churn Data
[4]ra Trees was developed by Pierre Geurts, Damien Ernst, and Louis Wehenkel in 2006.As the name suggests, Extremely Randomized Trees, this method carries out extreme randomization.Randomization in Extra Trees is not only carried out when selecting explanatory variables but also when selecting cut points.In addition, Extra Trees does not use bootstrap data to build each classification tree.The data used to build each classification tree in Extra Trees is the entire training data.Extra Trees also does not perform pruning when building a classification tree.The following is the algorithm from Extra Trees[4].1.The formation of a classification tree in Extra Trees is carried out using all training data.

Table 2 : Confusion Matrix
1. Exploring Telco customer churn data.2. Dividing the data into training data and test data with proportions of 85% and 15%.Training data is used to build the model while test data is used to evaluate the model performance.3. Handling the problem of class imbalance in Telco customer churn data using Random Oversampling, Random Undersampling, and SMOTE-NC techniques.4. Carrying out the training process using training data that has not been balanced and training data that