Churn Analysis with Feature Selection -Genetic Algorithm

Andriyan Saputra
7 min readMar 20, 2022

FINANCIAL DATA CHALLENGE 2022

Financial Data Challenge 2022

Arranged By: Collaboration between PT. Sharing Vision Indonesia and PT Bank Rakyat Indonesia, Tbk

Goal: Creating classification model which able to distinguish customer churn condition.

Model should be containing at least:

  1. Exploratory Data Analysis (EDA)
  2. Feature Engineering
  3. Modelling
  4. Model Evaluation

Data

Data sets are containing with:

  1. fin_data_challenge_train.csv: 100.000 rows with 126 columns (125 feature columns and 1 target column ‘y’)
  2. fin_data_challenge_test.csv: 25.000 rows with 125 columns (without target column ‘y’)

Column Description

Customer Data: x0 — x124 (125 columns): Customer data which have already normalized and confidentially name covered.

Dependent variable:

y : Is that customer churn? (1: yes, 0: no)

Note: There are some missing values on the data set

Environment and Library

Jupyter-notebook: 6.0.3

Ptyhon 3.7.6

Libraries:

  1. Matplotlib
  2. Numpy
  3. Pandas
  4. Seaborn
  5. Sklearn; (train test split, Pipeline, make classification, Random Forest Classifier, Random Serach CV)
  6. imblearn; (Oversampling)
  7. Jcopml; (pipeline, utlis, plot, feature_importance)

At this process, we should Jcopml library to reduce number of processing time using the pipeline mechanism for trial and error procedure. It helps us to put different number of constraint and consideration into modelling activities more efficiently.

EDA (Exploratory Data Analysis)

Upload Raw Data: findata_challenge_train.csv

Customer Data x0-x124 (125 columns): Consist of customer data that have been normalized and classified for confidential purposes.

Dependent variable (y) : Is that customer Churn or not? (1: yes, 0: no)

Note: There are several missing values in the data

Based on the basic information above, Data is consisting of 121 columns with Numerical data type and 5 columns with Categorical data type.

If we pay more attention, there are 2 columns that actually containing Years: X80 and X93. (At this point we do not have any information about what years information is related with, it could be related with Customer’s birth of years or years of customer registration). Based on that condition, writer decided to categorize this Years information into Categorical types of data.

Now, we have all the unique Categorical information from each data. As we can see that categorical data from X21 and X12 are Month name, meanwhile X80 and X93 are Years. Next, X79 is a gender information. One particular interesting column from X89 is actually consisting with spatial information (regional province name from Indonesia country). It could be determined the information about customer origin host location. It has too many variation that could be burden the modelling processing time. At last, X108 is containing information about education level background from the customer.

One more important information in this table, there are several empty values still exist in the rows. We will deal with that later.

According to the information from the Chart above, we could figured it out that Dependent variable (y) has unbalance proportional data ratio. This information show the Imbalance Data condition.

At correlation values, we could see that every column has low number of correlation to the Dependent variable (y). Overall the correlation value below 0.5.

Next, we drop empty values (N/A) that exist from each row. It reduces the number of data for around 16000 rows.

Based on the information of correlation value and categorical data, X89 will be dropped. Writer made hypothesis about X89 did not have any significant impact to the Dependent variable and had too many category variables.

Feature Engineering

In this part, writer will use Dummy Variable method to convert categorical data into numerical types of data.

Based on the information about Imbalance Data condition, where overall data has 83% of Churn and only 17% of non-Churn information customer data.

In order to solve the Imbalance Data condition, there are several methods that can be used; SMOTHE, Random Oversampling, Undersampling, or both of combination from random sampling.

At this time, writer decide to choose Random Oversampling to balance minority non-Churn data level.

As we can see after Random Sampling method, we have 139.836 number of rows data compared with the previous one 84.328 rows.

Modelling

We started modelling process with split the data into training and test data set with ratio: 75:25.

Classification model: Random Forest Classifier

Data set ratio (Train : Test): 75:25

Model selection & tuning: Random Search CV

The processing time: 2 hour 27 minutes 42 seconds

Based on the result, we could see that the modelling result show really high values, 1.00 for training model, 0.986 for test model, and 0.964 for best score model.

Model Evaluation

Model evaluation result is consisted of classification report and confusion matrix model. The result values is really high, indicated over fit condition from the model.

Parameter Tuning

There are several options for parameter tuning activities. It is basically divided into two category:

  1. Data Tuning : We decided which columns of data that we used for modelling activities. The data that considered did not have valuable contribution to the model should be dropped. For instance, removing data with low correlation values, Feature data selection, Outlier, more Feature Engineering, different Imbalanced data solution or any other relevant information from outside.
  2. Model Parameter Tuning: Modified parameter values that we used in modelling activities. For instance, using Grid Search Classifier instead of random, Stratified split for training and test data set, using different parameter from classification model.

At this time, writer will try to use Feature Selection method to parameter tuning from the model output. We use Genetic Selection CV method from Scikit.learn library.

Genetic algorithms mimic the process of natural selection to search for optimal values of a function.Genetic algorithms completely focus on natural selection and easily solve constrained and unconstrained escalation or we can say that optimization problem.

So, basically we use Genetic Selection CV to determine which variable that will be existed in our data for modelling. Before that, we should consider number of variables that we tend to keep in our data. Algorithm of Genetic Selection will choose the best combination of variable for the modelling output.

The processing time: 15 hour 28 minutes 14 seconds

The outcome of this Feature Selection is the combination of variable that selected for the optimum model. At this result, we get the combination of 50 variables. We could also mention the long processing time for this process.

Modelling (after tuning)

After we got the combination of 50 variables, we repeat the modelling activities with the selective variable.

The output shows almost the same result from the previous one (before the tuning). The most important point here, the number of processing time is reduce significantly from 2 hour 27 minutes 42 seconds into 38 minutes 33 seconds. The difference came from the different number of variable use for modelling. At the start, we use 124 different variables and after tuning we could only focus on 50 variables.

Conclusion

Based on the modelling activities that we have been tried, there were several output summary produced:

Modelling Classification: Random Forest Classifier

ROC-AUC

Train dataset = 1.000 | Test dataset: 0.993

The processing time: 2 hour 27 minutes 42 seconds

Tuning: GeneticSelectionCV

ROC-AUC

Train dataset = 1.000 | Test dataset: 0.994

Total processing time: 38 minutes 33 seconds

Genetic Selection CV as feature selection method from Scikit.learn has successfully produce better alternative for modelling. It reduced the number of processing time significantly and keep the modelling output in the optimum values. Furthermore, we could try implement the model into larger data set and help process the model more efficiently.

References:

  1. https://towardsdatascience.com/predict-customer-churn-with-neural-network-1ef8f1a1c6ab
  2. https://towardsdatascience.com/churn-prediction-using-neural-networks-and-ml-models-c817aadb7057
  3. https://www.kaggle.com/imoore/intro-to-exploratory-data-analysis-eda-in-python
  4. Soui, M., Mansouri, N., Alhamad, R. et al. NSGA-II as feature selection technique and AdaBoost classifier for COVID-19 prediction using patient’s symptoms. Nonlinear Dyn 106, 1453–1475 (2021). https://doi.org/10.1007/s11071-021-06504-1

Closing Remark

This publication is produced for educational or information only, if there are any mistake in data, judgement, or methodology that i used to produce this publication.

  • * Please consider to contact the writer using contact information at Profile. I would like to discuss and sharing more about the topic. Thank you.

Best Regards,

Andriyan Saputra

--

--