A Comprehensive Overview of Data Mining Approaches for Image Classification

Andriyan Saputra
11 min readNov 20, 2023

In this project, a detailed explanation of the comparative approach to data mining techniques in multi-class image classification will be presented.

Image by Freepik.com

INTRODUCTION

Multiclass image classification is a common task in computer vision which aims to categorize images of a data set into their respective categories or labels. The main objective of this project is to develop a classifier capable of accurately classifying a data point into one of the ten classes for unseen data.

THE APPROACH FOR PRE-PROCESSING DATASET

In this project the dataset will be used is the CIFAR-10 dataset. It has been deliberately various realistic challenges, such as missing values, a diverse range of data scales, and outliers. Datasets consist of 2180 data points, with first 100 columns contain numerical features, while the subsequent 28 columns contain nominal features. The last column is an independent variable indicates the corresponding label for each data point. The dataset includes a total of 10 classes, which are donated by numbers 0 to 9, representing the following categories: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck, respectively.

Based on the Exploratory Data Analysis (EDA) data set, there are approximately more or less than 200 rows of varying null values distributed in each column (numeric and nominal features) in the data set.

The nominal feature column (Cat_Col) is divided into two categories:

  1. The nominal features Cat_Col_100 to Cat_Col_109 consist of unique values between 0, 1, 2, 3 and 4.
  2. The nominal features Cat _Col_110 to Cat_ Col_127 consist of binary values between 0 or 1.

No outlier values are detected in this datatype.

The numeric features (Num_Col) column consists of random values between 0 to 1. Outliers are detected by values that provide a sufficiently large contrast represented which is represented by the mean, minimum, and maximum values in the dataset. There are interesting conditions from the numeric features Num_Col_22, 27, 33, and 56. Each column has a value that is quite large compared to other numerical values. Removing these values would probably result in loss of information, a normalization approach will be specified to retain this information.

Another approach applied in this section is data distribution analysis via histograms. The results of analysis show that the measure of central tendency shows asymmetric data conditions that vary from each column. This means that the dataset does not follow a Gaussian distribution.

NORMALIZATION

Based on the EDA results, several numerical features (Num_Col_22, 27, 33, and 56) need to be normalized. Because the data distribution does not follow a Gaussian distribution, max-min normalization will be chosen [3].

Histogram analysis shows that in general each column of data does not follow a Gaussian distribution.

OUTLIER DETECTION

The range of outliers varies from each numeric feature column in the dataset. The outlier strategy needs to be considered carefully, as it will result in the loss of a lot of information in the dataset. At this point, an outlier detection strategy will be performed on the entire numeric column data set. This means that this study will use a certain threshold to filter the values of the data set. The accepted range values to use are above 0.00 and below 1.00. All values outside the threshold range will be converted to null values.

Outlier data can be detected by looking at the mean, min, and max values of each column

MISSING DATA IMPUTATION

After carrying out the outlier strategy, there will be lots of null values spread across each row and column in the dataset. Removing rows that contain null values can result in the loss of a large amount of information from the data set.

The missing data imputation strategy will be performed on class-specific values. For numeric features, imputation of missing data can be done with the mean value. Since outlier removal has been done previously, calculating the average will not pose a problem at this point. For nominal features, imputation of missing data can be performed with the most common value [3]. Imputation of missing data will be done with SimpleImputer from Scikit-learn.

SPLIT DATASET

Before splitting the dataset there are considerations that need to be considered.

Calculation of each class value distribution

Based on the analysis of the target dataset (Label column), the output dataset has a balanced distribution. Except for class 5.0 (deer) which only accounts for 3.7% (with ratio around 1:7 of the largest class distribution) among all class variables. At this point, the target class value of 5.0 needs to be considered in the data split strategy and class weight parameter model. K-fold sampling with shuffle and random state parameters will be chosen to maintain the best representation of the entire target population data set. In this study, we will use the KFold algorithm from Scikit-learn to split the dataset.

This data splitting strategy will provide a training or testing indices to split data in train and test sets. Split the dataset into k consecutive folds. Each fold is then used once as a validation while the remaining k- 1 folds form the training set [2]. The number of k folds is 10 or more. Splitting the dataset will be carried out using a cross validation scheme which produces a training dataset and validation dataset with a ratio of 75:25 respectively.

THE PROPOSED DATA MINING APPROACH FOR COMPARATIVE STUDY

In this study, an experiment is carried out using four (4) data mining techniques (Decision tree, Random Forest, K-Nearest Neighbor, and Naïve bayes) toward multi-class image classification. In addition, this study also implements an ensemble method that combines these techniques.

K-Nearest Neighbor

In this model, the KNN algorithm will be determined by the distance from each data point. Based on the proposed data pre-processing, normalization and outlier data strategy will handle unusual distances between data points and make a positive contribution to the model results.In the KNN algorithm, there will be several options for considering the hyperparameter model. Based on the information from [1], Lk norm worsens faster within increasing dimensionality for higher values of k. High value for the dimensionality dataset, it may be preferable to use lower values of k. This means L1 (Manhattan) distance metric is the most preferable for high dimensional applications. Another consideration is the choice of the number of k parameter values. Hyperparameter tuning will be conducted with a range of k parameter values. The parameter selection will be listed on GridSearchCV as a grid search across the specified parameter space.

Based on the result of combination from cross validation and grid search method, we generate the best hyperparameter selection:

{ metric: ‘euclidean’, n_neighbors: 11, CV: 10}

Model evaluation:

Model evaluation for KNN model from each label class: Accuracy, F1-score, Confusion matrix

Naïve Bayes

The Naive Bayes Multinomial classifier from Scikit-learn will be selected as the model classification for this section. The Naive Bayes multinomial classifier is suitable for classification with discrete features [2]. The Laplacian correction will be added to the feature as a Laplace smoothing parameter.

Model evaluation:

Model evaluation for Multimodal NB model from each label class: Accuracy, F1-score, Confusion matrix

Decision tree

For conducting decision tree classification, this study will be carried out using the DecisionTreeClassifier provided by Scikit-learn. The function to measure the quality of a split is the Gini impurity. The default values for the parameters controlling the size of the trees lead to fully grown and unpruned trees which can potentially be very large on some data sets [2].Hyperparameter tuning with several different experiments on parameter selection of tree depth and maximum class weight will be performed and evaluated to avoid overfitting result.

Based on the result of combination from cross validation and grid search method, we generate the best hyperparameter selection:

{ criterion: ‘entropy’, max_depth: 10, CV: 10}

Model evaluation:

Model evaluation for Decision Tree model from each label class: Accuracy, F1-score, Confusion matrix

Random Forest

A random forest is a meta estimator that fits a few decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting [2]. In this model, the RandomForestClassifier algorithm from Scikitlearn will be chosen as the model classifier.

Same as decision tree classifier algorithm, random forest classifier has default values of the parameters lead to fully grown and unpruned trees which can potentially be very large on some data sets. To control the complexity and resulting model, hyperparameter selection will be controlled.

We will conduct multiple experiments to fine-tune hyperparameters, including the number of trees in the forest, the root criteria, maximum tree depth, and class weight for each class, and subsequently, evaluate their impact.

Based on the result of combination from cross validation and grid search method, we generate the best hyperparameter selection:

{ criterion: ‘gini’, max_depth: 12, max_features: ‘sqrt’, n_estimators: 10, CV: 10}

Note*: Grid Search method on Random Forest model took quite long period time processing. In this case, it took 5 hours.

Model evaluation:

Model evaluation for Random Forest model from each label class: Accuracy, F1-score, Confusion matrix

Cross Validation

In this study, multiclass image classification using data mining techniques is studied and compared for KNN, Naïve Bayes, Decision Tree, and Random Forest classifiers. All these classifiers were processed with different combinations of Kfold shuffle split datasets and hyperparameter selection in crossvalidation.

Cross validation will be performed with 10 or more number of re-shuffling and splitting iterations. At each iteration a data splitting strategy will be carried out which produces a training and validation dataset. Dataset preprocessing will only be applied to the training dataset and leave the validation dataset at the default set. For hyperparameter tuning, model parameter selection will be listed and performed by the GridSearchCV algorithm. After the entire process is carried out, an evaluation table will be provided for each combination.

Ensemble model

In this study, the VotingClassifier algorithm from the Scikit-learn library will be selected for the ensemble method. The idea behind the VotingClassifier is to combine conceptually different machine learning classifiers and use a majority vote or the average predicted probabilities (soft vote) to predict the class labels. Such a classifier can be useful for a set of equally well performing models to balance out their individual weaknesses [2].

In majority voting, the predicted class label for a particular sample is the class label that represents the majority (mode) of the class labels predicted by each individual classifier. The result of this approach will be used as a comparative study with other model evaluation.

(1) Random Forest & KNN

Based on the analysis of all previous model results, we decided to choose RF & KNN as an ensemble model. The reason because each model show quite promising result and they are able to cover each other’s shortcomings.

Model evaluation:

Model evaluation for Random Forest and KNN model from each label class: Accuracy, F1-score, Confusion matrix

EVALUATION MODEL

Mutliclass classifiers can be tailored to the oneclass situation by fitting a boundary around the target data and demming instances that fall outside it to be outliers. These methods rely heavly on a parameter that determines how much of the target data is likely to be classified as outliers. If it is chosen too conservatively, data in the target class will erroneously be rejected. If it is chosen too liberally, the model will overfit and reject too much legitimate data [5].

The model performance is evaluated using the three following measurements: 1) accuracy, 2) sensitivity, 3) specifity, and 4) F1 score. The combination matrix is also considered for image classification, and it is a cross-tabulation of different prediction values with respect to the observed values (i.e., true values) and accuracy depends on empirical rate of correct predictions. Because there is an imbalanced target dataset (class 5.0), the most appropriate metric for evaluation is probably the F1 score.

Model evaluation will be carried out on the entire dataset and each class in data set. Each model resulting from the cross-validation scheme will be evaluated. The mean and standard deviation values of different cross combinations of list parameters in GridSearch and cross validation will then be used as model evaluation criteria. The objective of this part is not only to find the highest percentage of model performance but find the most stable performance across all different classes. This activity will determine the best model parameter on the dataset [4].

Model evaluation summary: Accuracy & F1-score from each class

Based on the summary of the model implementation of each class above, we can see that Random Forest has better performance than other classes. However, KNN and NB also show slightly better performance in classes 1.0 and 8.0.

MODEL TESTING

The test data set will be provided in Week 9. The test data set may require the same pre-processing strategy as that applied to the training data set. All results of the best model classifier will be used for the test data considering the unknown condition of the test dataset. If the model results are very different from the training dataset, then there is a possibility of the model is overfit.

At this point, the model parameters will be adjusted based on the test data set information. Dataset training model and model evaluation will be carried out again until the optimum generalized model is achieved. The optimal model will be determined based on a comparison between the training and testing dataset errors.

After being submitted with an unknown test data set, the model selection succeeded in achieving an F1 score: 0.7567.

POTENTIAL FOR FURTHER RESEARCH

  • Stratified Cross Validation on split data sets strategy
  • Assign data weight to class label ‘5.0’ in model training

REFERENCES

[1] Aggrawal, C., Hinneburg, A., and Keim, D., “On the Surprising Behavior of Distance Metrics in High Dimensional Space”, Vianu (Eds.): ICDT 2001, LNCS 1973: 420–434, 2001.

2] Pedregosa et al., “Scikit-learn: Machine Learning in Python”, Journal of Machine Learning Research 12, pp. 2825–2830, 2011.

[3] Miao Xu, “Lecture University 2: Introduction toDataClassification,” pp. 49 & 58, 2023.

[4] Rao Sreenivasa A., Ramana V. A., Ramakrishna, S., “Implementing the Data Mining Approaches toClassify the Images with Visual Words,”International Journal of Recent Technology and Engineering (IJRTE), vol.7, issue-6S2, 2019.

[5] Witten H., Frank E., Hall M., “Data mining practical machine learning tools and techniques,” Elsevier, 3rd edition, pp. 336, 2012

Closing Remark

This publication is produced for educational or information only, If there are any mistake in data, judgement, or methodology that I used to produce this publication.

  • * Please consider to contact the writer using contact information at Profile. I would like to discuss and sharing more about the topic. Thank you.

Best Regards,

Andriyan Saputra

--

--