FEATURE-IMPORTANCE ANALYSIS OF SOCIO-DEMOGRAPHIC DATA IN THE HOSPITALITY SECTOR

Andriyan Saputra
6 min readNov 30, 2021
Tableau Dashboard Output: Feature Importance Analysis

BACKGROUND

One of the sectors most affected by the Covid-19 pandemic is the hotel sector. Where the determination of movement restriction policies throughout the province of West Java has broken the economic chain of hotel sector activities. The hotel sector itself is one of the main contributors to local revenue sources along with the tourism sector. The decline in the amount of local revenue (PAD) in the hotel sector occurred from 2018 to 2020. In addition, the different conditions and characteristics of each district/city in West Java province are a challenge in determining policies and making the right decisions. In order to improve understanding of the condition of the hospitality sector in the province of West Java, the author tries to analyze the modeling of the hospitality sector in terms of socio-demographic data information.

OBJECTIVES

  1. Make a modeling analysis of the hospitality sector based on socio-demographic data sources for the province of West Java
  2. Determine the supporting variables that contribute positively and negatively to the development of the hospitality sector in the West Java Province

SCOPE OF PROBLEMS

  1. The data are generally socio-demographic information for the province of West Java
  2. Data is available in spatial categories at the district/city level
  3. Data is available in an annual range in the period 2014–2020
  4. There are a lot of information gaps with incomplete report conditions
  5. Target information (Dependant Variable) is limited to the amount of Regional Original Income (PAD) from the Hospitality sector
  6. Modeling information (Independent Variables) uses 104 types of supporting features

RESOURCE DATA & INFORMATION

The data sources used come from various data sources, both from those provided by the 2021 National Data Science Tournament committee team and other external data sources with details as shown in the table below. Determination of the selection of data sources used refers to the details of the problem boundaries.

The total category of data used in the modeling is 105 features. By determining the Total Regional Original Income of the Hospitality Sector (GDP2) as a Dependent Variable. Then as many as 104 other data as Independent Variables.

WORKFLOW & ENVIRONMENT

Application Environment used:

•Python 3.7.6

•Tableau Desktop

•IDE: Jupyter Notebook

Library

•Numpy

•Pandas

•Sklearn

•XGBOOST

•Matplotlib

•Seaborn

•Jupyter Nbextensions

•J.COp Snippets

•Shapley Additive Explanations (SHAP)

EDA (Exploratory Data Analysis)

In order to solve the data condition problem, we try to make adjustments by filling in the blank data with values from the previous/after year period. Blank values in the report are caused by incomplete reporting conditions.

Trained/test Model

The distribution of the composition of the Training data set and the Test data set uses a ratio of 80:20 and random state conditions. The modeling process uses the concept of a Regression model using the help of a pipeline to facilitate the trial and error process

Model regression:

  1. XGBOOST
  2. Random Forest

Regression modeling process using Machine Learning algorithms: XGBOOST and Random Forest is carried out with all the datasets used. The model results from the Train data set and testing to the Test data set will produce initial output as the basis for further analysis. Then the Feature Importance component is determined through the Mean Score Decrease value. After obtaining the 20 values of the greatest importance feature, a modeling procedure was carried out on the Train and Test data set by limiting the choice of features to only 20 types. This procedure is repeated (trial & error) to produce optimal model results with the best feature composition.

The best model is produced from 22 types of independent variable data features. Furthermore, through this general model, interpretation of the results from each district/city area in the province of West Java is carried out

Interpretation Result

In terms of interpreting the results, we use the SHAP value parameter to see the contribution of the 22 sample features to the total Regional Original Income (PAD) from the hotel sector in each district/city location of West Java province. Shapley Additive Explanations (SHAP) is a method introduced by Lundberg and Lee in 2017 for the interpretation of machine learning model predictions through Shapely values.

The figure above shows an example of the contribution of each feature in determining the final value of the base value (the average value of the final result of the training data set) to the model output. The JML_HT feature shows a very dominant positive contribution to the base value. Meanwhile, other features such as KDR_U3, KDR_U5 contributed negatively in lowering the value from the base value.

Reading the Dashboard Output

Using the information above, we will try to analyze the contribution of the SHAP value to the PAD value in the Kuningan Regency area. Based on the distribution of SHAP values, the value of local revenue from the hospitality sector is dominated by several factors:

The first factor is dominated by the number of restaurants, in this case statistically the area has quite good culinary potential. In the face of a pandemic, the hospitality sector has good potential to transform into the culinary business. The culinary and restaurant sectors are the sectors that have survived the most during the pandemic, so to remain able to run hotel operations, the hotel sector should also be able to participate in the culinary business by taking advantage of the convenience of various channels that are already available, such as; delivery system, logistics, orders, and drive-thru.

The second factor is influenced by the number of visitors to the museum. Through this output, statistically it can be shown that museum tourist destinations are one of the tourism objects that make a positive contribution in the region.

The third factor is influenced by the number of private vehicles of the types of sedans, jeeps, minibuses.

Through this output, it can be estimated that the demand for parking lots and road access will be able to increase the hotel sector in the region If we look at the opposite side, the JML_HT (Number of Star Hotels) feature makes a very dominant negative contribution. Through this output, it can be interpreted that the number of star hotels does not make a positive contribution to the hotel sector.

To analyze further, we can look at other features that might affect this, such as inappropriate price ranges, poor reviews, and others. In addition, information is also provided on the condition of population mobility in each region from several external sources. Through this information, we can see that the increase in the intensity of population movement indicates a positive condition for the recovery of the hospitality sector from the pandemic condition. However, it is necessary to anticipate that a pandemic or similar conditions may also occur in the future.

CONCLUSION

The diversity of socio-demographic conditions from each district/city in West Java Province requires specific policies and decision-making from each region. By using the Feature-Importance Analysis Dashboard of Socio-demographic Data in the Province of West Java, through the use of the distribution of the SHAP parameter values from the resulting model, we can determine which components are the most dominant and influential in each district/city location. Through this assessment, we can determine what components are of concern in conducting further analysis, determining policies and making better decisions in order to increase the potential of the hospitality sector from each district/city in West Java province.

--

--