fast.ai Tabular Data — Classification with Entity Embedding

Andriyan Saputra
7 min readSep 19, 2022

At this time we try different approach to define categorical variables in data set. We usually use several options such as One hot encoding or Dummy variable. At this point we will try to compare the result with Entity Embedding approach to define categorical data in data set.

According to Cheng Guo and Felix Berkhahn’s paper, Entity Embedding of Categorical Variables, “The embedding obtained from trained neural networks improves the performance of all machine learning methods tested significantly when used as an input feature instead.” Basically, they found that trained embedding matrices could be used as inputs for other models to improve their performance compared to training on the inputs themselves. The results of their paper are shown below in Figure 1 where Accuracy refers to the comparison of the predicted value of the model and the validation variable (with EE) refers to the entity embedding approach.

Figure 1. Neural network embedding as input to other machine learning algorithms (Entity Embedding of Categorical Variables).

At this point, I’ll try to replicate it with my own data set, analyzing Churn data from the latest competition from the Financial Data Challenge 2022. The code for training embedding matrices can be found on my GitHub.

Entity Embedding

An embedding is an approach to handling categorical variables. This is similar to one-hot coding where the tabular data set of the column “Gender” consisting of categories such as “male” and “female” as a string will be converted into two columns with the same name as ‘1’ if the row -item indicates affirmative, and ‘0’ otherwise. The embedding approach on the other hand would represent each category as a real-valued vector encoding the meaning of the word in such a way that similar words would be closer to each other in the vector space. This definition makes it sound more complicated than it really is; for me, it helps to understand what it really looks like. You can imagine an embedding matrix having the same rows with a different number of categories, and columns set to almost any number you want (more columns means higher complexity and the potential to learn additional latent factors from your wordspace), and rows of the matrix it is a vector that represents a certain category. For this example, see Figure 2 below; it shows what the column looks like if we express it in one-hot encoding format or as an embed.

Figure 2. Example of representing a categorical column using either a one-hot encoding or embedding approach. (Source: Christoper McBride)

The embedding matrix in this example would have a shape of 2 x N. For this project, we select N using the rule of thumb defined by the FastAI package, N = min(600, round(1.6 * categories⁰.56).

Embedding layers have been widely used, especially in word embedding. The most popular example is probably Word2vec, which is simply a layer 2 network that exploits the Embedding layer to convert words into a numeric format that can be used as input for new networks. So while most online tutorials and papers mention Embedding Layers in relation to text processing, this is not the only area where they can be used. Almost any categorical variable can be coded using this powerful technique.

Differentiation between One Hot Encoding and Entity Embedding approach in categorical variables

In this section I will tell you about what you need to know in practice about the differences between the two layers.

The two main advantages of Embedding over Dense layers are reduced input size and reduced computational complexity, which results in accelerated training times.

  1. Reduced input size

Because Embedding layers are most commonly used in text processing, let’s take a sentence as a concrete example:

‘I am who I am’

Let’s first of all integer-encode the input

A Dense layer requires the input to be strictly numerical. Thus, we need to one-hot encode it:

Notice that the 1st dimension of this matrix represents the sentence length, while the 2nd dimension represents the number of unique words in our Corpus. Instead, the Embedding layer accepts the first input, the integer-encoded vector, which is much smaller. For this example, we imagined that the Corpus is made of only the 3 words that appear in our sentence: ‘I’, ‘am’, ‘who’. However, in real-life scenarios, the Corpus will be the whole English dictionary, made of about 180,000 words! Imagine having to keep in memory a 180,000 long vector for each single instance in our data set, of which 99% will be made of zeros that do not contribute to the final output!

2. Reduced computational complexity

To get input to the Hidden Dense Layer, the network needs to perform a matrix multiplication between the one-hot-coded sparse matrix of the Input layer and the weight matrix. The Embedding layer, instead, considers the weight matrix as a mere lookup table, where the nth row represents the embedding vector of the nth integer encoding level of the categorical input variable. In other words, the Dense Layer performs the dot product operation, which is computationally more expensive than the selection operation performed by the Embedding layer. This makes the training process much faster.

However, the Embedding Layer lacks some parameters that can be trained, namely the bias and activation functions.

Can you use an Embedding layer to encode numeric features? No, that doesn’t make sense. If the advantage of the Embedding layer is to skip the one-hot encoding step, it is clear that we don’t need the Embedding layer when the one-hot encoding is not needed in the first place! So if your data set includes a set of high-dimensional categorical variables and one of the numeric variables, you should use an Embedding layer to encode the former and a Dense layer to encode the latter. These can then be combined and passed on to the next layer.

How it works?

It started with build our TabularPandas object:

df = pd.read_csv(path/'.csv')
cont_nn,cat_nn = cont_cat_split(df, max_card=9000, dep_var='y')
procs = [Categorify, FillMissing, Normalize]
y_names = 'y'
y_block = CategoryBlock()
splits = RandomSplitter()(range_of(df))
to = TabularPandas(df, procs=procs, cat_names=cat_names, cont_names=cont_names,y_names=y_names, y_block=y_block, splits=splits)dls = to.dataloaders()

Stochastic gradient descent will exactly select values of combination in our matrix. In the code, you’ll see that this is quite straightforward as we let FastAI’s tabular_learner build and train our neural network for us:

learn = tabular_learner(dls, y_range=(8,12), layers=[500,250], n_out=1, loss_func=F.mse_loss)learn.fit_one_cycle(5, 1e-2)

And then we can simply access the embedding matrices from the learner as below where emb_idx selects a specific categorical variables embedding and the weight attribute accesses the tensor:

learn.model.embeds[emb_idx].weight

Building the Entity Embedding Dataset

After we have our embedding matrices, our challenge becomes to recreate the dataset using our embedding instead of the categorical columns themselves. For this we use the code shown in Figure 3. It can be interpreted as: for each row in the dataset, and for each item, retrieve that values entity embedding vector (averaging the matrix if the value is missing or just returning the original data for the non-categorical and dependent variables), add it to a row and repeat until we have a dataset with the same number of rows as the original.

Figure 3. Jupyter Notebook Python excerpt for building the embedding dataset.

The comparison output of categorical data with different approaches: One Hot Encoding / Dummy variable and Entity Embeddings

Imbalance Dataset

After developing the dataset, we define oversampling strategy with:

oversample = RandomOverSampler(sampling_strategy='minority')

In Churn dataset cases, we should applied oversampling strategy to deal with imbalance data condition.

Testing Different Models

It continued with ran a couple of models with identical hyper-parameters comparing the two datasets and found the results presented in Figure 4.

Figure 4. Comparison of datasets with different number of models. Image by author.

As we can see, that in some cases tested the entity embedding dataset contribute to increase the accuracy.

Conclusion

Embedding layers are powerful tool that should be understood and used by any Data Scientist in order to create dense and meaningful representations of high-dimensional categorical data. Their advantage over Dense layers lies in the smaller input size required and in the reduced computational complexity, which speeds up the training process.

Further Step

  1. Feature Importance analysis
  2. Genetic Selection CV
  3. Redundant Features
  4. Partial Dependence
  5. Tree Interpreter

References:

  1. https://medium.com/analytics-vidhya/neural-network-entity-embeddings-as-model-inputs-5b5f635af313
  2. https://towardsdatascience.com/churn-prediction-using-neural-networks-and-ml-models-c817aadb7057
  3. https://walkwithfastai.com/Ensembling
  4. Cheng Guo, Felix Berkhahn. Entity Embeddings of Categorical Variables. 22 Apr 2016. https://doi.org/10.48550/arXiv.1604.06737
  5. https://andriyan-saputra78.medium.com/churn-analysis-with-feature-extraction-genetic-algorithm-988645d9f343
  6. https://docs.fast.ai/tabular.data.html

Closing Remark

This publication is produced for educational or information only, If there are any mistake in data, judgement, or methodology that I used to produce this publication.

  • * Please consider to contact the writer using contact information at Profile. I would like to discuss and sharing more about the topic. Thank you.

Best Regards,

Andriyan Saputra

--

--