fastai Tabular Model: Crop Yield Production in the World

Andriyan Saputra
9 min readNov 26, 2023

In this project, we tried to build a crop yield production regression with Tabular model method based on historical climate and environmental data.

Image is edited from Freepik.com

DATA SOURCE

  • Food and Agriculture Organisation (FAO): Comperhensive repository of agricultural and climate-related data by country. This includes historical crop yield data, information on crop types, climate variables, pesticides, insecticides, and geographical information. (http://www.fao.org/home/en/)
  • World Bank — Country Climate and Development Report (CCDR): Wealth of data related to climate, development, population growth, and human food consumption by country. Greenhouse gas emissions come from this source and would require data processing. (https://databank.worldbank.org/)
List of data source information

Our final dataset was merged from these data sources, and it contains 50,591 rows and 26 columns. These columns cover everything from country names to yields of different crops, as well as environmentally relevant indicators. In order to maintain consistency in the data, we focus on the 1990–2013 period, as this is the common time period for multiple datasets.

When dealing with missing values, we find that only the column “Temperature_Change(Degree Celsius)” has 365 missing entries for 1990–2013. Considering the time-series nature of the data, we decided to use forward and backward padding. This approach considers the autocorrelation of the time series data, i.e., the continuity of the previous and subsequent observations in the time series, which justifies the use of these observations to fill in the missing values.

Further, in order to obtain a comprehensive and integrated dataset, we merged different datasets based on country names or country codes. In terms of data conversion, we used the One-Hot Encoding method to convert textual data to numerical data to facilitate modelling and analysis.

Finally, to improve the readability of the dataset, we changed some of the column names. For example, we changed Value to Yield_hg/ha and changed Pesticides (total) | 00001357 || Use per area of cropland | 005159 || Kilograms per hectare to PesticidesTotal_kg/ha. This change simplifies the column names while ensuring that the data are descriptive and complete.

After this series of data cleaning and preprocessing steps, we now have a complete, consistent, and accurate dataset suitable for in-depth machine learning modelling and analysis.

Subsample of final dataset

PARAMETER DESCRIPTION

The data consists of 14 parameters (11 Numerical data types, 3 Categorical data types) and 1 target parameter. After applying Pearson analysis, the high correlated value were dropped. Final dataset parameters to be used in the model consist of:

{ ‘Country_Code’, ‘Item_Code’, ‘Year’, ‘Yield_hg/ha’, ‘CO2_kt’, ‘AvgPrecipitation_mm/year’, ‘AvgTemp_DegC’, ‘PesticidesTotal_kg/ha’, ‘TempChange_DegC’, ‘NutrientNitrogen Total_kg/ha’ }

Target parameter: { ‘Yield_hg/ha’ }

Preprocessing: Parameter selection by removing high correlated parameter from dataset. Actual dataset (above) and After dropping high correlated parameter (below)

NEURAL NETWORK

In this model, the fastai library was used to generate a Neural Network model using the problem dataset (fastai 2023). Fastai is one of the deep learning libraries that can provide cutting-edge results by mixing and matching in building new approaches efficiently. Fastai works flexibly to make things easier for users.

There are several unique features in fastai to help with tabular data collection. A series of processes to structure was applied to the dataset preparation pipeline including:

  • Determine which parameters are classified as continuous or categorical datatypes.
  • Determine the dependent variable (y value): ‘Yield_hg/ha’.
  • Define a set of process transformations to apply to the tabular dataset: Normalize: Normalize the continuous variables (subtract the mean anddivide by the std). Categorify: Take every categorical variable and make a map frominteger to unique categories, then replace the values by thecorresponding index.

Next, the TabularsDataLoaders object was defined to describe what the data will look like with all the predefined information. It contains information about how much should be fed to the model at once (batch size), how many processes should be performed to load the data.

Valid_idx: Creates a list of randomly selected indices from the train_df dataset. The purpose of doing this is often for data splitting in machine learning tasks, such as creating a validation set for model training and evaluation. By selecting a random subset of data points as the validation set, you can assess how well your model performs on unseen data.

The show_batch() function was used to see a sample batch including x and y values as shown in Figure 19 below:

Snapshot of sample batch from TabularDataLoaders

After that, a model can be defined using the tabular_learner method. When our model is defined, fastai will try to infer the loss function based on our y_names earlier.

The summary function gives us a way to see in detail the layers that fastai generates for a model.

Model summaries consist of detail layers generated from the model.
  • TabularModel consists of 3 Embedding layers, where each layer represents thedata transformation process from categorical to numerical. Detail distinct value of categorical data; { ‘Country_code’: 111, ‘Item_code’: 11, ‘Year’ : 23}. Note: Due to batch size set, it is possible that not all distinct values are included in the batch set.
  • The first Linear layer consists of 200 nodes with ReLU activation function,BatchNorm1d, and Dropout.
  • The second Linear Layer consists of 100 nodes with ReLU activation function,BatchNorm1d, and Dropout.

Dropout: Attempts to deal with overfitting by randomly disabling small parts of the model during training runs. We force the model to not overly rely on individual neural pathways, and thus force the need to create more complex associations.

  • The last Liner layer produces a single output node with the SigmoidRangefunction.
  • Optimizer: Adam

Adam (adaptive moment estimation): It is an optimization that can be an alternative for the stochastic gradient descent process. It uses estimations of the first and second moments of the gradient to adapt the learning rate for each weight of the neural network.

Weight decay: As a rule of thumb, the more training examples you have, the weaker this term should be. The more parameters you have the higher this term should be. So, Weight decay is a regularization term that penalizes big weights. When the weight decay coefficient is big, the penalty for big weights is also big, when it is small weights can freely grow.

  • Loss function: Root Mean Square Propagation (RMSP)

RMSP: An adaptive optimization algorithm. take the cumulative summation of squared gradients but, in RMSP we take the ‘exponential average’.

  • Callbacks

Once the model design is complete, we continue to train the model with parameters; number of epochs, learning rate, and weight decay. Learning rate parameter can be determined by look at the learn.lr_find() function. The function will generate the recommendation selection of learning rate value. The selection should be determined by the highest slope of trend from decreasing value of learning rate.

Snaphsot of loss values development on iteration

Hyperparameter Tuning

In this step, several experiments were conducted to determine the optimal model as the selected outcome. Different combinations of parameters, number of epochs and learning rate were selected.

Comparison of loss function in training models based on different hyperparameter selection.

The comparison results of the development of the loss function in the training model show that after 100–200 epochs, no improvement was obtained. Moreover, training and validation have the same pattern, so there is no need for further training. The training data with parameter; learning_rate: 0.03, number_of_epoch: 200, weight ddecay: 0.2, batch_size: 64, shows that the results are less smooth, this is shown in the graph results which are more spiky. The comparison results between the training and testing datasets show that 200 epochs with hyperparameter selection is the optimal result achieved by the model.

Model evaluation is carried out by plotting the predicted and actual values of the dependent variable ‘Yield_hg/ha’.

Model evaluation foe Neural Network model. Actual vs Predicted Yield (Left) and residual Plot (Right) for Neutral Network.

For further analysis, we tried to look at more detailed data where we calculated models of each Crop type (‘Item_code’):

Model evaluation results from in-depth analysis based on Crop type.

The results show that there are quite high error values produced from potatoes, sweet potatoes, cassava, and yam. After conducting several analyses of outlier detection and in-depth analysis of several countries, we still cannot find out what parameters contribute to the level of error rates.

Example of comparison between actual and prediction value yield from Rice crop

Parameter Importance

Permutation importance: is a technique in which we shuffle each column in a data frame and analyze how changing a particular column affected our Target values. The more that it was affected, the more “important” we can (generally) call a variable in the model.

Permutation importance values can produce different results based on model selection. Initially, MSE of the model is calculated with the original variables. Then, the values of single column are permutated and the MSE is calculated again. For example, if a column (Col1) takes the values 1,2,3,4 and a random permutation of the values resutls 4,3,1,2. This results in an MSE1. Then an increase in the MSE (MSE1 — MSE), would signify the importance of the variable.

Based on the results above, if we try to calculate the model with a combination of all parameters including Corp type (Item_Code) and Country_Code, the result is that the model calculation varies greatly depending on the Item_Code and Country_Code parameters. It explains that each type of plant in a country can have a variety of different environmental conditions. The model is highly dependent on these parameters.

If we try to look in more detail by hiding the high parameter values (Item_Code, Country_Code, Year) we can see other important values come from TotalGHG_MtCO2eq and NutrientNitrogenTotal_kg/ha.

Examples of wheat harvest data from various countries

Furthermore, negative numbers indicate that random permutations work better. It can be concluded that this variable does not play a role in prediction, that is, it is not important.

FURTHER STUDY

  • Implement a Stratified on train-test split dataset
  • Put some weight on different crop type

REFERENCE

Closing Remark

This publication is produced for educational or information only, If there are any mistake in data, judgement, or methodology that I used to produce this publication.

  • * Please consider to contact the writer using contact information at Profile. I would like to discuss and sharing more about the topic. Thank you.

Best Regards,

Andriyan Saputra

--

--