rw-book-cover

Metadata

Highlights

  • When using machine-learning methods with spatial data, we need to take care of, e.g., spatial autocorrelation, as well as extrapolation when predicting to regions that are far away from the training data. To deal with these issues, several methods have been developed. In this document, we will show how to combine the machine-learning workflow of caret with packages designed to deal with machine-learning with spatial data. Hereby, we use blockCV::cv_spatial() and CAST::knndm() for spatial cross-validation, and CAST::aoa() to mask areas of extrapolation. We use sf and terra for processing vector and raster data, respectively. (View Highlight)
  • In spatial machine-learning, a spatial CV is often needed to prevent very similar data to be in the training and testing fold at the same time, which is often the case if training data are clustered and leads to overly optimistic CV error estimates. R packages that implement spatial CV include, e.g., blockCV and CAST. Here, we will explore the integration of those two with caret. (View Highlight)
  • The blockCV package implements different blocking methods for spatial CV. The resulting object of the main function blockCV::cv_spatial() contains a nested list of the k folds and the training data rows that belong to each fold, as well as a list of the test data left out in each of the k iteration. These lists can be obtained using lapply() and then be used as an input to the caret::trainControl() function of caret that defines the CV strategy used in caret::train(). The grid of hyperparameter values tested during CV is defined using the tune_grid argument in caret::train(). Here, we test mtry values from 2 to 12 and min.node.size values between 5 and 15. The combination of mtry and min.node.size that minimizes the RMSE is then automatically used to re-train a final model with the complete training data set. (View Highlight)
  • Another spatial CV method is kNNDM, which is implemented in the CAST package and aims at emulating the prediction situation encountered by the model during CV. In this case, the prediction situation is to predict from the temperature measurement stations to the whole area of Spain. Since the temperature measurement stations are rather randomly distributed over the area of Spain, no spatial blocking is needed and kNNDM randomly assigns training points to CV folds. The output of kNNDM contains a list of row indices of training data points that are used in each CV-iteration (indx_train), as well as of indices that are left out in each iteration (indx_test). These lists can easily be used as input to the caret::trainControl() function of caret that defines the CV used in caret::train(). (View Highlight)
  • To reduce the number of environmental predictors, and thus enhance the generalizability of the model, feature selection is commonly applied in machine-learning workflows. CAST implements Forward-Feature-Selection, that can be used with spatial CV. Here, we use the results of the hyperparameter tuning above and kNNDM CV to select the most relevant features. Plotting the results of FFS() shows that the variables DEM, Y, EDF5 and primaryroads were selected. (View Highlight)
  • Lastly, the area which is too dissimilar from the training data for the models to make reliable predictions (area of applicability, AOA) is delineated using the function CAST::aoa(). The function CAST::aoa() takes as inputs the predictor stack, as well as the trained caret model. The resulting object contains the dissimilarity values, the threshold used to delineate the AOA (every dissimilarity value above this threshold is considered outside the AOA), as well as the final AOA raster. Since our training data are randomly distributed in the study area, most of the area falls within the AOA. (View Highlight)