For a single input dataset with N observations, each observation is removed one at a time. This creates N sub-datasets Using each sub-dataset, we can build a RF model and predict the one observation that's left out. We can also determine the features in each round that was deemed to be important. At the end, we get a list of predictions for every observation (because every observation was left out exactly one time) and we get a list of variables that were considered important.

LOOCVPredictionsRandomForestAutomaticMtryAndNtree(
  inputted.data,
  predictors.that.PCA.can.be.done.on,
  predictors.that.should.not.PCA,
  should.PCA.be.used,
  target.column.name,
  seed,
  should.mtry.and.ntree.be.optimized = FALSE,
  percentile.threshold.to.keep = 0.8
)

Arguments

inputted.data

A dataframe

predictors.that.PCA.can.be.done.on

A vector of strings that specify the name of columns that has data that should undergo PCA.

predictors.that.should.not.PCA

A vector of strings that specify the name of columns that has data that should not undergo PCA.

should.PCA.be.used

A boolean indicating if PCA should be used. If this is FALSE, then it doesn't matter whether predictors as listed in predictors.that.PCA.can.be.done.on or predictors.that.should.not.PCA

target.column.name

A string that specifies the column with values that we want to predict for. This column should be a factor.

seed

A integer that specifies the seed to use for random number generation.

should.mtry.and.ntree.be.optimized

A boolean to indicate if RandomForestAutomaticMtryAndNtree() should be used to optimize ntree and mtry. Default is FALSE.

percentile.threshold.to.keep

A number from 0-1 indicating the percentile to use for feature selection. The percentage of LOOCV rounds that the feature has importance values above the percentile.threshold.to.keep will be counted in the var.tally of the output. Example: If there are 4 features (x, y, a, b) and the mean decrease in gini index in one LOOCV round are 4, 3, 2, 1 respectively, then this means feature x is in the 100th percentile (percentile value of 1), y is in the 75th percentile (percentile value of 0.75), etc. If the threshold is set at 0.75, then both x and y will be tallied for this single LOOCV round.

Value

A list with two objects:

  1. running.pred: predicted values for each observation. A vector.

  2. var.tally: the percentage of LOOCV rounds that the features had importance values above the percentile.threshold.to.keep percentile. A table.

Details

This function uses RandomForestAutomaticMtryAndNtree() for each round of LOOCV. For each round of LOOCV, the mean decrease in gini index is used to determine which features (predictors) are important for predicting the classes. The code for this function has functionality for doing PCA to reduce the dimensionality and then using the PCs as predictors. Can choose to optimize for mtry and ntree.

Cross-validation allows you to do feature selection and assess if the model is overfit.

See also

Examples

id = c("1a", "1b", "1c", "1d", "1e", "1f", "1g", "2a", "2b", "2c", "2d", "2e", "2f", "3a", "3b", "3c", "3d", "3e", "3f", "3g", "3h", "3i") x = c(18, 21, 22, 24, 26, 26, 27, 30, 31, 35, 39, 35, 30, 40, 41, 42, 44, 46, 47, 48, 49, 54) y = c(10, 11, 22, 15, 12, 13, 14, 33, 39, 37, 44, 40, 45, 27, 29, 20, 28, 21, 30, 31, 23, 24) a = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) b = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) actual = as.factor(c("1", "1", "1", "1", "1", "1", "1", "2", "2", "2", "2", "2", "2", "3", "3", "3", "3", "3", "3", "3", "3", "3")) example.data <- data.frame(id, x, y, a, b, actual) result <- LOOCVPredictionsRandomForestAutomaticMtryAndNtree(example.data, predictors.that.PCA.can.be.done.on = c("x", "y", "a", "b"), predictors.that.should.not.PCA = NULL, should.PCA.be.used = FALSE, target.column.name = "actual", seed=2, percentile.threshold.to.keep = 0.5)
#> [1] 1 #> [1] 2 #> [1] 3 #> [1] 4 #> [1] 5 #> [1] 6 #> [1] 7 #> [1] 8 #> [1] 9 #> [1] 10 #> [1] 11 #> [1] 12 #> [1] 13 #> [1] 14 #> [1] 15 #> [1] 16 #> [1] 17 #> [1] 18 #> [1] 19 #> [1] 20 #> [1] 21 #> [1] 22
predicted <- result[[1]] actual <- example.data[,"actual"] eval.classification.results(as.character(actual), as.character(predicted), "Example")
#> [[1]] #> [1] "Example" #> #> [[2]] #> predicted #> actual 1 2 3 #> 1 7 0 0 #> 2 0 5 1 #> 3 0 0 9 #> #> [[3]] #> [[3]]$accuracy #> [1] 0.9545455 #> #> [[3]]$macro_prf #> # A tibble: 3 x 3 #> precision recall f1 #> <dbl> <dbl> <dbl> #> 1 1 1 1 #> 2 1 0.833 0.909 #> 3 0.9 1 0.947 #> #> [[3]]$macro_avg #> # A tibble: 1 x 3 #> avg_precision avg_recall avg_f1 #> <dbl> <dbl> <dbl> #> 1 0.967 0.944 0.952 #> #> [[3]]$ova #> [[3]]$ova$`1` #> classified #> actual 1 others #> 1 7 0 #> others 0 15 #> #> [[3]]$ova$`2` #> classified #> actual 2 others #> 2 5 1 #> others 0 16 #> #> [[3]]$ova$`3` #> classified #> actual 3 others #> 3 9 0 #> others 1 12 #> #> #> [[3]]$ova_sum #> classified #> actual relevant others #> relevant 21 1 #> others 1 43 #> #> [[3]]$kappa #> [1] 0.9301587 #> #> #> [[4]] #> [1] 0.9331967 #>
#Feature selection #As expected, only features x and y are indicated as important. result[[2]]
#> variables.with.sig.contributions #> y x #> 100 100