For a single input dataset, the dataset is divided sequentially into number.of.folds specified subsets. Each subset is left out as testing data while the remaining subsets are used to train a random forest model with default mtry and ntree. The left out data is then predicted by the model.

CVPredictionsRandomForest(
  inputted.data,
  name.of.predictors.to.use,
  target.column.name,
  seed,
  percentile.threshold.to.keep = 0.8,
  number.of.folds
)

Arguments

inputted.data

A dataframe that should already have the rows randomly shuffled.

name.of.predictors.to.use

A vector of strings that specify the name of columns to use as predictors. Each column should be numerical.

target.column.name

A string that specifies the column with values that we want to predict for. This column should be a factor.

seed

A integer that specifies the seed to use for random number generation.

percentile.threshold.to.keep

A number from 0-1 indicating the percentile to use for feature selection. The percentage of CV rounds that the feature has importance values above the percentile.threshold.to.keep will be counted in the var.tally of the output. Example: If there are 4 features (x, y, a, b) and the mean decrease in gini index in one LOOCV round are 4, 3, 2, 1 respectively, then this means feature x is in the 100th percentile (percentile value of 1), y is in the 75th percentile (percentile value of 0.75), etc. If the threshold is set at 0.75, then both x and y will be tallied for this single CV round.

number.of.folds

An integer from 1 to nrow(inputted.data) to specify the fold for CV. If This number is set to nrow(inputted.data), then the function will be LOOCV.

Value

A list with two objects: running.pred: predicted values for each observation. A vector. var.tally: the percentage of CV rounds that the features had importance values above the percentile.threshold.to.keep percentile. A table.

Details

This function assumes that the data is already randomly shuffled. This function is based off of the LOOCVPredictionsRandomForestAutomaticMtryAndNtree() function, but this function does not have the ability to do PCA or optimize for mtry and ntree. If the target classes are imbalanced, then stratified cross-validation should be used. Stratification is not automatically performed by this function. For stratification to occur, samples will need to be shuffled so that each fold has proportional representation from each class before inputting into this function.

See also

Examples

example.data <- GenerateExampleDataMachinelearnr() set.seed(1) example.data.shuffled <- example.data[sample(nrow(example.data)),] result.CV <- CVPredictionsRandomForest(inputted.data = example.data.shuffled, name.of.predictors.to.use = c("x", "y", "a", "b"), target.column.name = "actual", seed = 2, percentile.threshold.to.keep = 0.5, number.of.folds = nrow(example.data))
#> [1] 1 #> [1] 2 #> [1] 3 #> [1] 4 #> [1] 5 #> [1] 6 #> [1] 7 #> [1] 8 #> [1] 9 #> [1] 10 #> [1] 11 #> [1] 12 #> [1] 13 #> [1] 14 #> [1] 15 #> [1] 16 #> [1] 17 #> [1] 18 #> [1] 19 #> [1] 20 #> [1] 21 #> [1] 22 #> [1] 23 #> [1] 24 #> [1] 25 #> [1] 26 #> [1] 27 #> [1] 28 #> [1] 29 #> [1] 30 #> [1] 31 #> [1] 32 #> [1] 33 #> [1] 34 #> [1] 35 #> [1] 36
#Predicted result.CV[[1]]
#> [1] "1" "1" "5" "4" "3" "3" "3" "4" "3" "5" "3" "4" "2" "3" "5" "4" "1" "2" "3" #> [20] "1" "4" "3" "5" "3" "2" "1" "5" "5" "5" "3" "4" "2" "1" "5" "2" "1"
#Feature importance result.CV[[2]]
#> variables.with.sig.contributions #> x y a #> 100.00 97.22 2.78