Random Forest Classification in R

The R code for this tutorial can be found on GitHub here: https://github.com/statswithrdotcom/Random-Forest-Classification

In one of my books, I give an example of performing multi-class classification on the “iris” data set (a dataset built into base R). In the book, I used a deep learning model to perform the classification which resulted in about 91% accuracy, although the randomized starting weights made this fluctuate. While this is impressive, it is worth acknowledging that we can achieve better performance with a much simpler model in this case. In this tutorial, I am going to show you how to create a random forest classification model and how to assess its performance. First, I am going to write some preliminary code librarying the random forest package we are going to use, and importing the “iris” data set.

# library the random forest package
library(randomForest)

# load the iris data set
data(iris)

Next, I am going to specify some aspects of hyper-parameters I want to use and be able to change. I want the split between the training and test set to be alterable, so I can determine how many rows I actually need to train the model. I am not going to do this here, but it is interesting to tinker with. For now, I am going to just use half of the 150 rows for training. We also need to set the number of trees the random forest model will use; there are no hard rules for this, but 64 works fine for this small data set. You may find that for larger data sets you need to use more trees for the appropriate level of accuracy.

# Choose Size of training data and trees used
Train_N <- 75 # 50% split
Num_Trees <- 64 # 64 trees used

Now, I am going to use the sample() function to randomly select the rows that will be in the training set. Notice that I set replace equal to false; this is required for this method to work. Next, I simply add the selected rows to “Train” and remove those same rows from “Test”. I also used the set.seed() function to make sure the split happens the same way for you, although in practice this is not necessary.

# Split the data set into training and test
set.seed(123) #makes it repeatable
Ind <- sample(1:nrow(iris), Train_N, replace = FALSE)
Train <- iris[Ind,]
Test <- iris[-Ind,]

It is time for the actual model! Although it is a little anti-climatic being a single line of code. The first argument in the “randomForest” function is the model, which is indicating that we want to predict “Species” from all of the other variables. The “.” in the model is what denotes this. Next, we tell it what data set to use, of course, we are giving it the training data. Finally, we specify the number of trees to use, which determines the size of the model. We chose that earlier to be 64 trees, and we simply feed it into the “ntree” argument.

# The random forest model created using the training data
model <- randomForest(Species ~ ., data = Train, ntree = Num_Trees)

Now, we should actually use the model to make predictions on the training set. Below, I go ahead and combine the predictions and actual values into a data frame I call “Results”.

# A data frame containing the predicted and actual flower species
Results <- data.frame(predict(model,Test[,-5]), Test[,5])
names(Results) <- c("Predicted","Actual")

You could simply visually compare the cases where the predicted and actual values match; however, I want to go a bit further and have R do that for me. Right now, we are only dealing with a small data set, but we could have 10,000 rows, meaning we would have to do something like this. First, I am going to initialize some values in a vector of “Correct” indicating which pairings between predicted and actual are correct represented by a 1. Next, I am going to make some simple group_# variables that will count up every time we have a correct identification of a specific species. Finally, I am making C# variables that will simply count when the actual species occurs. Note, due to random sampling, I do not know how many of each species are in the training set beforehand. The for loop goes through each row of the “Results” data frame and accumulates these values.

# Initializing values for the loop
Correct <- rep(0,(150-Train_N))
Group_1 <- Group_2 <- Group_3 <- 0
C1 <- C2 <- C3 <- 0


# For loop that iterates through the row indexes of the "Results" data frame
for(i in 1:(150-Train_N)){
# Assigns a 1 to "Correct" if it is correct
if(Results$Predicted[i] == Results$Actual[i]){
Correct[i] = 1
}
# Counts up C1 and accumulates group 1 if correct
if(Results$Actual[i] == "setosa"){
C1 = C1 + 1
if(Correct[i] == 1){Group_1 = Group_1 + 1}
}
# Counts up C2 and accumulates group 2 if correct
if(Results$Actual[i] == "versicolor"){
C2 = C2 + 1
if(Correct[i] == 1){Group_2 = Group_2 + 1}
}
# Counts up C3 and accumulates group 3 if correct
if(Results$Actual[i] == "virginica"){
C3 = C3 + 1
if(Correct[i] == 1){Group_3 = Group_3 + 1}
}
}

We can now determine how accurate we were overall and in the identification of each species. The total accuracy is the sum of the “Correct” vector divided by the rows in the test set. Additionally, the species-based accuracy is calculated as the corresponding group_# count / the C# count, in other words, the number of correct identification over the number of times each species occurred. The additional code multiplies it by 100 to create a percentage, and rounds it to two decimal places to limit the length of the output.

# Calculating the percent correct overall and by species
Correct_Total = round(sum(Correct)*100/(150-Train_N),2)
Correct_Seto = round(Group_1*100/C1,2)
Correct_Vers = round(Group_2*100/C2,2)
Correct_Virg = round(Group_3*100/C3,2)

I am adding some code to make the results print in the console and represent them in a bar plot. You can see that the overall accuracy is 94.67% which is superior to the 91% I received from my deep learning model. You can also see that the model correctly identified every “setosa”, but had more trouble distinguishing between “versicolor” and “virginica”.

# Printing percent correct in the console
print(paste("Total accuracy: ", Correct_Total, "%",
" Setosa accuracy: ", Correct_Seto, "%",
" Versicolor accuracy: ", Correct_Vers, "%",
" Virginica accuracy: ", Correct_Virg, "%", sep = ""))


# Visualizing percent correct as a bar plot
barplot(c(Correct_Total,Correct_Seto,Correct_Vers,Correct_Virg),
names.arg = c("Total Acc","Seto Acc", "Vers Acc", "Virg Acc"),
main = paste("Accuracy for",Num_Trees,"Trees"),xlab = "Different Metics",
ylab = "Accuracy %", col = c("green","blue","orange","yellow"))


Accuracy for random forest classification overall and by species.


Previous
Previous

Naive Bayes Classification in R

Next
Next

Regression by Sampling