Predictive Modeling with H2O: Comparing H2O AutoML and Traditional Regression
Predictive modeling is a core technique in data science, and using machine learning frameworks can greatly improve both the accuracy and speed of model development. In this blog post, we explore how to use the h2o
package in R to automate the model building process with H2O's AutoML, and compare it with traditional regression models. We'll also calculate performance metrics using the Root Mean Squared Error (RMSE).
1. Setting up H2O and the RMSE Function
The h2o
package provides a robust framework for distributed machine learning. Before we dive into the modeling process, we define a custom RMSE function, which will be used to evaluate the predictions made by both H2O models and traditional regression models.
# Load necessary libraries
library(h2o)
library(psych)
# Function for calculating RMSE
RMSE <- function(Predictions, Test_Data){
RMSE_Results <- rep(0,ncol(Predictions))
for(i in 1:ncol(Predictions)){
if(is.numeric(Predictions[,i]) == TRUE){
RMSE_Results[i] <- mean((Predictions[,i]-Test_Data[,i])^2)^(1/2)
}
}
RMSE_Results <- data.frame(matrix(RMSE_Results,nrow=1))
names(RMSE_Results) <- names(Test_Data)
return(RMSE_Results)
}
The RMSE function calculates the square root of the mean squared error between predictions and actual values. It iterates over each column of predictions and only computes RMSE for numeric variables.
2. Initializing H2O and Preparing Data
We initialize an H2O cluster to take advantage of its distributed processing power. The dataset is derived from the bfi
dataset in the psych
package, which contains personality test data. We prepare the data by encoding the gender
variable as 0 (female) and 1 (male), splitting the dataset into training and testing sets.
# Initialize H2O cluster with 8 cores
h2o.init(nthreads = 8)
# Load and prepare the dataset
Data <- data.frame(bfi[complete.cases(bfi),])
Data$gender <- factor(ifelse(Data$gender == 1,1,0))
# Split data into training and testing sets
set.seed(123)
In_train <- sample(1:nrow(Data), round(nrow(Data)/2,0), replace = FALSE)
train <- as.h2o(Data[In_train,])
test <- as.h2o(Data[-In_train,])
# Convert gender back to numeric in the test set to avoid issues with regression
test$gender = as.numeric(test$gender)
The data is loaded, cleaned, and transformed for modeling. By converting the data into an H2O frame, we can take advantage of H2O's machine learning capabilities. The set.seed()
ensures reproducibility when splitting the data into training and test sets.
3. Automated Machine Learning with H2O AutoML
Next, we use H2O’s automl
function to automatically train and evaluate multiple machine learning models for three target variables: gender
, education
, and age
. H2O AutoML tests various algorithms and selects the best models based on performance.
# Perform automated machine learning
aml_1 <- h2o.automl(x = 1:25, y = 26, training_frame = train, max_models = 10, seed = 1)
aml_2 <- h2o.automl(x = 1:25, y = 27, training_frame = train, max_models = 10, seed = 1)
aml_3 <- h2o.automl(x = 1:25, y = 28, training_frame = train, max_models = 10, seed = 1)
# View the leaderboard of models
lb_1 <- aml_1@leaderboard
lb_2 <- aml_2@leaderboard
lb_3 <- aml_3@leaderboard
# Print leaderboards
print(lb_1, n = nrow(lb_1))
print(lb_2, n = nrow(lb_2))
print(lb_3, n = nrow(lb_3))
The automl
function automatically trains up to 10 models for each target variable. The leaderboards display the performance of each model, ranked by their performance metrics. H2O automates the process of model selection, which is incredibly useful for finding the best model with minimal effort.
4. Making Predictions and Evaluating RMSE
We now use the best models from the AutoML runs to make predictions on the test set. We calculate the RMSE of the predictions for each target variable to evaluate the accuracy of the H2O models.
# Make predictions using the best models
pred_1 <- as.matrix(h2o.predict(aml_1, test))[,3]
pred_2 <- as.vector(h2o.predict(aml_2, test))
pred_3 <- as.vector(h2o.predict(aml_3, test))
# Combine predictions into a data frame
h2o_predictions <- data.frame(as.numeric(pred_1), pred_2, pred_3)
names(h2o_predictions) <- c("gender","education","age")
# Calculate RMSE
rmse_h2o <- RMSE(Predictions = h2o_predictions, Test_Data = as.data.frame(test)[,26:28])
Here, we use the h2o.predict
function to generate predictions from the best models. The RMSE for each target variable is then calculated using the custom RMSE function. This provides us with a performance metric for the H2O models.
5. Comparing Predicted vs. Actual Distributions
We also compare the predicted and actual distributions of the gender
and education
variables to evaluate how well the models are performing.
# Gender comparison
h2o_predictions$gender <- factor(round(h2o_predictions$gender, 0), levels = c(1,0), labels = c("Male","Female"))
Predicted_Gender <- table(h2o_predictions$gender)
Actual_Gender <- table(factor(as.vector(ifelse(test$gender == 0,"Female","Male")), levels = c("Male","Female")))
rbind(c("Predicted", Predicted_Gender), c("Actual", Actual_Gender))
# Education comparison
h2o_predictions$education <- factor(round(h2o_predictions$education, 0), levels = 1:5, labels = c("HS","HS complete","Some College","BS/BA","MS/MA"))
Predicted_Edu <- table(h2o_predictions$education)
Actual_Edu <- table(factor(as.vector(test$education), levels = 1:5, labels = c("HS","HS complete","Some College","BS/BA","MS/MA")))
rbind(c("Predicted", Predicted_Edu), c("Actual", Actual_Edu))
This code compares the predicted vs. actual distributions of gender
and education
. Although the model may predict some categories well, others, such as education, might show a larger error.
6. Traditional Regression Models for Comparison
To provide a benchmark for the H2O models, we build three traditional regression models: a logistic regression for gender
, a Poisson regression for education
, and a linear regression for age
. We compare the RMSE of these models against the H2O models.
# Pulling the train and test data out of an h2o object
train = as.data.frame(train)
test = as.data.frame(test)
# Train traditional regression models
model_gender <- glm(as.numeric(as.matrix(train[,26])) ~ as.matrix(train[,1:25]), family = binomial())
model_education <- glm(as.matrix(train[,27]) ~ as.matrix(train[,1:25]), family = poisson())
model_age <- lm(as.matrix(train[,28]) ~ as.matrix(train[,1:25]))
# Make predictions
Reg_Predictions <- data.frame(predict(model_gender, test[,1:25], type = "response"),
predict(model_education, test[,1:25]),
predict(model_age, test[,1:25]))
# Add column names
names(Reg_Predictions) <- c("gender","education","age")
# Calculate RMSE for regression models
rmse_reg <- RMSE(Predictions = Reg_Predictions, Test_Data = test[,26:28])
We calculate the RMSE for each of the traditional models to compare them against the H2O models. This allows us to see if the automated models offer improved accuracy over more manual methods.
7. RMSE Comparison: H2O AutoML vs. Traditional Regression
Finally, we create a side-by-side comparison of the RMSE values from the H2O AutoML models and the traditional regression models.
# Compare RMSE between H2O and traditional regression models
rbind(c("H2O", rmse_h2o), c("Reg", rmse_reg))
This table summarizes the performance of the two modeling approaches. Depending on the problem and the data, one method may outperform the other, but the H2O AutoML provides a streamlined way to achieve high-quality results without manual tuning.
Conclusion
In this post, we demonstrated how to use H2O AutoML to automatically generate predictive models and compared them with traditional regression models. The H2O package offers powerful tools for building accurate models quickly, and the RMSE values help to objectively evaluate the performance of different approaches. By leveraging automated machine learning, data scientists can achieve high-quality models without extensive manual effort.