Neural Networks with R: Predictive Modeling using nnet and Regression Comparisons

Neural Networks with R: Predictive Modeling using nnet and Regression Comparisons

Neural networks provide a powerful tool for predictive modeling, capable of capturing complex relationships in data. In this blog post, we will explore how to implement a neural network using the nnet package in R. We will build a neural network model to predict gender, education, and age based on personality test data. Furthermore, we will compare the neural network's performance to traditional regression models using Root Mean Squared Error (RMSE).

1. Setting up the RMSE Function

Before training any models, we define a custom RMSE function to evaluate their performance. The RMSE function computes the square root of the mean squared difference between predictions and actual values for each column in the dataset.

# Load necessary libraries
library(nnet)
library(psych)

# Function for calculating RMSE
RMSE <- function(Predictions, Test_Data){
    RMSE_Results <- rep(0, ncol(Predictions))
    
    for (i in 1:ncol(Predictions)) {
        if (is.numeric(Predictions[, i])) {
            RMSE_Results[i] <- mean((Predictions[, i] - Test_Data[, i])^2)^(1/2)
        }
    }
    
    RMSE_Results <- data.frame(matrix(RMSE_Results, nrow=1))
    names(RMSE_Results) <- names(Test_Data)
    return(RMSE_Results)
}
    

The RMSE function calculates how far off our predictions are from the true values, providing a key metric to assess model accuracy.

2. Preparing the Data

We will be using the bfi dataset from the psych package, which contains data from a personality test. The gender variable is recoded as 0 for females and 1 for males. We split the dataset into training and testing sets for model evaluation.

# Load and prepare the dataset
Data <- data.frame(bfi[complete.cases(bfi),])
Data$gender <- ifelse(Data$gender == 1, 1, 0)

# Splitting the data into train and test sets
set.seed(123)
In_train <- sample(1:nrow(Data), round(nrow(Data) / 2, 0), replace = FALSE)
train <- Data[In_train,]
test <- Data[-In_train,]
    

This code prepares the data by handling missing values and recoding gender. By setting a seed, we ensure the same random division of data for reproducibility.

3. Building a Neural Network

We use the nnet package to build a neural network with 10 hidden nodes. The model is trained to predict three target variables: gender, education, and age.

# Train the neural network
set.seed(123)
model <- nnet(x = train[,1:25], y = train[,26:28], 
              size = 10, maxit = 10^6, MaxNWts = 10^6, linout = TRUE)
    

The maxit and MaxNWts parameters are set to high values to ensure that the model can converge. The linout = TRUE argument specifies a linear output, making it suitable for regression tasks.

4. Making Predictions and Evaluating RMSE

After training the neural network, we use it to make predictions on the test set. The RMSE function is then applied to evaluate the model's accuracy.

# Make predictions using the neural network
NN_Predictions <- data.frame(predict(model, test[,1:25]))

# Calculate RMSE for the neural network
rmse_nnet <- RMSE(Predictions = NN_Predictions, Test_Data = test[,26:28])
    

This gives us the RMSE for the neural network model, helping us assess how well it performs compared to traditional models.

5. Comparing Predicted and Actual Distributions

For categorical variables like gender and education, we convert the predictions into factors and compare the predicted vs. actual distributions.

# Predicted distribution of gender
NN_Predictions$gender <- factor(round(NN_Predictions$gender,0), levels = c(1,0), labels = c("Male","Female"))
Predicted_Gender <- table(NN_Predictions$gender)

# Actual distribution of gender
Actual_Gender <- table(factor(as.vector(ifelse(test$gender == 0, "Female", "Male")), levels = c("Male", "Female")))

# Summary of predicted and actual gender
rbind(c("Predicted", Predicted_Gender), c("Actual", Actual_Gender))
    

This code snippet shows how to compare the neural network's predictions with the actual gender values. A similar process is used for the education variable.

# Predicted distribution of education
NN_Predictions$education <- factor(round(NN_Predictions$education, 0), levels = 1:5, 
                                   labels = c("HS", "HS complete", "Some College", "BS/BA", "MS/MA"))
Predicted_Edu <- table(NN_Predictions$education)

# Actual distribution of education
Actual_Edu <- table(factor(as.vector(test$education), levels = 1:5, 
                labels = c("HS", "HS complete", "Some College", "BS/BA", "MS/MA")))

# Summary of predicted and actual education
rbind(c("Predicted", Predicted_Edu), c("Actual", Actual_Edu))
    

Here, we can see that the neural network may struggle to predict certain categories like education, which may lead to a poor match between predicted and actual distributions.

6. Visualizing Predicted vs Actual for Age

Next, we plot the predicted vs. actual values for the age variable to visualize how well the neural network captures this continuous target variable.

# Plot predicted vs actual for age
plot(test[,28], NN_Predictions[,3], main = "Predicted vs Actual Neural Network",
     xlab = "Actual", ylab = "Predicted")
abline(0,1)
    

This plot helps to visualize the model's performance on a continuous outcome like age. Ideally, points should lie along the diagonal line (indicating perfect predictions).

7. Benchmarking with Traditional Regression Models

To compare the neural network's performance, we also fit three traditional regression models: a logistic regression for gender, a Poisson regression for education, and a linear regression for age. We calculate their RMSE for comparison.

# Traditional regression models
model_gender <- glm(as.matrix(train[,26]) ~ as.matrix(train[,1:25]), family = binomial())
model_education <- glm(as.matrix(train[,27]) ~ as.matrix(train[,1:25]), family = poisson())
model_age <- lm(as.matrix(train[,28]) ~ as.matrix(train[,1:25]))

# Predictions for the regression models
Reg_Predictions <- data.frame(predict(model_gender, test[,1:25], type = "response"),
                              predict(model_education, test[,1:25]),
                              predict(model_age, test[,1:25]))

names(Reg_Predictions) <- c("gender", "education", "age")

# Calculate RMSE for regression models
rmse_reg <- RMSE(Predictions = Reg_Predictions, Test_Data = test[,26:28])
    

These traditional models serve as benchmarks for comparing the performance of the neural network. By comparing RMSE, we can see which method performs better for each variable.

8. RMSE Comparison: Neural Network vs. Regression

We now summarize the RMSE results for both the neural network and the traditional regression models.

# RMSE comparison
rbind(c("NNet", rmse_nnet), c("Reg", rmse_reg))
    

The table provides a direct comparison of the two approaches, allowing us to see how well the neural network performs against traditional methods for each target variable.

9. Visualizing Predicted vs Actual for Regression

Finally, we plot the predicted vs. actual values for age using the linear regression model to compare its performance with the neural network.

# Plot predicted vs actual for regression
plot(test[,28], Reg_Predictions[,3], main = "Predicted vs Actual Regression",
     xlab = "Actual", ylab = "Predicted")
abline(0,1)
    

This plot allows us to visually compare the performance of the regression model for predicting age, with points closer to the diagonal indicating better predictions.

Conclusion

In this post, we implemented a neural network using the nnet package in R and compared its performance to traditional regression models using RMSE. While neural networks offer the potential to capture complex relationships, their performance may vary depending on the dataset and tuning. Comparing RMSE values allows us to objectively evaluate the strengths and weaknesses of each approach for specific tasks like predicting gender, education, and age.

Previous
Previous

Mastering Function Writing in R: A Guide to Creating Reusable Code

Next
Next

Predictive Modeling with H2O: Comparing H2O AutoML and Traditional Regression