Neural Networks with R: Predictive Modeling using nnet and Regression Comparisons
Neural networks provide a powerful tool for predictive modeling, capable of capturing complex relationships in data. In this blog post, we will explore how to implement a neural network using the nnet
package in R. We will build a neural network model to predict gender, education, and age based on personality test data. Furthermore, we will compare the neural network's performance to traditional regression models using Root Mean Squared Error (RMSE).
1. Setting up the RMSE Function
Before training any models, we define a custom RMSE function to evaluate their performance. The RMSE function computes the square root of the mean squared difference between predictions and actual values for each column in the dataset.
# Load necessary libraries
library(nnet)
library(psych)
# Function for calculating RMSE
RMSE <- function(Predictions, Test_Data){
RMSE_Results <- rep(0, ncol(Predictions))
for (i in 1:ncol(Predictions)) {
if (is.numeric(Predictions[, i])) {
RMSE_Results[i] <- mean((Predictions[, i] - Test_Data[, i])^2)^(1/2)
}
}
RMSE_Results <- data.frame(matrix(RMSE_Results, nrow=1))
names(RMSE_Results) <- names(Test_Data)
return(RMSE_Results)
}
The RMSE function calculates how far off our predictions are from the true values, providing a key metric to assess model accuracy.
2. Preparing the Data
We will be using the bfi
dataset from the psych
package, which contains data from a personality test. The gender
variable is recoded as 0 for females and 1 for males. We split the dataset into training and testing sets for model evaluation.
# Load and prepare the dataset
Data <- data.frame(bfi[complete.cases(bfi),])
Data$gender <- ifelse(Data$gender == 1, 1, 0)
# Splitting the data into train and test sets
set.seed(123)
In_train <- sample(1:nrow(Data), round(nrow(Data) / 2, 0), replace = FALSE)
train <- Data[In_train,]
test <- Data[-In_train,]
This code prepares the data by handling missing values and recoding gender. By setting a seed, we ensure the same random division of data for reproducibility.
3. Building a Neural Network
We use the nnet
package to build a neural network with 10 hidden nodes. The model is trained to predict three target variables: gender, education, and age.
# Train the neural network
set.seed(123)
model <- nnet(x = train[,1:25], y = train[,26:28],
size = 10, maxit = 10^6, MaxNWts = 10^6, linout = TRUE)
The maxit
and MaxNWts
parameters are set to high values to ensure that the model can converge. The linout = TRUE
argument specifies a linear output, making it suitable for regression tasks.
4. Making Predictions and Evaluating RMSE
After training the neural network, we use it to make predictions on the test set. The RMSE function is then applied to evaluate the model's accuracy.
# Make predictions using the neural network
NN_Predictions <- data.frame(predict(model, test[,1:25]))
# Calculate RMSE for the neural network
rmse_nnet <- RMSE(Predictions = NN_Predictions, Test_Data = test[,26:28])
This gives us the RMSE for the neural network model, helping us assess how well it performs compared to traditional models.
5. Comparing Predicted and Actual Distributions
For categorical variables like gender
and education
, we convert the predictions into factors and compare the predicted vs. actual distributions.
# Predicted distribution of gender
NN_Predictions$gender <- factor(round(NN_Predictions$gender,0), levels = c(1,0), labels = c("Male","Female"))
Predicted_Gender <- table(NN_Predictions$gender)
# Actual distribution of gender
Actual_Gender <- table(factor(as.vector(ifelse(test$gender == 0, "Female", "Male")), levels = c("Male", "Female")))
# Summary of predicted and actual gender
rbind(c("Predicted", Predicted_Gender), c("Actual", Actual_Gender))
This code snippet shows how to compare the neural network's predictions with the actual gender values. A similar process is used for the education
variable.
# Predicted distribution of education
NN_Predictions$education <- factor(round(NN_Predictions$education, 0), levels = 1:5,
labels = c("HS", "HS complete", "Some College", "BS/BA", "MS/MA"))
Predicted_Edu <- table(NN_Predictions$education)
# Actual distribution of education
Actual_Edu <- table(factor(as.vector(test$education), levels = 1:5,
labels = c("HS", "HS complete", "Some College", "BS/BA", "MS/MA")))
# Summary of predicted and actual education
rbind(c("Predicted", Predicted_Edu), c("Actual", Actual_Edu))
Here, we can see that the neural network may struggle to predict certain categories like education, which may lead to a poor match between predicted and actual distributions.
6. Visualizing Predicted vs Actual for Age
Next, we plot the predicted vs. actual values for the age
variable to visualize how well the neural network captures this continuous target variable.
# Plot predicted vs actual for age
plot(test[,28], NN_Predictions[,3], main = "Predicted vs Actual Neural Network",
xlab = "Actual", ylab = "Predicted")
abline(0,1)
This plot helps to visualize the model's performance on a continuous outcome like age. Ideally, points should lie along the diagonal line (indicating perfect predictions).
7. Benchmarking with Traditional Regression Models
To compare the neural network's performance, we also fit three traditional regression models: a logistic regression for gender, a Poisson regression for education, and a linear regression for age. We calculate their RMSE for comparison.
# Traditional regression models
model_gender <- glm(as.matrix(train[,26]) ~ as.matrix(train[,1:25]), family = binomial())
model_education <- glm(as.matrix(train[,27]) ~ as.matrix(train[,1:25]), family = poisson())
model_age <- lm(as.matrix(train[,28]) ~ as.matrix(train[,1:25]))
# Predictions for the regression models
Reg_Predictions <- data.frame(predict(model_gender, test[,1:25], type = "response"),
predict(model_education, test[,1:25]),
predict(model_age, test[,1:25]))
names(Reg_Predictions) <- c("gender", "education", "age")
# Calculate RMSE for regression models
rmse_reg <- RMSE(Predictions = Reg_Predictions, Test_Data = test[,26:28])
These traditional models serve as benchmarks for comparing the performance of the neural network. By comparing RMSE, we can see which method performs better for each variable.
8. RMSE Comparison: Neural Network vs. Regression
We now summarize the RMSE results for both the neural network and the traditional regression models.
# RMSE comparison
rbind(c("NNet", rmse_nnet), c("Reg", rmse_reg))
The table provides a direct comparison of the two approaches, allowing us to see how well the neural network performs against traditional methods for each target variable.
9. Visualizing Predicted vs Actual for Regression
Finally, we plot the predicted vs. actual values for age using the linear regression model to compare its performance with the neural network.
# Plot predicted vs actual for regression
plot(test[,28], Reg_Predictions[,3], main = "Predicted vs Actual Regression",
xlab = "Actual", ylab = "Predicted")
abline(0,1)
This plot allows us to visually compare the performance of the regression model for predicting age, with points closer to the diagonal indicating better predictions.
Conclusion
In this post, we implemented a neural network using the nnet
package in R and compared its performance to traditional regression models using RMSE. While neural networks offer the potential to capture complex relationships, their performance may vary depending on the dataset and tuning. Comparing RMSE values allows us to objectively evaluate the strengths and weaknesses of each approach for specific tasks like predicting gender, education, and age.