Missing Data Imputation for Machine Learning

Github link to full R code: github.com/statswithrdotcom/Missing-Data-Imputation

In this article, I will go over possible methods for data imputation for training machine learning models, which contrasts with what should be done if hypothesis testing is the goal. These techniques are inappropriate for hypothesis testing because they do not account for the uncertainty in the imputed data, and will always give more narrow confidence intervals than warranted. However, if you are training a neural network on thousands of rows of data, and have missingness, these methods could be a good solution. I like to start my R code by librarying every package I am using, so I am going to go ahead and put that code down. I will also give the code I am using for data. I will be using the “iris” data set for this example, but these methods have also been validated on the “bfi” and “gapminder” data sets. To switch between the data sets you simply need to uncomment the assignment of the “Starting” variable.

library(psych)

library(magrittr)

library(missForest)

library(doParallel)

library(Amelia)

library(mice)

library(gapminder)

# Trial Data sets

Starting <- iris

#Starting <- bfi[1:1000,]

#Starting <- gapminder[1:1000,]

The first method is not really a method of imputation, but it is the easiest if you only have very occasional missing values. You can only take the rows with no missing values, called list-wise deletion. If you have many missing values, list-wise deletion may be a more significant loss of data than you would like. However, it is a real solution if the number of missing values is low. I am running this code to ensure that my test data set(s) start out with no missing values.

# Takes only the complete cases for this tutorial

# This is literally all of the code for list-wise deletion

Complete <- Starting[complete.cases(Starting),]

Now, I am going to take the “Complete” data and artificially create missing data for the sake of this tutorial. The rate of missingness is defined by the “noNA” argument and is set to 20%.

# Randomly makes 20% of the data missing for imputation

Data <- prodNA(Complete, noNA = 0.2)

Before I impute the data, I want to understand my data set, to get information including: the percent missingness, mean, mode, and number of categories. For that goal, I am going to create two helper functions that will make it easier to accomplish. The first is similar to the class function in R, but it uses more natural language (in my opinion). “Detect_Type” tells the user if a column or vector of data is either “Invariant” (does not change at all), “Numeric”, “Binary” or “Multicategorical”. You will find that the output of this function is more useful than “class” especially for the gapminder data set.

# Determines the type of vector that is given to it.

Detect_Type <- function(String){

# If the vector contains a single unchanging value it is labelled invariant

if(length(table(String)) == 1) {return("Invariant")}

# If the vector is numeric it is labelled as such

if(is.numeric(String) == TRUE) {return("Numeric")}

# If the vector is categorical it is labelled appropriately

if(is.numeric(String) != TRUE){

ifelse(length(table(String)) == 2,

return("Binary"),return("Multicategorical"))

}

} # End of function

# Demonstration of what it does

Detect_Type(Data[,1])

# Similar although less useful answer(s) without custom function (Especially Gapminder)

class(Data[,1])

The next function is more simple; it will calculate the percent missing values in a column or vector. A dedicated function may not have been as necessary, but it forces the data into a consistent type which can prevent errors.

# Determines the percent missingness in a vector

Detect_Missingness <- function(String){

# Forces lists to be a matrix

String = as.matrix(String)

# Percent missing values

Percent_Missing = mean(is.na(String))*100

#returns this percentage

return(Percent_Missing)

} # End of function

# Demonstration of the function

Detect_Missingness(Data[,1])

Now, we get to that full function that can give a better idea of what is going on with the data and is used in subsequent imputation methods. This function takes a data set as input and prints a collection of relevant information as promised. The names(Meta_Data) assignment statement below will give the full list of the information calculated.

# Generates meta data about a data set given to it.

Meta_Data <- function(data, Show_Plot = TRUE){

# Ensures matrices are converted to data frames.

data = data.frame(data)

# Initializes a progress bar for the meta data calculations

print("Meta Data Generation Progress Bar")

pb <- txtProgressBar(min = 0, max = ncol(data), style = 3)

# Creates an empty data frame and assigns column names

Meta_Data <- data.frame(matrix(NA, nrow = ncol(data), ncol = 9))

names(Meta_Data) <- c("Column Name", "Data Type", "Percent Missing",

"Mean", "Median", "Var.", "Std. Dev.", "N Categories","Mode")

# Loops through each column and generates the meta data

for(i in 1:ncol(data)){ # Start of loop

# Adds the column name, type, and missingness

Meta_Data[i,1] <- names(data)[i]

Meta_Data[i,2] <- Detect_Type(data[,i])

Meta_Data[i,3] <- Detect_Missingness(data[,i])

# If the data is numeric it gives information about it

if(Meta_Data[i,2] == "Numeric") {

Meta_Data[i,4] <- mean(data[,i], na.rm = TRUE)

Meta_Data[i,5] <- median(data[,i], na.rm = TRUE)

Meta_Data[i,6] <- var(data[,i], na.rm = TRUE)

Meta_Data[i,7] <- sd(data[,i], na.rm = TRUE)}

# If the data is categorical it gives the number of categories and mode

if(Meta_Data[i,2] == "Binary" | Meta_Data[i,2] == "Multicategorical"){

Meta_Data[i,8] <- length(table(data[,i]))

Meta_Data[i,9] <- names(table(data[,i])[which.max(table(data[,i]))])}

# changes the progress bar

setTxtProgressBar(pb,i)

} # End of loop

# Creates the optional Missingness bar plot

if(Show_Plot == TRUE) {barplot(Meta_Data$`Percent Missing`,

names.arg = Meta_Data$`Column Name`,

ylim = c(0,max(ceiling(Meta_Data$`Percent Missing`))),

main = "Missingness Plot", xlab = "Column names", ylab = "Percent Missing")}

# Returns the data frame as the output of the function

return(Meta_Data)

# Ends the progress bar

dev.off()

} # End of function

# Assigns the metadata variable and prints it to console

(M_Data <- Meta_Data(data = Data))

We are making progress toward the actual imputation methods; I just want to introduce more custom functions. The first function is a function called “Missing_Matrix” which will create a matrix with the same dimensions of the data and indicate the location of missing data with a 1. This allows this calculation to be removed from each subsequent function, and the result of this function can be provided as an argument to every imputation function. However, I have designed the function to run if the missing matrix is not given.

# Creates a binary matrix that represent missing values

Missing_Matrix <- function(data){

# An empty matrix for the nested loops below to fill in

Missing_Matrix <- matrix(0, nrow = nrow(data), ncol = ncol(data))

# Initializes the progress bar for the calculation below

print("Missing Matrix Generation Progress Bar")

pb <- txtProgressBar(min = 0, max = nrow(data)*ncol(data), style = 3)

# A counter to track the progress of the missing data

counter <- 0

# Searches for missing data and adds a 1 in the missing data matrix

for(i in 1:nrow(data)){

for(j in 1:ncol(data)){

if(is.na(data[i,j]) == TRUE){Missing_Matrix[i,j] = 1}

setTxtProgressBar(pb,counter) # For progress bar

counter = counter + 1 # For progress bar

}

}

# Returns the missing matrix

return(Missing_Matrix)

# Ends the progress bar

dev.off()

} # End of function

# Assigns the missingness matrix as M_Mat

M_Mat <- Missing_Matrix(Data)

Next, I am creating a function that compares the original “Complete” data and the imputed data to get a sense of how the imputation methods compare to each other. To do this, the function below calculates the root mean squared error (RMSE) between all of the original and imputed values. This means that the function requires the imputed data, original data, and that missing matrix we just created. This function does not really make sense if you are working with your own data, but in the context of this example, it makes sense to get a rough idea of imputation accuracy. The example below demonstrates the RMSE calculation if we imputed every missing value with the number 1, which will inevitably result in a high value.

# Function for comparing real data and imputed data

Imputation_RMSE <- function(Imputed_Set, Original_Set, Missing_Mat){

# Forces the data into a data frame

Original_Set = data.frame(Original_Set)

# Initializes variables and the progress bar

SE <- 0; Counter <- 0; Imp <- Imputed_Set; Runs <- 0

print("Imputation RMSE Calculation Progress Bar")

pb <- txtProgressBar(min = 0, max = nrow(Imp)*ncol(Imp), style = 3)

# Loops through the data sets and calculates the difference between

# actual and imputed values using the missing matrix as a guide

for(i in 1:nrow(Imputed_Set)){

for(j in 1:ncol(Imputed_Set)){

setTxtProgressBar(pb,Runs) # For progress bar

Runs = Runs + 1 # For progress bar

if(Missing_Mat[i,j] == 1){ # Where the missing mat is used

if(is.numeric(Original_Set[i,j]) == TRUE){ # Checks that it is numeric data

SE = SE + (Imputed_Set[i,j] - Original_Set[i,j])^2 # Sum of Squared differences

Counter = Counter + 1 # Counter used for averaging below

}

}

}

}

# Steps to create and return the desired RMSE value

MSE <- SE/Counter; RMSE <- MSE^(1/2); return(RMSE)

# Ends the progress bar

dev.off()

} # End of function

# Demonstration where every imputed value is the number 1, high RMSE...

Imputation_RMSE(Imputed_Set = matrix(1,nrow = nrow(Complete),ncol = ncol(Complete)),

Original_Set = Complete, Missing_Mat = M_Mat)

The first imputation method is mean imputation, where a missing value in a column will be imputed with the mean of that column. Unless that column is categorical and does not have a mean; in that case, it will be imputed with the mode. You may not want to use mean imputation as it will distort the relationship between variables. Although mean and median imputation are, by far, the most computationally efficient and may be the only option for large data sets (>10000 rows). In general, mean and median imputation are the least accurate and most efficient because the relationships among variables are not used in the imputation.

# Mean imputation function that uses the pre-made metadata and missing mat functions.

Mean_Imputation <- function(data, Meta_Dat = NA, Missing_Mat = NA){

# Forces the data into a data frame

data = data.frame(data)

# Turns character columns into factor columns

for(i in 1:ncol(data)){

if(is.character(data[,i]) == TRUE){

data[,i] = factor(data[,i])

}

}

# Generates Metadata if a two-dimensional matrix is not provided.

if(length(dim(Meta_Dat)) != 2){

Meta_Dat <- Meta_Data(data, Show_Plot = FALSE)}

# Generates the Missing Data Matrix if it is not provided.

if(length(dim(Missing_Mat)) != 2){

Missing_Mat <- Missing_Matrix(data)}

# Initialing the progress bar and counter for the progress bar

print("Mean Imputation Progress Bar")

pb <- txtProgressBar(min = 0, max = nrow(data)*ncol(data), style = 3)

Counter = 0

# The Mean imputation loops, if categorical it uses mode imputation

for(i in 1:nrow(data)){

for(j in 1:ncol(data)){

setTxtProgressBar(pb,Counter) # For progress bar

Counter = Counter + 1 # For progress bar

if(Missing_Mat[i,j] == 1){

if(is.numeric(data[i,j]) == TRUE){data[i,j] = Meta_Dat$Mean[j]}

if(is.numeric(data[i,j]) == FALSE){data[i,j] = Meta_Dat$Mode[j]}

}

}

}

# Returns the imputed data

return(data)

# Ends the progress bar

dev.off()

} # End of function

# The mean imputed data set assigned to "Mean_Test"

Mean_Test <- Mean_Imputation(Data)

# The RMSE value of actual versus imputed values

Mean_Test_RMSE <- Imputation_RMSE(Imputed_Set = Mean_Test, Original_Set = Complete, Missing_Mat = M_Mat)

Of course, we have a very similar function for median imputations. I have found that the median imputation tends to do slightly worse, but I know some people may want to use this option to handle asymmetric data and outliers.

# Median Imputation function

Median_Imputation <- function(data, Meta_Dat = NA, Missing_Mat = NA){

# Forces the data into a data frame

data = data.frame(data)

# Turns character columns into factor columns

for(i in 1:ncol(data)){

if(is.character(data[,i]) == TRUE){

data[,i] = factor(data[,i])

}

}

# Generates Metadata if a two-dimensional matrix is not provided.

if(length(dim(Meta_Dat)) != 2){

Meta_Dat <- Meta_Data(data, Show_Plot = FALSE)}

# Generates the Missing Data Matrix if it is not provided.

if(length(dim(Missing_Mat)) != 2){

Missing_Mat <- Missing_Matrix(data)}

# Initialing the progress bar and counter for the progress bar

print("Median Imputation Progress Bar")

pb <- txtProgressBar(min = 0, max = nrow(data)*ncol(data), style = 3)

Counter = 0

# The Mean imputation loops, if categorical, it uses mode imputation

for(i in 1:nrow(data)){

for(j in 1:ncol(data)){

setTxtProgressBar(pb,Counter) # For progress bar

Counter = Counter + 1 # For progress bar

if(Missing_Mat[i,j] == 1){

if(is.numeric(data[i,j]) == TRUE){data[i,j] = Meta_Dat$Median[j]}

if(is.numeric(data[i,j]) == FALSE){data[i,j] = Meta_Dat$Mode[j]}

}

}

}

# Returns the imputed data frame

return(data)

# Ends the progress bar

dev.off()

} # End of function

# Assigns the imputed data to "Median_Test"

Median_Test <- Median_Imputation(Data)

# Calculates the RMSE for Median imputation

Median_Test_RMSE <- Imputation_RMSE(Imputed_Set = Median_Test, Original_Set = Complete, Missing_Mat = M_Mat)

We can now move into the more computationally intensive calculations that should generally return better imputations as judged by RMSE. The first method is Amelia imputation which uses a special form of the EM algorithm to impute values based on what is statistically most likely. The standard Amelia function returns many imputed data sets that are averaged in the function I have created to return a data set that represents the original data as closely as possible. Much of the code in the function below is used to handle the different types of data that can be fed into it. For example, Amelia can not handle ID variables or multicategorical variables with more than 10 categories, so these columns are imputed with the word “Unknown” instead of being allowed to remain NA. Additionally, it is worth mentioning that Amelia is the only method listed here that will not impute a row with entirely missing values. It will instead leave them as missing. If you have this problem, you may want to use the “Mice” or “Random Forest” methods listed below (or just remove the row). In Windows, this function will run on 4 cores by default as specified in the “Cores” argument. You may need to lower this number if you have processing or memory problems. Likewise, if you know you have the resources, you could increase this number up to the number of cores on your computer minus 1.

# Amelia-based data imputation

Amelia_Imputation <- function(Unimputed_Data, Replicates = 50, Rounding = 8, Cores = 4){

# Forces the data into a data frame

Unimputed_Data = data.frame(Unimputed_Data)

# Generates the metadata

Meta <- Meta_Data(Unimputed_Data, Show_Plot = FALSE)

# Turns character columns into factor columns

for(i in 1:ncol(Unimputed_Data)){

if(is.character(Unimputed_Data[,i]) == TRUE){

Unimputed_Data[,i] = factor(Unimputed_Data[,i])

}

}

# Creates a vector of the categorical variables, if none it will be null

Cat_var <- Meta$`Column Name`[Meta$`Data Type` != "Numeric" & Meta$`N Categories` <= 9]

if(length(Cat_var) == 0){Cat_var = NULL}

# Creates a vector for ID variables, or variables with too many categories

# In this case the values are imputed as Unknown instead of giving a value from the data set

ID_var <- Meta$`Column Name`[Meta$`Data Type` != "Numeric" & Meta$`N Categories` > 9]

for(i in ID_var){

levels(Unimputed_Data[,i]) = c(levels(Unimputed_Data[,i]),"Unknown")

for(j in 1:length(Unimputed_Data[,i])){

if(is.na(Unimputed_Data[j,i]) == TRUE){

Unimputed_Data[j,i] = "Unknown"

}

}

}

if(length(ID_var) == 0){ID_var = NULL}

# Model Output, the "noms" argument treats the categorical variables as nominal.

# If you are not running windows, you can switch to "multicore" for the "parallel" argument in linux...

# you may need to remove the "parallel" and "ncpus" arguments entirely if your operating system is not compatible.

Amelia_Output <- amelia(Unimputed_Data, m = Replicates, p2s = 0, idvars = ID_var,

parallel = "snow", ncpus = Cores, noms = Cat_var)

# An empty matrix to fill with the average of the imputed data sets

Imputation_Mat <- matrix(0, nrow = nrow(Unimputed_Data), ncol = ncol(Unimputed_Data))

# Filling the matrix with these averaged data sets

for(i in 1:Replicates){

for(j in 1:nrow(Unimputed_Data)){

for(k in 1:ncol(Unimputed_Data)){

Imputation_Mat[j,k] <- Imputation_Mat[j,k] + as.numeric(Amelia_Output$imputations[[i]][j,k])/Replicates

}

}

}

# Turning the Output back into a data frame with the original names.

Imputation_Mat <- data.frame(Imputation_Mat)

names(Imputation_Mat) <- names(Unimputed_Data)

# Handles rounding for numeric data, and creates factors for the categories

for(i in 1:ncol(Imputation_Mat)){

if(is.numeric(Unimputed_Data[,i]) == TRUE){

Imputation_Mat[,i] = round(Imputation_Mat[,i],Rounding)

}

if(is.numeric(Unimputed_Data[,i]) == FALSE){

Imputation_Mat[,i] = factor(names(table(Unimputed_Data[,i]))[round(Imputation_Mat[,i],0)])

}

}

# Returning the Imputed data

return(Imputation_Mat)

} # End of function

# The Amelia imputed data set assigned to "Amelia_Test"

Amelia_Test <- Amelia_Imputation(Data)

# The RMSE of the imputed data versus real data

Amelia_Test_RMSE <- Imputation_RMSE(Imputed_Set = Amelia_Test, Original_Set = Complete, Missing_Mat = M_Mat)

The Mice method is a bit different in the sense that Mice is actually not a method at all, but rather an R package that allows for multiple different imputation methods. You can use help(mice) to find a full list of the methods available. In my function, you can change the method using the “type” argument; by default, it is set to “cart” which stands for classification and regression trees”. I chose this one because I find that it deals with categorical data very well, compared to all the other methods including Amelia and random forest. This is the only computationally complex method that will handle categorical values with more than 53 categories (like in the Gapminder data set). I thought this method to be comparatively more useful than the mice function default of predictive mean matching (pmm). You can also provide a “guess” complete data set for the algorithms to start with if you choose to with the “guess” argument in my “Mice_Imputation” function, although I have not found much value in that.

# Mice-based data imputation with "cart" the default method

# Use help(mice) to learn about the possible imputation methods

Mice_Imputation <- function(data, N = 10, Iterations = 10, Rounding = 8,

Jitter = 0, Verbose = TRUE, guess = NULL, type = "cart"){

# Forces the data into a data frame

data = data.frame(data)

# Turns character columns into factor columns

for(i in 1:ncol(data)){

if(is.character(data[,i]) == TRUE){

data[,i] = factor(data[,i])

}

}

# Sometimes Mice detects too much collinearity, and adding jitter can prevent the error

if(Jitter > 0){

for(i in 1:nrow(data)){

for(j in 1:ncol(data)){

if(is.numeric(data[i,j]) == TRUE){

data[i,j] = data[i,j] + rnorm(1,sd = Jitter)

}

}

}

}

# The output of the mice function, where each data set is an element in a list

Mice_Out <- complete(mice(data, m = N, maxit = Iterations, data.init = guess,

print = Verbose, method = type), action = "all")

# An empty matrix to fill with the average of the imputed data sets

Imputation_Mat <- matrix(0, nrow = nrow(data), ncol = ncol(data))

# Filling the matrix with these averaged data sets

for(i in 1:N){

for(j in 1:nrow(data)){

for(k in 1:ncol(data)){

Imputation_Mat[j,k] <- Imputation_Mat[j,k] + as.numeric(Mice_Out[[i]][j,k])/N

}

}

}

# Turning the Output back into a data frame with the original names.

Imputation_Mat <- data.frame(Imputation_Mat)

names(Imputation_Mat) <- names(data)

# Handles rounding for numeric data, and creates factors for the categories

for(i in 1:ncol(Imputation_Mat)){

if(is.numeric(data[,i]) == TRUE){

Imputation_Mat[,i] = round(Imputation_Mat[,i],Rounding)

}

if(is.numeric(data[,i]) == FALSE){

Imputation_Mat[,i] = factor(names(table(data[,i]))[round(Imputation_Mat[,i],0)])

}

}

# Returns the averaged data set

return(Imputation_Mat)

} # End of function

# The averaged imputed data set using the "pmm" method

Mice_Test_Output <- Mice_Imputation(Data, type = "cart")

# The resulting RMSE

Mice_Test_RMSE <- Imputation_RMSE(Imputed_Set = Mice_Test_Output,

Original_Set = Complete, Missing_Mat = M_Mat)

Finally, we have random forest imputation, which of course uses random forests to impute missing values. Random forest imputation tends to work similarly well to Amelia, but there are notable differences. Like I said earlier it will impute wholly empty rows, and it can handle categorical variables with up to 53 categories, although the accuracy of those imputations may be low without substantial data. If you have a variable with more than 50 categories, this function assumes it is an ID variable, and will remove it from the random forest imputation and add it back as the first column with the NAs replaced with “Unknown”s. If multiple ID columns are identified, the function will remove them without adding any of them back, although you could modify this behavior in the code. Like the Amelia function, in Windows, this function will run on 4 cores by default as specified in the “Cores” argument of the function below.

# Random Forest Missing data Imputation

Forest_Imputation <- function(Forest_Data, Meta_Dat = NA, Replace_ID = TRUE, Cores = 4){

# Tells the missing forest calculation to use 4 cores

registerDoParallel(cores=Cores)

# Generates Metadata if a two dimensional matrix is not provided.

if(length(dim(Meta_Dat)) != 2){

Meta_Dat <- Meta_Data(Forest_Data, Show_Plot = FALSE)}

# Turns character columns into factor columns

for(i in 1:ncol(Forest_Data)){

if(is.character(Forest_Data[,i]) == TRUE){

Forest_Data[,i] = factor(Forest_Data[,i])

}

}

# Forces the data into a data frame

Forest_Data = data.frame(Forest_Data)

# Removes ID variables that prevent the random forest model from running

N_Levels = Meta_Dat$`N Categories`

for(i in 1:length(N_Levels)){if(is.na(N_Levels[i]) == TRUE){N_Levels[i] = 0}}

included_Columns = Meta_Dat$`Column Name`[N_Levels <= 50]

if(length(included_Columns) != length(names(Forest_Data))){

print("Warning: One or more ID column(s) have been identified and may be removed!")}

Data = Forest_Data[,included_Columns]

# If a single ID variable is removed, Replace_ID will add it back as the left most column with Unknowns

if(Replace_ID == TRUE & (length(included_Columns)+1) == length(names(Forest_Data))){

ID_Var = Forest_Data[,Meta_Dat$`Column Name`[N_Levels > 50]]

levels(ID_Var) = c(levels(ID_Var),"Unknown")

for(i in 1:length(ID_Var)){if(is.na(ID_Var[i])==TRUE){ID_Var[i]="Unknown"}}

}

# The output of the missing forest algorithm, with the output being a 2-element list

Missing_For_Out <- missForest(Data, verbose = TRUE, parallelize = "forests", ntree = 250)

# The first element in the previous list is the imputed data set, so we just select that.

Forest_Out <- Missing_For_Out[[1]]

# Returns the imputed ID variable if "Replace_ID" is TRUE and there is only one ID variable.

if(Replace_ID == TRUE & (length(included_Columns)+1) == length(names(Forest_Data))){return(cbind(ID_Var,Forest_Out))}

if(Replace_ID == FALSE | (length(included_Columns)+1) != length(names(Forest_Data))){return(Forest_Out)}

} # End of function

# The random forest imputed data set is assigned to "Forest_Test"

Forest_Test <- Forest_Imputation(Forest_Data = Data)

# The RMSE is calculated for random forest imputation

Forest_Test_RMSE <- Imputation_RMSE(Imputed_Set = Forest_Test, Original_Set = Complete, Missing_Mat = M_Mat)

Previous
Previous

Using R to Automate Markdown Code

Next
Next

Visually Determining Normality in R