How to deal with missing data?

Missing data is a common occurrence in almost all real datasets. When some information about variables is missing, we call it incomplete data. Analyses or decisions cannot be made with missing values because the conclusions drawn from a dataset with missing values may not be accurate.

There are different reasons why missing values occur and can be divided into three different mechanisms, Missing Completely at Random, Missing at Random, and Missing Not at Random.

For Missing Completely at Random to work, two assumptions must be met. The first assumption is that the observed variables with missing values have no systematic differences. The second assumption that must be met is that there is no relationship between missing values on a variable and its values. The data is said to be Missing Completely at Random when those assumptions are met.

The assumptions in Missing at Random are weaker than in Missing Completely at Random. When compared to Missing Completely at Random, the definition of Missing at Random is that the missingness may depend on the observations. It cannot, however, rely on missing data in a dataset.

Missing Not at Random denotes that the missing values determine the missingness. Furthermore, when missingness is caused by other missing values or other related variables, Missing Not at Random occurs. The issue is that determining how much other variables influence missingness is difficult. Because the cause of missingness has not been measured and thus cannot be analyzed, this type of missingness is inaccessible.

Listwise Deletion

The method Listwise Deletion is used to deal with missing values. All observations with missing values on any variable are removed using this method. As a result, the analysis includes all observations with their values. Listwise Deletion has two advantages: it can be used for statistical analysis, and it does not require any special computer methods.

The Listwise Deletion method has statistical advantages if the data is completely missing at random. Because missingness is completely random, the observations that will be deleted are distributed randomly across the dataset, resulting in unbiased estimators and corrected standard errors. Because Listwise Deletion is simple to use and calculate, it is used by default in most statistical packages. The disadvantage of this method is that, even for a dataset with data Missing Completely at Random, if the number of missing values is too large compared to the sample size, the impact on the result can reduce precision and statistical power.

Despite these drawbacks, Listwise Deletion is a good method for dealing with missing values. Even if it does not use all available information, it usually gives valid inferences for data missing Completely at Random.

Multiple Imputation by Chained Equations

MICE (Multiple Imputation by Chained Equations) is an R package for dealing with missing values in multiple imputations. Multiple imputations work by generating and replacing the missing value several times. This means that the imputations are used to create multiple datasets. Before comparing the different methods, the different datasets are combined.

The chained equation method is adaptable to various types and levels of complexity. The assumption for MICE is that the missing values in the given dataset are Missing Completely at Random or Missing at Random.

Predictive Mean Matching

Predictive Mean Matching is a Hot Deck method that calculates the predicted value for a variable with missing values. Using a standard linear regression, the missing values are imputed with the help of the observations that are not missing. Predictive Mean Matching is based on the estimation of a linear regression model. The methods create a small set of donors with no missing values for each missing value. These donors are drawn from complete observations with comparable values for other variables.

One of the donors is chosen at random, and his or her observed value replaces the missing value. The distribution of missing values for Predictive Mean Matching is assumed to be the same as for the observed data of the donors. It’s a simple method to use, and because the imputed variables are based on observed data, they’re realistic. The method will not return values outside of the data range observed. Predictive Mean Matching is a method that can handle a variety of data types and is recommended for various scenarios.

Poisson imputation

The Poisson imputation method uses Poisson regression to impute missing values using the regular Poisson assumptions. As a result, this method effectively deals with missing data from Poisson distributed variables. The observed and temporary values are used to generate a value to replace the missing value in the general linear regression model. The imputations are drawn from a Poisson distribution with a predicted mean for each missing value.