What Should We Do About Missing Data? (A Case Study Using Logistic Regression with Missing Data on a Single Covariate)
PWP-CCPR-2003-028
Abstract
Fox et al. (1998) carried out a logistic regression analysis with discrete covariates in which one of the covariates was missing for a substantial percentage of respondents. The missing data problem was addressed using the “approximate Bayesian bootstrap.” We return to this missing data problem to provide a form of case study. Using the Fox et al. (1998) data for expository purposes we carry out a comparative analysis of eight of the most commonly used techniques for dealing with missing data. We then report on two sets of simulations based on the original data. These suggest, for patterns of missingness we consider realistic, that case deletion and weighted case deletion are inferior techniques, and that common simple alternatives are better. In addition, the simulations do not affirm the theoretical superiority of Bayesian Multiple Imputation. The apparent explanation is that the imputation model, which is the fully saturated interaction model recommended in the literature, was too detailed for the data. This result is cautionary. Even when the analyst of a single body of data is using a missingness technique with desirable theoretical properties, and the missingness mechanism and imputation model are supposedly correctly specified, the technique can still produce biased estimates. This is in addition to the generic problem posed by missing data, which is that usually analysts do not know the missingness mechanism or which among many alternative imputation models is correct.