Abstract

Missing data imputation forms the first critical step of many data analysis pipelines. The challenge is greatest for mixed data sets, including real, Boolean, and ordinal data, where standard techniques for imputation fail basic sanity checks: for example, the imputed values may not follow the same distributions as the data. In this talk, we develop a new semiparametric Gaussian copula model to impute missing values in mixed data. The model can handle arbitrary marginals for continuous variables and can handle ordinal variables with many levels, including Boolean variables as a special case. We develop an efficient approximate EM algorithm with no tuning parameters to estimate the model from incomplete mixed data. The resulting model reveals the statistical associations among variables. Experimental results on several synthetic and real datasets show superiority of our proposed algorithm to state-of-the-art imputation algorithms for mixed data.