next up previous contents home.gif
Next: What do I do Up: Questions for both versions Previous: Questions for both versions   Contents

How do I make my data fit the assumptions of the program?

The imputation model should contain at least as much information as the analysis model. The primary way to go wrong is to include information in your analysis model not in the imputation model, and the primary fix is to include this information. Thus, you should give $ {\mathfrak{A}melia}$ all variables you intend to include in your subsequent analyses. (If a variable is present in the analysis model but is excluded from the imputation model, estimates of the relationship between this variable and others will be biased, normally towards zero.) In addition, for additional efficiency, give $ {\mathfrak{A}melia}$ any other variables you have that would help predict the missing values in the original variables, even if you do not plan to use them in your analyses. For example, if you have 5 measures of a concept and only plan to use the best one in the analysis model, give $ {\mathfrak{A}melia}$ all 5. However, the speed of the program can be severly compromised when there are too many variables in the imputation model. As a very rough rule of thumb, you should probably not exceed twice the number of variables in your analysis model or 40, whichever is greater.

There are also several ways to make the distributional assumptions more realistic.

  1. If there is an obvious nonlinear relationship between some variables you know about, such as if it is to be the subject of your subsequent analysis, include squared terms to model the nonlinearity.

  2. To make the multivariate normal assumption fit better, variables should be transformed to make them unbounded and relatively symmetric. For example, budget figures, which are often restricted to be positive and are positively skewed, can be logged to make them approximately normal. Event counts can be made closer to normal by taking the square root, which stabilizes the variance and makes them approximately symmetric. The logistic transformation can be used to make proportions unbounded and symmetrically distributed, $ \ln(p/(1-p))$. Ordinal variables should be coded to as close to interval scales as information indicates. For example, if categories of a variable measuring the degree of intensity of a dispute are arguing, yelling, punching, and killing, a coding of 1, 2, 3, and 4 would not seem approximately interval. Perhaps, 1, 2, 20, 200 might be closer. Of course, including transformations to fit distributional assumptions, and making ordinal codings more reasonable like this, are called for in any linear model, even without missing data.


next up previous contents home.gif
Next: What do I do Up: Questions for both versions Previous: Questions for both versions   Contents
Gary King 2003-07-25