Next: What do I do
Up: Questions for both versions
Previous: Questions for both versions
  Contents
The imputation model should contain at least as much information as
the analysis model. The primary way to go wrong is to include
information in your analysis model not in the imputation model, and
the primary fix is to include this information. Thus, you should give
all variables you intend to include in your subsequent
analyses. (If a variable is present in the analysis model but is
excluded from the imputation model, estimates of the relationship
between this variable and others will be biased, normally towards
zero.) In addition, for additional efficiency, give
any
other variables you have that would help predict the missing values in
the original variables, even if you do not plan to use them in your
analyses. For example, if you have 5 measures of a concept and only
plan to use the best one in the analysis model, give
all 5.
However, the speed of the program can be severly compromised when
there are too many variables in the imputation model. As a very rough
rule of thumb, you should probably not exceed twice the number of
variables in your analysis model or 40, whichever is greater.
There are also several ways to make the distributional assumptions
more realistic.
- If there is an obvious nonlinear relationship between some
variables you know about, such as if it is to be the subject of your
subsequent analysis, include squared terms to model the
nonlinearity.
- To make the multivariate normal assumption fit better, variables
should be transformed to make them unbounded and relatively
symmetric. For example, budget figures, which are often restricted
to be positive and are positively skewed, can be logged to make them
approximately normal. Event counts can be made closer to normal by
taking the square root, which stabilizes the variance and makes them
approximately symmetric. The logistic transformation can be used to
make proportions unbounded and symmetrically distributed,
. Ordinal variables should be coded to as close to
interval scales as information indicates. For example, if
categories of a variable measuring the degree of intensity of a
dispute are arguing, yelling, punching, and killing, a coding of 1,
2, 3, and 4 would not seem approximately interval. Perhaps, 1, 2,
20, 200 might be closer. Of course, including transformations to
fit distributional assumptions, and making ordinal codings more
reasonable like this, are called for in any linear model, even
without missing data.
Next: What do I do
Up: Questions for both versions
Previous: Questions for both versions
  Contents
Gary King
2003-07-25