next up previous contents home.gif
Next: Which modules work together Up: Modules for Special Data Previous: Modeling -distributed Data   Contents


Data With Logical Constraints (Gauss version only)

Sometimes imputations from the normal model are not appropriate to the user's analysis or fail known bounds or identities of the data. For the special case of ordinal and nominal variables, there is an internal module to produce valid imputations from the normal model (see Section 7.2).

Two other simple approaches are either censoring or truncating the posterior density. Consider a variable $ p$ with values known to be between zero and one. Normally, we recommend taking transformations (such as $ a=\ln(p/(1-p))$) before Amelia and then untransforming afterwards (with $ p=1/[1+exp(-a)]$), but in some cases (such as when $ p$ sometimes takes on the exact values of 0 or 1), it is better to put the untransformed $ p$ directly into Amelia. Of course, in this situation, a multivariate normal imputation model will sometimes impute a missing value below zero or above one, which is logically impossible. If there are only a few out of bounds predictions, a reasonable strategy is to round the imputations below zero up to zero and likewise those above one down to one. In other situations with more out of bounds imputations, it is better to create a different set of imputations altogether by discarding imputations that do not satisfy the constraint (fall within the unit interval) and impute again until a value between zero and one is reached. $ {\mathfrak{A}melia}$ implements a rejection sampler that allows constraints to be implemented at the end of the simulation process in this way.

The AMrs routine in Amelia, together with a user written constraint checking procedure, implements the rejection sampler. The call to AMrs is made after the call to $ {\mathfrak{A}melia}$ , as in the following example.

library amelia;
dataset=loadd("ukpoll");
dbuf=amelia(dataset);
call amrs(dbuf,dataset,&myconstraint);

The argument myconstraint is the name of a user written procedure, myconstraint.g, that you must place in the current directory. This procedure identifies which observations in some data set meet the user-defined constraint. This procedure should accept as arguments the data buffer, the original data set, and the imputed data set. It then returns a column vector, flag, of ones and zeros with length equal to the number of observations in the dataset. Each value in this vector flag$ _i$ corresponds to observation $ i$ in the imputed dataset. A one should be assigned if the observation fails the constraint and a zero if the observation meets the constraint. For example, the following constraint checks that every observation of the second variable is not greater than ten.

proc myconstraint(dbuf,xorig,ximp);
  local flag;
  flag=ximp[.,2].>10;
  retp(flag);
endp;

Not all constraints require the original data or the information in the buffer. The above example uses neither but only refers to values in the imputed dataset. However, they are provided as they are often useful. The rejection sampler will cycle through each imputed dataset and make sure that every observation passes the user provided constraint. If an observation in a dataset fails the constraint, that observation will be reimputed using the sufficient statistics $ {\mathfrak{A}melia}$ used to originally impute that particular dataset. Then the dataset is checked again and this whole cycle is iterated until all observations pass the user provided constraint.

The procedure AMrs can only be called once in a program, since in later constraints imputations may be created that would fail an earlier constraint. Multiple constraints, or constraints on multiple variables should be written by running a series of constraints within the same procedure and combining results, as in the next example which checks that the fifth variable is less than the fourth and the ninth variable is non-negative:

proc myconstraint(dbuf,xorig,ximp);
local flag1, flag2, allflags, flag;
  flag1=ximp[.,5].>ximp[.,4];
  flag2=ximp[.,9].<=0;
  allflags= flag1+flag2;
  flag=allflags.>0;
  retp(flag);
endp;

Screen output is provided if _AMprt=1 (default) allows printing to the screen. Example screen output looks like:

* Running Rejection Sampler *

Rejected 0 >> Dataset 1 meets constraint.
Rejected 11 , 3 , 0 >> Dataset 3 meets constraint.
Rejected 1 , 0 >> Dataset 2 meets constraint.
Rejected 0 >> Dataset 4 meets constraint.
Rejected 5 , 1 , 1 , 1 , 1 , 1 , 0 >> Dataset 5 meets constraint.

Each line corresponds to one imputed dataset from $ {\mathfrak{A}melia}$ being tested. In the above example, five datasets were imputed by $ {\mathfrak{A}melia}$ (as governed by _AMnds). The second dataset initially failed the user's constraint on eleven observations. After these eleven were reimputed, three continued to fail the constraint and were again reimputed in the next round, after which all observations passed the constraint. Since only failing observations are reimputed the number of failing observations is equal or decreasing in each round. In the fifth dataset we see an observation where very little of the probability density in the unconstrained model is within the bounds of the constraint and thus takes six draws to pass.

There is an upper limit to the number of times the rejection sampler will resample in a dataset, set with the global _AMrs; its default is 20. After the maximum has been reached a message will appear like the one below, and the observations which continue to fail the user constraint are identified so the user can check them in the dataset. These may indicate a problem with the model not fitting the data, an outlier in the data, or a bug in the the constraint procedure.

Maximium number of 20 iterations reached.
In dataset 8 the following observations 
currently fail user constraint: 29 118

Since only the missing data are being reimputed, the user should take care that none of the observed data fails the constraint as then that particular observation will never pass.


next up previous contents home.gif
Next: Which modules work together Up: Modules for Special Data Previous: Modeling -distributed Data   Contents
Gary King 2003-07-25