next up previous external
Next: The Evidence Up: Chapter 1: Qualitative Overview Previous: The Problem

The Solution

  This section sets forth seven characteristics of the proposed solution to the ecological inference problem not met by previous methods. However, unlike the proof of a mathematical theorem, statistical solutions can usually be improved continually--hence the phrase a solution, rather than the solution, in the title of this book. Modern statistical theory does not date back even as far as the ecological inference problem, so as we learn more we should be able to improve on this solution further. Similarly, as computers continue to get faster, we can posit more sophisticated models that incorporate more information. The method offered here is the first that consistently works in practice, but it is also intended to put the ecological inference literature on a firmer theoretical and empirical foundation, helping to lead to further improvements.

First, the solution is scientifically validated with real data. Several extensive collections of real aggregate data, for which the inner cells of the cross-tabulation are known from public records, are used to help validate the method. For example, estimates of the levels of black and white voter registration are compared to the known answer in public records. (These are real issues, not contrived for the purpose of a methodological treatise; they are the subject of considerable academic inquiry, and even much litigation in many states.) Data from the U.S. Census aggregated to precinct-sized aggregates in South Carolina are used to study the relative frequency with which males and females are in poverty. Also useful for this purpose are data from Atlanta, Georgia, that include information about voter loyalty and defection rates in the transitions between elections, and from turn-of-the-century U.S. county-level data on black and white literacy rates, in order to validate the model in those contexts. Finally, I have been able to study the properties of aggregate data extensively with a large collection of merged U.S. Census data and precinct-level aggregate election data for most electoral offices and the entire nation. The method works in practice. In contrast, if the only goal were to develop a method that worked merely in theory, then the problem might already have been considered ``solved'' long ago, as the literature includes many methods that work only if a list of unverifiable assumptions are met.

Using data to evaluate methodological approaches is, of course, good scientific practice, but it has been rare in this field that has focused so exclusively on hypothetical data, and on theoretical arguments without economic, political, sociological, psychological, or other foundations. Indeed, the entire ecological inference literature contains only forty-nine comparisons between estimates from aggregate data and the known true individual-level answer.gif (Because this work includes a variety of new data sets, and a method that gives district- and precinct-level estimates, the book presents over sixteen thousand such comparisons between estimates and the truth.) Many of these forty-nine ecological inferences are compared to estimates from sample surveys, but scholars rarely correct for known survey biases with post-stratification or other methods.gif Others use ``data'' that are made up by the investigator, such as those created with computerized random number generators. All these data sets have their place (and some will have their place here too), but their artificial nature, exclusive use, and especially limited number and diversity fail to present the methodologist with the kinds of problems that arise in using real aggregate data and studying authentic social science problems. Scholars are therefore unable to adapt the methods to the opportunities in the data and will not know how to avoid the likely pitfalls that commonly arise in practice.

Second, the method described here offers realistic assessments of the uncertainty of ecological estimates. Reporting the uncertainty of one's conclusions is one of the hallmarks of modern statistics, but it is an especially important problem here. The reason is that ecological inference is an unusual statistical problem in which, under normal circumstances, we never observe realizations of our quantity of interest. For example, since most German citizens who voted for the Nazi party are no longer around to answer hypothetical survey questions, and could hardly be expected to answer them sincerely even if they were, no method will ever be able to fill in the cross-tabulation with certainty. Thus a key component of any solution to this problem is that correct uncertainty estimates be an integral part of all inferences.

Many methods proposed in the literature provide no uncertainty estimates. Others give uncertainty estimates that are usually incorrect (as for example when 95% confidence intervals do not capture the correct answer about 95% of the time). The method proposed here provides reasonably accurate (and empirically verified) uncertainty estimates. Moreover, these estimates are useful since the intervals turn out to be narrower than one might think.

Third, the basic model is robust to aggregation bias. Although this book also includes modifications of this basic model to compensate for aggregation bias explicitly, these modifications are often unnecessary. That is, even when the process of aggregation causes existing methods to give answers that bear no relationship to the truth, the method proposed here still usually gives accurate answers.

In order to develop an explicit approach to avoiding aggregation bias, I prove that the numerous and apparently conflicting explanations for aggregation bias are mathematically equivalent, even though they each appear to offer very different substantive insights. This theoretical result eliminates the basis for existing scholarly disagreements over which approach is better, or how many problems we need to deal with. All problems identified with aggregation bias are identical; only one problem needs to be solved. In the cases where an explicit treatment of aggregation bias is necessary under the proposed model, this result makes possible the model generalization required to accomplish the task.

Fourth, all components of the proposed model are in large part verifiable in aggregate data. That is, although information is lost in the process of aggregation, and thus ecological inferences will always involve risk, some observable implications of all model assumptions remain in aggregate data. These implications are used to develop diagnostic tests to evaluate the appropriateness of the model to each application, and to develop generalizations for the times when the assumptions of the basic model are contradicted by the data. Thus, the assumptions on which this model is based can usually be verified in sufficient detail in aggregate data in order to avoid problems that cause other methods to lose their bearing.

Fifth, the solution offered here corrects for a variety of serious statistical problems, unrelated to aggregation bias, that also affect ecological inferences. It explicitly models the main source of heteroskedasticity in aggregate data, allows precinct-level parameters to vary, and otherwise includes far more known information in the model about the problem.

The sometimes fierce debates between proponents of the deterministic ``method of bounds'' and supporters of various statistical approaches are resolved by combining their (largely noncontradictory) insights into a single model. Including the precinct-level bounds in the statistical model substantially increases the amount of information used in making ecological inferences. For example, imagine that every time you run a regression, you could take some feature of the model (such as a predicted value), hold it outside a window and, if it is wrong--completely wrong with no uncertainty--the clouds would part and a thunderbolt would turn your computer printout into a fiery crisp. Remarkably, although they have not been exploited in previous statistical models, the bounds provide exactly this kind of certain information in all ecological inference problems for each and every observation in a data set (albeit perhaps with a bit less fanfare). In any other field of statistical analysis, this valuable information, and the other more ordinary statistical problems, would be addressed first, and yet most have been ignored. Correcting these basic statistical problems is also what makes this model robust to aggregation bias.

Sixth, the method provides accurate estimates not only of the cells of the cross-tabulation at the level of the district-wide or state-wide aggregates but also at the precinct level. For example, the method enables one to fill in not only Table 1.1 with figures such as the fraction of blacks voting for the Democrats in the entire district, but also the precinct-level fractions for each of the 131 tables corresponding to Table 1.2. This has the obvious advantage of providing far more information to the analyst, information that can be studied, plotted on geographic maps, or used as dependent variables in subsequent analyses. It is also quite advantageous for verifying the method, since 131 tests of the model for each data set are considerably more informative than one.

Finally, the solution to the ecological inference problem turns out to be a solution to what geographers' call the ``modifiable areal unit problem.'' The modifiable areal unit problem occurs if widely varying estimates result when most methods are applied to alternate reaggregations of the same geographic (or ``areal'') units. This is a major concern in geography and related fields, where numerous articles have been written that rearrange geographic boundaries only to find that correlation coefficients and other statistics totally change substantive interpretations (see Openshaw, 1979, 1984; Fotheringham and Wong, 1991). In contrast, the method given here is almost invariant to the configuration of district lines. If precinct boundaries were redrawn, even in some random fashion, inferences about the cells of Table 1.1 would not drastically change in most cases.

Every methodologist dreams of inventing a statistical procedure that will work even if the researcher applying it does not understand the procedure or possess much ``local knowledge'' about the substance of the problem. This dream has never been fulfilled in statistics, and the same qualification holds for the method proposed here: The more contextual knowledge a researcher makes use of, the more likely the ecological inference is to be valid. The method gives the researcher with this local knowledge the tools to make a valid ecological inference. That is, with a fixed, even inadequate, amount of local knowledge about a problem, a researcher will almost always do far better by using this method than those previously proposed. But making valid ecological inferences is not usually possible without operator intervention. Valid inferences require that the diagnostic tests described be used to verify that the model fits the data and that the distributional assumptions apply. Because the basic problem is a lack of information, bringing diverse sources of knowledge to bear on ecological inferences can have an especially large payoff.


next up previous external
Next: The Evidence Up: Chapter 1: Qualitative Overview Previous: The Problem

Gary King
Mon Jan 27 13:02:30 EST 1997