The data for the first application come from the state of Louisiana,
which records by precinct the number of blacks who vote and the number
of whites who vote (among those registered). These data make it
possible to evaluate the ecological inference model described in this
book as follows. For each of Louisiana's 3,262 precincts, the
procedure uses only aggregate data: the fraction of those registered
who are black and the fraction of registered people turning out to
vote for the 1990 elections (as well as the number registered). These
aggregate, precinct-level data are then used to estimate the
fraction of blacks who vote in each precinct. Finally, I validate the
model by comparing these estimates to the true fractions of
blacks who turn out to vote. (That is, the true fractions of black
and white turnout are not used in the estimation
procedure.)
One brief summary of the results of this analysis appears in Figure 1.1. This figure plots the estimated
fraction of blacks turning out to vote in 1990 (horizontally) by the
true fraction of blacks voting in that year (vertically). Each
precinct is represented in the figure by a circle with area
proportional to the number of blacks in the precinct. If the model
estimates were exactly correct in every precinct, each circle would be
centered exactly on the
The results are compelling. If Figure 1.1 were merely a plot of the observed
values of a variable by the fitted values of the same variable used
during the estimation procedure, any empirical researcher should be
pleased: the fit is extremely good. If instead the figure were based
on the harder problem of making out-of-sample predictions, where past
realizations were used to calibrate the prediction, the result would
be even better. But the result here is even more dramatic, since the
estimates in the figure were computed from only aggregate data. The
true fraction of blacks turning out to vote (the vertical dimension in
the figure) was not part of the estimation procedure. Moreover, no
past realizations of the truth being estimated were used.
Part IV provides many more model evaluations and of many types. These
evaluations include data sets for which existing methods do reasonably
well at estimating the statewide average, in which case the method
offered here also gives reasonable statewide results and in addition
much more information in the form of correct confidence intervals and
accurate results for each precinct in the state. Part IV also gives
examples of data sets where existing methods are hopelessly biased,
but the method offered here gives highly accurate estimates. For
example, the best existing method indicates that 20% fewer males in
South Carolina fall below the poverty level than there are males in
that state (see Table 11.2 on page 220). In contrast, the method
offered here gives accurate answers for this statewide aggregate (see
Figure 11.2 and on page 222) as well as for the fraction of males in
poverty in each of the 3,187 precinct-sized geographic units (see
Figure 11.3 on page 223).
The book also includes situations in which almost all information was
aggregated away and standard methods give even more ridiculous
results; in those cases, the method described here gives reasonable
results with wider confidence intervals, reflecting accurately the
degree of uncertainty in the ecological inference (see Chapter 12).
The method usually even gives accurate estimates when all the
conditions for ``aggregation bias'' are met, when the process of
aggregation eliminates most of the variation in one of the aggregate
variables, and when extrapolations far from the range of observed data
are necessary. In all these difficult examples, the method offered
here gives accurate answers with correct confidence intervals. The
method will not always work: since information is lost during
aggregation, no method of ecological inference could work in all data
sets. However, the procedures introduced here come with diagnostics
that researchers can use to evaluate the risks and avoid the problems
in most cases.
Finally, I give a brief report of an analysis of 1990 turnout by race
in New Jersey's 567 minor civil divisions (mostly cities and towns).
These data cannot be used to verify ecological inferences since the
true individual-level answers are not known, but they can be used to
demonstrate how much more information the method offered here provides
to users. The most popular existing method (Goodman's regression)
gives only two numbers of relevance, the state-wide fractions of
blacks who vote and whites who vote (the latter estimate,
incidentally, is five standard deviations above its maximum possible
value given by the method of bounds). In contrast, the solution to
the ecological inference problem offered here gives reliable estimates
of these two numbers for the state-wide average as well as for each of
the 567 cities and towns.
In order to emphasize the rich information this method unearths,
Figure 1.2 maps the estimated degree of
voter turnout among non-minorities. In this map, minor civil divisons
in New Jersey are given darker shades when the estimated degree of
non-minority voter turnout is higher. A few landmarks are labeled to
give readers some bearing. The vast increase in information the
method provides is represented by the interesting geographic variation
in this map (and an additional complete map for minority
turnout). For example, Figure 1.2
shows that non-minority turnout is substantially higher in the city of
Newark than the neighboring city of Elizabeth. Is this because of a
racial threat posed by Newark's larger minority population? Is the
white mobilization in the wealthy towns of Bergen County near
Englewood Cliffs a result of the state government's attempt to
integrate schools by regionalizing its school districts? By providing
reliable individual-level geographic-based information, the solution
to the ecological inference problem can be used to raise numerous
questions such as these. The method also provides opportunities for
answering such questions by using the estimates provided as dependent
variables in second-stage analyses (using, in this case, explanatory
variables such as fraction minority population, or state attempts at
integration).
line. In fact,
almost all of the 3,262 precincts fall on or near this diagonal line,
demonstrating the success of this method of making inferences about
individual behavior using only aggregate data. The few precincts that
are farther from the line have tiny numbers of African Americans, so
the vast majority of individual voters are correctly estimated.
Next: The Method
Up: Chapter 1: Qualitative Overview
Previous: The Solution