Table 1.1 portrays the issue in this case as an example of the more general ecological inference problem. This table depicts what is known for the election to the Ohio State House that occurred in District 42 in 1990. The black Democratic candidate received 19,896 votes (65% of votes cast) in a race against a white Republican opponent. African Americans constituted 55,054 of the 80,760 people of voting age in this district (68%). Because this known information appears in the margins of the cross-tabulation, it is usually referred to as the marginals. The ecological inference problem involves replacing the question marks in the body of this table with inferences based on information from the marginals. (Ecological inference is traditionally defined in terms of a table like this and thus in terms of discrete individual-level variables. Most political scientists, sociologists, and geographers, and some statisticians, have retained this original definition. Epidemiologists and some others generalize the term to include any aggregation problem, including continuous individual-level variables. I use the traditional definition in this book in order to emphasize the distinctive characteristics of aggregated discrete data, and discuss aggregation problems involving continuous individual-level variables in Chapter 14.)
white
For example, the question mark in the upper left corner of the table
represents the (unknown) number of blacks who voted for the Democratic
candidate. Obviously, a wide range of different numbers could be put
in this cell of the table without contradicting its row and column
marginals, in this case any number between 0 and 19,896, a logic
referred to in the literature as the method of
bounds.
Fortunately, somewhat more information is available in this example,
since the parties in the Ohio case had data at the level of precincts
(or sometimes slightly higher levels of aggregation instead, which I
also will refer to as precincts). Ohio State House District 42 is
composed of 131 precincts, for which information analogous to Table 1.1 is available. For
example, Table 1.2 displays the
information from Precinct P, which in District 42 falls between
Cascade Valley Park and North High School in the First Ward in the
city of Akron. The sum of any item in the precinct tables, across all
precincts, would equal the number in the same position in the district
table. For example, if the number of blacks voting for the Democratic
candidate in Precinct P were added to the same number from each of the
other 130 precincts, we would arrive at the total number of blacks
casting ballots for the Democratic candidate represented as the first
cell in Table 1.1.
The ecological inference problem does not vanish by having access to
the precinct-level data, such as that in Table 1.2, because we ultimately
require individual-level information. Each of the cells in this table
is still unknown. Thus, knowing the parts would tell us about the
whole, but disaggregation to precincts does not appear to reveal much
more about the parts.
With a few minor exceptions, no method has even been proposed to fill
in the unknown quantities at the precinct level in Table 1.2. What scholars have done
is to develop methods to use the observed variation in the marginals
over precincts to help narrow the range of results at the district
level in Table 1.1. For example, if
the Democratic candidate receives the most votes in precincts with the
largest fractions of African Americans, then it seems intuitively
reasonable to suppose that blacks are voting disproportionately for
the Democrats (and thus the upper left cell in Table 1.1 is probably large). This
assumption is often reasonable, but Robinson showed that it can be
dead wrong: the individual-level relationship is often the opposite
sign of this aggregate correlation, as will occur if, for example,
whites in heavily black areas tend to vote more Democratic than whites
living in predominately white neighborhoods.
Unfortunately, even the best available current methods of ecological
inference are often wildly inaccurate. For example, at the federal
trial in Ohio (and in formal sworn deposition and in a prepared
report), the expert witness testifying for the plaintiffs reported
that 109.63% of blacks voted for the Democratic candidate in District
42 in 1990! He also reported in a separate, but obviously related,
statement that a negative number of blacks voted for the Republican
candidate. Lest this seem like one wayward result chosen selectively
from a sea of valid inferences, consider a list of the results from
all districts reported by this witness (every white Republican who
faced a black Democrat since 1986), which I present in Table 1.3. A majority of these results
are over 100%, and thus impossible. No one was accusing the
Democratic candidates of stuffing the ballot box; dead voters were not
suspected of turning out to vote more than they usually do. Rather,
these results point out the failure of the general methodological
approach. For those familiar with existing ecological inference
methods, these results may be disheartening, but they will not be
surprising: impossible results occur with regularity.
What of the analyses in Table 1.3 that
produced results that were not impossible? For example, in District
25, the application of this standard method of ecological inference
indicated that 99% of blacks voted for the Democratic candidate in
1990. Is this correct? Since no external information is available,
we have no idea. However, we do know, from other situations where
data do exist with which to verify the results of ecological analyses,
that the methods usually do not work. The problem, of course, is that
when they give results that are technically possible we might be
lulled into believing them. As Robinson so clearly stated, even
technically possible results from these standard methods are usually
wrong.
When ridiculous results appear in academic work, as they sometimes do, there
are few practical ramifications. In contrast, inaccurate results used in making
public policy can have far-reaching consequences. Thus, in order to attempt to
avoid this situation, the witness in this case used the best available methods
at the time and had at his disposal far more resources and time than one would
have for almost any academic project. The partisan control of a state
legislature was at stake, and research resources were the last things that
would be spared if the case could be won. (The witness also had extensive
experience testifying in similar cases.) Moreover, he was using a method (a
version of Goodman's ``ecological regression'') that the U.S. Supreme Court
had previously declared to be appropriate in applications such as this
(Thornburg v. Gingles, 1986). If there was any way of avoiding these
silly conclusions, he certainly would have done so. Yet, even with all this
going for him he was effectively forced by the lack of better methods to
present results that indicated, in over half the districts he studied, that
more African Americans voted for the Democratic candidate than there were
African Americans who voted.
Two types of statistical difficulties cause inaccurate results such as these in
ecological inferences. The first is aggregation bias. This is the
effect of the information loss that occurs when individual-level data are
aggregated into the observed marginals. The problem is that in some aggregate
data collections, the type of information loss may be selective, so that
inferences that do not take this into account will be biased.
The second cause of inaccurate results in ecological inferences is a
variety of basic statistical problems, unrelated to aggregation
bias, that have not been incorporated into existing methods. These
are the kinds of issues that would be resolved first in any other
methodological area, although most have not yet been addressed. For
example, much data used for ecological inferences have massive levels
of ``heteroskedasticity'' (a basic problem in regression analysis),
but this has never been noted in the literature--and sometimes
explicitly denied--even though it is obviously present even in most
published scatter plots (about which more in Chapter 4).
Race of
Voting Age Voting Decision
Person Democrat
Republican No vote
black
55,054
25,706
19,896 10,936 49,928 80,760
As a result, some other information or method must be used to further
narrow the range of results.
Race of
Voting Age Voting Decision
Person Democrat
Republican No
vote
black
221
white
484
130 92 483 705
Estimated Percent of Blacks
Year District Voting for the Democratic Candidate
1986 12 95.65%
23 100.06
29 103.47
31 98.92
42 108.41
45 93.58
1988 12 95.67
23 102.64
29 105.00
31 100.20
42 111.05
45 97.49
1990 12 94.79
14 97.83
16 94.36
23 101.09
25 98.83
29 103.42
31 102.17
36 101.35
37 101.39
42 109.63
45 97.62
Next: The Solution
Up: Chapter 1: Qualitative Overview
Previous: The Necessity of Ecological
Mon Jan 27 13:02:30 EST 1997