next up previous external
Next: The Solution Up: Chapter 1: Qualitative Overview Previous: The Necessity of Ecological

The Problem

  On 16 and 17 November 1994, a special three-judge federal court met in Cleveland to hear arguments concerning the legality of Ohio's State House districts. A key part of the trial turned on whether African Americans vote differently from whites. Although the required facts are only knowable for individual voters, and survey data were unavailable (and are unreliable in the context of racial politics), the only relevant information available to study this question was political and demographic data at the aggregate level.gif

Table 1.1 portrays the issue in this case as an example of the more general ecological inference problem. This table depicts what is known for the election to the Ohio State House that occurred in District 42 in 1990. The black Democratic candidate received 19,896 votes (65% of votes cast) in a race against a white Republican opponent. African Americans constituted 55,054 of the 80,760 people of voting age in this district (68%). Because this known information appears in the margins of the cross-tabulation, it is usually referred to as the marginals. The ecological inference problem involves replacing the question marks in the body of this table with inferences based on information from the marginals. (Ecological inference is traditionally defined in terms of a table like this and thus in terms of discrete individual-level variables. Most political scientists, sociologists, and geographers, and some statisticians, have retained this original definition. Epidemiologists and some others generalize the term to include any aggregation problem, including continuous individual-level variables. I use the traditional definition in this book in order to emphasize the distinctive characteristics of aggregated discrete data, and discuss aggregation problems involving continuous individual-level variables in Chapter 14.)

 

Race of
Voting Age Voting Decision
Person Democrat Republican No vote
black tex2html_wrap_inline1312 tex2html_wrap_inline1312 tex2html_wrap_inline1312 55,054

white

tex2html_wrap_inline1312 tex2html_wrap_inline1312 tex2html_wrap_inline1312 25,706

19,896 10,936 49,928 80,760


tex2html_wrap1336

For example, the question mark in the upper left corner of the table represents the (unknown) number of blacks who voted for the Democratic candidate. Obviously, a wide range of different numbers could be put in this cell of the table without contradicting its row and column marginals, in this case any number between 0 and 19,896, a logic referred to in the literature as the method of bounds.gif As a result, some other information or method must be used to further narrow the range of results.

Fortunately, somewhat more information is available in this example, since the parties in the Ohio case had data at the level of precincts (or sometimes slightly higher levels of aggregation instead, which I also will refer to as precincts). Ohio State House District 42 is composed of 131 precincts, for which information analogous to Table 1.1 is available. For example, Table 1.2 displays the information from Precinct P, which in District 42 falls between Cascade Valley Park and North High School in the First Ward in the city of Akron. The sum of any item in the precinct tables, across all precincts, would equal the number in the same position in the district table. For example, if the number of blacks voting for the Democratic candidate in Precinct P were added to the same number from each of the other 130 precincts, we would arrive at the total number of blacks casting ballots for the Democratic candidate represented as the first cell in Table 1.1.

 

Race of
Voting Age Voting Decision
Person Democrat Republican No vote
black tex2html_wrap_inline1312 tex2html_wrap_inline1312 tex2html_wrap_inline1312 221
white tex2html_wrap_inline1312 tex2html_wrap_inline1312 tex2html_wrap_inline1312 484
130 92 483 705


tex2html_wrap5409

The ecological inference problem does not vanish by having access to the precinct-level data, such as that in Table 1.2, because we ultimately require individual-level information. Each of the cells in this table is still unknown. Thus, knowing the parts would tell us about the whole, but disaggregation to precincts does not appear to reveal much more about the parts.

With a few minor exceptions, no method has even been proposed to fill in the unknown quantities at the precinct level in Table 1.2. What scholars have done is to develop methods to use the observed variation in the marginals over precincts to help narrow the range of results at the district level in Table 1.1. For example, if the Democratic candidate receives the most votes in precincts with the largest fractions of African Americans, then it seems intuitively reasonable to suppose that blacks are voting disproportionately for the Democrats (and thus the upper left cell in Table 1.1 is probably large). This assumption is often reasonable, but Robinson showed that it can be dead wrong: the individual-level relationship is often the opposite sign of this aggregate correlation, as will occur if, for example, whites in heavily black areas tend to vote more Democratic than whites living in predominately white neighborhoods.

Unfortunately, even the best available current methods of ecological inference are often wildly inaccurate. For example, at the federal trial in Ohio (and in formal sworn deposition and in a prepared report), the expert witness testifying for the plaintiffs reported that 109.63% of blacks voted for the Democratic candidate in District 42 in 1990! He also reported in a separate, but obviously related, statement that a negative number of blacks voted for the Republican candidate. Lest this seem like one wayward result chosen selectively from a sea of valid inferences, consider a list of the results from all districts reported by this witness (every white Republican who faced a black Democrat since 1986), which I present in Table 1.3. A majority of these results are over 100%, and thus impossible. No one was accusing the Democratic candidates of stuffing the ballot box; dead voters were not suspected of turning out to vote more than they usually do. Rather, these results point out the failure of the general methodological approach. For those familiar with existing ecological inference methods, these results may be disheartening, but they will not be surprising: impossible results occur with regularity.

 

Estimated Percent of Blacks
Year District Voting for the Democratic Candidate
1986 12 95.65%
23 100.06
29 103.47
31 98.92
42 108.41
45 93.58
1988 12 95.67
23 102.64
29 105.00
31 100.20
42 111.05
45 97.49
1990 12 94.79
14 97.83
16 94.36
23 101.09
25 98.83
29 103.42
31 102.17
36 101.35
37 101.39
42 109.63
45 97.62


tex2html_wrap1340

What of the analyses in Table 1.3 that produced results that were not impossible? For example, in District 25, the application of this standard method of ecological inference indicated that 99% of blacks voted for the Democratic candidate in 1990. Is this correct? Since no external information is available, we have no idea. However, we do know, from other situations where data do exist with which to verify the results of ecological analyses, that the methods usually do not work. The problem, of course, is that when they give results that are technically possible we might be lulled into believing them. As Robinson so clearly stated, even technically possible results from these standard methods are usually wrong.

When ridiculous results appear in academic work, as they sometimes do, there are few practical ramifications. In contrast, inaccurate results used in making public policy can have far-reaching consequences. Thus, in order to attempt to avoid this situation, the witness in this case used the best available methods at the time and had at his disposal far more resources and time than one would have for almost any academic project. The partisan control of a state legislature was at stake, and research resources were the last things that would be spared if the case could be won. (The witness also had extensive experience testifying in similar cases.) Moreover, he was using a method (a version of Goodman's ``ecological regression'') that the U.S. Supreme Court had previously declared to be appropriate in applications such as this (Thornburg v. Gingles, 1986). If there was any way of avoiding these silly conclusions, he certainly would have done so. Yet, even with all this going for him he was effectively forced by the lack of better methods to present results that indicated, in over half the districts he studied, that more African Americans voted for the Democratic candidate than there were African Americans who voted.

Two types of statistical difficulties cause inaccurate results such as these in ecological inferences. The first is aggregation bias. This is the effect of the information loss that occurs when individual-level data are aggregated into the observed marginals. The problem is that in some aggregate data collections, the type of information loss may be selective, so that inferences that do not take this into account will be biased.

The second cause of inaccurate results in ecological inferences is a variety of basic statistical problems, unrelated to aggregation bias, that have not been incorporated into existing methods. These are the kinds of issues that would be resolved first in any other methodological area, although most have not yet been addressed. For example, much data used for ecological inferences have massive levels of ``heteroskedasticity'' (a basic problem in regression analysis), but this has never been noted in the literature--and sometimes explicitly denied--even though it is obviously present even in most published scatter plots (about which more in Chapter 4).


next up previous external
Next: The Solution Up: Chapter 1: Qualitative Overview Previous: The Necessity of Ecological

Gary King
Mon Jan 27 13:02:30 EST 1997