Statistician Builds What May Be a Better Data Mousetrap

July 8, 1997, Science Times Section, Page C8

By KAREN FREEMAN

On Nov. 16, 1994, a political scientist sat in on testimony in a federal voting rights case, watching in dismay as the best statistical method then available produced estimates showing that in district after district, more than 100 percent of black voters had chosen Democratic candidates.

More than 100 percent?

"I thought there's just got to be a better way to do this," the political scientist, Dr. Gary King, a professor of government at Harvard University, said in a recent interview.

That better way appears to be a new statistical approach that was published in the spring by King and is already making its mark in voting rights cases. It is expected to help social scientists deal with a wide range of problems and may prove useful in fields like epidemiology as well.

Dr. Charles H. Franklin, a political science professor at the University of Wisconsin at Madison, called King's method "a really impressive piece of work that comes as close to being a real breakthrough as the social sciences are likely to produce."

King has devised a more accurate way to gauge the behavior of individual members of groups when researchers have data only about the groups, called aggregate data.

His method should make it possible to determine how many black voters in a precinct voted Democratic in an election even if the only information available is the percentage of black voters in the precinct and how many voters chose the Democratic candidate.

This seeming legerdemain is made possible by a new approach contained in a software program that King makes available to other researchers on the Internet. His method starts with the common-sense approach of eliminating all answers that are not logical. No group will show up as throwing more than 100 percent of its support to any candidate when his method is used.

King's method does not assume, as do older methods, that all the units studied -- precincts, for example -- will show the same behavior. He calculates the range of votes possible in each precinct, then uses information gained in analyses of the other precincts to narrow his focus. The result is a range of probable answers for each precinct.

Researchers have felt uneasy about the statistical ground where King is now treading since 1950, when a pivotal article by Dr. William S. Robinson of the University of California at Los Angeles in The American Sociological Review warned against making inferences about individuals based on group data.

Robinson pointed out that a statistician who looked at the numbers of immigrants and the literacy rates in each state would find a correlation indicating that immigrants had higher literacy rates than native-born Americans, even though official records of individuals had shown that not to be the case. The reason was that the higher literacy rates were also found in states that had spent more on education, and they just happened to have the highest percentages of immigrants.

Since that time, fledgling statisticians have been warned away from what has been called the ecological fallacy, or ecological inference problem: using group data, like the overall vote totals and racial makeup of a precinct, to infer something about the behavior of individuals, like how many voters of a particular race voted for a candidate.

"Robinson scared the living daylights out of everybody who worked with aggregate, or group-level, data," King said. "About the same time, modern survey research began, and that's been the most important methodological innovation in the social sciences of this century."

Since surveys, or polls, do provide information about what individuals do, there might seem to be little need for a better method for dealing with the fuzzier aggregate data. But in many cases, survey data are not available. Doing a precinct-level survey costs nearly as much as doing a national survey, so few polls look at such small geographic units.

And polls can be unreliable when the issue at hand is highly controversial (when it is racially charged, for example) because people are sometimes reluctant to give their true opinions. King's method could also be used to address historical, pre-poll, questions, like which German voters supported Hitler.

More accurate information from aggregate data might resolve questions in which group-level and individual-level results appear to disagree, both in social science and in other fields that rely heavily on statistics, like epidemiology.

One area in dispute in epidemiology concerns the relationship between radon exposure and lung cancer. Biological studies of individuals indicate that high levels of radon exposure increase the risk of lung cancer. But analyses of aggregate data by Dr. Bernard Cohen, who retired from the environmental and occupational health department at the University of Pittsburgh, indicate that the states with the highest radon exposures have the lowest lung cancer rates. His explanation is that radon might actually keep people from getting lung cancer.

Cohen's data have been challenged by a number of epidemiologists, including Dr. Sander Greenland at the University of California at Los Angeles School of Public Health. King's method could shed light on the radon question and other areas in epidemiology, Greenland said.

"Epidemiology is very slow and very cautious about adopting methods from other fields," Greenland said, "but the method could be of use, once validated for each specific application. Gary King has certainly done a thorough job of examining the mathematical issues, so I'd be interested to see what happens when someone attempts to apply the method to epidemiological questions."

Dr. Steven Piantadosi, director of biostatistics at the Johns Hopkins Oncology Center, said, "If somebody has discovered a way to make reliable inferences from aggregate data, that could be very helpful. It could be wrong, too."

So far, King's method is being widely accepted by political scientists and is beginning to be used in federal voting rights cases, where the issue is usually whether a particular group is being prevented from electing the candidates it favors.

He validated his method by using it to calculate 16,000 answers about voters' behavior where the individual-level answers were already known, comparing his results with the actual numbers, and he showed it to be highly accurate. The American Political Science Association recognized the study, supported by the National Science Foundation, as the "best methodological work in political science in 1995-96" and selected King to get its Gosnell Award.

The method's introduction into court testimony came this spring, when King testified as an expert witness in Federal District Court for the Southern District of Ohio (Eastern Division) in a voting rights case. Using his method, King testified that the voters split along racial lines to a greater extent in some elections than in others, the kind of information judges need to know in determining whether a particular group of voters was prevented from electing its preferred candidates because of the way district lines are drawn. Those results went unchallenged, he said.

The method will be used again in a case scheduled to begin on July 28 on whether to overturn Proposition 198, adopted by California voters in 1996, which calls for open primary elections, said Dr. R. Michael Alvarez, a political scientist at the California Institute of Technology. The proposition, which would allow voters to mix and match candidates from all parties in primaries, is being fought by the Democratic and Republican Parties in Federal District Court.

Alvarez and Dr. Jonathan Nagler, a political scientist at the University of California at Riverside, have used King's method to study cross-over voting in another open-primary state, Washington, and will present their results to the court. At issue in the case is just how much cross-over voting occurs and whether voters use cross-over voting to make another party nominate weaker candidates.

"This technique is being very well received and will revolutionize the study of political behavior, opening the door to all sorts of analyses that have been neglected in the last 30 or 40 years," Alvarez said.

Simple correlations cannot furnish the answers provided by King's method. Assume the problem is to find out how many black voters chose a Democratic candidate. Merely showing that precincts with large black populations voted heavily for the Democrat would not allow any conclusions to be drawn about how blacks vote because the nonblack voters living in such precincts might favor the Democrats more than nonblack voters in other precincts. But King's method would provide reliable answers.

The method appears to be largely self-correcting, said Franklin of the University of Wisconsin, because it is easy to spot when it is not working well. "It's far better to know that you don't know as much as you think you do," he said.

"When the new method fails, it tends to produce evidence that shows you you're not doing a good job," Franklin said. "The opposite happens with the older method: it tends to tell you too small a margin of error when it's doing badly."

More studies need to be done to validate King's method for multiple variables, like more than two categories of voters or political parties, Franklin said. "It's clear that Gary has done more toward solving this problem than has been done since 1950, despite some other good work in the field," he said. "Everybody wants experience with it and wants to see what happens when other people apply the method."

Franklin expects a "burst of activity" as political scientists look for previously unavailable answers in census and election data.

King explains his method in his book "A Solution to the Ecological Inference Problem: Reconstructing Individual Behavior From Aggregate Data" (Princeton University Press) and his software is available through his home page (http://gking.harvard.edu).

"I would be very surprised if the old statistical method persisted into the next round of redistricting," Franklin said.

c05432e852e7f3fbb2c56fc04411b732