Connecting the Dots of Data

By John Yemma, Boston Globe

What do sports cars and Cambridge residents have in common? New software may provide the answer.Cambridge's 02138 Zip code is awash in young people - Harvard students, hangers-out, buskers, shoppers, hipsters. It is also a place where you see more sports cars than normal. Young people in Cambridge, it seems, prefer sports cars.

That is an inference which, if true, means big bucks for car dealers, advertising salesmen, ragtop repairmen, and marketers of all stripes. Trouble is, you can't be sure it's true unless you commission an expensive survey of 02138 employing random sampling. Maybe the aging professors and lecturers and hangers-on who live near Harvard University buy sports cars to recapture their youth. Maybe indulgent parents buy sports cars for their kids.

Social scientists call this an "ecological inference problem," a dilemma first recognized 75 years ago, just as the United States was starting to turn into the data-crazy nation it now is. The ecological inference problem occurs when you have two or more sets of data about a geographical area - young people and sports cars in Cambridge - and want to connect them. ("Ecological" in this context means a geographical region filled with people, instead of the usual caribou, snail darters, or condors associated with ecology.) You can make an educated guess, but you can't link the data scientifically.

So stymied by the ecological inference problem were social scientists that they had despaired of using the troves of aggregate data constantly accumulating in government files, voting records, and censuses to figure out individual behavior. Then Gary King, a professor of government at Harvard, solved the problem. Frank Scioli, director of the National Science Foundation's political science research program, calls King's solution a breakthrough. "Of course, it has been challenged, like science always is, but unless I'm missing something," Scioli says, "it has withstood all the challenges it has faced."

What King did was develop a complex series of formulas, which he has packaged as a software program. I won't amuse you by pretending I understand much more of the algorithm than a few generalities. It is made up of equations filled with Greek letters and little italic exponents.

The National Science Foundation, which helped fund King's research, believes the formula, detailed in King's new book, A Solution to the Ecological Inference Problem, will offer social scientists much more accurate insight into problems ranging from implementation of the Voting Rights Act to epidemiological studies of the link between radon and lung cancer.

King is a casual, self-effacing 38-year-old professor whose office has a leafy view over the north end of Harvard Square. He concocted the sports car example while looking out that window one summer afternoon during an interview; he doesn't have an answer to the problem it poses. Leave that to a Miata dealer.

King got interested in political science, he says, because "to me, Tuesday night politics was more interesting than Monday night football. You don't just win in politics, you get to control the government." Voting is his primary focus. He had thought about the inference problem for most of his 20 years in academia; his epiphany came while sitting in an Ohio courtroom three years ago, watching lawyers argue over the boundaries for a judicial district.

King noticed that no one could say for certain how many blacks in that district voted for Democrats and how many voted for Republicans - an important point in ensuring that voting rights are protected. Exit polls might have done the trick, but they are expensive, unreliable, and unofficial. Official returns showed how many voters pulled Democratic or Republican levers. And census data showed how many blacks were in the district. But when experts tried to tie the two things together, they would come up with absurd numbers.

King recalls plaintiffs reporting that 109.63 percent of blacks voted Democratic - "a ridiculous answer," he says. What he wanted, at the least, was a method that would not produce answers known to be wrong. His algorithm can be downloaded from his Web site:

The formula will be a boon to social scientists, economists, market researchers, and all manner of numbers boffins. It has already been applied in voting rights cases. The formula is likely to be used widely in the redistricting round that will follow the 2000 census.

In education, it could determine the efficacy of the burgeoning school choice movement. While privacy rules prevent disclosure of which students in private schools are using school-choice vouchers, the schools do keep track of how many students overall are on vouchers and how many go on to college or drop out. From those two data sets, researchers can figure how well voucher students are doing.

Historians could use King's formula to determine whether it was, as some suspect, working-class people in Germany who accounted for the rapid rise of the Nazi party in the 1930s. German records show where working-class people lived and which areas voted for Nazis. A scientific link is now possible.

King's solution to the ecological inference problem is a tool that will allow researchers to look much more closely at society, to tease local behavior out of social statistics. "If all politics is local," King says from deep inside Tip O'Neill's former congressional district, "then political science has been missing much of the politics all these years."

© Copyright 1997 Globe Newspaper Company