Areas of Research

    • Evaluating U.S. Social Security Administration Forecasts
      The accuracy of U.S. Social Security Administration (SSA) demographic and financial forecasts is crucial for the solvency of its Trust Funds, government programs comprising greater than 50% of all federal government expenditures, industry decision making, and the evidence base of many scholarly articles. Forecasts are also essential for scoring policy proposals, put forward by both political parties. Because SSA makes public little replication information, and uses ad hoc, qualitative, and antiquated statistical forecasting methods, no one in or out of government has been able to produce fully independent alternative forecasts or policy scorings. Yet, no systematic evaluation of SSA forecasts has ever been published by SSA or anyone else. We show that SSA's forecasting errors were approximately unbiased until about 2000, but then began to grow quickly, with increasingly overconfident uncertainty intervals. Moreover, the errors all turn out to be in the same potentially dangerous direction, each making the Social Security Trust Funds look healthier than they actually are. We also discover the cause of these findings with evidence from a large number of interviews we conducted with participants at every level of the forecasting and policy processes. We show that SSA's forecasting procedures meet all the conditions the modern social-psychology and statistical literatures demonstrate make bias likely. When those conditions mixed with potent new political forces trying to change Social Security and influence the forecasts, SSA's actuaries hunkered down trying hard to insulate themselves from the intense political pressures. Unfortunately, this otherwise laudable resistance to undue influence, along with their ad hoc qualitative forecasting models, led them to also miss important changes in the input data such as retirees living longer lives, and drawing more benefits, than predicted by simple extrapolations. We explain that solving this problem involves using (a) removing human judgment where possible, by using formal statistical methods -- via the revolution in data science and big data; (b) instituting formal structural procedures when human judgment is required -- via the revolution in social psychological research; and (c) requiring transparency and data sharing to catch errors that slip through -- via the revolution in data sharing & replication. An article at Barron's about our work.
    • Incumbency Advantage
      Proof that previously used estimators of electoral incumbency advantage were biased, and a new unbiased estimator. Also, the first systematic demonstration that constituency service by legislators increases the incumbency advantage.
    • Mexican Health Care Evaluation
      An evaluation of the Mexican Seguro Popular program (designed to extend health insurance and regular and preventive medical care, pharmaceuticals, and health facilities to 50 million uninsured Mexicans), one of the world's largest health policy reforms of the last two decades. Our evaluation features a new design for field experiments that is more robust to the political interventions and implementation errors that have ruined many similar previous efforts; new statistical methods that produce more reliable and efficient results using fewer resources, assumptions, and data; and an implementation of these methods in the largest randomized health policy experiment to date. (See the Harvard Gazette story on this project.)
    • Presidency Research; Voting Behavior
      Resolution of the paradox of why polls are so variable over time during presidential campaigns even though the vote outcome is easily predictable before it starts. Also, a resolution of a key controversy over absentee ballots during the 2000 presidential election; and the methodology of small-n research on executives.
    • Informatics and Data Sharing
      Replication Standards New standards, protocols, and software for citing, sharing, analyzing, archiving, preserving, distributing, cataloging, translating, disseminating, naming, verifying, and replicating scholarly research data and analyses. Also includes proposals to improve the norms of data sharing and replication in science.
    • International Conflict
      Methods for coding, analyzing, and forecasting international conflict and state failure. Evidence that the causes of conflict, theorized to be important but often found to be small or ephemeral, are indeed tiny for the vast majority of dyads, but are large, stable, and replicable wherever the ex ante probability of conflict is large.
    • Legislative Redistricting
      The definition of partisan symmetry as a standard for fairness in redistricting; methods and software for measuring partisan bias and electoral responsiveness; discussion of U.S. Supreme Court rulings about this work. Evidence that U.S. redistricting reduces bias and increases responsiveness, and that the electoral college is fair; applications to legislatures, primaries, and multiparty systems.
    • Mortality Studies
      Methods for forecasting mortality rates (overall or for time series data cross-classified by age, sex, country, and cause); estimating mortality rates in areas without vital registration; measuring inequality in risk of death; applications to US mortality, the future of the Social Security, armed conflict, heart failure, and human security.
    • Teaching and Administration
      Publications and other projects designed to improve teaching, learning, and university administration, as well as broader writings on the future of the social sciences.
    • Automated Text Analysis
      Automated and computer-assisted methods of extracting, organizing, and consuming knowledge from unstructured text.
    • Causal Inference
      Methods for detecting and reducing model dependence (i.e., when minor model changes produce substantively different inferences) in inferring causal effects and other counterfactuals. Matching methods; "politically robust" and cluster-randomized experimental designs; causal bias decompositions.
    • Event Counts and Durations
      Statistical models to explain or predict how many events occur for each fixed time period, or the time between events. An application to cabinet dissolution in parliamentary democracies which united two previously warring scholarly literature. Other applications to international relations and U.S. Supreme Court appointments.
    • Ecological Inference
      Inferring individual behavior from group-level data: The first approach to incorporate both unit-level deterministic bounds and cross-unit statistical information, methods for 2x2 and larger tables, Bayesian model averaging, applications to elections, software.
    • Missing Data
      Statistical methods to accommodate missing information in data sets due to scattered unit nonresponse, missing variables, or cell values or variables measured with error. Easy-to-use algorithms and software for multiple imputation and multiple overimputation for surveys, time series, and time series cross-sectional data. Applications to electoral, and other compositional, data.
    • Qualitative Research
      How the same unified theory of inference underlies quantitative and qualitative research alike; scientific inference when quantification is difficult or impossible; research design; empirical research in legal scholarship.
    • Rare Events
      How to save 99% of your data collection costs; bias corrections for logistic regression in estimating probabilities and causal effects in rare events data; estimating base probabilities or any quantity from case-control data; automated coding of events.
    • Survey Research
      "Anchoring Vignette" methods for when different respondents (perhaps from different cultures, countries, or ethnic groups) understand survey questions in different ways; an approach to developing theoretical definitions of complicated concepts apparently definable only by example (i.e., "you know it when you see it"); how surveys work.
    • Unifying Statistical Analysis
      Development of a unified approach to statistical modeling, inference, interpretation, presentation, analysis, and software; integrated with most of the other projects listed here.

Recent Work

A Unified Approach to Measurement Error and Missing Data: Details and Extensions
Matthew Blackwell, James Honaker, and Gary King. 2015. “A Unified Approach to Measurement Error and Missing Data: Details and Extensions.” Sociological Methods and Research, 1-28. Publisher's VersionAbstract

We extend a unified and easy-to-use approach to measurement error and missing data. In our companion article, Blackwell, Honaker, and King give an intuitive overview of the new technique, along with practical suggestions and empirical applications. Here, we offer more precise technical details, more sophisticated measurement error model specifications and estimation procedures, and analyses to assess the approach’s robustness to correlated measurement errors and to errors in categorical variables. These results support using the technique to reduce bias and increase efficiency in a wide variety of empirical research.

A Theory of Statistical Inference for Matching Methods in Applied Causal Research
Stefano M. Iacus, Gary King, and Giuseppe Porro. 2015. “A Theory of Statistical Inference for Matching Methods in Applied Causal Research”.Abstract

To reduce model dependence and bias in causal inference, researchers usually use matching as a data preprocessing step, after which they apply whatever statistical model and uncertainty estimators they would have without matching. Unfortunately, this approach is appropriate in finite samples only under exact matching, which is usually infeasible, or approximate matching only under asymptotic theory if large enough sample sizes are available, but even then requires unfamiliar specialized point and variance estimators. Instead of attempting to change common practices, we show how those analyzing certain specific (but extremely common) types of data can instead appeal to a much easier version of existing theory. This alternative theory is substantively plausible, requires no asymptotic theory, and is simple to understand. Its core conceptualizes continuous variables as having natural breakpoints, which are common in applications (e.g., high school or college degrees in years of education, a governmental poverty level in income, or phase transitions in temperature). The theory allows binary, multicategory, and continuous treatment variables from the outset and straightforward extensions for imperfect treatment assignment and different versions of treatments.

How Robust Standard Errors Expose Methodological Problems They Do Not Fix, and What to Do About It
Gary King and Margaret E Roberts. 2015. “How Robust Standard Errors Expose Methodological Problems They Do Not Fix, and What to Do About It.” Political Analysis, 2, 23: 159–179. Publisher's VersionAbstract

"Robust standard errors" are used in a vast array of scholarship to correct standard errors for model misspecification. However, when misspecification is bad enough to make classical and robust standard errors diverge, assuming that it is nevertheless not so bad as to bias everything else requires considerable optimism. And even if the optimism is warranted, settling for a misspecified model, with or without robust standard errors, will still bias estimators of all but a few quantities of interest. The resulting cavernous gap between theory and practice suggests that considerable gains in applied statistics may be possible. We seek to help researchers realize these gains via a more productive way to understand and use robust standard errors; a new general and easier-to-use "generalized information matrix test" statistic that can formally assess misspecification (based on differences between robust and classical variance estimates); and practical illustrations via simulations and real examples from published research. How robust standard errors are used needs to change, but instead of jettisoning this popular tool we show how to use it to provide effective clues about model misspecification, likely biases, and a guide to considerably more reliable, and defensible, inferences. Accompanying this article [soon!] is software that implements the methods we describe. 

Methods for Extremely Large Scale Media Experiments and Observational Studies (Poster)
Gary King, Benjamin Schneer, and Ariel White. 2014. “Methods for Extremely Large Scale Media Experiments and Observational Studies (Poster).” In Society for Political Methodology. Athens, GA, 24 July.Abstract

This is a poster presentation describing (1) the largest ever experimental study of media effects, with more than 50 cooperating traditional media sites, normally unavailable web site analytics, the text of hundreds of thousands of news articles, and tens of millions of social media posts, and (2) a design we used in preparation that attempts to anticipate experimental outcomes

You Lie! Patterns of Partisan Taunting in the U.S. Senate (Poster)
Justin Grimmer, Gary King, and Chiara Superti. 2014. “You Lie! Patterns of Partisan Taunting in the U.S. Senate (Poster).” In Society for Political Methodology. Athens, GA, 24 July.Abstract

This is a poster that describes our analysis of "partisan taunting," the explicit, public, and negative attacks on another political party or its members, usually using vitriolic and derogatory language. We first demonstrate that most projects that hand code text in the social sciences optimize with respect to the wrong criterion, resulting in large, unnecessary biases. We show how to fix this problem and then apply it to taunting. We find empirically that, unlike most claims in the press and the literature, taunting is not inexorably increasing; it appears instead to be a rational political strategy, most often used by those least likely to win by traditional means -- ideological extremists, out-party members when the president is unpopular, and minority party members. However, although taunting appears to be individually rational, it is collectively irrational: Constituents may resonate with one cutting taunt by their Senator, but they might not approve if he or she were devoting large amounts of time to this behavior rather than say trying to solve important national problems. We hope to partially rectify this situation by posting public rankings of Senatorial taunting behavior.

Reverse-engineering censorship in China: Randomized experimentation and participant observation
Gary King, Jennifer Pan, and Margaret E Roberts. 2014. “Reverse-Engineering Censorship in China: Randomized Experimentation and Participant Observation.” Science, 6199, 345: 1-10. Publisher's VersionAbstract

Existing research on the extensive Chinese censorship organization uses observational methods with well-known limitations. We conducted the first large-scale experimental study of censorship by creating accounts on numerous social media sites, randomly submitting different texts, and observing from a worldwide network of computers which texts were censored and which were not. We also supplemented interviews with confidential sources by creating our own social media site, contracting with Chinese firms to install the same censoring technologies as existing sites, and—with their software, documentation, and even customer support—reverse-engineering how it all works. Our results offer rigorous support for the recent hypothesis that criticisms of the state, its leaders, and their policies are published, whereas posts about real-world events with collective action potential are censored.

Computer-Assisted Keyword and Document Set Discovery from Unstructured Text
Gary King, Patrick Lam, and Margaret Roberts. 2014. “Computer-Assisted Keyword and Document Set Discovery from Unstructured Text”.Abstract

The (unheralded) first step in many applications of automated text analysis involves selecting keywords to choose documents from a large text corpus for further study. Although all substantive results depend crucially on this choice, researchers typically pick keywords in ad hoc ways, given the lack of formal statistical methods to help. Paradoxically, this often means that the validity of the most sophisticated text analysis methods depends in practice on the inadequate keyword counting or matching methods they are designed to replace. The same ad hoc keyword selection process is also used in many other areas, such as following conversations that rapidly innovate language to evade authorities, seek political advantage, or express creativity; generic web searching; eDiscovery; look-alike modeling; intelligence analysis; and sentiment and topic analysis. We develop a computer-assisted (as opposed to fully automated) statistical approach that suggests keywords from available text, without needing any structured data as inputs. This framing poses the statistical problem in a new way, which leads to a widely applicable algorithm. Our specific approach is based on training classifiers, extracting information from (rather than correcting) their mistakes, and then summarizing results with Boolean search strings. We illustrate how the technique works with examples in English and Chinese.

Google Flu Trends Still Appears Sick: An Evaluation of the 2013‐2014 Flu Season
David Lazer, Ryan Kennedy, Gary King, and Alessandro Vespignani. 2014. “Google Flu Trends Still Appears Sick: An Evaluation of the 2013‐2014 Flu Season”.Abstract
Last year was difficult for Google Flu Trends (GFT). In early 2013, Nature reported that GFT was estimating more than double the percentage of doctor visits for influenza like illness than the Centers for Disease Control and Prevention s (CDC) sentinel reports during the 2012 2013 flu season (1). Given that GFT was designed to forecast upcoming CDC reports, this was a problematic finding. In March 2014, our report in Science found that the overestimation problem in GFT was also present in the 2011 2012 flu season (2). The report also found strong evidence of autocorrelation and seasonality in the GFT errors, and presented evidence that the issues were likely, at least in part, due to modifications made by Google s search algorithm and the decision by GFT engineers not to use previous CDC reports or seasonality estimates in their models what the article labeled algorithm dynamics and big data hubris respectively. Moreover, the report and the supporting online materials detailed how difficult/impossible it is to replicate the GFT results, undermining independent efforts to explore the source of GFT errors and formulate improvements.
The Parable of Google Flu: Traps in Big Data Analysis
David Lazer, Ryan Kennedy, Gary King, and Alessandro Vespignani. 2014. “The Parable of Google Flu: Traps in Big Data Analysis.” Science, 14 March, 343: 1203-1205.Abstract
Large errors in flu prediction were largely avoidable, which offers lessons for the use of big data.

In February 2013, Google Flu Trends (GFT) made headlines but not for a reason that Google executives or the creators of the flu tracking system would have hoped. Nature reported that GFT was predicting more than double the proportion of doctor visits for influenza-like illness (ILI) than the Centers for Disease Control and Prevention (CDC), which bases its estimates on surveillance reports from laboratories across the United States ( 1, 2). This happened despite the fact that GFT was built to predict CDC reports. Given that GFT is often held up as an exemplary use of big data ( 3, 4), what lessons can we draw from this error?

Restructuring the Social Sciences: Reflections from Harvard's Institute for Quantitative Social Science
Gary King. 2014. “Restructuring the Social Sciences: Reflections from Harvard's Institute for Quantitative Social Science.” PS: Political Science and Politics, 1, 47: 165-172. Cambridge University Press versionAbstract

The social sciences are undergoing a dramatic transformation from studying problems to solving them; from making do with a small number of sparse data sets to analyzing increasing quantities of diverse, highly informative data; from isolated scholars toiling away on their own to larger scale, collaborative, interdisciplinary, lab-style research teams; and from a purely academic pursuit to having a major impact on the world. To facilitate these important developments, universities, funding agencies, and governments need to shore up and adapt the infrastructure that supports social science research. We discuss some of these developments here, as well as a new type of organization we created at Harvard to help encourage them -- the Institute for Quantitative Social Science.  An increasing number of universities are beginning efforts to respond with similar institutions. This paper provides some suggestions for how individual universities might respond and how we might work together to advance social science more generally.

The Troubled Future of Colleges and Universities (with comments from five scholar-administrators)
Gary King and Maya Sen. 2013. “The Troubled Future of Colleges and Universities (With Comments from Five Scholar-Administrators).” PS: Political Science and Politics, 1, 46: 81--113.Abstract

The American system of higher education is under attack by political, economic, and educational forces that threaten to undermine its business model, governmental support, and operating mission. The potential changes are considerably more dramatic and disruptive than what we've already experienced. Traditional colleges and universities urgently need a coherent, thought-out response. Their central role in ensuring the creation, preservation, and distribution of knowledge may be at risk and, as a consequence, so too may be the spectacular progress across fields we have come to expect as a result.

Symposium contributors include Henry E. Brady, John Mark Hansen, Gary King, Nannerl O. Keohane, Michael Laver, Virginia Sapiro, and Maya Sen.

Gary King. 2002. “COUNT: A Program for Estimating Event Count and Duration Regressions”.Abstract
A stand-alone, easy-to-use program for running event count and duration regression models, developed by and/or discussed in a series of journal articles by me. (Event count models have a dependent variable measured as the number of times something happens, such as the number of uncontested seats per state or the number of wars per year. Duration models explain dependent variables measured as the time until some event, such as the number of months a parliamentary cabinet endures.) Winner of the APSA Research Software Award.
Gary King. 1998. “MAXLIK”.Abstract
A set of Gauss programs and datasets (annotated for pedagogical purposes) to implement many of the maximum likelihood-based models I discuss in Unifying Political Methodology: The Likelihood Theory of Statistical Inference, Ann Arbor: University of Michigan Press, 1998, and use in my class. All datasets are real, not simulated.
JudgeIt I: A Program for Evaluating Electoral Systems and Redistricting Plans
Andrew Gelman and Gary King. 1992. “JudgeIt I: A Program for Evaluating Electoral Systems and Redistricting Plans”.Abstract
A program for analyzing almost any feature of district-level legislative elections data, including prediction, evaluating redistricting plans, estimating counterfactual hypotheses (such as what would happen if a term-limitation amendment were imposed), and others. This implements statistical procedures described in a series of journal articles and has been used during redistricting in many states by judges, partisans, governments, private citizens, and many others. Winner of the APSA Research Software Award.
Reverse-Engineering Censorship in China, at Ohio State University, Mershon Center for International Security Studies, Thursday, October 22, 2015:

Chinese government censorship of social media constitutes the largest selective suppression of human communication in recorded history. In three ways, we show, paradoxically, that this large system also leaves large footprints that reveal a great deal about itself and the intentions of the government. First is an observational study where we download all social media posts before the Chinese government can read and censor those they deem objectionable, and then detect from a network of computers all over the world which are censored.

Why Propensity Scores Should Not Be Used For Matching, at Department of Epidemiology, Harvard T.H. Chan School of Public Health, Thursday, October 15, 2015:

This talk summarizes a paper -- Gary King and Richard Nielsen. 2015. “Why Propensity Scores Should Not Be Used for Matching” -- with this abstract:  Researchers use propensity score matching (PSM) as a data preprocessing step to selectively prune units prior to applying a model to estimate a causal effect.

Explaining Systematic Bias and Nontransparency in US Social Security Administration Forecasts, at Inaugural Distinguished Lecture, Institute for Social Science, UC-Davis, Wednesday, October 7, 2015:


The accuracy of U.S. Social Security Administration (SSA) demographic and financial forecasts is crucial for the solvency of its Trust Funds, government programs comprising greater than 50% of all federal government expenditures, industry decision making, and the evidence base of many scholarly articles. Forecasts are also essential for scoring policy proposals put forward by both political parties or anyone else.

Why Propensity Scores Should Not Be Used for Matching, at Harvard Medical School, Brigham and Women's Hosptial, Division of Pharmacoepidemiology and Pharmacoeconomics, Wednesday, September 23, 2015:

This talk summarizes a paper -- Gary King and Richard Nielsen. 2015. “Why Propensity Scores Should Not Be Used for Matching” -- with this abstract:  Researchers use propensity score matching (PSM) as a data preprocessing step to selectively prune units prior to applying a model to estimate a causal effect.

Why Propensity Scores Should Not Be Used For Matching, at Harvard's Applied Statistics Workshop, at IQSS, Wednesday, September 16, 2015:

This talk summarizes a paper -- Gary King and Richard Nielsen. 2015. “Why Propensity Scores Should Not Be Used for Matching” -- with this abstract:  Researchers use propensity score matching (PSM) as a data preprocessing step to selectively prune units prior to applying a model to estimate a causal effect.

Why Propensity Scores Should Not Be Used for Matching, at International Methods Colloquium , Friday, September 11, 2015:

This talk summarizes a paper -- Gary King and Richard Nielsen. 2015. “Why Propensity Scores Should Not Be Used for Matching” -- with this abstract:  Researchers use propensity score matching (PSM) as a data preprocessing step to selectively prune units prior to applying a model to estimate a causal effect.

Why Propensity Scores Should Not Be Used For Matching, at Society for Political Methodology, University of Rochester, Friday, July 24, 2015:

This talk summarizes a paper -- Gary King and Richard Nielsen. 2015. “Why Propensity Scores Should Not Be Used for Matching” -- with this abstract: Researchers use propensity score matching (PSM) as a data preprocessing step to selectively prune units prior to applying a model to estimate a causal effect.

Simplifying Matching Methods for Causal Inference, at MIT, Political Methodology Series, Monday, March 16, 2015:

This talk explains how to make matching methods for causal inference easier to use and more powerful. Applied researchers commonly use matching methods as a data preprocessing step for reducing model dependence and bias, after which they use whatever statistical procedure they would have without matching, such as regression. They routinely ignore the requirement that all matches be exact, and also commonly use ad hoc analyses that iterate between formal matching methods and informal balance and sample size checks.

Reverse-Engineering Censorship in China, at Harvard Graduate Commons Program, Wednesday, February 11, 2015:

Chinese government censorship of social media constitutes the largest selective suppression of human communication in recorded history. In three ways, we show, paradoxically, that this large system also leaves large footprints that reveal a great deal about itself and the intentions of the government. First is an observational study where we download all social media posts before the Chinese government can read and censor those they deem objectionable, and then detect from a network of computers all over the world which are censored.