Areas of Research

    • Automated Text Analysis (6)
      Automated and computer-assisted methods of extracting, organizing, and consuming knowledge from unstructured text.
    • Incumbency Advantage (7)
      Proof that previously used estimators of electoral incumbency advantage were biased, and a new unbiased estimator. Also, the first systematic demonstration that constituency service by legislators increases the incumbency advantage.
    • Mexican Health Care Evaluation (6)
      New designs and statistical methods for large scale policy evaluations; robustness to implementation errors and political interventions, with very high levels of statistical efficiency. Application to the Mexican Seguro Popular De Salud (Universal Health Insurance) Program.
    • Presidency Research; Voting Behavior (9)
      Resolution of the paradox of why polls are so variable over time during presidential campaigns even though the vote outcome is easily predictable before it starts. Also, a resolution of a key controversy over absentee ballots during the 2000 presidential election; and the methodology of small-n research on executives.
    • Informatics and Data Sharing (15)
      New standards, protocols, and software for citing, sharing, analyzing, archiving, preserving, distributing, cataloging, translating, disseminating, naming, verifying, and replicating scholarly research data and analyses. Also includes proposals to improve the norms of data sharing and replication in science.
    • International Conflict (10)
      Methods for coding, analyzing, and forecasting international conflict and state failure. Evidence that the causes of conflict, theorized to be important but often found to be small or ephemeral, are indeed tiny for the vast majority of dyads, but are large, stable, and replicable wherever the ex ante probability of conflict is large.
    • Legislative Redistricting (15)
      The definition of partisan symmetry as a standard for fairness in redistricting; methods and software for measuring partisan bias and electoral responsiveness; discussion of U.S. Supreme Court rulings about this work. Evidence that U.S. redistricting reduces bias and increases responsiveness, and that the electoral college is fair; applications to legislatures, primaries, and multiparty systems.
    • Mortality Studies (16)
      Methods for forecasting mortality rates (overall or for time series data cross-classified by age, sex, country, and cause); estimating mortality rates in areas without vital registration; measuring inequality in risk of death; applications to US mortality, the future of the Social Security, armed conflict, heart failure, and human security.
  • Methods (102)
    • Causal Inference (22)
      Methods for detecting and reducing model dependence (i.e., when minor model changes produce substantively different inferences) in inferring casual effects and other counterfactuals. Matching methods; "politically robust" and cluster-randomized experimental designs; causal bias decompositions.
    • Event Counts and Durations (15)
      Statistical models to explain or predict how many events occur for each fixed time period, or the time between events. An application to cabinet dissolution in parliamentary democracies which united two previously warring scholarly literature. Other applications to international relations and U.S. Supreme Court appointments.
    • Ecological Inference (17)
      Inferring individual behavior from group-level data: The first approach to incorporate both unit-level deterministic bounds and cross-unit statistical information, methods for 2x2 and larger tables, Bayesian model averaging, applications to elections, software.
    • Missing Data (11)
      Statistical methods to accommodate missing information in data sets due to scattered unit nonresponse, missing variables, or cell values or variables measured with error. Easy-to-use algorithms and software for multiple imputation and multiple overimputation for surveys, time series, and time series cross-sectional data. Applications to electoral, and other compositional, data.
    • Qualitative Research (5)
      How the same unified theory of inference underlies quantitative and qualitative research alike; scientific inference when quantification is difficult or impossible; research design; empirical research in legal scholarship.
    • Rare Events (10)
      How to save 99% of your data collection costs; bias corrections for logistic regression in estimating probabilities and causal effects in rare events data; estimating base probabilities or any quantity from case-control data; automated coding of events.
    • Survey Research (10)
      "Anchoring Vignette" methods for when different respondents (perhaps from different cultures, countries, or ethnic groups) understand survey questions in different ways; an approach to developing theoretical definitions of complicated concepts apparently definable only by example (i.e., "you know it when you see it"); how surveys work.
    • Unifying Statistical Analysis (12)
      Development of a unified approach to statistical modeling, inference, interpretation, presentation, analysis, and software; integrated with most of the other projects listed here.

Recent Work

Soneji, Samir, and Gary King. "Statistical Security for Social Security." Demography (In Press). AbstractArticle
The financial viability of Social Security, the single largest U.S. Government program, depends on accurate forecasts of the solvency of its intergenerational trust fund. We be¬gin by detailing information necessary for replicating the Social Security Administration’s (SSA’s) forecasting procedures, which until now has been unavailable in the public domain. We then offer a way to improve the quality of these procedures due to age-and sex-specific mortality forecasts. The most recent SSA mortality forecasts were based on the best available technology at the time, which was a combination of linear extrapolation and qualitative judgments. Unfortunately, linear extrapolation excludes known risk factors and is inconsistent with long-standing demographic patterns such as the smoothness of age profiles. Modern statistical methods typically outperform even the best qualitative judgments in these contexts. We show how to use such methods here, enabling researchers to forecast using far more information, such as the known risk factors of smoking and obesity and known demographic patterns. Including this extra information makes a sub¬stantial difference: For example, by only improving mortality forecasting methods, we predict three fewer years of net surplus, $730 billion less in Social Security trust funds, and program costs that are 0.66% greater of projected taxable payroll compared to SSA projections by 2031. More important than specific numerical estimates are the advantages of transparency, replicability, reduction of uncertainty, and what may be the resulting lower vulnerability to the politicization of program forecasts. In addition, by offering with this paper software and detailed replication information, we hope to marshal the efforts of the research community to include ever more informative inputs and to continue to reduce the uncertainties in Social Security forecasts.
King, Gary, Richard Nielsen, and Aaron Wells. "Letter to the Editor (on McCall and Cromwell)." New England Journal of Medicine (In Press).
Iacus, Stefano M., Gary King, and Giuseppe Porro. "Causal Inference Without Balance Checking: Coarsened Exact Matching." Political Analysis (2011). AbstractArticle

We discuss a method for improving causal inferences called "Coarsened Exact Matching'' (CEM), and the new "Monotonic Imbalance Bounding'' (MIB) class of matching methods from which CEM is derived. We summarize what is known about CEM and MIB, derive and illustrate several new desirable statistical properties of CEM, and then propose a variety of useful extensions. We show that CEM possesses a wide range of desirable statistical properties not available in most other matching methods, but is at the same time exceptionally easy to comprehend and use. We focus on the connection between theoretical properties and practical applications. We also make available easy-to-use open source software for R and Stata which implement all our suggestions.

Political Analysis version

Blackwell, Matthew, James Honaker, and Gary King. Multiple Overimputation: A Unified Approach to Measurement Error and Missing Data. Working Paper, 2011. AbstractArticle

Social scientists typically devote considerable effort to reducing measurement error during data collection and then ignore the issue during data analysis. Although many statistical methods have been proposed for reducing measurement error-induced biases, few have been widely used because of implausible assumptions, high levels of model dependence, difficult computation, or inapplicability with multiple mismeasured variables. We develop an easy-to-use alternative that generalizes the popular multiple imputation (MI) framework by treating missing data problems as a special case of extreme measurement error and correcting for both. Like MI, the proposed "multiple overimputation" (MO) framework is a simple two-step procedure. First, multiple (≈5) completed copies of the data set are created where cells measured without error are held constant, those missing are imputed from the distribution of predicted values, and cells (or entire variables) with measurement error are "overimputed," that is imputed from a predictive distribution with observation-level priors defined by the mismeasured values and available external information, if any. In the second step, analysts can then run whatever statistical method they would have run on each of the overimputed data sets as if there had been no missingness or measurement error; the results are then combined via a simple procedure. We also (will) offer open source software that implements all the methods described herein.

Ho, Daniel E., Kosuke Imai, Gary King, and Elizabeth A. Stuart. "MatchIt: Nonparametric Preprocessing for Parametric Causal Inference." Journal of Statistical Software 42, no. 8 (2011).Website
Goldstein, Edward, Benjamin J. Cowling, Allison E. Aiello, Saki Takahashi, Gary King, Ying Lu, and Marc Lipsitch. "Estimating Incidence Curves of Several Infections Using Symptom Surveillance Data." PLoS ONE 6, no. 8 (2011): e23380. AbstractArticle

We introduce a method for estimating incidence curves of several co-circulating infectious pathogens, where each infection has its own probabilities of particular symptom profiles. Our deconvolution method utilizes weekly surveillance data on symptoms from a defined population as well as additional data on symptoms from a sample of virologically confirmed infectious episodes. We illustrate this method by numerical simulations and by using data from a survey conducted on the University of Michigan campus. Last, we describe the data needs to make such estimates accurate.

Link to PLoS version

King, Gary, Richard Nielsen, Carter Coberley, James E. Pope, and Aaron Wells. Comparative Effectiveness of Matching Methods for Causal Inference. Working Paper, 2011. AbstractArticle

Matching is an increasingly popular method of causal inference in observational data, but following methodological best practices has proven difficult for applied researchers. We address this problem by providing a simple graphical approach for choosing among the numerous possible matching solutions generated by three methods: the venerable ``Mahalanobis Distance Matching'' (MDM), the commonly used ``Propensity Score Matching'' (PSM), and a newer approach called ``Coarsened Exact Matching'' (CEM). In the process of using our approach, we also discover that PSM often approximates random matching, both in many real applications and in data simulated by the processes that fit PSM theory. Moreover, contrary to conventional wisdom, random matching is not benign: it (and thus PSM) can often degrade inferences relative to not matching at all. We find that MDM and CEM do not have this problem, and in practice CEM usually outperforms the other two approaches. However, with our comparative graphical approach and easy-to-follow procedures, focus can be on choosing a matching solution for a particular application, which is what may improve inferences, rather than the particular method used to generate it.

Honaker, James, Gary King, and Matthew Blackwell. "Amelia II: A Program for Missing Data." Journal of Statistical Software 45, no. 7 (2011): 1-47. AbstractArticle

Amelia II is a complete R package for multiple imputation of missing data. The package implements a new expectation-maximization with bootstrapping algorithm that works faster, with larger numbers of variables, and is far easier to use, than various Markov chain Monte Carlo approaches, but gives essentially the same answers. The program also improves imputation models by allowing researchers to put Bayesian priors on individual cell values, thereby including a great deal of potentially valuable and extensive information. It also includes features to accurately impute cross-sectional datasets, individual time series, or sets of time series for different cross-sections. A full set of graphical diagnostics are also available. The program is easy to use, and the simplicity of the algorithm makes it far more robust; both a simple command line and extensive graphical user interface are included.

Amelia II software web site

Wand, Jonathan, Gary King, and Olivia Lau. "Anchors: Software for Anchoring Vignettes Data." Journal of Statistical Software 42, no. 3 (2011): 1-25. AbstractArticleWebsite

When respondents use the ordinal response categories of standard survey questions in different ways, the validity of analyses based on the resulting data can be biased. Anchoring vignettes is a survey design technique intended to correct for some of these problems. The anchors package in R includes methods for evaluating and choosing anchoring vignettes, and for analyzing the resulting data.

Iacus, Stefano M., Gary King, and Giuseppe Porro. "Multivariate Matching Methods That are Monotonic Imbalance Bounding." Journal of the American Statistical Association 106, no. 493 (2011): 345-361. AbstractArticle
We introduce a new "Monotonic Imbalance Bounding" (MIB) class of matching methods for causal inference with a surprisingly large number of attractive statistical properties. MIB generalizes and extends in several new directions the only existing class, "Equal Percent Bias Reducing" (EPBR), which is designed to satisfy weaker properties and only in expectation. We also offer strategies to obtain specific members of the MIB class, and analyze in more detail a member of this class, called Coarsened Exact Matching, whose properties we analyze from this new perspective. We offer a variety of analytical results and numerical simulations that demonstrate how members of the MIB class can dramatically improve inferences relative to EPBR-based matching methods.
King, Gary, and Samir Soneji. "The Future of Death in America." Demographic Research 25, no. 1 (2011): 1-38. AbstractArticleWebsite

Population mortality forecasts are widely used for allocating public health expenditures, setting research priorities, and evaluating the viability of public pensions, private pensions, and health care financing systems. In part because existing methods seem to forecast worse when based on more information, most forecasts are still based on simple linear extrapolations that ignore known biological risk factors and other prior information. We adapt a Bayesian hierarchical forecasting model capable of including more known health and demographic information than has previously been possible. This leads to the first age- and sex-specific forecasts of American mortality that simultaneously incorporate, in a formal statistical model, the effects of the recent rapid increase in obesity, the steady decline in tobacco consumption, and the well known patterns of smooth mortality age profiles and time trends. Formally including new information in forecasts can matter a great deal. For example, we estimate an increase in male life expectancy at birth from 76.2 years in 2010 to 79.9 years in 2030, which is 1.8 years greater than the U.S. Social Security Administration projection and 1.5 years more than U.S. Census projection. For females, we estimate more modest gains in life expectancy at birth over the next twenty years from 80.5 years to 81.9 years, which is virtually identical to the Social Security Administration projection and 2.0 years less than U.S. Census projections. We show that these patterns are also likely to greatly affect the aging American population structure. We offer an easy-to-use approach so that researchers can include other sources of information and potentially improve on our forecasts too.

King, Gary, Kay Schlozman, and Norman Nie. The Future of Political Science: 100 Perspectives. New York: Routledge Press, 2009.
Girosi, Federico, and Gary King. Demographic Forecasting. Princeton: Princeton University Press, 2008. Abstract

We introduce a new framework for forecasting age-sex-country-cause-specific mortality rates that incorporates considerably more information, and thus has the potential to forecast much better, than any existing approach. Mortality forecasts are used in a wide variety of academic fields, and for global and national health policy making, medical and pharmaceutical research, and social security and retirement planning.

As it turns out, the tools we developed in pursuit of this goal also have broader statistical implications, in addition to their use for forecasting mortality or other variables with similar statistical properties. First, our methods make it possible to include different explanatory variables in a time series regression for each cross-section, while still borrowing strength from one regression to improve the estimation of all. Second, we show that many existing Bayesian (hierarchical and spatial) models with explanatory variables use prior densities that incorrectly formalize prior knowledge. Many demographers and public health researchers have fortuitously avoided this problem so prevalent in other fields by using prior knowledge only as an ex post check on empirical results, but this approach excludes considerable information from their models. We show how to incorporate this demographic knowledge into a model in a statistically appropriate way. Finally, we develop a set of tools useful for developing models with Bayesian priors in the presence of partial prior ignorance. This approach also provides many of the attractive features claimed by the empirical Bayes approach, but fully within the standard Bayesian theory of inference.

King, Gary, Ori Rosen, and Martin Tanner. Ecological Inference: New Methodological Strategies, Edited by Gary King, Ori Rosen and Martin A. Tanner. New York: Cambridge University Press, 2004. AbstractComplete Book (PDF)
Ecological Inference: New Methodological Strategies brings together a diverse group of scholars to survey the latest strategies for solving ecological inference problems in various fields. The last half decade has witnessed an explosion of research in ecological inference – the attempt to infer individual behavior from aggregate data. The uncertainties and the information lost in aggregation make ecological inference one of the most difficult areas of statistical inference, but such inferences are required in many academic fields, as well as by legislatures and the courts in redistricting, by businesses in marketing research, and by governments in policy analysis.
Brace, Paul, Christine Harrington, and Gary King. The Presidency in American Politics. New York and London: New York University Press, 1989.
King, Gary, and Lyn Ragsdale. The Elusive Executive: Discovering Statistical Patterns in the Presidency. Washington, D.C: Congressional Quarterly Press, 1988.Website
Bischof, Jonathan, Gary King, and Samir Soneji. AutoCast: Automated Bayesian Forecasting with YourCast., 2011.Website
Gelman, Andrew, Gary King, and Andrew Thomas. JudgeIt II: A Program for Evaluating Electoral Systems and Redistricting Plans., 2010. AbstractWebsite
A program for analyzing most any feature of district-level legislative elections data, including prediction, evaluating redistricting plans, estimating counterfactual hypotheses (such as what would happen if a term-limitation amendment were imposed). This implements statistical procedures described in a series of journal articles and has been used during redistricting in many states by judges, partisans, governments, private citizens, and many others. The earlier version was winner of the APSA Research Software Award.
King, Gary, Matthew Knowles, and Steven Melendez. ReadMe: Software for Automated Content Analysis., 2010. AbstractWebsite
This program will read and analyze a large set of text documents and report on the proportion of documents in each of a set of given categories.
Honaker, James, Gary King, and Matthew Blackwell. AMELIA II: A Program for Missing Data., 2009. AbstractWebsite
This program multiply imputes missing data in cross-sectional, time series, and time series cross-sectional data sets. It includes a Windows version (no knowledge of R required), and a version that works with R either from the command line or via a GUI.
Iacus, Stefano, Gary King, and Giuseppe Porro. CEM: Coarsened Exact Matching Software., 2009.Website
King, Gary, and Ying Lu. VA: Verbal Autopsies., 2008.Website
King, Gary, Kosuke Imai, Gary King, and Elizabeth A. Stuart. MatchIt: Nonparametric Preprocessing for Parametric Causal Inference., 2007.Website
Wand, Johnathan, Gary King, and Olivia Lau. Anchors: Software for Anchoring Vignettes Data., 2007.Website
Imai, Kosuke, Gary King, and Olivia Lau. Zelig: Everyone's Statistical Software., 2006.Website
Stoll, Heather, Gary King, and Langche Zeng. WhatIf: Software for Evaluating Counterfactuals., 2005.Website
Girosi, Frederico, and Gary King. YourCast., 2004. AbstractWebsite
YourCast is (open source and free) software that makes forecasts by running sets of linear regressions together in a variety of sophisticated ways. YourCast avoids the bias that results when stacking datasets from separate cross-sections and assuming constant parameters, and the inefficiency that results from running independent regressions in each cross-section.
King, Gary, Michael Tomz, and Langche Zeng. ReLogit: Rare Events Logistic Regression., 2003.Website
King, Gary, and Kenneth Benoit. EzI: A(n Easy) Program for Ecological Inference., 2003.Website
Tomz, Michael, Jason Wittenberg, and Gary King. CLARIFY: Software for Interpreting and Presenting Statistical Results In Journal of Statistical Software. Vol. 8., 2003. Abstract
This is a set of easy-to-use Stata macros that implement the techniques described in Gary King, Michael Tomz, and Jason Wittenberg's "Making the Most of Statistical Analyses: Improving Interpretation and Presentation". To install Clarify, type "net from http://gking.harvard.edu/clarify" at the Stata command line. The documentation [ HTML | PDF ] explains how to do this. We also provide a zip archive for users who want to install Clarify on a computer that is not connected to the internet. Winner of the Okidata Best Research Software Award. Also try -ssc install qsim- to install a wrapper, donated by Fred Wolfe, to automate Clarify's simulation of dummy variables.
An Overview of the Institute for Quantitative Social Science, Harvard University, Social Science Council, 12/15/2011:
The Social Science Data Revolution, People, Power, & CyberPolitics Workshop, MIT, 12/8/2011:
Matching Methods for Causal Inference, University of Kansas, 12/2/2011:
Computer-Assisted Conceptualization, Harvard Law School, 11/17/2011:
Matching Methods for Causal Inference, University of Rochester, 11/4/2011:
Topics in Measurement for the Social and Health Sciences, "Foundations in Global Health" class, Harvard School of Public Health, 9/16/2011:
Computer-Assisted Clustering and Conceptualization from Unstructured Text, Talk at University of Chicago's Computation Institute, 5/9/2011:
Computer-Assisted Conceptualization, Talk at Harvard Gradaute School of Arts and Science Alumni Day, 4/2/2011:
The Social Science Data Revolution, Horizons in Political Science talk, Government Department, Harvard University, 3/30/2011:
Computer-Assisted Clustering and Conceptualization from Unstructured Text, Machine Learning/Google Distinguished Lecture, Carnegie Mellon University, 3/17/2011:
Computer-Assisted Clustering and Conceptualization from Unstructured Text, Center for Research on Computation and Society, School of Engineering and Applied Sciences, Harvard University, 3/7/2011:
Computer-Assisted Clustering and Conceptualization, Parthemos Lecture, University of Georgia, 3/4/2011:

Parthemos Lecture