# Working Paper

Iacus, Stefano M, Gary King, and Giuseppe Porro. 2015. “A Theory of Statistical Inference for Matching Methods in Applied Causal Research”.Abstract

Matching methods for causal inference have become a popular way of reducing model dependence and bias, in large part because of their convenience and conceptual simplicity. Researchers most commonly use matching as a data preprocessing step, after which they apply whatever statistical model and uncertainty estimators they would have without matching. Unfortunately, for a given sample of any finite size, this approach is theoretically appropriate only under exact matching, which is usually infeasible; approximate matching can be justified under asymptotic theory, if large enough sample sizes are available, but then specialized point and variance estimators are required, which sacrifices some of matching's simplicity and convenience. Researchers also violate statistical theory with ad hoc iterations between formal matching methods and informal balance checks. Instead of asking researchers to change their widely used practices, we develop a comprehensive theory of statistical inference able to justify them. The theory we propose is substantively plausible, requires no asymptotic theory, and is simple to understand. Its core conceptualizes continuous variables as having natural breakpoints, which are common in applications (e.g., high school or college degrees in years of education, a governmental poverty level in income, or phase transitions in temperature). The theory allows binary, multicategory, and continuous treatment variables from the outset and straightforward extensions for imperfect treatment assignment and different versions of treatments. Although this theory provides a valid foundation for most commonly used methods of matching, researchers must still satisfy the assumptions in any real application.

King, Gary, Christopher Lucas, and Richard Nielsen. 2015. “The Balance-Sample Size Frontier in Matching Methods for Causal Inference”.Abstract

We propose a simplified approach to matching for causal inference that simultaneously optimizes both balance (between the treated and control groups) and matched sample size. This procedure resolves two widespread tensions in the use of this popular methodology. First, current practice is to run a matching method that maximizes one balance metric (such as a propensity score or average Mahalanobis distance), but then to check whether it succeeds with respect to a different balance metric for which it was not designed (such as differences in means or L1). Second, current matching methods either fix the sample size and maximize balance (e.g., Mahalanobis or propensity score matching), fix balance and maximize the sample size (such as coarsened exact matching), or are arbitrary compromises between the two (such as calipers with ad hoc thresholds applied to other methods). These tensions lead researchers to either try to optimize manually, by iteratively tweaking their matching method and rechecking balance, or settle for suboptimal solutions. We address these tensions by first defining and showing how to calculate the

*matching frontier*as the set of matching solutions with maximum balance for each possible sample size. Researchers can then choose one, several, or all matching solutions from the frontier for analysis in one step without iteration. The main difficulty in this strategy is that checking all possible solutions is exponentially difficult. We solve this problem with new algorithms that finish fast, optimally, and without iteration or manual tweaking. We also offer easy-to-use software that implements these ideas, along with analyses of the effect of sex on judging and job training programs that show how the methods we introduce enable us to extract new knowledge from existing data sets.King, Gary, and Richard Nielsen. 2015. “Why Propensity Scores Should Not Be Used for Matching”.Abstract

Researchers use propensity score matching (PSM) as a data preprocessing step to selectively prune observations prior to applying a model to estimate a causal effect. The goal of PSM is to reduce imbalance in the chosen pre-treatment covariates between the treatment and control groups, thereby reducing the degree of model dependence and potential for bias. We show here that PSM often accomplishes the opposite of what is intended --- increasing imbalance, model dependence, and bias. The weakness of PSM is that it approximates a completely randomized experiment, rather than, as with other matching methods, a more powerful fully blocked randomized experiment. PSM is therefore blind to the often large portion of imbalance that would have been eliminated by approximating full blocking. Moreover, in data balanced enough to approximate complete randomization, either to begin with or after pruning some observations, PSM approximates random matching which turns out to increase imbalance. For other matching methods, the point where additional pruning increases imbalance occurs much later in the pruning process, when full blocking is approximated, and so the danger of increasing model dependence and bias is considerably less. We show that these problems occur even in data designed for PSM and with as few as two covariates, and they are exacerbated in data with better balance, higher dimensionality, and (in our experience) real applications. Although these results suggest that propensity scores not be used for matching, propensity scores have many other productive uses.

Lazer, David, Ryan Kennedy, Gary King, and Alessandro Vespignani. 2014. “

Google Flu Trends Still Appears Sick: an Evaluation of the 2013‐2014 Flu Season

”.AbstractLast year was difficult for Google Flu Trends (GFT). In early 2013, Nature reported that GFT was estimating more than double the percentage of doctor visits for influenza like illness than the Centers for Disease Control and Prevention s (CDC) sentinel reports during the 2012 2013 flu season (1). Given that GFT was designed to forecast upcoming CDC reports, this was a problematic finding. In March 2014, our report in Science found that the overestimation problem in GFT was also present in the 2011 2012 flu season (2). The report also found strong evidence of autocorrelation and seasonality in the GFT errors, and presented evidence that the issues were likely, at least in part, due to modifications made by Google s search algorithm and the decision by GFT engineers not to use previous CDC reports or seasonality estimates in their models what the article labeled algorithm dynamics and big data hubris respectively. Moreover, the report and the supporting online materials detailed how difficult/impossible it is to replicate the GFT results, undermining independent efforts to explore the source of GFT errors and formulate improvements.

King, Gary, Patrick Lam, and Margaret Roberts. 2014. “

Computer-Assisted Keyword and Document Set Discovery from Unstructured Text

”.AbstractThe (unheralded) first step in many applications of automated text analysis involves selecting keywords to choose documents from a large text corpus for further study. Although all substantive results depend crucially on this choice, researchers typically pick keywords in ad hoc ways, given the lack of formal statistical methods to help. Paradoxically, this often means that the validity of the most sophisticated text analysis methods depends in practice on the inadequate keyword counting or matching methods they are designed to replace. The same ad hoc keyword selection process is also used in many other areas, such as following conversations that rapidly innovate language to evade authorities, seek political advantage, or express creativity; generic web searching; eDiscovery; look-alike modeling; intelligence analysis; and sentiment and topic analysis. We develop a computer-assisted (as opposed to fully automated) statistical approach that suggests keywords from available text, without needing any structured data as inputs. This framing poses the statistical problem in a new way, which leads to a widely applicable algorithm. Our specific approach is based on training classifiers, extracting information from (rather than correcting) their mistakes, and then summarizing results with Boolean search strings. We illustrate how the technique works with examples in English and Chinese.

King, Gary, Richard Nielsen, Carter Coberley, James E Pope, and Aaron Wells. 2011. “Comparative Effectiveness of Matching Methods for Causal Inference”.Abstract

Matching is an increasingly popular method of causal inference in observational data, but following methodological best practices has proven difficult for applied researchers. We address this problem by providing a simple graphical approach for choosing among the numerous possible matching solutions generated by three methods: the venerable ``Mahalanobis Distance Matching'' (MDM), the commonly used ``Propensity Score Matching'' (PSM), and a newer approach called ``Coarsened Exact Matching'' (CEM). In the process of using our approach, we also discover that PSM often approximates random matching, both in many real applications and in data simulated by the processes that fit PSM theory. Moreover, contrary to conventional wisdom, random matching is not benign: it (and thus PSM) can often degrade inferences relative to not matching at all. We find that MDM and CEM do not have this problem, and in practice CEM usually outperforms the other two approaches. However, with our comparative graphical approach and easy-to-follow procedures, focus can be on choosing a matching solution for a particular application, which is what may improve inferences, rather than the particular method used to generate it.

King, Gary, and Eleanor Neff Powell. 2008. “How Not to Lie Without Statistics”.Abstract

We highlight, and suggest ways to avoid, a large number of common misunderstandings in the literature about best practices in qualitative research. We discuss these issues in four areas: theory and data, qualitative and quantitative strategies, causation and explanation, and selection bias. Some of the misunderstandings involve incendiary debates within our discipline that are readily resolved either directly or with results known in research areas that happen to be unknown to political scientists. Many of these misunderstandings can also be found in quantitative research, often with different names, and some of which can be fixed with reference to ideas better understood in the qualitative methods literature. Our goal is to improve the ability of quantitatively and qualitatively oriented scholars to enjoy the advantages of insights from both areas. Thus, throughout, we attempt to construct specific practical guidelines that can be used to improve actual qualitative research designs, not only the qualitative methods literatures that talk about them.