Working Paper
Do Nonpartisan Programmatic Policies Have Partisan Electoral Effects? Evidence from Two Large Scale Randomized Experiments

A vast literature demonstrates that voters around the world who benefit from their governments' discretionary spending cast ballots for the incumbent party in larger proportions than those not receiving funds. But contrary to most theories of political accountability, the evidence seems to indicate that voters also reward incumbent parties for implementing ``programmatic'' spending legislation, over which incumbents have no discretion, and even when passed with support from all major parties. Why voters would attribute responsibility when none exists is unclear, as is why minority party legislators would approve of legislation that will cost them votes. We address this puzzle with one of the largest randomized social experiments ever, resulting in clear rejection of the claim, at least in this context, that programmatic policies greatly increase voter support for incumbents. We also reanalyze the study cited as claiming the strongest support for the electoral effects of programmatic policies, which is also a very large scale randomized experiment. We show that its key results vanish after correcting either a simple coding error affecting only two observations or highly unconventional data analysis procedures (or both). We discuss how these consistent empirical results from the only two probative experiments on this question may be reconciled with several observational and theoretical studies touching on similar questions in other contexts. 

Paper Supplementary Appendix
How Human Subjects Research Rules Mislead You and Your University, and What to Do About it

Universities require faculty and students planning research involving human subjects to pass formal certification tests and then submit research plans for prior approval. Those who diligently take the tests may better understand certain important legal requirements but, at the same time, are often misled into thinking they can apply these rules to their own work which, in fact, they are not permitted to do. They will also be missing many other legal requirements not mentioned in their training but which govern their behaviors. Finally, the training leaves them likely to completely misunderstand the essentially political situation they find themselves in. The resulting risks to their universities, collaborators, and careers may be catastrophic, in addition to contributing to the more common ordinary frustrations of researchers with the system. To avoid these problems, faculty and students conducting research about and for the public need to understand that they are public figures, to whom different rules apply, ones that political scientists have long studied. University administrators (and faculty in their part-time roles as administrators) need to reorient their perspectives as well. University research compliance bureaucracies have grown, in well-meaning but sometimes unproductive ways that are not required by federal laws or guidelines. We offer advice to faculty and students for how to deal with the system as it exists now, and suggestions for changes in university research compliance bureaucracies, that should benefit faculty, students, staff, university budgets, and our research subjects.

PSI (Ψ): a Private data Sharing Interface
Marco Gaboardi, James Honaker, Gary King, Kobbi Nissim, Jonathan Ullman, and Salil Vadhan. Working Paper. “PSI (Ψ): a Private data Sharing Interface”. Publisher's Version Abstract

We provide an overview of PSI ("a Private data Sharing Interface"), a system we are developing to enable researchers in the social sciences and other fields to share and explore privacy-sensitive datasets with the strong privacy protections of differential privacy.

Why Propensity Scores Should Not Be Used for Matching
Gary King and Richard Nielsen. Working Paper. “Why Propensity Scores Should Not Be Used for Matching”. Abstract

We show that propensity score matching (PSM), an enormously popular method of preprocessing data for causal inference, often accomplishes the opposite of its intended goal -- increasing imbalance, inefficiency, model dependence, and bias. PSM supposedly makes it easier to find matches by projecting a large number of covariates to a scalar propensity score and applying a single model to produce an unbiased estimate. However, in observational analysis the data generation process is rarely known and so users typically try many models before choosing one to present. The weakness of PSM comes from its attempts to approximate a completely randomized experiment, rather than, as with other matching methods, a more efficient fully blocked randomized experiment. PSM is thus uniquely blind to the often large portion of imbalance that can be eliminated by approximating full blocking with other matching methods. Moreover, in data balanced enough to approximate complete randomization, either to begin with or after pruning some observations, PSM approximates random matching which, we show, increases imbalance even relative to the original data. Although these results suggest that researchers replace PSM with one of the other available methods when performing matching, propensity scores have many other productive uses.

Paper Supplementary Appendix
In Press
The Balance-Sample Size Frontier in Matching Methods for Causal Inference
Gary King, Christopher Lucas, and Richard Nielsen. In Press. “The Balance-Sample Size Frontier in Matching Methods for Causal Inference.” American Journal of Political Science. Abstract

We propose a simplified approach to matching for causal inference that simultaneously optimizes balance (similarity between the treated and control groups) and matched sample size. Existing approaches either fix the matched sample size and maximize balance or fix balance and maximize sample size, leaving analysts to settle for suboptimal solutions or attempt manual optimization by iteratively tweaking their matching method and rechecking balance. To jointly maximize balance and sample size, we introduce the matching frontier, the set of matching solutions with maximum possible balance for each sample size. Rather than iterating, researchers can choose matching solutions from the frontier for analysis in one step. We derive fast algorithms that calculate the matching frontier for several commonly used balance metrics. We demonstrate with analyses of the effect of sex on judging and job training programs that show how the methods we introduce can extract new knowledge from existing data sets.

Easy to use, open source, software is available here to implement all methods in the paper.

Proofs Supplementary Appendix
Computer-Assisted Keyword and Document Set Discovery from Unstructured Text
Gary King, Patrick Lam, and Margaret Roberts. In Press. “Computer-Assisted Keyword and Document Set Discovery from Unstructured Text.” American Journal of Political Science. Abstract

The (unheralded) first step in many applications of automated text analysis involves selecting keywords to choose documents from a large text corpus for further study. Although all substantive results depend on this choice, researchers usually pick keywords in ad hoc ways that are far from optimal and usually biased. Paradoxically, this often means that the validity of the most sophisticated text analysis methods depend in practice on the inadequate keyword counting or matching methods they are designed to replace. Improved methods of keyword selection would also be valuable in many other areas, such as following conversations that rapidly innovate language to evade authorities, seek political advantage, or express creativity; generic web searching; eDiscovery; look-alike modeling; intelligence analysis; and sentiment and topic analysis. We develop a computer-assisted (as opposed to fully automated) statistical approach that suggests keywords from available text without needing structured data as inputs. This framing poses the statistical problem in a new way, which leads to a widely applicable algorithm. Our specific approach is based on training classifiers, extracting information from (rather than correcting) their mistakes, and summarizing results with Boolean search strings. We illustrate how the technique works with analyses of English texts about the Boston Marathon Bombings, Chinese social media posts designed to evade censorship, among others.

booc.io: An Education System with Hierarchical Concept Maps
Michail Schwab, Hendrik Strobelt, James Tompkin, Colin Fredericks, Connor Huff, Dana Higgins, Anton Strezhnev, Mayya Komisarchik, Gary King, and Hanspeter Pfister. Forthcoming. “booc.io: An Education System with Hierarchical Concept Maps.” IEEE Transactions on Visualization and Computer Graphics. Abstract

Information hierarchies are difficult to express when real-world space or time constraints force traversing the hierarchy in linear presentations, such as in educational books and classroom courses. We present booc.io, which allows linear and non-linear presentation and navigation of educational concepts and material. To support a breadth of material for each concept, booc.io is Web based, which allows adding material such as lecture slides, book chapters, videos, and LTIs. A visual interface assists the creation of the needed hierarchical structures. The goals of our system were formed in expert interviews, and we explain how our design meets these goals. We adapt a real-world course into booc.io, and perform introductory qualitative evaluation with students.

How the Chinese Government Fabricates Social Media Posts for Strategic Distraction, not Engaged Argument
Gary King, Jennifer Pan, and Margaret E. Roberts. Forthcoming. “How the Chinese Government Fabricates Social Media Posts for Strategic Distraction, not Engaged Argument.” American Political Science Review. Abstract

The Chinese government has long been suspected of hiring as many as 2,000,000 people to surreptitiously insert huge numbers of pseudonymous and other deceptive writings into the stream of real social media posts, as if they were the genuine opinions of ordinary people. Many academics, and most journalists and activists, claim that these so-called ``50c party'' posts vociferously argue for the government's side in political and policy debates. As we show, this is also true of the vast majority of posts openly accused on social media of being 50c. Yet, almost no systematic empirical evidence exists for this claim, or, more importantly, for the Chinese regime's strategic objective in pursuing this activity. In the first large scale empirical analysis of this operation, we show how to identify the secretive authors of these posts, the posts written by them, and their content. We estimate that the government fabricates and posts about 448 million social media comments a year. In contrast to prior claims, we show that the Chinese regime's strategy is to avoid arguing with skeptics of the party and the government, and to not even discuss controversial issues. We show that the goal of this massive secretive operation is instead to distract the public and change the subject, as most of the these posts involve cheerleading for China, the revolutionary history of the Communist Party, or other symbols of the regime. We discuss how these results fit with what is known about the Chinese censorship program, and suggest how they may change our broader theoretical understanding of ``common knowledge'' and information control in authoritarian regimes.

This paper follows up on our articles in Science, “Reverse-Engineering Censorship In China: Randomized Experimentation And Participant Observation”, and the American Political Science Review, “How Censorship In China Allows Government Criticism But Silences Collective Expression”.

Paper Supplementary Appendix
Method and Apparatus for Selecting Clusterings to Classify a Data Set
Gary King and Justin Grimmer. 12/13/2016. “Method and Apparatus for Selecting Clusterings to Classify a Data Set.” United States of America 9,519,705 B2 (Patent and Trademark Office). Abstract

In a computer assisted clustering method, a clustering space is generated from fixed basis partitiions that embed the entire space of all possible clusterings. A lower dimensional clustering space is created from the space of all possible clusterings by isometrically embedding the space of all possible clusterings in a lower dimensional Euclidean space. This lower dimensional space is then sampled based on the number of documents in the corpus. Partitions are then developed based on the samples that tessellate the space. Finally, using clusterings representative of these tessellations, a two-dimensional representation for users to explore is created.

Cross-Classroom and Cross-Institution Item Validation
Gary King, Brian Lukoff, and Eric Mazur. 11/29/2016. “Cross-Classroom and Cross-Institution Item Validation.” United States of America 9,508,266 (US Patent and Trademark Office). Abstract

Anonymous pretesting items for subsequent presentation to participants in a group enable an instructor to validate responses and revise the items accordingly. ... The present invention facilitates anonymous pretesting of items in classrooms (and/or other similar settings) to which the item author has no direct access or knowledge. In some enbodiments, pretesting is performed by software used by the instructor/author in his or her own classroom for other tasks. In various implementations, the software shares information with a central clearninghouse anonymously. The central clearinghouse then automatically matches students in the instructor's class with "relevant" students from other classes -- e.g., students that a statistical algorithm predicts will have approximately the same understanding, and will give approximately the same answers, as the instructor's class. ...

Systems and methods for calculating category proportions
Aykut Firat, Mitchell Brooks, Christopher Bingham, Amac Herdagdelen, and Gary King. 11/1/2016. “Systems and methods for calculating category proportions.” United States of America 9,483,544 (U.S. Patent and Trademark Office). Abstract

Systems and methods are provided for classifying text based on language using one or more computer servers and storage devices. A computer-implemented method includes receiving a training set of elements, each element in the training set being assigned to one of a plurality of categories and having one of a plurality of content profiles associated therewith; receiving a population set of elements, each element in the population set having one of the plurality of content profiles associated therewith; and calculating using at least one of a stacked regression algorithm, a bias formula algorithm, a noise elimination algorithm, and an ensemble method consisting of a plurality of algorithmic methods the results of which are averaged, based on the content profiles associated with and the categories assigned to elements in the training set and the content profiles associated with the elements of the population set, a distribution of elements of the population set over the categories.

Comment on 'Estimating the Reproducibility of Psychological Science'
Daniel Gilbert, Gary King, Stephen Pettigrew, and Timothy Wilson. 2016. “Comment on 'Estimating the Reproducibility of Psychological Science'.” Science, 6277, 351: 1037a-1038a. Publisher's Version Abstract

recent article by the Open Science Collaboration (a group of 270 coauthors) gained considerable academic and public attention due to its sensational conclusion that the replicability of psychological science is surprisingly low. Science magazine lauded this article as one of the top 10 scientific breakthroughs of the year across all fields of science, reports of which appeared on the front pages of newspapers worldwide. We show that OSC's article contains three major statistical errors and, when corrected, provides no evidence of a replication crisis. Indeed, the evidence is consistent with the opposite conclusion -- that the reproducibility of psychological science is quite high and, in fact, statistically indistinguishable from 100%. (Of course, that doesn't mean that the replicability is 100%, only that the evidence is insufficient to reliably estimate replicability.) The moral of the story is that meta-science must follow the rules of science.

Replication data is available in this dataverse archive. See also the full web site for this article and related materials, and one of the news articles written about it.

Article, with Supplementary Appendix Our Response to OSC's Reply Reply to post-publication discussion
The C-SPAN Archives as The Policymaking Record of American Representative Democracy: A Foreword
Gary King. 2016. “The C-SPAN Archives as The Policymaking Record of American Representative Democracy: A Foreword.” In Exploring the C-SPAN Archives: Advancing the Research Agenda, edited by Robert X Browning. West Lafayette, IN: Purdue University Press. Abstract

Almost two centuries ago, the idea of research libraries, and the possibility of building them at scale, began to be realized. Although we can find these libraries at every major college and university in the world today, and at many noneducational research institutions, this outcome was by no means obvious at the time. And the benefits we all now enjoy from their existence were then at best merely vague speculations.

How many would have supported the formation of these institutions at the time, without knowing the benefits that have since become obvious? After all, the arguments against this massive ongoing expenditure are impressive. The proposal was to construct large buildings, hire staff, purchase all manner of books and other publications and catalogue and shelve them, provide access to visitors, and continually reorder all the books that the visitors disorder. And the libraries would keep the books, and fund the whole operation, in perpetuity. Publications would be collected without anyone deciding which were of high quality and thus deserving of preservation—leading critics to argue that all this effort would result in expensive buildings packed mostly with junk.  . . .

Effectiveness of the WHO Safe Childbirth Checklist Program in Reducing Severe Maternal, Fetal, and Newborn Harm: Study Protocol for a Matched-Pair, Cluster Randomized Controlled Trial in Uttar Pradesh, India
Katherine Semrau, Lisa R. Hirschhorn, Bhala Kodkany, Jonathan Spector, Danielle E. Tuller, Gary King, Stuart Lisptiz, Narender Sharma, Vinay P. Singh, Bharath Kumar, Neelam Dhingra-Kumar, Rebecca Firestone, Vishwajeet Kumar, and Atul Gawande. 2016. “Effectiveness of the WHO Safe Childbirth Checklist Program in Reducing Severe Maternal, Fetal, and Newborn Harm: Study Protocol for a Matched-Pair, Cluster Randomized Controlled Trial in Uttar Pradesh, India.” Trials, 17, 576: 1-10. Abstract

Background: Effective, scalable strategies to improve maternal, fetal, and newborn health and reduce preventable morbidity and mortality are urgently needed in low- and middle-income countries. Building on the successes of previous checklist-based programs, the World Health Organization (WHO) and partners led the development of the Safe Childbirth Checklist (SCC), a 28-item list of evidence-based practices linked with improved maternal and newborn outcomes. Pilot-testing of the Checklist in Southern India demonstrated dramatic improvements in adherence by health workers to essential childbirth-related practices (EBPs). The BetterBirth Trial seeks to measure the effectiveness of SCC impact on EBPs, deaths, and complications at a larger scale.

Methods: This matched-pair, cluster-randomized controlled, adaptive trial will be conducted in 120 facilities across 24 districts in Uttar Pradesh, India. Study sites, identified according to predefined eligibility criteria, were matched by measured covariates before randomization. The intervention, the SCC embedded in a quality improvement program, consists of leadership engagement, a 2-day educational launch of the SCC, and support through placement of a trained peer “coach” to provide supportive supervision and real-time data feedback over an 8-month period with decreasing intensity. A facility-based childbirth quality coordinator is trained and supported to drive sustained behavior change after the BetterBirth team leaves the facility. Study participants are birth attendants and women and their newborns who present to the study facilities for childbirth at 60 intervention and 60 control sites. The primary outcome is a composite measure including maternal death, maternal severe morbidity, stillbirth, and newborn death, occurring within 7 days after birth. The sample size (n = 171,964) was calculated to detect a 15% reduction in the primary outcome. Adherence by health workers to EBPs will be measured in a subset of births (n = 6000). The trial will be conducted in close collaboration with key partners including the Governments of India and Uttar Pradesh, the World Health Organization, an expert Scientific Advisory Committee, an experienced local implementing organization (Population Services International, PSI), and frontline facility leaders and workers

Discussion: If effective, the WHO Safe Childbirth Checklist program could be a powerful health facilitystrengthening intervention to improve quality of care and reduce preventable harm to women and newborns, with millions of potential beneficiaries.

Trial registration: BetterBirth Study Protocol dated: 13 February 2014; ClinicalTrials.gov: NCT02148952; Universal Trial Number: U1111-1131-5647. 

Preface: Big Data is Not About the Data!
Gary King. 2016. “Preface: Big Data is Not About the Data!.” In Computational Social Science: Discovery and Prediction, edited by R. Michael Alvarez. Cambridge: Cambridge University Press. Abstract

A few years ago, explaining what you did for a living to Dad, Aunt Rose, or your friend from high school was pretty complicated. Answering that you develop statistical estimators, work on numerical optimization, or, even better, are working on a great new Markov Chain Monte Carlo implementation of a Bayesian model with heteroskedastic errors for automated text analysis is pretty much the definition of conversation stopper.

Then the media noticed the revolution we’re all apart of, and they glued a label to it. Now “Big Data” is what you and I do.  As trivial as this change sounds, we should be grateful for it, as the name seems to resonate with the public and so it helps convey the importance of our field to others better than we had managed to do ourselves. Yet, now that we have everyone’s attention, we need to start clarifying for others -- and ourselves -- what the revolution means. This is much of what this book is about.

Throughout, we need to remember that for the most part, Big Data is not about the data....

Scoring Social Security Proposals: Response from Kashin, King, and Soneji
Konstantin Kashin, Gary King, and Samir Soneji. 2016. “Scoring Social Security Proposals: Response from Kashin, King, and Soneji.” Journal of Economic Perspectives, 2, 30: 245-248. Publisher's Version Abstract

This is a response to Peter Diamond's comment on two paragraph comment on a passage in our article, Konstantin Kashin, Gary King, and Samir Soneji. 2015. “Systematic Bias and Nontransparency in US Social Security Administration Forecasts.” Journal of Economic Perspectives, 2, 29: 239-258. 

Aristides A. N. Patrinos, Hannah Bayer, Paul W. Glimcher, Steven Koonin, Miyoung Chun, and Gary King. 3/19/2015. “Urban observatories: City data can inform decision theory.” Nature, 519: 291. Publisher's Version Abstract

Data are being collected on human behaviour in cities such as London, New York, Singapore and Shanghai, with a view to meeting city dwellers' needs more effectively. Incorporating decision-making theory into analyses of the data from these 'urban observatories' would yield further valuable information.

Automating Open Science for Big Data
Merce Crosas, James Honaker, Gary King, and Latanya Sweeney. 2015. “Automating Open Science for Big Data.” ANNALS of the American Academy of Political and Social Science, 1, 659: 260-273. Publisher's Version Abstract

The vast majority of social science research presently uses small (MB or GB scale) data sets. These fixed-scale data sets are commonly downloaded to the researcher's computer where the analysis is performed locally, and are often shared and cited with well-established technologies, such as the Dataverse Project (see Dataverse.org), to support the published results.  The trend towards Big Data -- including large scale streaming data -- is starting to transform research and has the potential to impact policy-making and our understanding of the social, economic, and political problems that affect human societies.  However, this research poses new challenges in execution, accountability, preservation, reuse, and reproducibility. Downloading these data sets to a researcher’s computer is infeasible or not practical; hence, analyses take place in the cloud, require unusual expertise, and benefit from collaborative teamwork and novel tool development. The advantage of these data sets in how informative they are also means that they are much more likely to contain highly sensitive personally identifiable information. In this paper, we discuss solutions to these new challenges so that the social sciences can realize the potential of Big Data.

Explaining Systematic Bias and Nontransparency in US Social Security Administration Forecasts
Konstantin Kashin, Gary King, and Samir Soneji. 2015. “Explaining Systematic Bias and Nontransparency in US Social Security Administration Forecasts.” Political Analysis, 3, 23: 336-362. Publisher's Version Abstract

The accuracy of U.S. Social Security Administration (SSA) demographic and financial forecasts is crucial for the solvency of its Trust Funds, other government programs, industry decision making, and the evidence base of many scholarly articles. Because SSA makes public little replication information and uses qualitative and antiquated statistical forecasting methods, fully independent alternative forecasts (and the ability to score policy proposals to change the system) are nonexistent. Yet, no systematic evaluation of SSA forecasts has ever been published by SSA or anyone else --- until a companion paper to this one (King, Kashin, and Soneji, 2015a). We show that SSA's forecasting errors were approximately unbiased until about 2000, but then began to grow quickly, with increasingly overconfident uncertainty intervals. Moreover, the errors are all in the same potentially dangerous direction, making the Social Security Trust Funds look healthier than they actually are. We extend and then attempt to explain these findings with evidence from a large number of interviews we conducted with participants at every level of the forecasting and policy processes. We show that SSA's forecasting procedures meet all the conditions the modern social-psychology and statistical literatures demonstrate make bias likely. When those conditions mixed with potent new political forces trying to change Social Security, SSA's actuaries hunkered down trying hard to insulate their forecasts from strong political pressures. Unfortunately, this otherwise laudable resistance to undue influence, along with their ad hoc qualitative forecasting models, led the actuaries to miss important changes in the input data. Retirees began living longer lives and drawing benefits longer than predicted by simple extrapolations. We also show that the solution to this problem involves SSA or Congress implementing in government two of the central projects of political science over the last quarter century: [1] promoting transparency in data and methods and [2] replacing with formal statistical models large numbers of qualitative decisions too complex for unaided humans to make optimally.

How Robust Standard Errors Expose Methodological Problems They Do Not Fix, and What to Do About It
Gary King and Margaret E Roberts. 2015. “How Robust Standard Errors Expose Methodological Problems They Do Not Fix, and What to Do About It.” Political Analysis, 2, 23: 159–179. Publisher's Version Abstract

"Robust standard errors" are used in a vast array of scholarship to correct standard errors for model misspecification. However, when misspecification is bad enough to make classical and robust standard errors diverge, assuming that it is nevertheless not so bad as to bias everything else requires considerable optimism. And even if the optimism is warranted, settling for a misspecified model, with or without robust standard errors, will still bias estimators of all but a few quantities of interest. The resulting cavernous gap between theory and practice suggests that considerable gains in applied statistics may be possible. We seek to help researchers realize these gains via a more productive way to understand and use robust standard errors; a new general and easier-to-use "generalized information matrix test" statistic that can formally assess misspecification (based on differences between robust and classical variance estimates); and practical illustrations via simulations and real examples from published research. How robust standard errors are used needs to change, but instead of jettisoning this popular tool we show how to use it to provide effective clues about model misspecification, likely biases, and a guide to considerably more reliable, and defensible, inferences. Accompanying this article [soon!] is software that implements the methods we describe.