A vast literature demonstrates that voters around the world who benefit from their governments' discretionary spending cast ballots for the incumbent party in larger proportions than those not receiving funds. But surprisingly, and contrary to most theories of political accountability, the evidence seems to indicate that voters also reward incumbent parties for implementing ``programmatic'' spending legislation, passed with support from all major parties, and over which incumbents have no discretion. Why voters would attribute responsibility when none exists is unclear, as is why minority party legislators would approve of legislation that will cost them votes. We address this puzzle with one of the largest randomized social experiments ever, resulting in clear rejection of the claim that programmatic policies greatly increase voter support for incumbents. We also reanalyze the study cited as claiming the strongest support for the electoral effects of programmatic policies, which is also a very large scale randomized experiment. We show that its key results vanish after correcting either a simple coding error affecting only two observations or highly unconventional data analysis procedures (or both). We also discuss how these consistent empirical results from the only two probative experiments on this question may be reconciled with several observational and theoretical studies touching on similar questions in other contexts.
Universities require faculty and students planning research involving human subjects to pass formal certification tests and then submit research plans for prior approval. Those who diligently take the tests may better understand certain important legal requirements but, at the same time, are often misled into thinking they can apply these rules to their own work which, in fact, they are not permitted to do. They will also be missing many other legal requirements not mentioned in their training but which govern their behaviors. Finally, the training leaves them likely to completely misunderstand the essentially political situation they find themselves in. The resulting risks to their universities, collaborators, and careers may be catastrophic, in addition to contributing to the more common ordinary frustrations of researchers with the system. To avoid these problems, faculty and students conducting research about and for the public need to understand that they are public figures, to whom different rules apply, ones that political scientists have long studied. University administrators (and faculty in their part-time roles as administrators) need to reorient their perspectives as well. University research compliance bureaucracies have grown, in well-meaning but sometimes unproductive ways that are not required by federal laws or guidelines. We offer advice to faculty and students for how to deal with the system as it exists now, and suggestions for changes in university research compliance bureaucracies, that should benefit faculty, students, staff, university budgets, and our research subjects.
A few years ago, explaining what you did for a living to Dad, Aunt Rose, or your friend from high school was pretty complicated. Answering that you develop statistical estimators, work on numerical optimization, or, even better, are working on a great new Markov Chain Monte Carlo implementation of a Bayesian model with heteroskedastic errors for automated text analysis is pretty much the definition of conversation stopper.
Then the media noticed the revolution we’re all apart of, and they glued a label to it. Now “Big Data” is what you and I do. As trivial as this change sounds, we should be grateful for it, as the name seems to resonate with the public and so it helps convey the importance of our field to others better than we had managed to do ourselves. Yet, now that we have everyone’s attention, we need to start clarifying for others -- and ourselves -- what the revolution means. This is much of what this book is about.
Throughout, we need to remember that for the most part, Big Data is not about the data....
We propose a simplified approach to matching for causal inference that simultaneously optimizes balance (similarity between the treated and control groups) and matched sample size. Existing approaches either fix the matched sample size and maximize balance or fix balance and maximize sample size, leaving analysts to settle for suboptimal solutions or attempt manual optimization by iteratively tweaking their matching method and rechecking balance. To jointly maximize balance and sample size, we introduce the matching frontier, the set of matching solutions with maximum possible balance for each sample size. Rather than iterating, researchers can choose matching solutions from the frontier for analysis in one step. We derive fast algorithms (about one million times faster than the best existing approach) that calculate the matching frontier, for several commonly used balance metrics. We demonstrate with analyses of the effect of sex on judging and job training programs that show how the methods we introduce can extract new knowledge from existing data sets.
A recent article by the Open Science Collaboration (a group of 270 coauthors) gained considerable academic and public attention due to its sensational conclusion that the replicability of psychological science is surprisingly low. Science magazine lauded this article as one of the top 10 scientific breakthroughs of the year across all fields of science, reports of which appeared on the front pages of newspapers worldwide. We show that OSC's article contains three major statistical errors and, when corrected, provides no evidence of a replication crisis. Indeed, the evidence is consistent with the opposite conclusion -- that the reproducibility of psychological science is quite high and, in fact, statistically indistinguishable from 100%. (Of course, that doesn't mean that the replicability is 100%, only that the evidence is insufficient to reliably estimate replicability.) The moral of the story is that meta-science must follow the rules of science.
Almost two centuries ago, the idea of research libraries, and the possibility of building them at scale, began to be realized. Although we can find these libraries at every major college and university in the world today, and at many noneducational research institutions, this outcome was by no means obvious at the time. And the benefits we all now enjoy from their existence were then at best merely vague speculations.
How many would have supported the formation of these institutions at the time, without knowing the benefits that have since become obvious? After all, the arguments against this massive ongoing expenditure are impressive. The proposal was to construct large buildings, hire staff, purchase all manner of books and other publications and catalogue and shelve them, provide access to visitors, and continually reorder all the books that the visitors disorder. And the libraries would keep the books, and fund the whole operation, in perpetuity. Publications would be collected without anyone deciding which were of high quality and thus deserving of preservation—leading critics to argue that all this effort would result in expensive buildings packed mostly with junk. . . .
We show that propensity score matching (PSM), an enormously popular method of preprocessing data for causal inference, often accomplishes the opposite of its intended goal -- increasing imbalance, inefficiency, model dependence, and bias. PSM supposedly makes it easier to find matches by projecting a large number of covariates to a scalar propensity score and applying a single model to produce an unbiased estimate. However, in observational analysis the data generation process is rarely known and so users typically try many models before choosing one to present. The weakness of PSM comes from its attempts to approximate a completely randomized experiment, rather than, as with other matching methods, a more efficient fully blocked randomized experiment. PSM is thus uniquely blind to the often large portion of imbalance that can be eliminated by approximating full blocking with other matching methods. Moreover, in data balanced enough to approximate complete randomization, either to begin with or after pruning some observations, PSM approximates random matching which, we show, increases imbalance even relative to the original data. Although these results suggest that researchers replace PSM with one of the other available methods when performing matching, propensity scores have many other productive uses.
Although social scientists devote considerable effort to mitigating measurement error during data collection, they often ignore the issue during data analysis. And although many statistical methods have been proposed for reducing measurement error-induced biases, few have been widely used because of implausible assumptions, high levels of model dependence, difficult computation, or inapplicability with multiple mismeasured variables. We develop an easy-to-use alternative without these problems; it generalizes the popular multiple imputation (MI) framework by treating missing data problems as a limiting special case of extreme measurement error, and corrects for both. Like MI, the proposed framework is a simple two-step procedure, so that in the second step researchers can use whatever statistical method they would have if there had been no problem in the first place. We also offer empirical illustrations, open source software that implements all the methods described herein, and a companion paper with technical details and extensions (Blackwell, Honaker, and King, 2014b).
The financial stability of four of the five largest U.S. federal entitlement programs, strategic decision making in several industries, and many academic publications all depend on the accuracy of demographic and financial forecasts made by the Social Security Administration (SSA). Although the SSA has performed these forecasts since 1942, no systematic and comprehensive evaluation of their accuracy has ever been published by SSA or anyone else. The absence of a systematic evaluation of forecasts is a concern because the SSA relies on informal procedures that are potentially subject to inadvertent biases and does not share with the public, the scientific community, or other parts of SSA sufficient data or information necessary to replicate or improve its forecasts. These issues result in SSA holding a monopoly position in policy debates as the sole supplier of fully independent forecasts and evaluations of proposals to change Social Security. To assist with the forecasting evaluation problem, we collect all SSA forecasts for years that have passed and discover error patterns that could have been---and could now be---used to improve future forecasts. Specifically, we find that after 2000, SSA forecasting errors grew considerably larger and most of these errors made the Social Security Trust Funds look more financially secure than they actually were. In addition, SSA's reported uncertainty intervals are overconfident and increasingly so after 2000. We discuss the implications of these systematic forecasting biases for public policy.
The accuracy of U.S. Social Security Administration (SSA) demographic and financial forecasts is crucial for the solvency of its Trust Funds, other government programs, industry decision making, and the evidence base of many scholarly articles. Because SSA makes public little replication information and uses qualitative and antiquated statistical forecasting methods, fully independent alternative forecasts (and the ability to score policy proposals to change the system) are nonexistent. Yet, no systematic evaluation of SSA forecasts has ever been published by SSA or anyone else --- until a companion paper to this one (King, Kashin, and Soneji, 2015a). We show that SSA's forecasting errors were approximately unbiased until about 2000, but then began to grow quickly, with increasingly overconfident uncertainty intervals. Moreover, the errors are all in the same potentially dangerous direction, making the Social Security Trust Funds look healthier than they actually are. We extend and then attempt to explain these findings with evidence from a large number of interviews we conducted with participants at every level of the forecasting and policy processes. We show that SSA's forecasting procedures meet all the conditions the modern social-psychology and statistical literatures demonstrate make bias likely. When those conditions mixed with potent new political forces trying to change Social Security, SSA's actuaries hunkered down trying hard to insulate their forecasts from strong political pressures. Unfortunately, this otherwise laudable resistance to undue influence, along with their ad hoc qualitative forecasting models, led the actuaries to miss important changes in the input data. Retirees began living longer lives and drawing benefits longer than predicted by simple extrapolations. We also show that the solution to this problem involves SSA or Congress implementing in government two of the central projects of political science over the last quarter century:  promoting transparency in data and methods and  replacing with formal statistical models large numbers of qualitative decisions too complex for unaided humans to make optimally.
The vast majority of social science research presently uses small (MB or GB scale) data sets. These fixed-scale data sets are commonly downloaded to the researcher's computer where the analysis is performed locally, and are often shared and cited with well-established technologies, such as the Dataverse Project (see Dataverse.org), to support the published results. The trend towards Big Data -- including large scale streaming data -- is starting to transform research and has the potential to impact policy-making and our understanding of the social, economic, and political problems that affect human societies. However, this research poses new challenges in execution, accountability, preservation, reuse, and reproducibility. Downloading these data sets to a researcher’s computer is infeasible or not practical; hence, analyses take place in the cloud, require unusual expertise, and benefit from collaborative teamwork and novel tool development. The advantage of these data sets in how informative they are also means that they are much more likely to contain highly sensitive personally identifiable information. In this paper, we discuss solutions to these new challenges so that the social sciences can realize the potential of Big Data.
We extend a unified and easy-to-use approach to measurement error and missing data. In our companion article, Blackwell, Honaker, and King give an intuitive overview of the new technique, along with practical suggestions and empirical applications. Here, we offer more precise technical details, more sophisticated measurement error model specifications and estimation procedures, and analyses to assess the approach’s robustness to correlated measurement errors and to errors in categorical variables. These results support using the technique to reduce bias and increase efficiency in a wide variety of empirical research.
To reduce model dependence and bias in causal inference, researchers usually use matching as a data preprocessing step, after which they apply whatever statistical model and uncertainty estimators they would have without matching. Unfortunately, this approach is appropriate in finite samples only under exact matching, which is usually infeasible, or approximate matching only under asymptotic theory if large enough sample sizes are available, but even then requires unfamiliar specialized point and variance estimators. Instead of attempting to change common practices, we show how those analyzing certain specific (but extremely common) types of data can instead appeal to a much easier version of existing theory. This alternative theory is substantively plausible, requires no asymptotic theory, and is simple to understand. Its core conceptualizes continuous variables as having natural breakpoints, which are common in applications (e.g., high school or college degrees in years of education, a governmental poverty level in income, or phase transitions in temperature). The theory allows binary, multicategory, and continuous treatment variables from the outset and straightforward extensions for imperfect treatment assignment and different versions of treatments.
"Robust standard errors" are used in a vast array of scholarship to correct standard errors for model misspecification. However, when misspecification is bad enough to make classical and robust standard errors diverge, assuming that it is nevertheless not so bad as to bias everything else requires considerable optimism. And even if the optimism is warranted, settling for a misspecified model, with or without robust standard errors, will still bias estimators of all but a few quantities of interest. The resulting cavernous gap between theory and practice suggests that considerable gains in applied statistics may be possible. We seek to help researchers realize these gains via a more productive way to understand and use robust standard errors; a new general and easier-to-use "generalized information matrix test" statistic that can formally assess misspecification (based on differences between robust and classical variance estimates); and practical illustrations via simulations and real examples from published research. How robust standard errors are used needs to change, but instead of jettisoning this popular tool we show how to use it to provide effective clues about model misspecification, likely biases, and a guide to considerably more reliable, and defensible, inferences. Accompanying this article [soon!] is software that implements the methods we describe.
This is a poster that describes our analysis of "partisan taunting," the explicit, public, and negative attacks on another political party or its members, usually using vitriolic and derogatory language. We first demonstrate that most projects that hand code text in the social sciences optimize with respect to the wrong criterion, resulting in large, unnecessary biases. We show how to fix this problem and then apply it to taunting. We find empirically that, unlike most claims in the press and the literature, taunting is not inexorably increasing; it appears instead to be a rational political strategy, most often used by those least likely to win by traditional means -- ideological extremists, out-party members when the president is unpopular, and minority party members. However, although taunting appears to be individually rational, it is collectively irrational: Constituents may resonate with one cutting taunt by their Senator, but they might not approve if he or she were devoting large amounts of time to this behavior rather than say trying to solve important national problems. We hope to partially rectify this situation by posting public rankings of Senatorial taunting behavior.
This is a poster presentation describing (1) the largest ever experimental study of media effects, with more than 50 cooperating traditional media sites, normally unavailable web site analytics, the text of hundreds of thousands of news articles, and tens of millions of social media posts, and (2) a design we used in preparation that attempts to anticipate experimental outcomes
Representative embodiments of a method for grouping participants in an activity include the steps of: (i) defining a grouping policy; (ii) storing, in a database, participant records that include a participant identifer, a characteristic associated With the participant, and/or an identifier for a participant’s handheld device; (iii) defining groupings based on the policy and characteristics of the participants relating to the policy and to the activity; and (iv) communicating the groupings to the handheld devices to establish the groups.
MatchingFrontier is an easy-to-use R Package for making optimal causal inferences from observational data. Despite their popularity, existing matching approaches leave researchers with two fundamental tensions. First, they are designed to maximize one metric (such as propensity score or Mahalanobis distance) but are judged against another for which they were not designed (such as L1 or differences in means). Second, they lack a principled solution to revealing the implicit bias-variance trade off: matching methods need to optimize with respect to both imbalance (between the treated and control groups) and the number of observations pruned, but existing approaches optimize with respect to only one; users then either ignore the other, or tweak it, usually suboptimally, by hand.
MatchingFrontier resolves both tensions by consolidating previous techniques into a single, optimal, and flexible approach. It calculates the matching solution with maximum balance for each possible sample size (N, N-1, N-2,...). It thus directly calculates the entire balance-sample size frontier, from which the user can easily choose one, several, or all subsamples from which to conduct their final analysis, given their own choice of imbalance metric and quantity of interest. MatchingFrontier solves the joint optimization problem in one run, automatically, without manual tweaking, and without iteration. Although for each subset size k, there exist a huge (N choose k) number of unique subsets, MatchingFrontier includes specially designed fast algorithms that give the optimal answer, usually in a few minutes.