A method of computerized content analysis that gives “approximately unbiased and statistically consistent estimates” of a distribution of elements of structured, unstructured, and partially structured source data among a set of categories. In one embodiment, this is done by analyzing a distribution of small set of individually-classified elements in a plurality of categories and then using the information determined from the analysis to extrapolate a distribution in a larger population set. This extrapolation is performed without constraining the distribution of the unlabeled elements to be equal to the distribution of labeled elements, nor constraining a content distribution of content of elements in the labeled set (e.g., a distribution of words used by elements in the labeled set) to be equal to a content distribution of elements in the unlabeled set. Not being constrained in these ways allows the estimation techniques described herein to provide distinct advantages over conventional aggregation techniques.
Amelia II is a complete R package for multiple imputation of missing data. The package implements a new expectation-maximization with bootstrapping algorithm that works faster, with larger numbers of variables, and is far easier to use, than various Markov chain Monte Carlo approaches, but gives essentially the same answers. The program also improves imputation models by allowing researchers to put Bayesian priors on individual cell values, thereby including a great deal of potentially valuable and extensive information. It also includes features to accurately impute cross-sectional datasets, individual time series, or sets of time series for diﬀerent cross-sections. A full set of graphical diagnostics are also available. The program is easy to use, and the simplicity of the algorithm makes it far more robust; both a simple command line and extensive graphical user interface are included.
When respondents use the ordinal response categories of standard survey questions in different ways, the validity of analyses based on the resulting data can be biased. Anchoring vignettes is a survey design technique intended to correct for some of these problems. The anchors package in R includes methods for evaluating and choosing anchoring vignettes, and for analyzing the resulting data.
We highlight common problems in the application of random treatment assignment in large scale program evaluation. Random assignment is the defining feature of modern experimental design. Yet, errors in design, implementation, and analysis often result in real world applications not benefiting from the advantages of randomization. The errors we highlight cover the control of variability, levels of randomization, size of treatment arms, and power to detect causal effects, as well as the many problems that commonly lead to post-treatment bias. We illustrate with an application to the Medicare Health Support evaluation, including recommendations for improving the design and analysis of this and other large scale randomized experiments.
Matching is an increasingly popular method of causal inference in observational data, but following methodological best practices has proven difficult for applied researchers. We address this problem by providing a simple graphical approach for choosing among the numerous possible matching solutions generated by three methods: the venerable ``Mahalanobis Distance Matching'' (MDM), the commonly used ``Propensity Score Matching'' (PSM), and a newer approach called ``Coarsened Exact Matching'' (CEM). In the process of using our approach, we also discover that PSM often approximates random matching, both in many real applications and in data simulated by the processes that fit PSM theory. Moreover, contrary to conventional wisdom, random matching is not benign: it (and thus PSM) can often degrade inferences relative to not matching at all. We find that MDM and CEM do not have this problem, and in practice CEM usually outperforms the other two approaches. However, with our comparative graphical approach and easy-to-follow procedures, focus can be on choosing a matching solution for a particular application, which is what may improve inferences, rather than the particular method used to generate it.
Massive increases in the availability of informative social science data are making dramatic progress possible in analyzing, understanding, and addressing many major societal problems. Yet the same forces pose severe challenges to the scientific infrastructure supporting data sharing, data management, informatics, statistical methodology, and research ethics and policy, and these are collectively holding back progress. I address these changes and challenges and suggest what can be done.
We introduce a method for estimating incidence curves of several co-circulating infectious pathogens, where each infection has its own probabilities of particular symptom profiles. Our deconvolution method utilizes weekly surveillance data on symptoms from a defined population as well as additional data on symptoms from a sample of virologically confirmed infectious episodes. We illustrate this method by numerical simulations and by using data from a survey conducted on the University of Michigan campus. Last, we describe the data needs to make such estimates accurate.
Population mortality forecasts are widely used for allocating public health expenditures, setting research priorities, and evaluating the viability of public pensions, private pensions, and health care financing systems. In part because existing methods seem to forecast worse when based on more information, most forecasts are still based on simple linear extrapolations that ignore known biological risk factors and other prior information. We adapt a Bayesian hierarchical forecasting model capable of including more known health and demographic information than has previously been possible. This leads to the first age- and sex-specific forecasts of American mortality that simultaneously incorporate, in a formal statistical model, the effects of the recent rapid increase in obesity, the steady decline in tobacco consumption, and the well known patterns of smooth mortality age profiles and time trends. Formally including new information in forecasts can matter a great deal. For example, we estimate an increase in male life expectancy at birth from 76.2 years in 2010 to 79.9 years in 2030, which is 1.8 years greater than the U.S. Social Security Administration projection and 1.5 years more than U.S. Census projection. For females, we estimate more modest gains in life expectancy at birth over the next twenty years from 80.5 years to 81.9 years, which is virtually identical to the Social Security Administration projection and 2.0 years less than U.S. Census projections. We show that these patterns are also likely to greatly affect the aging American population structure. We offer an easy-to-use approach so that researchers can include other sources of information and potentially improve on our forecasts too.
We develop a computer-assisted method for the discovery of insightful conceptualizations, in the form of clusterings (i.e., partitions) of input objects. Each of the numerous fully automated methods of cluster analysis proposed in statistics, computer science, and biology optimize a different objective function. Almost all are well defined, but how to determine before the fact which one, if any, will partition a given set of objects in an "insightful" or "useful" way for a given user is unknown and difficult, if not logically impossible. We develop a metric space of partitions from all existing cluster analysis methods applied to a given data set (along with millions of other solutions we add based on combinations of existing clusterings), and enable a user to explore and interact with it, and quickly reveal or prompt useful or insightful conceptualizations. In addition, although uncommon in unsupervised learning problems, we offer and implement evaluation designs that make our computer-assisted approach vulnerable to being proven suboptimal in specific data types. We demonstrate that our approach facilitates more efficient and insightful discovery of useful information than either expert human coders or many existing fully automated methods.
MatchIt implements the suggestions of Ho, Imai, King, and Stuart (2007) for improving parametric statistical models by preprocessing data with nonparametric matching methods. MatchIt implements a wide range of sophisticated matching methods, making it possible to greatly reduce the dependence of causal inferences on hard-to-justify, but commonly made, statistical modeling assumptions. The software also easily fits into existing research practices since, after preprocessing data with MatchIt, researchers can use whatever parametric model they would have used without MatchIt, but produce inferences with substantially more robustness and less sensitivity to modeling assumptions. MatchIt is an R program, and also works seamlessly with Zelig.
We introduce a new "Monotonic Imbalance Bounding" (MIB) class of matching methods for causal inference with a surprisingly large number of attractive statistical properties. MIB generalizes and extends in several new directions the only existing class, "Equal Percent Bias Reducing" (EPBR), which is designed to satisfy weaker properties and only in expectation. We also offer strategies to obtain specific members of the MIB class, and analyze in more detail a member of this class, called Coarsened Exact Matching, whose properties we analyze from this new perspective. We offer a variety of analytical results and numerical simulations that demonstrate how members of the MIB class can dramatically improve inferences relative to EPBR-based matching methods.
Background: Incomplete information on death certificates makes recorded cause of death data less useful for public health monitoring and planning. Certifying physicians sometimes list only the mode of death (and in particular, list heart failure) without indicating the underlying disease(s) that gave rise to the death. This can prevent valid epidemiologic comparisons across countries and over time. Methods and Results: We propose that coarsened exact matching be used to infer the underlying causes of death where only the mode of death is known; we focus on the case of heart failure in U.S., Mexican and Brazilian death records. Redistribution algorithms derived using this method assign the largest proportion of heart failure deaths to ischemic heart disease in all three countries (53%, 26% and 22%), with larger proportions assigned to hypertensive heart disease and diabetes in Mexico and Brazil (16% and 23% vs. 7% for hypertensive heart disease and 13% and 9% vs. 6% for diabetes). Reassigning these heart failure deaths increases US ischemic heart disease mortality rates by 6%.Conclusions: The frequency with which physicians list heart failure in the causal chain for various underlying causes of death allows for inference about how physicians use heart failure on the death certificate in different settings. This easy-to-use method has the potential to reduce bias and increase comparability in cause-of-death data, thereby improving the public health utility of death records. Key Words: vital statistics, heart failure, population health, mortality, epidemiology
Background: Verbal autopsy analyses are widely used for estimating cause-specific mortality rates (CSMR) in the vast majority of the world without high quality medical death registration. Verbal autopsies -- survey interviews with the caretakers of imminent decedents -- stand in for medical examinations or physical autopsies, which are infeasible or culturally prohibited. Methods and Findings: We introduce methods, simulations, and interpretations that can improve the design of automated, data-derived estimates of CSMRs, building on a new approach by King and Lu (2008). Our results generate advice for choosing symptom questions and sample sizes that is easier to satisfy than existing practices. For example, most prior effort has been devoted to searching for symptoms with high sensitivity and specificity, which has rarely if ever succeeded with multiple causes of death. In contrast, our approach makes this search irrelevant because it can produce unbiased estimates even with symptoms that have very low sensitivity and specificity. In addition, the new method is optimized for survey questions caretakers can easily answer rather than questions physicians would ask themselves. We also offer an automated method of weeding out biased symptom questions and advice on how to choose the number of causes of death, symptom questions to ask, and observations to collect, among others. Conclusions: With the advice offered here, researchers should be able to design verbal autopsy surveys and conduct analyses with greatly reduced statistical biases and research costs.
We report the results of several randomized survey experiments designed to evaluate two intended improvements to anchoring vignettes, an increasingly common technique used to achieve interpersonal comparability in survey research. This technique asks for respondent self-assessments followed by assessments of hypothetical people described in vignettes. Variation in assessments of the vignettes across respondents reveals interpersonal incomparability and allows researchers to make responses more comparable by rescaling them. Our experiments show, first, that switching the question order so that self-assessments follow the vignettes primes respondents to define the response scale in a common way. In this case, priming is not a bias to avoid but a means of better communicating the question’s meaning. We then demonstrate that combining vignettes and self-assessments in a single direct comparison induces inconsistent and less informative responses. Since similar combined strategies are widely employed for related purposes, our results indicate that anchoring vignettes could reduce measurement error in many applications where they are not currently used. Data for our experiments come from a national telephone survey and a separate on-line survey.
Classic (or "cumulative") case-control sampling designs do not admit inferences about quantities of interest other than risk ratios, and then only by making the rare events assumption. Probabilities, risk differences, and other quantities cannot be computed without knowledge of the population incidence fraction. Similarly, density (or "risk set") case-control sampling designs do not allow inferences about quantities other than the rate ratio. Rates, rate differences, cumulative rates, risks, and other quantities cannot be estimated unless auxiliary information about the underlying cohort such as the number of controls in each full risk set is available. Most scholars who have considered the issue recommend reporting more than just the relative risks and rates, but auxiliary population information needed to do this is not usually available. We address this problem by developing methods that allow valid inferences about all relevant quantities of interest from either type of case-control study when completely ignorant of or only partially knowledgeable about relevant auxiliary population information. This is a somewhat revised and extended version of Gary King and Langche Zeng. 2002. "Estimating Risk and Rate Levels, Ratios, and Differences in Case-Control Studies," Statistics in Medicine, 21: 1409-1427. You may also be interested in our related work in other fields, such as in international relations, Gary King and Langche Zeng. "Explaining Rare Events in International Relations," International Organization, 55, 3 (Spring, 2001): 693-715, and in political methodology, Gary King and Langche Zeng, "Logistic Regression in Rare Events Data," Political Analysis, Vol. 9, No. 2, (Spring, 2001): Pp. 137--63.
A program for analyzing most any feature of district-level legislative elections data, including prediction, evaluating redistricting plans, estimating counterfactual hypotheses (such as what would happen if a term-limitation amendment were imposed). This implements statistical procedures described in a series of journal articles and has been used during redistricting in many states by judges, partisans, governments, private citizens, and many others. The earlier version was winner of the APSA Research Software Award.
The increasing availability of digitized text presents enormous opportunities for social scientists. Yet hand coding many blogs, speeches, government records, newspapers, or other sources of unstructured text is infeasible. Although computer scientists have methods for automated content analysis, most are optimized to classify individual documents, whereas social scientists instead want generalizations about the population of documents, such as the proportion in a given category. Unfortunately, even a method with a high percent of individual documents correctly classified can be hugely biased when estimating category proportions. By directly optimizing for this social science goal, we develop a method that gives approximately unbiased estimates of category proportions even when the optimal classifier performs poorly. We illustrate with diverse data sets, including the daily expressed opinions of thousands of people about the U.S. presidency. We also make available software that implements our methods and large corpora of text for further analysis.