We develop a computer-assisted method for the discovery of insightful conceptualizations, in the form of clusterings (i.e., partitions) of input objects. Each of the numerous fully automated methods of cluster analysis proposed in statistics, computer science, and biology optimize a different objective function. Almost all are well defined, but how to determine before the fact which one, if any, will partition a given set of objects in an "insightful" or "useful" way for a given user is unknown and difficult, if not logically impossible. We develop a metric space of partitions from all existing cluster analysis methods applied to a given data set (along with millions of other solutions we add based on combinations of existing clusterings), and enable a user to explore and interact with it, and quickly reveal or prompt useful or insightful conceptualizations. In addition, although uncommon in unsupervised learning problems, we offer and implement evaluation designs that make our computer-assisted approach vulnerable to being proven suboptimal in specific data types. We demonstrate that our approach facilitates more efficient and insightful discovery of useful information than either expert human coders or many existing fully automated methods.

# Methods

Automated and computer-assisted methods of extracting, organizing, and consuming knowledge from unstructured text.

## Content Analysis

. 2011. “General Purpose Computer-Assisted Clustering and Conceptualization.” Proceedings of the National Academy of Sciences. Publisher's VersionAbstract

Methods to evaluate automated information extraction systems when coding rare events, the success of one such system, along with considerable data. . 2003. “An Automated Information Extraction Tool For International Conflict Data with Performance as Good as Human Coders: A Rare Events Evaluation Design.” International Organization, 57: 617-642, July.Abstract

. 2012. “System for Estimating a Distribution of Message Content Categories in Source Data.” U.S. Patent and Trademark Office. United States of America 8180717 (May 15).Abstract

. In Press. “Computer-Assisted Keyword and Document Set Discovery from Unstructured Text.” American Journal of Political Science, 2017.Abstract

. 2014. “Participant Grouping for Enhanced Interactive Experience.” United States of America US 8,914,373 B2 (U.S. Patent and Trademark Office).Abstract

. 2013. “Method and Apparatus for Selecting Clusterings to Classify A Predetermined Data Set.” U.S. Patent and Trademark Office. United States of America 8,438,162 (May 7).Abstract

A method that gives unbiased estimates of the proportion of text documents in investigator-chosen categories, given only a small subset of hand-coded documents. Also includes the first correction for the far less-than-perfect levels of inter-coder reliability that typically characterize hand coding. Applications to sentiment detection about politicians in blog posts. . 2010. “A Method of Automated Nonparametric Content Analysis for Social Science.” American Journal of Political Science, 1, 54: 229–247, 01/2010.Abstract

. 2014. “

*You Lie!*Patterns of Partisan Taunting in the U.S. Senate (Poster).” In Society for Political Methodology. Athens, GA, 24 July.Abstract

A version of the previous article for a different audience: . 2003. “Some Statistical Methods for Evaluating Information Extraction Systems.” Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics, 19-26.Abstract

. 2013. “How Censorship in China Allows Government Criticism but Silences Collective Expression.” American Political Science Review, 2 (May), 107: 1-18.Abstract

## Software

## Data

## Causal Inference

Methods for detecting and reducing model dependence (i.e., when minor model changes produce substantively different inferences) in inferring causal effects and other counterfactuals. Matching methods; "politically robust" and cluster-randomized experimental designs; causal bias decompositions.

## Methods for Observational Data

. In Press. “The Balance-Sample Size Frontier in Matching Methods for Causal Inference.” American Journal of Political Science, 2016.Abstract

## Evaluating Model Dependence

Evaluating whether counterfactual questions (predictions, what-if questions, and causal effects) can be reasonably answered from given data, or whether inferences will instead be highly model-dependent; also, a new decomposition of bias in causal inference. These articles overlap (and each as been the subject of a journal symposium):

For complete mathematical proofs, general notation, and other technical material, see: . 2006. “The Dangers of Extreme Counterfactuals.” Political Analysis, 14: 131–159.Abstract

For more intuitive, but less general, notation, but with additional examples and more pedagogically oriented material, see: . 2007. “When Can History Be Our Guide? The Pitfalls of Counterfactual Inference.” International Studies Quarterly, 183-210, March.Abstract

## Matching Methods

A simple and powerful method of matching: . 2011. “Causal Inference Without Balance Checking: Coarsened Exact Matching.” Political Analysis.Abstract

. 2009. “CEM: Software for Coarsened Exact Matching.” Journal of Statistical Software, 30. Publisher's VersionAbstract

A technical paper that describes a new class of matching methods, of which coarsened exact matching is an example: . 2011. “Multivariate Matching Methods That are Monotonic Imbalance Bounding.” Journal of the American Statistical Association, 493, 106: 345-361, 2011.Abstract

A unified approach to matching methods as a way to reduce model dependence by preprocessing data and then using any model you would have without matching: . 2007. “Matching as Nonparametric Preprocessing for Reducing Model Dependence in Parametric Causal Inference.” Political Analysis, 15: 199–236.Abstract

. 2011. “MatchIt: Nonparametric Preprocessing for Parametric Causal Inference.” Journal of Statistical Software, 8, 42. Publisher's VersionAbstract

## Additional Approaches

A method to estimate base probabilities or any quantity of interest from case-control data, even with no (or partial) auxiliary information. Discusses problems with odds-ratios. . 2002. “Estimating Risk and Rate Levels, Ratios, and Differences in Case-Control Studies.” Statistics in Medicine, 21: 1409–1427.Abstract

. 1991. “'Truth' is Stranger than Prediction, More Questionable Than Causal Inference.” American Journal of Political Science, 35: 1047–1053, November.Abstract

Causal inference in qualitative research (Chapter 4). . 1994. Designing Social Inquiry: Scientific Inference in Qualitative Research. Princeton: Princeton University Press. Publisher's Version

## Experimental Design

. 2014. “Methods for Extremely Large Scale Media Experiments and Observational Studies (Poster).” In Society for Political Methodology. Athens, GA, 24 July.Abstract

. 2012. “Letter to the Editor on the "Medicare Health Support Pilot Program" (by McCall and Cromwell).” New England Journal of Medicine, 7, 366: 667. New England Journal of Medicine version

. 2011. “Avoiding Randomization Failure in Program Evaluation.” Population Health Management, 1, 14: S11-S22, 2011.Abstract An evaluation of the Mexican Seguro Popular program (designed to extend health insurance and regular and preventive medical care, pharmaceuticals, and health facilities to 50 million uninsured Mexicans), one of the world's largest health policy reforms of the last two decades. The evaluation features the largest randomized health policy experiment in history, a new design for field experiments that is more robust to the political interventions that have ruined many similar previous efforts, and new statistical methods that produce more reliable and efficient results using substantially fewer resources, assumptions, and data.

**(Articles on the Seguro Popular Evaluation: Website)**

Clarifying serious misunderstandings in the advantages and uses of the most common research designs for making causal inferences. . 2008. “Misunderstandings Among Experimentalists and Observationalists about Causal Inference.” Journal of the Royal Statistical Society, Series A, 171, part 2: 481–502.Abstract

## Software

WhatIf: Software for Evaluating Counterfactuals.” Journal of Statistical Software, 15. Publisher's Version

. 2005. “
CLARIFY: Software for Interpreting and Presenting Statistical Results.” Journal of Statistical Software 8.Abstract

. 2003. “## Applications

A brief summary of the above article for an undergraduate audience: . 2005. “The Supreme Court During Crisis: How War Affects only Non-War Cases.” New York University Law Review, 80: 1–116, April.Abstract

. 2006. “The Effect of War on the Supreme Court.” In Principles and Practice in American Politics: Classic and Contemporary Readings, , 3rd ed. Washington, D.C.: Congressional Quarterly Press.Abstract

## Event Counts and Durations

Statistical models to explain or predict how many events occur for each fixed time period, or the time between events. An application to cabinet dissolution in parliamentary democracies which united two previously warring scholarly literature. Other applications to international relations and U.S. Supreme Court appointments.

## Event Counts

A series of methods that introduced existing, and developed new, statistical models for event counts for political science research.

. 1996. “The Generalization in the Generalized Event Count Model, With Comments on Achen, Amato, and Londregan.” Political Analysis, 6: 225–252.Abstract

. 1988. “Statistical Models for Political Science Event Counts: Bias in Conventional Procedures and Evidence for The Exponential Poisson Regression Model.” American Journal of Political Science, 32: 838-863, August.Abstract

. 1998. Unifying Political Methodology: The Likelihood Theory of Statistical Inference. Ann Arbor: University of Michigan Press. Publisher's Version

. 1987. “Presidential Appointments to the Supreme Court: Adding Systematic Explanation to Probabilistic Description.” American Politics Quarterly, 15: 373–386, July.Abstract

. 1989. “A Seemingly Unrelated Poisson Regression Model.” Sociological Methods and Research, 17: 235–255, February.Abstract

. 1989. “Event Count Models for International Relations: Generalizations and Applications.” International Studies Quarterly, 33: 123–147, June.Abstract

. 1989. “Variance Specification in Event Count Models: From Restrictive Assumptions to a Generalized Estimator.” American Journal of Political Science, 33: 762–784, August.Abstract

. 1995. “A Correction for an Underdispersed Event Count Probability Distribution.” Political Analysis, 215–228.Abstract

. 2008. Demographic Forecasting. Princeton: Princeton University Press.Abstract - see sections on dealing with small death counts

## Duration of Parliamentary Governments

A statistical model, and related work, that united two warring scholarly literatures.

. 1990. “A Unified Model of Cabinet Dissolution in Parliamentary Democracies.” American Journal of Political Science, 34: 846–871, August.Abstract

. 1994. “Transfers of Governmental Power: The Meaning of Time Dependence.” Comparative Political Studies, 27: 190–210, July.Abstract

. 2001. “Aggregation Among Binary, Count, and Duration Models: Estimating the Same Quantities from Different Levels of Data.” Political Analysis, 9: 21–44, Winter.Abstract

## Software

Includes several methods for count and duration analysis: . 2006. “Zelig: Everyone's Statistical Software”. Publisher's Version

## Related Data

10 Million International Dyadic Events”. Publisher's Version Coding conflict and cooperation in international relations, 1990-2004, as evaluated by King and Lowe (2003)

. 2003. “## Ecological Inference

Inferring individual behavior from group-level data: The first approach to incorporate both unit-level deterministic bounds and cross-unit statistical information, methods for 2x2 and larger tables, Bayesian model averaging, applications to elections, software.

. 2008. “Ordinary Economic Voting Behavior in the Extraordinary Election of Adolf Hitler.” Journal of Economic History, 4, 68: 996, 12/2008.Abstract

## Methods

Summarizes the explosion of research in ecological inference that has occurred in the previous eight years, all following the key insight of models that include both deterministic and statistical information. . 2004. Ecological Inference: New Methodological Strategies. New York: Cambridge University Press.Abstract

An extension of the work in the above book to use MCMC technology, making models for larger tables possible. . 1999. “Binomial-Beta Hierarchical Models for Ecological Inference.” Sociological Methods and Research, 28: 61–90, August.Abstract

The first ecological inference method to combine, in a single model, unit-level deterministic bounds with cross-unit statistical information, unifying two literatures that had been in conflict since 1953. . 1997. A Solution to the Ecological Inference Problem: Reconstructing Individual Behavior from Aggregate Data. Princeton: Princeton University Press.

Outlines some of the history of ecological inference research, and introduces this new book. . 2004. “Information in Ecological Inference: An Introduction.” In Ecological Inference: New Methodological Strategies, . New York: Cambridge University Press.

This article uses MCMC technology, and a quicker approximation, to make ecological inferences using deterministic and statistical information in larger tables. . 2001. “Bayesian and Frequentist Inference for Ecological Inference: The RxC Case.” Statistica Neerlandica, 55: 134–156.Abstract

Details of an application conducted for the New York Times, including extensions of ecological inference to Bayesian model averaging. . 2004. “Did Illegal Overseas Absentee Ballots Decide the 2000 U.S. Presidential Election?.” Perspectives on Politics, 2: 537–549, September.Abstract

Related research on aggregation, revealing the logical inconsistency of some popularly used models. . 2001. “Aggregation Among Binary, Count, and Duration Models: Estimating the Same Quantities from Different Levels of Data.” Political Analysis, 9: 21–44, Winter.Abstract

## Software

. 2006. “Zelig: Everyone's Statistical Software”. Publisher's Version includes several methods of ecological inference, and will soon include EI.

EzI: A(n Easy) Program for Ecological Inference”. Publisher's Version Published as part of the Gauss Package by Aptech Systems, Kent, Washington, and as a stand-alone program called EzI: A(n Easy) Program for Ecological Inference, by Kenneth Benoit and me.

. 2003. “
The above is published as . 2004. “EI: A Program for Ecological Inference.” Journal of Statistical Software, 11. Publisher's Version

## Data

The Record of American Democracy, 1984-1990.” Sociological Methods and Research, 26: 424–427, February. Publisher's Version

. 1998. “## Discussions and Extensions

. 1999. “The Future of Ecological Inference Research: A Reply to Freedman et al..” Journal of the American Statistical Association, 94: 352-355, March.Abstract

A Consensus on Second Stage Analyses in Ecological Inference Models.” Political Analysis, 11: 86–94, Winter.Abstract

. 2003. “
Analyzing Second Stage Ecological Regressions.” Political Analysis, 11: 65-76, Winter.

. 2003. “
Finding New Information for Ecological Inference Models: A Comment on Jon Wakefield, 'Ecological Inference in 2X2 Tables'.” Journal of the Royal Statistical Society, 167: 437.

. 2004. “
Isolating Spatial Autocorrelation, Aggregation Bias, and Distributional Violations in Ecological Inference.” Political Analysis, 10: 298–300, Summer.Abstract

. 2002. “
Geography, Statistics, and Ecological Inference.” Annals of the Association of American Geographers, 90: 601–606, September.Abstract

. 2000. “## Missing Data

Statistical methods to accommodate missing information in data sets due to scattered unit nonresponse, missing variables, or cell values or variables measured with error. Easy-to-use algorithms and software for multiple imputation and multiple overimputation for surveys, time series, and time series cross-sectional data. Applications to electoral, and other compositional, data.

## Methods

The methods developed in this paper greatly expands the size and types of data sets that can be imputed without difficulty, for cross-sectional, time series, and time series cross-sectional data. . 2001. “Analyzing Incomplete Political Science Data: An Alternative Algorithm for Multiple Imputation.” American Political Science Review, 95: 49–69, March.Abstract

Develops multiple imputation methods for when entire survey questions are missing from some of a series of cross-sectional samples. . 1999. “Not Asked and Not Answered: Multiple Imputation for Multiple Surveys.” Journal of the American Statistical Association, 93: 846–857, September.Abstract

. 2015. “A Unified Approach to Measurement Error and Missing Data: Details and Extensions.” Sociological Methods and Research, 1-28. Publisher's VersionAbstract

We extend the algorithm in the previous paper to encompass classic missing data as an extreme version of measurement error, and to correct for both. . 2015. “A Unified Approach to Measurement Error and Missing Data: Overview and Applications.” Sociological Methods and Research, 1-39. Publisher's VersionAbstract

Multiple imputation for missing data had long been recognized as theoretical appropriate, but algorithms to use it were difficult, and applications were rare. This article introduced an easy-to-apply algorithm, making multiple imputation within reach of practicing social scientists. It, and the related software, has been widely used. . 2010. “What to do About Missing Values in Time Series Cross-Section Data.” American Journal of Political Science, 3, 54: 561-581, 2010. Publisher's VersionAbstract

A general purpose method for analyzing multiparty electoral data. . 1999. “A Statistical Model for Multiparty Electoral Data.” American Political Science Review, 93: 15–32, March.Abstract

Uses the insights from the above two articles to greatly increase the number of parties that can be analyzed. . 2002. “A Fast, Easy, and Efficient Estimator for Multiparty Electoral Data.” Political Analysis, 10: 84–100, Winter.Abstract

## Software

. 2006. “Zelig: Everyone's Statistical Software”. Publisher's Version , which easily combines multiply imputed data in R.

- . 2011. “Amelia II: A Program for Missing Data.” Journal of Statistical Software, 7, 45: 1-47.Abstract

CLARIFY: Software for Interpreting and Presenting Statistical Results.” Journal of Statistical Software 8.Abstract , which easily combines multiply imputed data in Stata.

. 2003. “## How Surveys Work

. 1995. “Pre-Election Survey Methodology: Details From Nine Polling Organizations, 1988 and 1992.” Public Opinion Quarterly, 59: 98–132, Spring.Abstract

## Qualitative Research

How the same unified theory of inference underlies quantitative and qualitative research alike; scientific inference when quantification is difficult or impossible; research design; empirical research in legal scholarship.

## Scientific Inference in Qualitative Research

. 1995. “The Importance of Research Design in Political Science.” American Political Science Review, 89: 454–481, June.Abstract

. 1994. Designing Social Inquiry: Scientific Inference in Qualitative Research. Princeton: Princeton University Press. Publisher's Version

## In Legal Research

. 2003. “Building An Infrastructure for Empirical Research in the Law.” Journal of Legal Education, 53: 311–320.Abstract

## Rare Events

How to save 99% of your data collection costs; bias corrections for logistic regression in estimating probabilities and causal effects in rare events data; estimating base probabilities or any quantity from case-control data; automated coding of events.

## Case Control and Rare Events Bias Corrections

## Bias Correction

Develops corrections for the biases in logistic regression that occur when predicting or explaining rare outcomes (such as when you have many more zeros than ones). Corrections developed for standard prospective studies, as well as case-control designs. How to use "case-control designs" to save 99% of your data collection costs. These articles overlap:

For general mathematical proofs and other technical material: . 2001. “Logistic Regression in Rare Events Data.” Political Analysis, 9: 137–163, Spring.Abstract

An applied companion paper to the previous article that has more examples and pedagogical material but none of the mathematical proofs. . 2001. “Explaining Rare Events in International Relations.” International Organization, 55: 693–715, Summer.Abstract

## Hidden Region 1

Example of an analysis of case-control data. Also an independent evaluation of the U.S. State Failure Task Force, including improved methods of forecasting state failure and assessing its causes. . 2001. “Improving Forecasts of State Failure.” World Politics, 53: 623–658, July.Abstract

## Estimating Base Probabilities

A method to estimate base probabilities or any quantity of interest from case-control data, even with no (or partial) auxiliary information. Discusses problems with odds-ratios.

The original article: . 2002. “Estimating Risk and Rate Levels, Ratios, and Differences in Case-Control Studies.” Statistics in Medicine, 21: 1409–1427.Abstract

A revised and extended version of the previous article. . 2004. “Inference in Case-Control Studies.” In Encyclopedia of Biopharmaceutical Statistics, , 2nd ed. New York: Marcel Dekker.Abstract

## Hidden Region 2

The first extensive empirical study of the probability of your vote changing the outcome of a U.S. presidential election? Most previous studies of the probability of a tied vote have involved theoretical calculation without data. . 1998. “Estimating the Probability of Events that Have Never Occurred: When Is Your Vote Decisive?.” Journal of the American Statistical Association, 93: 1–9, March.Abstract

## Automatic Coding of Rare Events

## Software

ReLogit: Rare Events Logistic Regression.” Journal of Statistical Software, 8. Publisher's Version

. 2003. “## Data

10 Million International Dyadic Events”. Publisher's Version . Coding conflict and cooperation in international relations, 1990-2004, as evaluated by King and Lowe (2003).

. 2003. “## Survey Research

"Anchoring Vignette" methods for when different respondents (perhaps from different cultures, countries, or ethnic groups) understand survey questions in different ways; an approach to developing theoretical definitions of complicated concepts apparently definable only by example (i.e., "you know it when you see it"); how surveys work.

## Anchoring Vignettes

Methods for when different respondents (perhaps from different cultures, countries, or ethnic groups), or respondents and investigators, understand survey questions in different ways. Also includes an approach to developing theoretical definitions of complicated concepts apparently definable only by example (i.e., "you know it when you see it").

Develops methods for selecting vignettes and new, simpler, nonparametric methods requiring fewer assumptions for analyzing anchoring vignettes data. . 2007. “Comparing Incomparable Survey Responses: New Tools for Anchoring Vignettes.” Political Analysis, 15: 46-66, Winter.Abstract

. 2010. “Improving Anchoring Vignettes: Designing Surveys to Correct Interpersonal Incomparability.” Public Opinion Quarterly, 1-22.Abstract

The original article that lays out the idea, develops the basic models, and gives examples. . 2004. “Enhancing the Validity and Cross-cultural Comparability of Measurement in Survey Research.” American Political Science Review, 98: 191–207, February.Abstract

Many more details, examples, videos, software, etc. can be found at the The Anchoring Vignettes Website: HTML

## Software

## How Surveys Work

Resolution of a paradox in the study of American voting behavior. . 1993. “Why are American Presidential Election Campaign Polls so Variable when Votes are so Predictable?.” British Journal of Political Science, 23: 409–451, October.Abstract

. 2011. “Anchors: Software for Anchoring Vignettes Data.” Journal of Statistical Software, 3, 42: 1--25. Publisher's VersionAbstract

## Related Research

**Imputing Missing Data** due to survey nonresponse: Website

**Analyzing Rare Events**, including rare survey outcomes and alternative methods of sampling for rare events: Website

**Estimating Mortality by Survey** using surveys of siblings or other groups, as well as methods designed for estimating cause-specific mortality that applies more generally for extrapolating from one population to another: Website

## Unifying Statistical Analysis

Development of a unified approach to statistical modeling, inference, interpretation, presentation, analysis, and software; integrated with most of the other projects listed here.

## Unifying Approaches to Statistical Analysis

A generalization of Clarify, and much other software, implemented in R. The extensive manual encompasses most of the above works and can be read independently as an introduction to wide range of models. Under active development. . 2006. “Zelig: Everyone's Statistical Software”. Publisher's Version

Generalizes the unification in the book (replacing its Section 5.2 with simulation to compute quantities of interest). This paper, which was originally titled "Enough with the Logit Coefficients, Already!", explains how to compute any quantity of interest from almost any statistical model; and shows, with replications of several published works, how to extract considerably more information than standard practices, without changing any data or statistical assumptions. . 2000. “Making the Most of Statistical Analyses: Improving Interpretation and Presentation.” American Journal of Political Science, 44: 341–355, April. Publisher's VersionAbstract

. 2015. “How Robust Standard Errors Expose Methodological Problems They Do Not Fix, and What to Do About It.” Political Analysis, 2, 23: 159–179. Publisher's VersionAbstract

A paper that describes the advances underlying Zelig software: . 2008. “Toward A Common Framework for Statistical Analysis and Development.” Journal of Computational Graphics and Statistics, 17: 1–22.Abstract

Sets out the general framework. . 1998. Unifying Political Methodology: The Likelihood Theory of Statistical Inference. Ann Arbor: University of Michigan Press. Publisher's Version

Software that accompanies the above article and implements its key ideas in easy-to-use Stata macros. . 2003. “CLARIFY: Software for Interpreting and Presenting Statistical Results.” Journal of Statistical Software 8.Abstract

## Related Materials

. 2009. “The Changing Evidence Base of Social Science Research.” In The Future of Political Science: 100 Perspectives, . New York: Routledge Press.Abstract

. 2003. “Numerical Issues Involved in Inverting Hessian Matrices.” In Numerical Issues in Statistical Computing for the Social Scientist, , 143-176. Hoboken, NJ: John Wiley and Sons, Inc.

. 1991. “Calculating Standard Errors of Predicted Values based on Nonlinear Functional Forms.” The Political Methodologist, 4, Fall.

. 1986. “How Not to Lie With Statistics: Avoiding Common Mistakes in Quantitative Political Science.” American Journal of Political Science, 30: 666–687, August.Abstract

. 2004. “What to do When Your Hessian is Not Invertible: Alternatives to Model Respecification in Nonlinear Estimation.” Sociological Methods and Research, 32: 54-87, August.Abstract