Survey researchers have long sought to protect the privacy of their respondents via de-identification (removing names and other directly identifying information) before sharing data. Although these procedures can help, recent research demonstrates that they fail to protect respondents from intentional re-identification attacks, a problem that threatens to undermine vast survey enterprises in academia, government, and industry. This is especially a problem in political science because political beliefs are not merely the subject of our scholarship; they represent some of the most important information respondents want to keep private. We confirm the problem in practice by re-identifying individuals from a survey about a controversial referendum declaring life beginning at conception. We build on the concept of "differential privacy" to offer new data sharing procedures with mathematical guarantees for protecting respondent privacy and statistical validity guarantees for social scientists analyzing differentially private data. The cost of these new procedures is larger standard errors, which can be overcome with somewhat larger sample sizes.
Katz, King, and Rosenblatt (2020) introduces a theoretical framework for understanding redistricting and electoral systems, built on basic statistical and social science principles of inference. DeFord et al. (Forthcoming, 2021) instead focuses solely on descriptive measures, which lead to the problems identified in our arti- cle. In this paper, we illustrate the essential role of these basic principles and then offer statistical, mathematical, and substantive corrections required to apply DeFord et al.’s calculations to social science questions of interest, while also showing how to easily resolve all claimed paradoxes and problems. We are grateful to the authors for their interest in our work and for this opportunity to clarify these principles and our theoretical framework.
Some scholars build models to classify documents into chosen categories. Others, especially social scientists who tend to focus on population characteristics, instead usually estimate the proportion of documents in each category -- using either parametric "classify-and-count" methods or "direct" nonparametric estimation of proportions without individual classification. Unfortunately, classify-and-count methods can be highly model dependent or generate more bias in the proportions even as the percent of documents correctly classified increases. Direct estimation avoids these problems, but can suffer when the meaning of language changes between training and test sets or is too similar across categories. We develop an improved direct estimation approach without these issues by including and optimizing continuous text features, along with a form of matching adapted from the causal inference literature. Our approach substantially improves performance in a diverse collection of 73 data sets. We also offer easy-to-use software software that implements all ideas discussed herein.
We are grateful to DeFord et al. for the continued attention to our work and the crucial issues of fair representation in democratic electoral systems. Our response (Katz, King, and Rosenblatt, forthcoming) was designed to help readers avoid being misled by mistaken claims in DeFord et al. (forthcoming-a), and does not address other literature or uses of our prior work. As it happens, none of our corrections were addressed (or contradicted) in the most recent submission (DeFord et al., forthcoming-b).
We also offer a recommendation regarding DeFord et al.’s (forthcoming-b) concern with how expert witnesses, consultants, and commentators should present academic scholarship to academic novices, such as judges, public officials, the media, and the general public. In these public service roles, scholars attempt to translate academic understanding of sophisticated scholarly literatures, technical methodologies, and complex theories for those without sufficient background in social science or statistics.
Unprecedented quantities of data that could help social scientists understand and ameliorate the challenges of human society are presently locked away inside companies, governments, and other organizations, in part because of privacy concerns. We address this problem with a general-purpose data access and analysis system with mathematical guarantees of privacy for research subjects, and statistical validity guarantees for researchers seeking social science insights. We build on the standard of ``differential privacy,'' correct for biases induced by the privacy-preserving procedures, provide a proper accounting of uncertainty, and impose minimal constraints on the choice of statistical methods and quantities estimated. We also replicate two recent published articles and show how we can obtain approximately the same substantive results while simultaneously protecting the privacy. Our approach is simple to use and computationally efficient; we also offer open source software that implements all our methods.