We are grateful to DeFord et al. for the continued attention to our work and the crucial issues of fair representation in democratic electoral systems. Our response (Katz, King, and Rosenblatt, forthcoming) was designed to help readers avoid being misled by mistaken claims in DeFord et al. (forthcoming-a), and does not address other literature or uses of our prior work. As it happens, none of our corrections were addressed (or contradicted) in the most recent submission (DeFord et al., forthcoming-b).
We also offer a recommendation regarding DeFord et al.’s (forthcoming-b) concern with how expert witnesses, consultants, and commentators should present academic scholarship to academic novices, such as judges, public officials, the media, and the general public. In these public service roles, scholars attempt to translate academic understanding of sophisticated scholarly literatures, technical methodologies, and complex theories for those without sufficient background in social science or statistics.
Survey researchers have long sought to protect the privacy of their respondents via de-identification (removing names and other directly identifying information) before sharing data. Although these procedures can help, recent research demonstrates that they fail to protect respondents from intentional re-identification attacks, a problem that threatens to undermine vast survey enterprises in academia, government, and industry. This is especially a problem in political science because political beliefs are not merely the subject of our scholarship; they represent some of the most important information respondents want to keep private. We confirm the problem in practice by re-identifying individuals from a survey about a controversial referendum declaring life beginning at conception. We build on the concept of "differential privacy" to offer new data sharing procedures with mathematical guarantees for protecting respondent privacy and statistical validity guarantees for social scientists analyzing differentially private data. The cost of these new procedures is larger standard errors, which can be overcome with somewhat larger sample sizes.
Katz, King, and Rosenblatt (2020) introduces a theoretical framework for understanding redistricting and electoral systems, built on basic statistical and social science principles of inference. DeFord et al. (Forthcoming, 2021) instead focuses solely on descriptive measures, which lead to the problems identified in our arti- cle. In this paper, we illustrate the essential role of these basic principles and then offer statistical, mathematical, and substantive corrections required to apply DeFord et al.’s calculations to social science questions of interest, while also showing how to easily resolve all claimed paradoxes and problems. We are grateful to the authors for their interest in our work and for this opportunity to clarify these principles and our theoretical framework.
Some scholars build models to classify documents into chosen categories. Others, especially social scientists who tend to focus on population characteristics, instead usually estimate the proportion of documents in each category -- using either parametric "classify-and-count" methods or "direct" nonparametric estimation of proportions without individual classification. Unfortunately, classify-and-count methods can be highly model dependent or generate more bias in the proportions even as the percent of documents correctly classified increases. Direct estimation avoids these problems, but can suffer when the meaning of language changes between training and test sets or is too similar across categories. We develop an improved direct estimation approach without these issues by including and optimizing continuous text features, along with a form of matching adapted from the causal inference literature. Our approach substantially improves performance in a diverse collection of 73 data sets. We also offer easy-to-use software software that implements all ideas discussed herein.
We offer methods to analyze the "differentially private" Facebook URLs Dataset which, at over 40 trillion cell values, is one of the largest social science research datasets ever constructed. The version of differential privacy used in the URLs dataset has specially calibrated random noise added, which provides mathematical guarantees for the privacy of individual research subjects while still making it possible to learn about aggregate patterns of interest to social scientists. Unfortunately, random noise creates measurement error which induces statistical bias -- including attenuation, exaggeration, switched signs, or incorrect uncertainty estimates. We adapt methods developed to correct for naturally occurring measurement error, with special attention to computational efficiency for large datasets. The result is statistically valid linear regression estimates and descriptive statistics that can be interpreted as ordinary analyses of non-confidential data but with appropriately larger standard errors.
We have implemented these methods in open source software for R called PrivacyUnbiased. Facebook has ported PrivacyUnbiased to open source Python code called svinfer. We have extended these results in Evans and King (2021).