To deter gerrymandering, many state constitutions require legislative districts to be "compact." Yet, the law offers few precise definitions other than "you know it when you see it," which effectively implies a common understanding of the concept. In contrast, academics have shown that compactness has multiple dimensions and have generated many conflicting measures. We hypothesize that both are correct -- that compactness is complex and multidimensional, but a common understanding exists across people. We develop a survey to elicit this understanding, with high reliability (in data where the standard paired comparisons approach fails). We create a statistical model that predicts, with high accuracy, solely from the geometric features of the district, compactness evaluations by judges and public officials responsible for redistricting, among others. We also offer compactness data from our validated measure for 20,160 state legislative and congressional districts, as well as open source software to compute this measure from any district.
Winner of the 2018 Robert H Durr Award from the MPSA.
Some scholars build models to classify documents into chosen categories. Others, especially social scientists who tend to focus on population characteristics, instead usually estimate the proportion of documents in each category -- using either parametric "classify-and-count" methods or "direct" nonparametric estimation of proportions without individual classification. Unfortunately, classify-and-count methods can be highly model dependent or generate more bias in the proportions even as the percent of documents correctly classified increases. Direct estimation avoids these problems, but can suffer when the meaning of language changes between training and test sets or is too similar across categories. We develop an improved direct estimation approach without these issues by including and optimizing continuous text features, along with a form of matching adapted from the causal inference literature. Our approach substantially improves performance in a diverse collection of 73 data sets. We also offer easy-to-use software software that implements all ideas discussed herein.
We offer methods to analyze the "differentially private" Facebook URLs Dataset which, at over 10 trillion cell values, is one of the largest social science research datasets ever constructed. The version of differential privacy used in the URLs dataset has specially calibrated random noise added, which provides mathematical guarantees for the privacy of individual research subjects while still making it possible to learn about aggregate patterns of interest to social scientists. Unfortunately, random noise creates measurement error which induces statistical bias -- including attenuation, exaggeration, switched signs, or incorrect uncertainty estimates. We adapt methods developed to correct for naturally occurring measurement error, with special attention to computational efficiency for large datasets. The result is statistically consistent and approximately unbiased linear regression estimates and descriptive statistics that can be interpreted as ordinary analyses of non-confidential data but with appropriately larger standard errors.
We have implemented these methods in open source software for R called PrivacyUnbiased. Facebook has ported PrivacyUnbiased to open source Python code called svinfer. We have extended these results in Evans and King (2021).