Universities require faculty and students planning research involving human subjects to pass formal certification tests and then submit research plans for prior approval. Those who diligently take the tests may better understand certain important legal requirements but, at the same time, are often misled into thinking they can apply these rules to their own work which, in fact, they are not permitted to do. They will also be missing many other legal requirements not mentioned in their training but which govern their behaviors. Finally, the training leaves them likely to completely misunderstand the essentially political situation they find themselves in. The resulting risks to their universities, collaborators, and careers may be catastrophic, in addition to contributing to the more common ordinary frustrations of researchers with the system. To avoid these problems, faculty and students conducting research about and for the public need to understand that they are public figures, to whom different rules apply, ones that political scientists have long studied. University administrators (and faculty in their part-time roles as administrators) need to reorient their perspectives as well. University research compliance bureaucracies have grown, in well-meaning but sometimes unproductive ways that are not required by federal laws or guidelines. We offer advice to faculty and students for how to deal with the system as it exists now, and suggestions for changes in university research compliance bureaucracies, that should benefit faculty, students, staff, university budgets, and our research subjects.
Computer scientists and statisticians often try to classify individual textual documents into chosen categories. In contrast, social scientists more commonly focus on populations and thus estimate the proportion of documents falling in each category. The two existing types of techniques for estimating these category proportions are parametric “classify and count” methods and “direct” nonparametric estimation of category proportions without an individual classification step. Unfortunately, classify and count methods can sometimes be highly model dependent or generate more bias in the proportions even as the percent correctly classified increases. Direct estimation avoids these problems, but can suffer when the meaning and usage of language is too similar across categories or too different between training and test sets. We develop an improved direct estimation approach without these issues by introducing continuously valued text features optimized for this problem, along with a form of matching adapted from the causal inference literature. We evaluate our approach in analyses of a diverse collection of 73 data sets, showing that it substantially improves performance compared to existing approaches. As a companion to this paper, we offer easy-to-use software that implements all ideas discussed herein.
The mission of the academic social sciences is to understand and ameliorate society’s greatest challenges. The data held by private companies hold vast potential to further this mission. Yet, because of their interaction with highly politicized issues, customer privacy, proprietary content, and differing goals of business and academia, these datasets are often inaccessible to university researchers. We propose here a model for industry-academic partnerships that addresses these problems via a novel organizational structure: Respected scholars form a commission which, as a trusted third party, receives access to all relevant company information and systems, and then invites independent academics to do research in specific areas, following standard peer review protocols, funded by nonprofit foundations, and with no required pre-publication approval by the company. We also report on a partnership we helped forge under this model to make data available about the incendiary issues surrounding the impact of social media on elections and democracy. In our first partnership, Facebook provides (privacy-preserving) data access; eight major ideologically and substantively diverse nonprofit foundations fund the research; an organization of academics we created, Social Science One, leads the project; and logistical help is provided by the Institute for Quantitative Social Science at Harvard and a respected nonprofit.
First version released April 9, 2018, with updates here.
We provide an overview of PSI ("a Private data Sharing Interface"), a system we are developing to enable researchers in the social sciences and other fields to share and explore privacy-sensitive datasets with the strong privacy protections of differential privacy.
We clarify the theoretical foundations of partisan fairness standards for district-based democratic electoral systems, including essential assumptions and definitions that have not been formalized or in some cases even discussed. We pare assumptions down to their minimal essential components and add extensive empirical evidence for those with observable implications. Throughout, we follow a fundamental principle of statistics too often ignored -- defining the quantity of interest separately so its measures are vulnerable to being proven wrong, evaluated, and improved. This enables us to prove which approaches -- claimed in the literature to be estimators of partisan symmetry, the most widely accepted standard -- are statistically appropriate and which are biased, limited, or not measures of symmetry at all. Because real world redistricting involves complicated politics with numerous participants and conflicting goals, measures biased for partisan fairness sometimes still provide useful descriptions of other aspects of electoral systems.
Inference is the process of using facts we know to learn about facts we do not know. A theory of inference gives assumptions necessary to get from the former to the latter, along with a definition for and summary of the resulting uncertainty. Any one theory of inference is neither right nor wrong, but merely an axiom that may or may not be useful. Each of the many diverse theories of inference can be valuable for certain applications. However, no existing theory of inference addresses the tendency to choose, from the range of plausible data analysis specifications consistent with prior evidence, those that inadvertently favor one's own hypotheses. Since the biases from these choices are a growing concern across scientific fields, and in a sense the reason the scientific community was invented in the first place, we introduce a new theory of inference designed to address this critical problem. We derive "hacking intervals," which are the range of a summary statistic one may obtain given a class of possible endogenous manipulations of the data. Hacking intervals require no appeal to hypothetical data sets drawn from imaginary superpopulations. A scientific result with a small hacking interval is more robust to researcher manipulation than one with a larger interval, and is often easier to interpret than a classical confidence interval. Some versions of hacking intervals turn out to be equivalent to classical confidence intervals, which means they may also provide a more intuitive and potentially more useful interpretation of classical confidence intervals.