An Improved Method of Automated Nonparametric Content Analysis for Social Science

Presentation Date: 

Thursday, December 1, 2016


New York University, Text as Data Speaker Series

Presentation Slides: 

A vast literature in computer science and statistics develops methods to automatically classify textual documents into chosen categories. In contrast, social scientists are often more interested in aggregate generalizations about populations of documents --- such as the percent of social media posts that speak favorably of a candidate's foreign policy. Unfortunately, trying to maximize the percent of individual documents correctly classified often yields biased estimates of statistical aggregates. Fortunately, classification is neither a necessary nor even a desirable step in estimating aggregate percentages, as shown by the widely used approach developed in King and Lu (2008) and Hopkins and King (2010). In this paper, we build a new approach on this methodology and show how to substantially improve its estimates of category percentages. We evaluate our approach with analyses of 72 separate data sets. This talk is based on joint work with Connor Jerzak and Anton Strezhnev