Daniel Hopkins and Gary King. "Extracting Systematic Social Science Meaning from Text," copy at http://gking.harvard.edu/files/abs/words-abs.shtml. (Article: PDF)

Abstract

We develop a method of automated content analysis that gives approximately unbiased estimates of quantities of theoretical interest to social scientists. With a small sample of documents hand coded into investigator-chosen categories, our method can give accurate estimates of the proportion of documents in each category in a larger population. Existing methods allow for the possibility of substantial bias in estimating the category proportions that are often of interest to social scientists. We first show how to correct the bias for any existing classifier, and then go further to estimate the proportions without the intermediate step of individual document classification and with greatly reduced assumptions. We also introduce a statistical correction for the less-than-perfect levels of inter-coder reliability that typically characterize human document classification. These methods allow us to measure the classical conception of public opinion as those views that are actively and publicly expressed, rather than the attitudes or non-attitudes of the populace as a whole. Specifically, we track the daily opinions of thousands of people about President Bush using a massive data set of online blogs we develop and make available with this article. We also offer easy-to-use software that implements our methods, and we demonstrate its effectiveness with several other text categorization problems.

Also see related research on content analysis.