Daniel
Hopkins and Gary King. "Extracting Systematic
Social Science Meaning from Text," copy at
http://gking.harvard.edu/files/abs/words-abs.shtml. (Article: PDF)
Abstract
We develop a method of automated content analysis that gives
approximately unbiased estimates of quantities of theoretical
interest to social scientists. With a small sample of documents
hand coded into investigator-chosen categories, our method can give
accurate estimates of the proportion of documents in each category
in a larger population. Existing methods allow for the possibility
of substantial bias in estimating the category proportions that are
often of interest to social scientists. We first show how to
correct the bias for any existing classifier, and then go further to
estimate the proportions without the intermediate step of individual
document classification and with greatly reduced assumptions. We
also introduce a statistical correction for the less-than-perfect
levels of inter-coder reliability that typically characterize human
document classification. These methods allow us to measure the
classical conception of public opinion as those views that are
actively and publicly expressed, rather than the attitudes or
non-attitudes of the populace as a whole. Specifically, we track
the daily opinions of thousands of people about President Bush using
a massive data set of online blogs we develop and make available
with this article. We also offer easy-to-use software that
implements our methods, and we demonstrate its effectiveness with
several other text categorization problems.
Also see related research
on content analysis.