Daniel
Hopkins and Gary King. "A Method of Automated
Nonparametric Content Analysis for Social Science," forthcoming
American Journal of Political Science, copy at
http://gking.harvard.edu/files/abs/words-abs.shtml. (Article: PDF)
Abstract
The massive increase in text available in digital formats presents
enormous opportunities for social scientists. Yet systematically
hand coding a significant share of the available blogs, speeches,
emails, web pages, government records, newspapers, or other
digitized texts is infeasible. Although computer scientists have
developed effective methods for automated content analysis, those
methods aim to classify individual documents correctly, whereas
social scientists are usually interested in generalizations about
the population of documents, such as the proportion in a
given category. Unfortunately, even classifiers that categorize
individual documents with high accuracy can be hugely biased when
estimating category proportions. By directly optimizing for the
broader goal of many social scientists, we develop a method that gives
approximately unbiased estimates of the category proportions. We
illustrate the method with several diverse data sources, including
the daily expressed opinions of hundreds of thousands of people
about the U.S.\ presidency. We also make available easy-to-use
software that implements our methods and large corpora of text for
further analysis.
Also see related research
on content analysis.