Computer-Assisted Keyword and Document Set Discovery from Unstructured Text
Gary King, Patrick Lam, Margaret Roberts. 2017.
"Computer-Assisted Keyword and Document Set Discovery from Unstructured Text".
American Journal of Political Science, 61, 4, Pp. 971–988.

Abstract
The (unheralded) first step in many applications of automated text analysis involves selecting keywords to choose documents from a large text corpus for further study. Although all substantive results depend on this choice, researchers usually pick keywords in ad hoc ways that are far from optimal and usually biased. Paradoxically, this often means that the validity of the most sophisticated text analysis methods depend in practice on the inadequate keyword counting or matching methods they are designed to replace. Improved methods of keyword selection would also be valuable in many other areas, such as following conversations that rapidly innovate language to evade authorities, seek political advantage, or express creativity; generic web searching; eDiscovery; look-alike modeling; intelligence analysis; and sentiment and topic analysis. We develop a computer-assisted (as opposed to fully automated) statistical approach that suggests keywords from available text without needing structured data as inputs. This framing poses the statistical problem in a new way, which leads to a widely applicable algorithm. Our specific approach is based on training classifiers, extracting information from (rather than correcting) their mistakes, and summarizing results with Boolean search strings. We illustrate how the technique works with analyses of English texts about the Boston Marathon Bombings, Chinese social media posts designed to evade censorship, among others.
See Also
- [Dataset] Replication Data for: Computer-Assisted Keyword and Document Set Discovery from Unstructured Text
- [Paper] A Method of Automated Nonparametric Content Analysis for Social Science (2010)
- [Paper] An Automated Information Extraction Tool For International Conflict Data With Performance As Good As Human Coders: A Rare Events Evaluation Design (2003)
- [Paper] An Improved Method of Automated Nonparametric Content Analysis for Social Science (2022)
- [Paper] General Purpose Computer-Assisted Clustering and Conceptualization (2011)
- [Paper] How Censorship in China Allows Government Criticism But Silences Collective Expression (2013)
- [Patent] Method and Apparatus for Selecting Clusterings to Classify A Predetermined Data Set (2013)
- [Patent] Participant Grouping for Enhanced Interactive Experience (2014)