We develop a computer-assisted method for the discovery of insightful conceptualizations, in the form of clusterings (i.e., partitions) of input objects. Each of the numerous fully automated methods of cluster analysis proposed in statistics, computer science, and biology optimize a different objective function. Almost all are well defined, but how to determine before the fact which one, if any, will partition a given set of objects in an "insightful" or "useful" way for a given user is unknown and difficult, if not logically impossible. We develop a metric space of partitions from all existing cluster analysis methods applied to a given data set (along with millions of other solutions we add based on combinations of existing clusterings), and enable a user to explore and interact with it, and quickly reveal or prompt useful or insightful conceptualizations. In addition, although uncommon in unsupervised learning problems, we offer and implement evaluation designs that make our computer-assisted approach vulnerable to being proven suboptimal in specific data types. We demonstrate that our approach facilitates more efficient and insightful discovery of useful information than either expert human coders or many existing fully automated methods.
Automated Text Analysis
Automated and computer-assisted methods of extracting, organizing, understanding, conceptualizing, and consuming knowledge from massive quantities of unstructured text.
Content Analysis
General Purpose Computer-Assisted Clustering and Conceptualization.” Proceedings of the National Academy of Sciences. Publisher's VersionAbstract
. 2011. “
Methods to evaluate automated information extraction systems when coding rare events, the success of one such system, along with considerable data. . 2003. “An Automated Information Extraction Tool For International Conflict Data with Performance as Good as Human Coders: A Rare Events Evaluation Design.” International Organization, 57, Pp. 617-642.Abstract
System for Estimating a Distribution of Message Content Categories in Source Data.” United States of America 8,180,717 (May 15).Abstract
. 2012. “
Computer-Assisted Keyword and Document Set Discovery from Unstructured Text.” American Journal of Political Science, 61, 4, Pp. 971-988. Publisher's VersionAbstract
. 2017. “
Participant Grouping for Enhanced Interactive Experience.” United States of America US 8,914,373 B2 (U.S. Patent and Trademark Office).Abstract
. 2014. “
An Improved Method of Automated Nonparametric Content Analysis for Social Science.” Political Analysis, 31, Pp. 42-58.Abstract
. 2022. “
Method and Apparatus for Selecting Clusterings to Classify A Predetermined Data Set.” United States of America 8,438,162 (May 7).Abstract
. 2013. “
Reverse-engineering censorship in China: Randomized experimentation and participant observation.” Science, 345, 6199, Pp. 1-10. Publisher's VersionAbstract
. 2014. “
A method that gives unbiased estimates of the proportion of text documents in investigator-chosen categories, given only a small subset of hand-coded documents. Also includes the first correction for the far less-than-perfect levels of inter-coder reliability that typically characterize hand coding. Applications to sentiment detection about politicians in blog posts. . 2010. “A Method of Automated Nonparametric Content Analysis for Social Science.” American Journal of Political Science, 54, 1, Pp. 229–247.Abstract
You Lie! Patterns of Partisan Taunting in the U.S. Senate (Poster).” In Society for Political Methodology. Athens, GA.Abstract
. 2014. “
A version of the previous article for a different audience: . 2003. “Some Statistical Methods for Evaluating Information Extraction Systems.” Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics, Pp. 19-26.Abstract
Systems and methods for calculating category proportions.” United States of America 9,483,544 (U.S. Patent and Trademark Office).Abstract
. 11/1/2016. “
How Censorship in China Allows Government Criticism but Silences Collective Expression.” American Political Science Review, 107, 2 (May), Pp. 1-18.Abstract
. 2013. “
System for Estimating a Distribution of Message Content Categories in Source Data (2nd).” United States of America US 9,189,538 B2 (U.S Patent and Trademark Office).Abstract
. 11/17/2015. “