System for Estimating a Distribution of Message Content Categories in Source Data

Citation:

Daniel Hopkins, Gary King, and Ying Lu. 2012. “System for Estimating a Distribution of Message Content Categories in Source Data.” United States of America 8180717 (May 15). US Copy at http://j.mp/2oDf7B3
Patent3.58 MB
System for Estimating a Distribution of Message Content Categories in Source Data

Abstract:

A method of computerized content analysis that gives “approximately unbiased and statistically consistent estimates” of a distribution of elements of structured, unstructured, and partially structured source data among a set of categories. In one embodiment, this is done by analyzing a distribution of small set of individually-classified elements in a plurality of categories and then using the information determined from the analysis to extrapolate a distribution in a larger population set. This extrapolation is performed without constraining the distribution of the unlabeled elements to be equal to the distribution of labeled elements, nor constraining a content distribution of content of elements in the labeled set (e.g., a distribution of words used by elements in the labeled set) to be equal to a content distribution of elements in the unlabeled set. Not being constrained in these ways allows the estimation techniques described herein to provide distinct advantages over conventional aggregation techniques.

Last updated on 03/06/2015