We develop a computer-assisted method for the discovery of insightful conceptualizations, in the form of clusterings (i.e., partitions) of input objects. Each of the numerous fully automated methods of cluster analysis proposed in statistics, computer science, and biology optimize a different objective function. Almost all are well defined, but how to determine before the fact which one, if any, will partition a given set of objects in an "insightful" or "useful" way for a given user is unknown and difficult, if not logically impossible. We develop a metric space of partitions from all existing cluster analysis methods applied to a given data set (along with millions of other solutions we add based on combinations of existing clusterings), and enable a user to explore and interact with it, and quickly reveal or prompt useful or insightful conceptualizations. In addition, although uncommon in unsupervised learning problems, we offer and implement evaluation designs that make our computer-assisted approach vulnerable to being proven suboptimal in specific data types. We demonstrate that our approach facilitates more efficient and insightful discovery of useful information than either expert human coders or many existing fully automated methods.
The increasing availability of digitized text presents enormous opportunities for social scientists. Yet hand coding many blogs, speeches, government records, newspapers, or other sources of unstructured text is infeasible. Although computer scientists have methods for automated content analysis, most are optimized to classify individual documents, whereas social scientists instead want generalizations about the population of documents, such as the proportion in a given category. Unfortunately, even a method with a high percent of individual documents correctly classified can be hugely biased when estimating category proportions. By directly optimizing for this social science goal, we develop a method that gives approximately unbiased estimates of category proportions even when the optimal classifier performs poorly. We illustrate with diverse data sets, including the daily expressed opinions of thousands of people about the U.S. presidency. We also make available software that implements our methods and large corpora of text for further analysis.
This article led to the formation of Crimson Hexagon
This is a poster that describes our analysis of "partisan taunting," the explicit, public, and negative attacks on another political party or its members, usually using vitriolic and derogatory language. We first demonstrate that most projects that hand code text in the social sciences optimize with respect to the wrong criterion, resulting in large, unnecessary biases. We show how to fix this problem and then apply it to taunting. We find empirically that, unlike most claims in the press and the literature, taunting is not inexorably increasing; it appears instead to be a rational political strategy, most often used by those least likely to win by traditional means -- ideological extremists, out-party members when the president is unpopular, and minority party members. However, although taunting appears to be individually rational, it is collectively irrational: Constituents may resonate with one cutting taunt by their Senator, but they might not approve if he or she were devoting large amounts of time to this behavior rather than say trying to solve important national problems. We hope to partially rectify this situation by posting public rankings of Senatorial taunting behavior.
We present new statistical methods for evaluating information extraction systems. The methods were developed to evaluate a system used by political scientists to extract event information from news leads about international politics. The nature of this data presents two problems for evaluators: 1) the frequency distribution of event types in international event data is strongly skewed, so a random sample of newsleads will typically fail to contain any low frequency events. 2) Manual information extraction necessary to create evaluation sets is costly, and most effort is wasted coding high frequency categories . We present an evaluation scheme that overcomes these problems with considerably less manual effort than traditional methods, and also allows us to interpret an information extraction system as an estimator (in the statistical sense) and to estimate its bias.
Systems and methods are provided for classifying text based on language using one or more computer servers and storage devices. A computer-implemented method includes receiving a training set of elements, each element in the training set being assigned to one of a plurality of categories and having one of a plurality of content profiles associated therewith; receiving a population set of elements, each element in the population set having one of the plurality of content profiles associated therewith; and calculating using at least one of a stacked regression algorithm, a bias formula algorithm, a noise elimination algorithm, and an ensemble method consisting of a plurality of algorithmic methods the results of which are averaged, based on the content profiles associated with and the categories assigned to elements in the training set and the content profiles associated with the elements of the population set, a distribution of elements of the population set over the categories.
We offer the first large scale, multiple source analysis of the outcome of what may be the most extensive effort to selectively censor human expression ever implemented. To do this, we have devised a system to locate, download, and analyze the content of millions of social media posts originating from nearly 1,400 different social media services all over China before the Chinese government is able to find, evaluate, and censor (i.e., remove from the Internet) the large subset they deem objectionable. Using modern computer-assisted text analytic methods that we adapt to and validate in the Chinese language, we compare the substantive content of posts censored to those not censored over time in each of 85 topic areas. Contrary to previous understandings, posts with negative, even vitriolic, criticism of the state, its leaders, and its policies are not more likely to be censored. Instead, we show that the censorship program is aimed at curtailing collective action by silencing comments that represent, reinforce, or spur social mobilization, regardless of content. Censorship is oriented toward attempting to forestall collective activities that are occurring now or may occur in the future --- and, as such, seem to clearly expose government intent.
A method of computerized content analysis that gives “approximately unbiased and statistically consistent estimates” of a distribution of elements of structured, unstructured, and partially structured source data among a set of categories. In one embodiment, this is done by analyzing a distribution of small set of individually-classified elements in a plurality of categories and then using the information determined from the analysis to extrapolate a distribution in a larger population set. This extrapolation is performed without constraining the distribution of the unlabeled elements to be equal to the distribution of labeled elements, nor constraining a content distribution of content of elements in the labeled set (e.g., a distribution of words used by elements in the labeled set) to be equal to a content distribution of elements in the unlabeled set. Not being constrained in these ways allows the estimation techniques described herein to provide distinct advantages over conventional aggregation techniques.
The (unheralded) first step in many applications of automated text analysis involves selecting keywords to choose documents from a large text corpus for further study. Although all substantive results depend on this choice, researchers usually pick keywords in ad hoc ways that are far from optimal and usually biased. Paradoxically, this often means that the validity of the most sophisticated text analysis methods depend in practice on the inadequate keyword counting or matching methods they are designed to replace. Improved methods of keyword selection would also be valuable in many other areas, such as following conversations that rapidly innovate language to evade authorities, seek political advantage, or express creativity; generic web searching; eDiscovery; look-alike modeling; intelligence analysis; and sentiment and topic analysis. We develop a computer-assisted (as opposed to fully automated) statistical approach that suggests keywords from available text without needing structured data as inputs. This framing poses the statistical problem in a new way, which leads to a widely applicable algorithm. Our specific approach is based on training classifiers, extracting information from (rather than correcting) their mistakes, and summarizing results with Boolean search strings. We illustrate how the technique works with analyses of English texts about the Boston Marathon Bombings, Chinese social media posts designed to evade censorship, among others.
Representative embodiments of a method for grouping participants in an activity include the steps of: (i) defining a grouping policy; (ii) storing, in a database, participant records that include a participant identifer, a characteristic associated With the participant, and/or an identifier for a participant’s handheld device; (iii) defining groupings based on the policy and characteristics of the participants relating to the policy and to the activity; and (iv) communicating the groupings to the handheld devices to establish the groups.
Computer scientists and statisticians are often interested in classifying textual documents into chosen categories. Social scientists and others are often less interested in any one document and instead try to estimate the proportion falling in each category. The two existing types of techniques for estimating these category proportions are parametric "classify and count" methods and "direct" nonparametric estimation of category proportions without an individual classification step. Unfortunately, classify and count methods can sometimes be highly model dependent or generate more bias in the proportions even as the percent correctly classified increases. Direct estimation avoids these problems, but can suffer when the meaning and usage of language is too similar across categories or too different between training and test sets. We develop an improved direct estimation approach without these problems by introducing continuously valued text features optimized for this problem, along with a form of matching adapted from the causal inference literature. We evaluate our approach in analyses of a diverse collection of 73 data sets, showing that it substantially improves performance compared to existing approaches. As a companion to this paper, we offer easy-to-use software that implements all ideas discussed herein.
A method for selecting clusterings to classify a predetermined data set of numerical data comprises five steps. First, a plurality of known clustering methods are applied, one at a time, to the data set to generate clusterings for each method. Second, a metric space of clusterings is generated using a metric that measures the similarity between two clusterings. Third, the metric space is projected to a lower dimensional representation useful for visualization. Fourth, a “local cluster ensemble” method generates a clustering for each point in the lower dimensional space. Fifth, an animated visualization method uses the output of the local cluster ensemble method to display the lower dimensional space and to allow a user to move around and explore the space of clustering.
Existing research on the extensive Chinese censorship organization uses observational methods with well-known limitations. We conducted the first large-scale experimental study of censorship by creating accounts on numerous social media sites, randomly submitting different texts, and observing from a worldwide network of computers which texts were censored and which were not. We also supplemented interviews with confidential sources by creating our own social media site, contracting with Chinese firms to install the same censoring technologies as existing sites, and—with their software, documentation, and even customer support—reverse-engineering how it all works. Our results offer rigorous support for the recent hypothesis that criticisms of the state, its leaders, and their policies are published, whereas posts about real-world events with collective action potential are censored.