RSNL: R Statistics for Natural Language

Anthony Fader Gary King, Daniel Pemstein, Kevin Quinn

Version:0.11

RSNL (pronounced "arsenal") is open source software platform for the statistical analysis of unstructured textual data, written for the R language for statistical computing. While the base R language has exceptional capabilities for data analysis, it currently has only rudimentary capabilities for processing textual data. On the other hand, standard packages for statistical natural language processing have extensive support for data processing but only limited support for statistical modeling and data analysis. RSNL serves as a bridge between these two worlds by providing the language fundamentals for R to analyze text data and making it easy to call routines from other languages for processing textual data from within R. RSNL thus provides a unified (and eventually comprehensive) suite of tools for data processing, statistical analysis, and graphical displays of textual data, and a platform on which it is easy to build new tools. We hope this accelerates the development of new statistical methods for textual data and more applied researchers who think of text sources as just another data set.

The initial version of RSNL will allow users to perform common natural language processing tasks: Tokenization; Stemming; Token filtering (by frequency, regular expression, arbitrary function); Token replacing; Part of speech tagging; Calculate document and corpus distance measures; Manage document relationships in a corpus (e.g., hyperlink network); Manage document structure (e.g., sentences and paragraphs); HTML parsing; Readability tests (e.g., Flesch-Kincaid); Functions to compute n-gram statistics. In addition, RSNL seamlessly integrates its own internal data structures with the standard data types expected by existing R functions. The data structures used by RSNL will be designed to accommodate the large datasets that are common in statistical natural language processing work.

RSNL will feature two sets of documentation---a developer API designed for developers wishing to add functionality to the package and a standard user manual designed to help applied researchers use the package to solve substantive problems of interest to them.

RSNL is currently in early development and the source distribution available here is best described as pre-release code.