Informatics and Data Sharing

Replication Standards New standards, protocols, and software for citing, sharing, analyzing, archiving, preserving, distributing, cataloging, translating, disseminating, naming, verifying, and replicating scholarly research data and analyses. Also includes proposals to improve the norms of data sharing and replication in science.
Replication, Replication
"The replication standard holds that sufficient information exists with which to understand, evaluate, and build upon a prior work if a third party can replicate the results without any additional information from the author." This, and the data sharing to support it, was proposed for political science, along with policy suggestions in King, Gary. 1995. “Replication, Replication.” PS: Political Science and Politics 28: 444-452.Abstract

Political science is a community enterprise and the community of empirical political scientists need access to the body of data necessary to replicate existing studies to understand, evaluate, and especially build on this work. Unfortunately, the norms we have in place now do not encourage, or in some cases even permit, this aim. Following are suggestions that would facilitate replication and are easy to implement – by teachers, students, dissertation writers, graduate programs, authors, reviewers, funding agencies, and journal and book editors.

Preface: Big Data is Not About the Data!
King, Gary. In Press. “Preface: Big Data is Not About the Data!.” Computational Social Science: Discovery and Prediction, edited by R. Michael Alvarez. Cambridge: Cambridge University Press.Abstract

A few years ago, explaining what you did for a living to Dad, Aunt Rose, or your friend from high school was pretty complicated. Answering that you develop statistical estimators, work on numerical optimization, or, even better, are working on a great new Markov Chain Monte Carlo implementation of a Bayesian model with heteroskedastic errors for automated text analysis is pretty much the definition of conversation stopper.

Then the media noticed the revolution we’re all apart of, and they glued a label to it. Now “Big Data” is what you and I do.  As trivial as this change sounds, we should be grateful for it, as the name seems to resonate with the public and so it helps convey the importance of our field to others better than we had managed to do ourselves. Yet, now that we have everyone’s attention, we need to start clarifying for others -- and ourselves -- what the revolution means. This is much of what this book is about.

Throughout, we need to remember that for the most part, Big Data is not about the data....

Publication, Publication
King, Gary. 2006. “Publication, Publication.” PS: Political Science and Politics 39: 119–125. Continuing updates to this paperAbstract

I show herein how to write a publishable paper by beginning with the replication of a published article. This strategy seems to work well for class projects in producing papers that ultimately get published, helping to professionalize students into the discipline, and teaching them the scientific norms of the free exchange of academic information. I begin by briefly revisiting the prominent debate on replication our discipline had a decade ago and some of the progress made in data sharing since.

A Revised Proposal, Proposal
Comments from nineteen authors and a response to the above: King, Gary. 1995. “A Revised Proposal, Proposal.” PS: Political Science and Politics XXVIII: 494–499.

The Dataverse Network Project

The Dataverse Network Project: a major ongoing project to write web applications, standards, protocols, and software for automating the process of citing, archiving, preserving, distributing, cataloging, translating, disseminating, naming, verifying, and replicating data and associated analyses (Website: TheData.Org). See also:
An Introduction to the Dataverse Network as an Infrastructure for Data Sharing
King, Gary. 2007. “An Introduction to the Dataverse Network as an Infrastructure for Data Sharing.” Sociological Methods and Research 36: 173–199.Abstract

We introduce a set of integrated developments in web application software, networking, data citation standards, and statistical methods designed to put some of the universe of data and data sharing practices on somewhat firmer ground. We have focused on social science data, but aspects of what we have developed may apply more widely. The idea is to facilitate the public distribution of persistent, authorized, and verifiable data, with powerful but easy-to-use technology, even when the data are confidential or proprietary. We intend to solve some of the sociological problems of data sharing via technological means, with the result intended to benefit both the scientific community and the sometimes apparently contradictory goals of individual researchers.

Automating Open Science for Big Data
Crosas, Merce, James Honaker, Gary King, and Latanya Sweeney. 2015. “Automating Open Science for Big Data.” ANNALS of the American Academy of Political and Social Science 659 (1): 260-273. Publisher's VersionAbstract

The vast majority of social science research presently uses small (MB or GB scale) data sets. These fixed-scale data sets are commonly downloaded to the researcher's computer where the analysis is performed locally, and are often shared and cited with well-established technologies, such as the Dataverse Project (see, to support the published results.  The trend towards Big Data -- including large scale streaming data -- is starting to transform research and has the potential to impact policy-making and our understanding of the social, economic, and political problems that affect human societies.  However, this research poses new challenges in execution, accountability, preservation, reuse, and reproducibility. Downloading these data sets to a researcher’s computer is infeasible or not practical; hence, analyses take place in the cloud, require unusual expertise, and benefit from collaborative teamwork and novel tool development. The advantage of these data sets in how informative they are also means that they are much more likely to contain highly sensitive personally identifiable information. In this paper, we discuss solutions to these new challenges so that the social sciences can realize the potential of Big Data.

From Preserving the Past to Preserving the Future: The Data-PASS Project and the Challenges of Preserving Digital Social Science Data
Gutmann, Myron P, Mark Abrahamson, Margaret O Adams, Micah Altman, Caroline Arms, Kenneth Bollen, Michael Carlson, et al.. 2009. “

From Preserving the Past to Preserving the Future: The Data-PASS Project and the Challenges of Preserving Digital Social Science Data

.” Library Trends 57: 315–337.Abstract

Social science data are an unusual part of the past, present, and future of digital preservation. They are both an unqualified success, due to long-lived and sustainable archival organizations, and in need of further development because not all digital content is being preserved. This article is about the Data Preservation Alliance for Social Sciences (Data-PASS), a project supported by the National Digital Information Infrastructure and Preservation Program (NDIIPP), which is a partnership of five major U.S. social science data archives. Broadly speaking, Data-PASS has the goal of ensuring that at-risk social science data are identified, acquired, and preserved, and that we have a future-oriented organization that could collaborate on those preservation tasks for the future. Throughout the life of the Data-PASS project we have worked to identify digital materials that have never been systematically archived, and to appraise and acquire them. As the project has progressed, however, it has increasingly turned its attention from identifying and acquiring legacy and at-risk social science data to identifying on going and future research projects that will produce data. This article is about the project's history, with an emphasis on the issues that underlay the transition from looking backward to looking forward.

Hidden Section 1

A symposium on replication, edited by Nils Petter Gleditsch and Claire Metelits, with several articles including mine, King, Gary. 2003. “The Future of Replication.” International Studies Perspectives 4: 443–499.Abstract

Since the replication standard was proposed for political science research, more journals have required or encouraged authors to make data available, and more authors have shared their data. The calls for continuing this trend are more persistent than ever, and the agreement among journal editors in this Symposium continues this trend. In this article, I offer a vision of a possible future of the replication movement. The plan is to implement this vision via the Virtual Data Center project, which – by automating the process of finding, sharing, archiving, subsetting, converting, analyzing, and distributing data – may greatly facilitate adherence to the replication standard.

The Virtual Data Center

The Virtual Data Center, the predecessor to the Dataverse Network. See:
A Digital Library for the Dissemination and Replication of Quantitative Social Science Research
Altman, Micah, Leonid Andreev, Mark Diggory, Gary King, Daniel L Kiskis, Elizabeth Kolster, Michael Krot, and Sidney Verba. 2001. “A Digital Library for the Dissemination and Replication of Quantitative Social Science Research.” Social Science Computer Review 19: 458–470.Abstract
The Virtual Data Center (VDC) software is an open-source, digital library system for quantitative data. We discuss what the software does, and how it provides an infrastructure for the management and dissemination of disturbed collections of quantitative data, and the replication of results derived from this data.

See Also

A Proposed Standard for the Scholarly Citation of Quantitative Data
Altman, Micah, and Gary King. 2007. “

A Proposed Standard for the Scholarly Citation of Quantitative Data

.” D-Lib Magazine 13. Publisher's VersionAbstract

An essential aspect of science is a community of scholars cooperating and competing in the pursuit of common goals. A critical component of this community is the common language of and the universal standards for scholarly citation, credit attribution, and the location and retrieval of articles and books. We propose a similar universal standard for citing quantitative data that retains the advantages of print citations, adds other components made possible by, and needed due to, the digital form and systematic nature of quantitative data sets, and is consistent with most existing subfield-specific approaches. Although the digital library field includes numerous creative ideas, we limit ourselves to only those elements that appear ready for easy practical use by scientists, journal editors, publishers, librarians, and archivists.

Related Papers on New Forms of Data

Ensuring the Data Rich Future of the Social Sciences
King, Gary. 2011. “Ensuring the Data Rich Future of the Social Sciences.” Science 331 (11 February): 719-721.Abstract

Massive increases in the availability of informative social science data are making dramatic progress possible in analyzing, understanding, and addressing many major societal problems. Yet the same forces pose severe challenges to the scientific infrastructure supporting data sharing, data management, informatics, statistical methodology, and research ethics and policy, and these are collectively holding back progress. I address these changes and challenges and suggest what can be done.

The Changing Evidence Base of Social Science Research
King, Gary. 2009. “The Changing Evidence Base of Social Science Research.” The Future of Political Science: 100 Perspectives, edited by Gary King, Kay Schlozman, and Norman Nie. New York: Routledge Press.Abstract

This (two-page) article argues that the evidence base of political science and the related social sciences are beginning an underappreciated but historic change.

Preserving Quantitative Research-Elicited Data for Longitudinal Analysis.  New Developments in Archiving Survey Data in the U.S.
Abrahamson, Mark, Kenneth A Bollen, Myron P Gutmann, Gary King, and Amy Pienta. 2009. “Preserving Quantitative Research-Elicited Data for Longitudinal Analysis. New Developments in Archiving Survey Data in the U.S..” Historical Social Research 34 (3): 51-59.Abstract

Social science data collected in the United States, both historically and at present, have often not been placed in any public archive -- even when the data collection was supported by government grants. The availability of the data for future use is, therefore, in jeopardy. Enforcing archiving norms may be the only way to increase data preservation and availability in the future.

Computational Social Science
Lazer, David, Alex Pentland, Lada Adamic, Sinan Aral, Albert-Laszlo Barabasi, Devon Brewer, Nicholas Christakis, et al.. 2009. “Computational Social Science.” Science 323: 721-723.Abstract

A field is emerging that leverages the capacity to collect and analyze data at a scale that may reveal patterns of individual and group behaviors.