Informatics and Data Sharing

Replication Standards New standards, protocols, and software for citing, sharing, analyzing, archiving, preserving, distributing, cataloging, translating, disseminating, naming, verifying, and replicating scholarly research data and analyses. Also includes proposals to improve the norms of data sharing and replication in science.
Indaca
Gary King and Nathaniel Persily. 2019. “A New Model for Industry-Academic Partnerships.” PS: Political Science and Politics, 53, 4, Pp. 703-709. Publisher's VersionAbstract

The mission of the social sciences is to understand and ameliorate society’s greatest challenges. The data held by private companies, collected for different purposes, hold vast potential to further this mission. Yet, because of consumer privacy, trade secrets, proprietary content, and political sensitivities, these datasets are often inaccessible to scholars. We propose a novel organizational model to address these problems. We also report on the first partnership under this model, to study the incendiary issues surrounding the impact of social media on elections and democracy: Facebook provides (privacy-preserving) data access; eight ideologically and substantively diverse charitable foundations provide funding; an organization of academics we created, Social Science One (see SocialScience.One), leads the project; and the Institute for Quantitative Social Science at Harvard and the Social Science Research Council provide logistical help.

Replication, Replication
"The replication standard holds that sufficient information exists with which to understand, evaluate, and build upon a prior work if a third party can replicate the results without any additional information from the author." This, and the data sharing to support it, was proposed for political science, along with policy suggestions in Gary King. 1995. “Replication, Replication.” PS: Political Science and Politics, 28, Pp. 444-452.Abstract

Political science is a community enterprise and the community of empirical political scientists need access to the body of data necessary to replicate existing studies to understand, evaluate, and especially build on this work. Unfortunately, the norms we have in place now do not encourage, or in some cases even permit, this aim. Following are suggestions that would facilitate replication and are easy to implement – by teachers, students, dissertation writers, graduate programs, authors, reviewers, funding agencies, and journal and book editors.

Preface: Big Data is Not About the Data!
Gary King. 2016. “Preface: Big Data is Not About the Data!” In Computational Social Science: Discovery and Prediction, edited by R. Michael Alvarez. Cambridge: Cambridge University Press.Abstract

A few years ago, explaining what you did for a living to Dad, Aunt Rose, or your friend from high school was pretty complicated. Answering that you develop statistical estimators, work on numerical optimization, or, even better, are working on a great new Markov Chain Monte Carlo implementation of a Bayesian model with heteroskedastic errors for automated text analysis is pretty much the definition of conversation stopper.

Then the media noticed the revolution we’re all apart of, and they glued a label to it. Now “Big Data” is what you and I do.  As trivial as this change sounds, we should be grateful for it, as the name seems to resonate with the public and so it helps convey the importance of our field to others better than we had managed to do ourselves. Yet, now that we have everyone’s attention, we need to start clarifying for others -- and ourselves -- what the revolution means. This is much of what this book is about.

Throughout, we need to remember that for the most part, Big Data is not about the data....

Precision mapping child undernutrition for nearly 600,000 inhabited census villages in India
Rockli Kim, Avleen S. Bijral, Yun Xu, Xiuyuan Zhang, Jeffrey C. Blossom, Akshay Swaminathan, Gary King, Alok Kumar, Rakesh Sarwal, Juan M. Lavista Ferres, and S.V. Subramanian. 2021. “Precision mapping child undernutrition for nearly 600,000 inhabited census villages in India.” Proceedings of the National Academy of Sciences, 118, 18, Pp. 1-11. Publisher's VersionAbstract
There are emerging opportunities to assess health indicators at truly small areas with increasing availability of data geocoded to micro geographic units and advanced modeling techniques. The utility of such fine-grained data can be fully leveraged if linked to local governance units that are accountable for implementation of programs and interventions. We used data from the 2011 Indian Census for village-level demographic and amenities features and the 2016 Indian Demographic and Health Survey in a bias-corrected semisupervised regression framework to predict child anthropometric failures for all villages in India. Of the total geographic variation in predicted child anthropometric failure estimates, 54.2 to 72.3% were attributed to the village level followed by 20.6 to 39.5% to the state level. The mean predicted stunting was 37.9% (SD: 10.1%; IQR: 31.2 to 44.7%), and substantial variation was found across villages ranging from less than 5% for 691 villages to over 70% in 453 villages. Estimates at the village level can potentially shift the paradigm of policy discussion in India by enabling more informed prioritization and precise targeting. The proposed methodology can be adapted and applied to diverse population health indicators, and in other contexts, to reveal spatial heterogeneity at a finer geographic scale and identify local areas with the greatest needs and with direct implications for actions to take place.
A Revised Proposal, Proposal
Comments from nineteen authors and a response to the above: Gary King. 1995. “A Revised Proposal, Proposal.” PS: Political Science and Politics, XXVIII, Pp. 494–499.
Publication, Publication
Gary King. 2006. “Publication, Publication.” PS: Political Science and Politics, 39, Pp. 119–125. Continuing updates to this paperAbstract

I show herein how to write a publishable paper by beginning with the replication of a published article. This strategy seems to work well for class projects in producing papers that ultimately get published, helping to professionalize students into the discipline, and teaching them the scientific norms of the free exchange of academic information. I begin by briefly revisiting the prominent debate on replication our discipline had a decade ago and some of the progress made in data sharing since.

The Dataverse Network Project

The Dataverse Network Project: a major ongoing project to write web applications, standards, protocols, and software for automating the process of citing, archiving, preserving, distributing, cataloging, translating, disseminating, naming, verifying, and replicating data and associated analyses (Website: TheData.Org). See also:
An Introduction to the Dataverse Network as an Infrastructure for Data Sharing
Gary King. 2007. “An Introduction to the Dataverse Network as an Infrastructure for Data Sharing.” Sociological Methods and Research, 36, Pp. 173–199.Abstract

We introduce a set of integrated developments in web application software, networking, data citation standards, and statistical methods designed to put some of the universe of data and data sharing practices on somewhat firmer ground. We have focused on social science data, but aspects of what we have developed may apply more widely. The idea is to facilitate the public distribution of persistent, authorized, and verifiable data, with powerful but easy-to-use technology, even when the data are confidential or proprietary. We intend to solve some of the sociological problems of data sharing via technological means, with the result intended to benefit both the scientific community and the sometimes apparently contradictory goals of individual researchers.

Automating Open Science for Big Data
Merce Crosas, Gary King, James Honaker, and Latanya Sweeney. 2015. “Automating Open Science for Big Data.” ANNALS of the American Academy of Political and Social Science, 659, 1, Pp. 260-273. Publisher's VersionAbstract

The vast majority of social science research presently uses small (MB or GB scale) data sets. These fixed-scale data sets are commonly downloaded to the researcher's computer where the analysis is performed locally, and are often shared and cited with well-established technologies, such as the Dataverse Project (see Dataverse.org), to support the published results.  The trend towards Big Data -- including large scale streaming data -- is starting to transform research and has the potential to impact policy-making and our understanding of the social, economic, and political problems that affect human societies.  However, this research poses new challenges in execution, accountability, preservation, reuse, and reproducibility. Downloading these data sets to a researcher’s computer is infeasible or not practical; hence, analyses take place in the cloud, require unusual expertise, and benefit from collaborative teamwork and novel tool development. The advantage of these data sets in how informative they are also means that they are much more likely to contain highly sensitive personally identifiable information. In this paper, we discuss solutions to these new challenges so that the social sciences can realize the potential of Big Data.

From Preserving the Past to Preserving the Future: The Data-PASS Project and the Challenges of Preserving Digital Social Science Data
Myron P Gutmann, Mark Abrahamson, Margaret O Adams, Micah Altman, Caroline Arms, Kenneth Bollen, Michael Carlson, Jonathan Crabtree, Darrell Donakowski, Gary King, Jaret Lyle, Marc Maynard, Amy Pienta, Richard Rockwell, Lois Rocms-Ferrara, and Copeland H Young. 2009. “From Preserving the Past to Preserving the Future: The Data-PASS Project and the Challenges of Preserving Digital Social Science Data.” Library Trends, 57, Pp. 315–337.Abstract

Social science data are an unusual part of the past, present, and future of digital preservation. They are both an unqualified success, due to long-lived and sustainable archival organizations, and in need of further development because not all digital content is being preserved. This article is about the Data Preservation Alliance for Social Sciences (Data-PASS), a project supported by the National Digital Information Infrastructure and Preservation Program (NDIIPP), which is a partnership of five major U.S. social science data archives. Broadly speaking, Data-PASS has the goal of ensuring that at-risk social science data are identified, acquired, and preserved, and that we have a future-oriented organization that could collaborate on those preservation tasks for the future. Throughout the life of the Data-PASS project we have worked to identify digital materials that have never been systematically archived, and to appraise and acquire them. As the project has progressed, however, it has increasingly turned its attention from identifying and acquiring legacy and at-risk social science data to identifying on going and future research projects that will produce data. This article is about the project's history, with an emphasis on the issues that underlay the transition from looking backward to looking forward.

An update on Dataverse
Gary King. 12/7/2014. “An update on Dataverse.” Oxford University Press Blog. Publisher's VersionAbstract
At the American Political Science Association meetings earlier this year, Gary King, Albert J. Weatherhead III University Professor at Harvard University, gave a presentation on Dataverse. Dataverse is an important tool that many researchers use to archive and share their research materials. As many readers of this blog may already know, the journal that I co-edit, Political Analysis, uses Dataverse to archive and disseminate the replication materials for the articles we publish in our journal. I asked Gary to write some remarks about Dataverse, based on his APSA presentation. His remarks are below.  -- Michael Alvarez, Editor, Political Analysis.

Hidden Section 1

A symposium on replication, edited by Nils Petter Gleditsch and Claire Metelits, with several articles including mine, Gary King. 2003. “The Future of Replication.” International Studies Perspectives, 4, Pp. 443–499.Abstract

Since the replication standard was proposed for political science research, more journals have required or encouraged authors to make data available, and more authors have shared their data. The calls for continuing this trend are more persistent than ever, and the agreement among journal editors in this Symposium continues this trend. In this article, I offer a vision of a possible future of the replication movement. The plan is to implement this vision via the Virtual Data Center project, which – by automating the process of finding, sharing, archiving, subsetting, converting, analyzing, and distributing data – may greatly facilitate adherence to the replication standard.

The Virtual Data Center

The Virtual Data Center, the predecessor to the Dataverse Network. See:
A Digital Library for the Dissemination and Replication of Quantitative Social Science Research
Micah Altman, Leonid Andreev, Mark Diggory, Gary King, Daniel L Kiskis, Elizabeth Kolster, Michael Krot, and Sidney Verba. 2001. “A Digital Library for the Dissemination and Replication of Quantitative Social Science Research.” Social Science Computer Review, 19, Pp. 458–470.Abstract
The Virtual Data Center (VDC) software is an open-source, digital library system for quantitative data. We discuss what the software does, and how it provides an infrastructure for the management and dissemination of disturbed collections of quantitative data, and the replication of results derived from this data.

See Also

Comment on 'Estimating the Reproducibility of Psychological Science'
Daniel Gilbert, Gary King, Stephen Pettigrew, and Timothy Wilson. 2016. “Comment on 'Estimating the Reproducibility of Psychological Science'.” Science, 351, 6277, Pp. 1037a-1038a. Publisher's VersionAbstract

recent article by the Open Science Collaboration (a group of 270 coauthors) gained considerable academic and public attention due to its sensational conclusion that the replicability of psychological science is surprisingly low. Science magazine lauded this article as one of the top 10 scientific breakthroughs of the year across all fields of science, reports of which appeared on the front pages of newspapers worldwide. We show that OSC's article contains three major statistical errors and, when corrected, provides no evidence of a replication crisis. Indeed, the evidence is consistent with the opposite conclusion -- that the reproducibility of psychological science is quite high and, in fact, statistically indistinguishable from 100%. (Of course, that doesn't mean that the replicability is 100%, only that the evidence is insufficient to reliably estimate replicability.) The moral of the story is that meta-science must follow the rules of science.

Replication data is available in this dataverse archive. See also the full web site for this article and related materials, and one of the news articles written about it.

A Proposed Standard for the Scholarly Citation of Quantitative Data
Micah Altman and Gary King. 2007. “A Proposed Standard for the Scholarly Citation of Quantitative Data.” D-Lib Magazine, 13. Publisher's VersionAbstract

An essential aspect of science is a community of scholars cooperating and competing in the pursuit of common goals. A critical component of this community is the common language of and the universal standards for scholarly citation, credit attribution, and the location and retrieval of articles and books. We propose a similar universal standard for citing quantitative data that retains the advantages of print citations, adds other components made possible by, and needed due to, the digital form and systematic nature of quantitative data sets, and is consistent with most existing subfield-specific approaches. Although the digital library field includes numerous creative ideas, we limit ourselves to only those elements that appear ready for easy practical use by scientists, journal editors, publishers, librarians, and archivists.

Related Papers on New Forms of Data

Ensuring the Data Rich Future of the Social Sciences
Gary King. 2011. “Ensuring the Data Rich Future of the Social Sciences.” Science, 331, 11 February, Pp. 719-721.Abstract

Massive increases in the availability of informative social science data are making dramatic progress possible in analyzing, understanding, and addressing many major societal problems. Yet the same forces pose severe challenges to the scientific infrastructure supporting data sharing, data management, informatics, statistical methodology, and research ethics and policy, and these are collectively holding back progress. I address these changes and challenges and suggest what can be done.

The Changing Evidence Base of Social Science Research
Gary King. 2009. “The Changing Evidence Base of Social Science Research.” In The Future of Political Science: 100 Perspectives, edited by Gary King, Kay Schlozman, and Norman Nie. New York: Routledge Press.Abstract

This (two-page) article argues that the evidence base of political science and the related social sciences are beginning an underappreciated but historic change.

Preserving Quantitative Research-Elicited Data for Longitudinal Analysis.  New Developments in Archiving Survey Data in the U.S.
Mark Abrahamson, Kenneth A Bollen, Myron P Gutmann, Gary King, and Amy Pienta. 2009. “Preserving Quantitative Research-Elicited Data for Longitudinal Analysis. New Developments in Archiving Survey Data in the U.S.” Historical Social Research, 34, 3, Pp. 51-59.Abstract

Social science data collected in the United States, both historically and at present, have often not been placed in any public archive -- even when the data collection was supported by government grants. The availability of the data for future use is, therefore, in jeopardy. Enforcing archiving norms may be the only way to increase data preservation and availability in the future.

Computational Social Science
David Lazer, Alex Pentland, Lada Adamic, Sinan Aral, Albert-Laszlo Barabasi, Devon Brewer, Nicholas Christakis, Noshir Contractor, James Fowler, Myron Gutmann, Tony Jebara, Gary King, Michael Macy, Deb Roy, and Marshall Van Alstyne. 2009. “Computational Social Science.” Science, 323, Pp. 721-723.Abstract

A field is emerging that leverages the capacity to collect and analyze data at a scale that may reveal patterns of individual and group behaviors.