Statistically Valid Inferences from Differentially Private Data Releases

Paper226 KB
Statistically Valid Inferences from Differentially Private Data Releases

Abstract:

In a major development in data sharing, data providers are beginning to supplement insecure privacy protection strategies, such as "de-identification," with a formal approach called "differential privacy". One version of differential privacy adds specially calibrated random noise to a dataset, which is then released to researchers. This offers mathematical guarantees for the privacy of research subjects while still making it possible to learn about aggregate patterns of interest. Unfortunately, adding random noise creates measurement error, which induces statistical bias -- including attenuation, exaggeration, switched signs, or incorrect uncertainty estimates. We offer an easy-to-use, computationally efficient approach that corrects for these biases, can be used as researchers would use linear regression, and gives statistically consistent and approximately unbiased estimates and standard errors. We use as our running example the Full URLs Dataset recently released by Social Science One and Facebook, containing more than 10 trillion cell values.

Last updated on 02/18/2020