Speaker
Description
Limitation of statistical disclosure, while preserving utility and accuracy, by using standard methods is the pragmatic goal for official statistics while differential privacy is regarded as a practical goal for technological institutions which use or even stream data and involve open-source tools and libraries.
In this paper we explore the potential of using Bayesian methods to both estimate and decrease the disclosure risk when publishing official statistical output datasets, e.g. aggregated and/or spatial or survey data.
Bayesian modelling has been already employed in numerous instances for the purpose of:
(i) generating multiple synthetic data sets (Graham and Penny, 2005; Graham et al., 2009; Drechsler and Reiter, 2010, 2012)
(ii) simultaneous detection and correction of errors as well as disclosure limitation for microdata (Kim, Reiter and Karr, 2016)
(iii) differentially private algorithms through posterior sampling (Dimitrakakis etal, 2017)
We however focus on two important directions but less explored:
(i) the risk quantification and the solutions offered by Bayesian formulations for attack scenario and risk assessment
(ii) using non-parametric hierarchical Bayesian models for building a data protection procedure and evaluating the risk (see Battiston and Rimella, 2024, for same type of models)
We illustrate the new approach on a typical case study, i.e. the Icelandic 2021-Census data, comparing the new results with standard solutions like record swapping and cell key method. When regarded through the Census-grid(s), Iceland looks like a “virtual archipelago”, i.e. a large set of disconnected populated cells, many of them with a very small number of inhabitants, separated by large unpopulated areas. This means that even aggregated data for such cell-systems pose a high disclosure risk when published, making it an ideal candidate for data protection.
In addition, we impose several requirements for this data protection strategy, i.e. that it should preserve the relevant distributions over regional (and more, e.g. urban/rural) divisions therefore produce results fitting the public knowledge and that it could be automatically and straightforwardly applied to an arbitrary output dataset.
The open code and results are reported/linked on the repository of open code which already includes the information concerning our project on SDC for the small output area system and preliminary code for the evaluation of new methods.