We implement a synthetic data generation framework on a pseudonymized subset of the 2021 Census data for Luxembourg. Focusing on seven categorical variables—including ordered age and education—we drop unique records upfront to mitigate the risk of singling out. Synthetic data are produced via the CART method in the synthpop package. Utility is measured using the propensity score mean-squared...
Disseminating synthetic data enables easy access to data that retains statistical similarities to the original data if access to sensitive data is restricted. However, the model employed when generating the synthetic data may influence the structure of the data, potentially affecting subsequent predictive analysis. This paper empirically investigates whether the choice of synthesis model...
The demand for georeferenced data is increasing, while sharing proprietary location data poses privacy and confidentiality challenges. This study investigates the use of synthetic data generators (SDGs) to protect sensitive locations in georeferenced datasets. We propose transforming spatial coordinates into a one-dimensional index via a Hilbert space-filling curve, thereby preserving local...
The Trusted Research Environment (TRE, or Research Data Environment, RDC) has been the great success story of data access this century. By providing highly secure yet flexible access, the TRE has enabled research use of the most sensitive data. In its turn the development of the TRE has led to significant developments in research data governance, particularly output disclosure control. The TRE...
Differential privacy (DP) has become the de facto data protection mechanism due to its strong privacy guarantees. The mathematical foundation of $\epsilon$-DP is based on the principle that the presence or absence of any record in a data set should not influence the protected result by more than an exponential factor determined by the parameter $\epsilon$. Even though DP was originally...
This contribution addresses the intersection of statistical disclosure control and the special requirements of psychological research. Exemplarily, we show the unique sensitivity and complexity of empirical data from psychological research and the problems and possibilities to anonymize them.
The replication crisis in psychology (Open Science Collaboration, 2015; Camerer et al., 2018) has...
When evaluating the scientific worth of microdata, formally anonymized data provides maximum research potential. But this data can only be accessed onsite via Remote Execution or Safe Centers which offers little convenience for data users. In contrast, factual anonymized data can be accessed from the institutional workspace (offsite access, e. g. Scientific Use Files; SUFs) but offer less...
Data in tables published for the Swedish R&D survey in the business enterprise sector (BERD) were previously protected by cell suppression to prevent disclosure of sensitive information. In order to avoid cell suppression, key respondents were asked to sign waivers allowing the publication of their data. However, consent was rarely given to disseminate cells where an enterprise’s data...
The package τ-Argus is a widely used EU-funded Open Source tool for disclosure control in tabular data. It is automatable via batch functionality, and as Open Source package it is supposed to be easy to adapt, and transparent. As there is growing demand for specific τ-ARGUS functionalities to be provided “as a service”, the paper will discuss pros and cons of a fundamental revision of the...
In many countries, perturbative methods are increasingly used as a privacy protection method for official statistics. The U.S. Census Bureau has applied the mechanism of differential privacy, specifically Zero-Concentrated Differential Privacy (zCDP) during the creation of statistical tables created based on data from the 2020 Census as well as Privacy-Protected Microdata Files (PPMFs) as a...
The presentation outlines the practical application and evaluation of the targeted record swapping (TRS) method in the context of the 2021 Polish Census (NSP2021), specifically for population data dissemination within a 1 km² grid framework. The method, recommended by Eurostat, was employed to address statistical disclosure control (SDC) requirements while preserving data utility. The talk...
Limitation of statistical disclosure, while preserving utility and accuracy, by using standard methods is the pragmatic goal for official statistics while differential privacy is regarded as a practical goal for technological institutions which use or even stream data and involve open-source tools and libraries.
In this paper we explore the potential of using Bayesian methods to both estimate...
This article presents recommendations on data sharing written in the framework of the third phase of the G20 Data Gaps Initiative (DGI-3) .
The recommendations were written by the task team led by Eurostat and ECB and bringing together representatives of the 21 countries.
The recommendations comprise amongst others: definitions of terms, general principles of data sharing, modalities of...
The traditional approach to accessing register-based microdata requires researchers to apply for data on a project-by-project basis, a time-consuming process as each application must be manually reviewed and approved before the relevant data can be extracted and handed out.
A more flexible approach is to grant a broader range of researchers from authorized institutions quick access to a...
National Statistical Offices collect massive volumes of data to fulfill their missions. These data fuel the generation of regional, national, and international statistics across various sectors. However, their immense potential remains largely untapped due to strict and legitimate privacy regulations. In this context, Lomas is a novel open-source platform designed to realize the full potential...
The International Network for Exchanging Experience on Statistical Handling of Granular Data (INEXDA) is a collaborative project involving central banks, the ECB, Eurostat, and other international organizations and national statistical institutes, with strong support from the BIS. The primary goal of INEXDA is to facilitate the exchange of experiences related to the statistical handling of...
Establishing standardized Statistical Disclosure Control (SDC) processes is vital as data-sharing demands increase, anonymization techniques advance, and principles for privacy preservation continue to develop. In response, we present a SDC Architecture to systematically plan, implement, and document SDC of microdata with the objective to improve consistency, transparency, and adaptability in...
Ensuring secure, efficient, and transparent data-sharing mechanisms is a key challenge for modern public administration, particularly when handling sensitive microdata for policy development and evaluation. The Ministry of Foreign Affairs of the Hellenic Republic (MFA) has implemented an innovative Strategic and Operational Planning (SOP) function, incorporating digital tools and Business...
The practice of Output Statistical Disclosure Control has developed largely by consensus, a situation which is being challenged by a number of factors. First of these is the almost Cambrian Explosion in the number and scope of Trusted Research Environments as many domains move away from the ‘download’ model of enabling research. A second challenge is the accompanying proliferation of...
The [Humanitarian Data Exchange][1] (HDX) is an open platform managed by the Centre for Humanitarian Data designed to facilitate the sharing of humanitarian data among organizations and improve decision-making and response efforts in humanitarian crises. As part of its role in managing HDX, the Centre recognizes the various types of sensitive data being collected and used by partners to...
Trusted research environments have historically used rounding and thresholding as the recommended disclosure control method for exports of population data. However, within ONS Trusted Research Environments, for some datasets, perturbation is allowed in combination with thresholding. Code has been made available so researchers can create perturbed outputs using a specific level of noise.
This...
In 2023, CASD introduced a system to detect exports that do not comply with statistical secrecy [1]. This approach, based on feature generation from groups of exported files and the training of a boosting model, showed promise but precision could improve. The system relied on historical data from past Statistical Disclosure Control (SDC) expert reviews, where decisions (Accepted/Refused)...
ε-Differential privacy (DP) is a popular privacy model that has been promoted as the de facto standard in most data intensive areas. However, the selection of the privacy parameter ε (also called budget) in applications of DP remains an open challenge. Even though the meaning and implications of the value of ε are not fully understood, it is clear that large budget values are less...
This paper evaluates disclosure risk measures for synthetic data generated by CART-based models, using both a controlled simulated dataset and publicly available data. We find that common disclosure risk measures may fail to detect disclosure risks and, in some cases, misrepresent actual disclosure risks. Additionally, CART-based models, while maintaining high statistical utility, may...
The rise in access to public data on the internet, and specifically online social networks (OSNs), is causing new pressures on the statistical disclosure control of microdata. Currently at Statistics Netherlands, a criterium is applied that looks at three properties of variables: rarity, visibility and searchability. Underlying this criterium are, similar to other methods used to assess the...
Government statistical agencies increasingly rely on sensitive tabular data to guide evidence-based policymaking, yet restrictions on data access hinder research and transparency. Synthetic data generated with Generative Adversarial Networks (GANs) offers a promising solution, but conventional GANs often produce unrealistic tables or fail to preserve the statistical relationships that matter...
Introduction: The de-identification of unstructured free-text data is important for sharing large amounts of healthcare information generated by electronic health records, publications and clinical trials. To automate this process, information extraction (IE) and natural language processing (NLP) are essential tools. However, evaluating NLP performance in de-identification requires...
At Statistics Norway, the methodology department is responsible for the internal education of staff. Traditionally, SDC training has been offered on demand, with courses held at most once a year. For a statistician who is newly employed or interested in applying a new SDC method, waiting up to a year for training is both impractical and unproductive. In addition, Statistics Norway has...
To reflect key advances in statistical disclosure control (SDC), we present a revised and unified version of the World Bank’s microdata anonymization guides. The World Bank previously published three separate guides: one on SDC theory and two practice guides for implementing microdata anonymization using the R package sdcMicro, both via command line and the sdcApp GUI. These guides have been...
Synthetic data is often hailed as the future of safe data access – but in practice, it is insufficient for a method to be mathematically private or analytically useful: if legal and privacy teams do not understand the guarantees, they cannot confidently allow its use. This creates a critical but underexplored tension between cutting-edge privacy techniques and real-world operational...