Expert Meeting on Statistical Data Confidentiality

Name: Expert Meeting on Statistical Data Confidentiality
Start: 2025-10-15T08:45:00+02:00
End: 2025-10-17T16:00:00+02:00
Location: Poblenou Campus Auditorium

15–17 Oct 2025

Poblenou Campus Auditorium

Europe/Zurich timezone

Chris Jones

jonesc@un.org

Contribution List

20. Approaches to Synthetic Data Generation: Insights from Luxembourg's Census Data

Dr Basheer Kalash (Luxembourg National Data Service (LNDS))

15/10/2025, 09:45

Releasing Microdata, including Synthetic Data

We implement a synthetic data generation framework on a pseudonymized subset of the 2021 Census data for Luxembourg. Focusing on seven categorical variables—including ordered age and education—we drop unique records upfront to mitigate the risk of singling out. Synthetic data are produced via the CART method in the synthpop package. Utility is measured using the propensity score mean-squared...

7. Does the Synthesis Model Influence a Subsequent Prediction of the Same Model Type?

Emma Fössing (Institute for Employment Research, Nueremberg, Germany)

15/10/2025, 10:00

Releasing Microdata, including Synthetic Data

Disseminating synthetic data enables easy access to data that retains statistical similarities to the original data if access to sensitive data is restricted. However, the model employed when generating the synthetic data may influence the structure of the data, potentially affecting subsequent predictive analysis. This paper empirically investigates whether the choice of synthesis model...

31. A synthesis approach for georeferenced data based on Hilbert space-filling curves

Jui Andreas Tang (Germany)

15/10/2025, 10:25

Releasing Microdata, including Synthetic Data

The demand for georeferenced data is increasing, while sharing proprietary location data poses privacy and confidentiality challenges. This study investigates the use of synthetic data generators (SDGs) to protect sensitive locations in georeferenced datasets. We propose transforming spatial coordinates into a one-dimensional index via a Hilbert space-filling curve, thereby preserving local...

52. In defence of scientific use files

Felix Ritchie (University of West of England)

15/10/2025, 10:40

Releasing Microdata, including Synthetic Data

The Trusted Research Environment (TRE, or Research Data Environment, RDC) has been the great success story of data access this century. By providing highly secure yet flexible access, the TRE has enabled research use of the most sensitive data. In its turn the development of the TRE has led to significant developments in research data governance, particularly output disclosure control. The TRE...

40. Critical analysis of real-world Differential Privacy applications from the last decade

Luis Del Vasto Terrientes (Universitat Rovira i Virgili)

15/10/2025, 11:25

Releasing Microdata, including Synthetic Data

Differential privacy (DP) has become the de facto data protection mechanism due to its strong privacy guarantees. The mathematical foundation of $\epsilon$-DP is based on the principle that the presence or absence of any record in a data set should not influence the protected result by more than an exponential factor determined by the parameter $\epsilon$. Even though DP was originally...

12. Producing synthetic teaching datasets using evolutionary algorithms

Mark Eliot

15/10/2025, 11:40

Releasing Microdata, including Synthetic Data

Teaching versions of datasets are an important component of the data discovery pipeline. These datasets often serve as an introduction to the data for potential users, allowing them to explore the data and assess the relevance of a dataset to their needs. However, in cases where source data is only available in restricted settings, such as trusted research environments (TREs), then capacity to...

26. Sharing Sensitive Psychological Data: Anonymization Techniques as a Prerequisite for Open Science

Roman Müller (University of Applied Sciences and Arts Northwestern Switzerland)

15/10/2025, 12:00

Releasing Microdata, including Synthetic Data

This contribution addresses the intersection of statistical disclosure control and the special requirements of psychological research. Exemplarily, we show the unique sensitivity and complexity of empirical data from psychological research and the problems and possibilities to anonymize them.

The replication crisis in psychology (Open Science Collaboration, 2015; Camerer et al., 2018) has...

9. Building a bridge between utility, risk and data access convenience: Partial data synthesis as an additional anonymization method for Remote Desktop data?

Yannik Garcia Ritz (Germany)

15/10/2025, 12:15

Releasing Microdata, including Synthetic Data

When evaluating the scientific worth of microdata, formally anonymized data provides maximum research potential. But this data can only be accessed onsite via Remote Execution or Safe Centers which offers little convenience for data users. In contrast, factual anonymized data can be accessed from the institutional workspace (offsite access, e. g. Scientific Use Files; SUFs) but offer less...

6. A framework for Cell Key Method parameters calibration based on risk-utility trade-off

Julien Jamme (France)

15/10/2025, 14:30

Safe Tables and Maps

Statistical institutes face a major challenge when transitioning from suppressive to perturbative disclosure control methods: how to objectively calibrate protection parameters. While the Cell Key Method (CKM) effectively protects frequency tables by adding controlled noise, selecting optimal parameters remains a serious challenge. We present an evidence-based framework for calibrating CKM's...

34. Using perturbative methods for magnitude tables in statistical disclosure control

Mr Lars-Erik Almberg (Statistics Sweden)

15/10/2025, 14:45

Safe Tables and Maps

Data in tables published for the Swedish R&D survey in the business enterprise sector (BERD) were previously protected by cell suppression to prevent disclosure of sensitive information. In order to avoid cell suppression, key respondents were asked to sign waivers allowing the publication of their data. However, consent was rarely given to disseminate cells where an enterprise’s data...

24. Some ideas for future development of the open source package Tau-Argus

Mr Peter-Paul De Wolf, Ms Sarah Giessing (Germany)

15/10/2025, 15:10

Safe Tables and Maps

The package τ-Argus is a widely used EU-funded Open Source tool for disclosure control in tabular data. It is automatable via batch functionality, and as Open Source package it is supposed to be easy to adapt, and transparent. As there is growing demand for specific τ-ARGUS functionalities to be provided “as a service”, the paper will discuss pros and cons of a fundamental revision of the...

27. The Potential of Differential Privacy for Japanese Population Census Data

Mr Shunsuke Kato (National Statistics Center / Statistical Research and Training Institute, Ministry of Internal Affairs and Communications)

15/10/2025, 15:20

Safe Tables and Maps

In many countries, perturbative methods are increasingly used as a privacy protection method for official statistics. The U.S. Census Bureau has applied the mechanism of differential privacy, specifically Zero-Concentrated Differential Privacy (zCDP) during the creation of statistical tables created based on data from the 2020 Census as well as Privacy-Protected Microdata Files (PPMFs) as a...

28. A comparison of data utility of the Cell Key Method and SynDiffix for geocoded data

Simon Kolb (Destatis)

15/10/2025, 16:05

Safe Tables and Maps

The Cell Key Method (CKM) is commonly used by statistics agencies to release tabular data. This paper compares the utility of a new open-source synthetic data tool, SynDiffix, with CKM for very fine-grained geographic data. SynDiffix is designed to have strong anonymity even when used by non-experts, and aims for high accuracy while maintaining strong anonymity. We compare the utility of...

58. Targeted Record Swapping for Grid-Based Population Data: Lessons from the 2021 Polish Census

Mr Tomasz Klimanek (Poznań University of Economics and Business)

15/10/2025, 16:20

Safe Tables and Maps

The presentation outlines the practical application and evaluation of the targeted record swapping (TRS) method in the context of the 2021 Polish Census (NSP2021), specifically for population data dissemination within a 1 km² grid framework. The method, recommended by Eurostat, was employed to address statistical disclosure control (SDC) requirements while preserving data utility. The talk...

49. Bayesian methods for the evaluation and limitation of disclosure risk in official statistics

Dr Violeta Calian (Statistics Iceland)

15/10/2025, 16:30

Safe Tables and Maps

Limitation of statistical disclosure, while preserving utility and accuracy, by using standard methods is the pragmatic goal for official statistics while differential privacy is regarded as a practical goal for technological institutions which use or even stream data and involve open-source tools and libraries.
In this paper we explore the potential of using Bayesian methods to both estimate...

25. DGI-3 Recommendations on data sharing

Aleksandra Bujnowska (Eurostat)

16/10/2025, 09:05

Research Data Centre Issues & Data Sharing

This article presents recommendations on data sharing written in the framework of the third phase of the G20 Data Gaps Initiative (DGI-3) .
The recommendations were written by the task team led by Eurostat and ECB and bringing together representatives of the 21 countries.
The recommendations comprise amongst others: definitions of terms, general principles of data sharing, modalities of...

56. Output SDC Measures and User Behaviour Detection in microdata.no

Vidar Klungre (Statistics Norway)

16/10/2025, 09:15

Research Data Centre Issues & Data Sharing

The traditional approach to accessing register-based microdata requires researchers to apply for data on a project-by-project basis, a time-consuming process as each application must be manually reviewed and approved before the relevant data can be extracted and handed out.
A more flexible approach is to grant a broader range of researchers from authorized institutions quick access to a...

14. Lomas: A Platform for Confidential Data Analysis

Prof. Christine Choirat (Federal Statistical Offfice)

16/10/2025, 09:30

Research Data Centre Issues & Data Sharing

National Statistical Offices collect massive volumes of data to fulfill their missions. These data fuel the generation of regional, national, and international statistics across various sectors. However, their immense potential remains largely untapped due to strict and legitimate privacy regulations. In this context, Lomas is a novel open-source platform designed to realize the full potential...

59. Research Data Centre issues in HCSO

Gyula István Bálint

16/10/2025, 09:55

Research Data Centre Issues & Data Sharing

In this paper, we are trying to address the current issues and challenges of Research Data Centres, using the Hungarian Central Statistical Office’s model, while proposing possible improvements to further enhance efficiency. In order to do so, we are planning to take a three-way standpoint. By that, we mean to take into consideration the viewpoint of SDC Experts, Researchers and the User...

43. Unlocking Confidentiality: Harmonized Guidelines and Best Practices for Output Checking within INEXDA

Ana Esteban (Banco de España)

16/10/2025, 10:10

Research Data Centre Issues & Data Sharing

The International Network for Exchanging Experience on Statistical Handling of Granular Data (INEXDA) is a collaborative project involving central banks, the ECB, Eurostat, and other international organizations and national statistical institutes, with strong support from the BIS. The primary goal of INEXDA is to facilitate the exchange of experiences related to the statistical handling of...

46. Enhancing Transparency and Reproducibility in Data Anonymization—An Architecture for Statistical Disclosure Control of Microdata

Daniel Boller (World Bank Group)

16/10/2025, 10:55

Research Data Centre Issues & Data Sharing

Establishing standardized Statistical Disclosure Control (SDC) processes is vital as data-sharing demands increase, anonymization techniques advance, and principles for privacy preservation continue to develop. In response, we present a SDC Architecture to systematically plan, implement, and document SDC of microdata with the objective to improve consistency, transparency, and adaptability in...

62. Enhancing Secure Data Sharing in Governmental Strategic Planning: A Case Study of the Ministry of Foreign Affairs of the Hellenic Republic

Dimitrios Avouris Kalamas (Ministry of Foreign Affairs of the Hellenic Republic)

16/10/2025, 11:10

Research Data Centre Issues & Data Sharing

Ensuring secure, efficient, and transparent data-sharing mechanisms is a key challenge for modern public administration, particularly when handling sensitive microdata for policy development and evaluation. The Ministry of Foreign Affairs of the Hellenic Republic (MFA) has implemented an innovative Strategic and Operational Planning (SOP) function, incorporating digital tools and Business...

48. Enhancing Data Confidentiality in Research: A Dual-Layered Approach to Secure Import of Text-Based Content

Cédric Hansen (CASD)

16/10/2025, 12:05

Machine Learning and Artificial Intelligence versus Disclosure Control

The Secure Data Access Center (CASD) platform provides controlled access to sensitive administrative data for research purposes, ensuring strict adherence to confidentiality regulations. Operating in an offline environment, CASD enhances data security by minimizing external vulnerabilities while allowing researchers to access and analyze this data in their secure and totally isolated...

36. A formal model for reasoning about output disclosure risks and mitigations

Jim Smith (University of the West of England Bristol)

16/10/2025, 12:20

Machine Learning and Artificial Intelligence versus Disclosure Control

The practice of Output Statistical Disclosure Control has developed largely by consensus, a situation which is being challenged by a number of factors. First of these is the almost Cambrian Explosion in the number and scope of Trusted Research Environments as many domains move away from the ‘download’ model of enabling research. A second challenge is the accompanying proliferation of...

39. Detecting Contextually Sensitive Data with AI

Ms Madelon Hulsebos (Centrum Wiskunde & Informatica (CWI))

16/10/2025, 12:35

Machine Learning and Artificial Intelligence versus Disclosure Control

The [Humanitarian Data Exchange][1] (HDX) is an open platform managed by the Centre for Humanitarian Data designed to facilitate the sharing of humanitarian data among organizations and improve decision-making and response efforts in humanitarian crises. As part of its role in managing HDX, the Centre recognizes the various types of sensitive data being collected and used by partners to...

38. Machine learning methods to detect Correct Perturbation

iain dove (Office for National Statistics)

16/10/2025, 14:35

Machine Learning and Artificial Intelligence versus Disclosure Control

Trusted research environments have historically used rounding and thresholding as the recommended disclosure control method for exports of population data. However, within ONS Trusted Research Environments, for some datasets, perturbation is allowed in combination with thresholding. Code has been made available so researchers can create perturbed outputs using a specific level of noise.
This...

29. Enhancing Statistical Disclosure Control using Large Language Models

Titouan Rigaud (CASD Secure Data Hub)

16/10/2025, 14:50

Machine Learning and Artificial Intelligence versus Disclosure Control

In 2023, CASD introduced a system to detect exports that do not comply with statistical secrecy [1]. This approach, based on feature generation from groups of exported files and the training of a boosting model, showed promise but precision could improve. The system relied on historical data from past Statistical Disclosure Control (SDC) expert reviews, where decisions (Accepted/Refused)...

5. Understanding Differential Privacy: A Small ε Budget Does Not Necessarily Prevent Disclosure

Krish Muralidhar (University of Oklahoma)

16/10/2025, 15:40

Disclosure Risk

ε-Differential privacy (DP) is a popular privacy model that has been promoted as the de facto standard in most data intensive areas. However, the selection of the privacy parameter ε (also called budget) in applications of DP remains an open challenge. Even though the meaning and implications of the value of ε are not fully understood, it is clear that large budget values are less...

33. Buyer Beware: Understanding the trade-off between utility and risk in CART based models using simulation data

Jonathan Latner (Institute for Employment Research (IAB))

16/10/2025, 15:55

Disclosure Risk

This paper evaluates disclosure risk measures for synthetic data generated by CART-based models, using both a controlled simulated dataset and publicly available data. We find that common disclosure risk measures may fail to detect disclosure risks and, in some cases, misrepresent actual disclosure risks. Additionally, CART-based models, while maintaining high statistical utility, may...

23. A critical look at the influence of the world wide web on data dissemination

Marieke de Vries (Netherlands (Kingdom of the))

16/10/2025, 16:40

Disclosure Risk

The rise in access to public data on the internet, and specifically online social networks (OSNs), is causing new pressures on the statistical disclosure control of microdata. Currently at Statistics Netherlands, a criterium is applied that looks at three properties of variables: rarity, visibility and searchability. Underlying this criterium are, similar to other methods used to assess the...

15. Knowledge-Informed GANs for Privacy-Preserving Synthetic Tabular Data

Dr Sonakshi Garg, Vicenc Torra (Umea University)

16/10/2025, 16:55

Disclosure Risk

Government statistical agencies increasingly rely on sensitive tabular data to guide evidence-based policymaking, yet restrictions on data access hinder research and transparency. Synthetic data generated with Generative Adversarial Networks (GANs) offers a promising solution, but conventional GANs often produce unrealistic tables or fail to preserve the statistical relationships that matter...

44. Confidentiality and disclosure risk from administrative data

Gillian Raab

16/10/2025, 17:20

Disclosure Risk

Recent years have seen an increased pressure to allow information derived from administrative data to be used to inform policy; see for example the Sturrock Report, 2024. Several organisations have been set up in the UK to develop policies to facilitate this. When data access is given to researchers, who are not part of the organisation that owns the data, there is a concern that there may...

22. Beyond Standard Metrics: Assessing NLP for De-Identification of Clinical Trial Narratives

Prof. Mark Elliot (University of Manchester)

16/10/2025, 17:35

Disclosure Risk

Introduction: The de-identification of unstructured free-text data is important for sharing large amounts of healthcare information generated by electronic health records, publications and clinical trials. To automate this process, information extraction (IE) and natural language processing (NLP) are essential tools. However, evaluating NLP performance in de-identification requires...

18. Sharing experiences from new strategies for internal education in SDC at Statistics Norway

Hege Marie Bøvelstad (Norway)

17/10/2025, 09:05

Communication, Education & Training in Statistical Data Confidentiality

At Statistics Norway, the methodology department is responsible for the internal education of staff. Traditionally, SDC training has been offered on demand, with courses held at most once a year. For a statistician who is newly employed or interested in applying a new SDC method, waiting up to a year for training is both impractical and unproductive. In addition, Statistics Norway has...

16. Strengthening Statistical Confidentiality: Lessons from Mexico’s 2021 Policy Framework

Jesús González (Mexico)

17/10/2025, 09:20

Communication, Education & Training in Statistical Data Confidentiality

National Statistical Offices (NSOs) are pivotal in producing data essential for analyzing sociodemographic and economic trends. Most national statistical laws enshrine the confidentiality of collected data to safeguard citizens’ privacy. Yet, these laws frequently lack specific, actionable measures to enforce this principle. Consequently, supplementary normative frameworks are critical to...

45. A Revised Practice Guide for Microdata Anonymization

Dr Thijs Benschop (World Bank Group)

17/10/2025, 09:35

Communication, Education & Training in Statistical Data Confidentiality

To reflect key advances in statistical disclosure control (SDC), we present a revised and unified version of the World Bank’s microdata anonymization guides. The World Bank previously published three separate guides: one on SDC theory and two practice guides for implementing microdata anonymization using the R package sdcMicro, both via command line and the sdcApp GUI. These guides have been...

35. Is it really private if you can’t explain it? A practical framework for productonalising legally-compliant synthetic data in government

Owen Daniel (Office for National Statistics)

17/10/2025, 10:20

Communication, Education & Training in Statistical Data Confidentiality

Synthetic data is often hailed as the future of safe data access – but in practice, it is insufficient for a method to be mathematically private or analytically useful: if legal and privacy teams do not understand the guarantees, they cannot confidently allow its use. This creates a critical but underexplored tension between cutting-edge privacy techniques and real-world operational...

17. Using a blended approach of promotion and preventive measures for SDC at labour statistics office in Singapore

Weiqi Wong

17/10/2025, 10:35

Communication, Education & Training in Statistical Data Confidentiality

With the need for deeper analysis and more granular data, statistical offices must place a greater focus on measures to mitigate the risks of statistical data disclosure. There can, however, be tensions between users who require granular data and the need for statistical offices not to disclose information of the data subjects.

However, this tension may not be most obvious to some user...

67. Conclusion of the meeting

65. Group discussions on future work proposals

68. Outcome of the discussions on future work

63. Registration

66. Reporting back from groups

64. Welcome and opening remarks

Choose timezone

Expert Meeting on Statistical Data Confidentiality

Chris Jones