15–17 Oct 2025
Poblenou Campus Auditorium
Europe/Zurich timezone

Approaches to Synthetic Data Generation: Insights from Luxembourg's Census Data

15 Oct 2025, 09:45
14m
In-Person
Poblenou Campus Auditorium, Barcelona, Spain

Poblenou Campus Auditorium

Roc Boronat, 138 08018 Barcelona

Speaker

Dr Basheer Kalash (Luxembourg National Data Service (LNDS))

Description

We implement a synthetic data generation framework on a pseudonymized subset of the 2021 Census data for Luxembourg. Focusing on seven categorical variables—including ordered age and education—we drop unique records upfront to mitigate the risk of singling out. Synthetic data are produced via the CART method in the synthpop package. Utility is measured using the propensity score mean-squared error metrics and regression analyses, confirming that key statistical relationships are retained. In addition, privacy is further reinforced by aggregating numerical values (e.g., grouping ages) in combination with the inherent randomness of the data generation process, thereby mitigating disclosure risks related to linkability and inference. By addressing the three core EU guidance concerns—singling out, linkability, and inference—our methodology meets standards for anonymization and statistical disclosure control, keeping disclosure risk minimal while maintaining data utility. In this presentation, we present our approach and highlight lessons learned.

Authors

Dr Basheer Kalash (Luxembourg National Data Service (LNDS)) Claude Lamboray (STATEC)

Presentation materials