Speaker
Description
We implement a synthetic data generation framework on a pseudonymized subset of the 2021 Census data for Luxembourg. Focusing on seven categorical variables—including ordered age and education—we drop unique records upfront to mitigate the risk of singling out. Synthetic data are produced via the CART method in the synthpop package. Utility is measured using the propensity score mean-squared error metrics and regression analyses, confirming that key statistical relationships are retained. In addition, privacy is further reinforced by aggregating numerical values (e.g., grouping ages) in combination with the inherent randomness of the data generation process, thereby mitigating disclosure risks related to linkability and inference. By addressing the three core EU guidance concerns—singling out, linkability, and inference—our methodology meets standards for anonymization and statistical disclosure control, keeping disclosure risk minimal while maintaining data utility. In this presentation, we present our approach and highlight lessons learned.