15–17 Oct 2025
Poblenou Campus Auditorium
Europe/Zurich timezone

Enhancing Statistical Disclosure Control using Large Language Models

16 Oct 2025, 14:50
14m
In-Person
Poblenou Campus Auditorium, Barcelona, Spain

Poblenou Campus Auditorium

Roc Boronat, 138 08018 Barcelona

Speaker

Titouan Rigaud (CASD Secure Data Hub)

Description

In 2023, CASD introduced a system to detect exports that do not comply with statistical secrecy [1]. This approach, based on feature generation from groups of exported files and the training of a boosting model, showed promise but precision could improve. The system relied on historical data from past Statistical Disclosure Control (SDC) expert reviews, where decisions (Accepted/Refused) served as labels, and structured features were extracted from the reviewed exports. The primary goal was to identify situations that could pose a risk to confidential data and trigger alerts for expert review. Additionally, this dataset provided a foundation for data augmentation techniques to enhance model robustness.
An initial improvement to the system involved shifting the compliance prediction from the export level (a group of files) to the individual file level. This refinement allowed the model to be trained on a larger dataset, improving detection accuracy. The transition also necessitated a new approach to file representation and analysis, particularly through the use of structural components.
In this paper, we propose a further enhancement to increase the system’s reliability. By leveraging Large Language Models (LLMs), we introduce new predictive features that enrich the risk rating model. In particular, LLMs enable the detection of key attributes previously missing from the system, such as identifying columns containing headcounts and ensuring compliance with minimum value requirements.
Beyond numerical validation, LLMs also enhance the system’s ability to analyze textual content, determining whether exported data consists of code, structured datasets, or free-text documents. A crucial aspect is assessing the human readability of exported content, ensuring that flagged files can be manually reviewed when necessary. By improving transparency and traceability, this integration strengthens both the reliability and interpretability of statistical secrecy compliance checks.

[1] Rigaud, Titouan, et al. "Checking data outputs from research works: a mixed method with ai and human control." United Nations Economic Commission for Europe (UNECE) Conference of European Statisticians (CES): Expert meeting on Statistical Data Confidentiality. 2023.

Authors

Mr Cédric Hansen (CASD Secure Data Hub) Titouan Rigaud (CASD Secure Data Hub)

Presentation materials