Speaker
Description
In 2023, CASD introduced a system to detect exports that do not comply with statistical secrecy [1]. This approach, based on feature generation from groups of exported files and the training of a boosting model, showed promise but precision could improve. The system relied on historical data from past Statistical Disclosure Control (SDC) expert reviews, where decisions (Accepted/Refused) served as labels, and structured features were extracted from the reviewed exports. The primary goal was to identify situations that could pose a risk to confidential data and trigger alerts for expert review. Additionally, this dataset provided a foundation for data augmentation techniques to enhance model robustness.
An initial improvement to the system involved shifting the compliance prediction from the export level (a group of files) to the individual file level. This refinement allowed the model to be trained on a larger dataset, improving detection accuracy. The transition also necessitated a new approach to file representation and analysis, particularly through the use of structural components.
In this paper, we propose a further enhancement to increase the system’s reliability. By leveraging Large Language Models (LLMs), we introduce new predictive features that enrich the risk rating model. In particular, LLMs enable the detection of key attributes previously missing from the system, such as identifying columns containing headcounts and ensuring compliance with minimum value requirements.
Beyond numerical validation, LLMs also enhance the system’s ability to analyze textual content, determining whether exported data consists of code, structured datasets, or free-text documents. A crucial aspect is assessing the human readability of exported content, ensuring that flagged files can be manually reviewed when necessary. By improving transparency and traceability, this integration strengthens both the reliability and interpretability of statistical secrecy compliance checks.
[1] Rigaud, Titouan, et al. "Checking data outputs from research works: a mixed method with ai and human control." United Nations Economic Commission for Europe (UNECE) Conference of European Statisticians (CES): Expert meeting on Statistical Data Confidentiality. 2023.