Expert Meeting on Statistical Data Confidentiality

Name: Expert Meeting on Statistical Data Confidentiality
Start: 2025-10-15T08:45:00+02:00
End: 2025-10-17T16:00:00+02:00
Location: Poblenou Campus Auditorium

15–17 Oct 2025

Poblenou Campus Auditorium

Europe/Zurich timezone

Chris Jones

jonesc@un.org

Enhancing Data Confidentiality in Research: A Dual-Layered Approach to Secure Import of Text-Based Content

16 Oct 2025, 12:05

14m

In-Person

Poblenou Campus Auditorium, Barcelona, Spain

Poblenou Campus Auditorium

Roc Boronat, 138 08018 Barcelona

Machine Learning and Artificial Intelligence versus Disclosure Control

Cédric Hansen (CASD)

The Secure Data Access Center (CASD) platform provides controlled access to sensitive administrative data for research purposes, ensuring strict adherence to confidentiality regulations. Operating in an offline environment, CASD enhances data security by minimizing external vulnerabilities while allowing researchers to access and analyze this data in their secure and totally isolated environment.
To improve research efficiency, CASD is integrating a feature that allows researchers to import small text-based content that could be statistical code, structured datasets , or nomenclature mappings for categorical variables. This capability reduces the need for manual coding tasks and streamlines research workflows. These imports are activated by the researcher themselves, without human intervention by CASD teams.
However, the introduction of text imports raises potential security concerns, particularly in terms of cyber risks such as the possibility of injecting hidden scripts or binaries that could exploit system vulnerabilities. That being said, these risks are already significantly mitigated by the strict isolation of research environments within CASD and the restricted size of possible imports.
To address these risks, CASD employs a dual-layered control system that evaluates imported content prior to its integration into the secure environment. The first layer utilizes a deterministic, rule-based approach that employs regular expressions and heuristic rules to scan text chunks for recognizable structured formats (e.g., CSV, JSON), as well as binary and code structures. When such content is detected, the system flags it and provides descriptions of the associated potential risks.
The second layer leverages an offline Large Language Model (LLM) to enhance content classification and detect anomalies. Operating in an isolated environment, the LLM analyzes the text to distinguish between scripts, structured data, categorical nomenclature, binary structures, and free text. It also identifies potentially malicious scripts, such as binaries or unknown content, which could pose security risks to the platform. The LLM achieves over 90% accuracy in classification tasks on the training dataset, significantly improving detection capabilities compared to rule-based filtering alone.
This AI-driven analysis significantly improves the platform’s ability to detect subtle confidentiality risks that may not be immediately apparent through rule-based filtering alone.
Finally, the system generates a detailed report for each imported text, documenting the detected features and anomalies. This report facilitates human oversight by providing security personnel with a transparent view of the evaluation process, allowing for manual review and validation of flagged content.

Cédric Hansen (CASD)

Mr Rémy Marquier (CASD) Mr Titouan Rigaud (CASD)

SDC2025_Sc_CASD_Hansen.pdf

Expert Meeting on Statistical Data Confidentiality

Chris Jones

Enhancing Data Confidentiality in Research: A Dual-Layered Approach to Secure Import of Text-Based Content

Poblenou Campus Auditorium

Speaker

Description

Author

Co-authors

Presentation materials