15–17 Oct 2025
Poblenou Campus Auditorium
Europe/Zurich timezone

Detecting Contextually Sensitive Data with AI

16 Oct 2025, 12:35
10m
In-Person
Poblenou Campus Auditorium, Barcelona, Spain

Poblenou Campus Auditorium

Roc Boronat, 138 08018 Barcelona

Speaker

Ms Madelon Hulsebos (Centrum Wiskunde & Informatica (CWI))

Description

The Humanitarian Data Exchange (HDX) is an open platform managed by the Centre for Humanitarian Data designed to facilitate the sharing of humanitarian data among organizations and improve decision-making and response efforts in humanitarian crises. As part of its role in managing HDX, the Centre recognizes the various types of sensitive data being collected and used by partners to support humanitarian operations. While humanitarian organizations are not allowed to share personally identifiable information (PII) on HDX, they may upload survey data or multi-needs assessment datasets on the platform that could be sensitive due to the potential risk of re-identifying individuals. In addition, data can be deemed highly sensitive in some contexts (e.g.: active conflict) or countries (e.g.: due to local laws or restrictions), while they are not sensitive in others. To ensure sensitive data is not available on the platform, the HDX team manually reviews every dataset at the time of publication. However, as the platform has grown and the diversity of shared data has increased, detecting potentially sensitive data has become a significant challenge.

In collaboration with researchers at the Centrum Wiskunde & Informatica (CWI), we developed an approach to detect contextually sensitive data in tabular datasets that leverages a multi-stage pipeline architecture powered by large language models (LLMs). The system processes raw tabular data through sequential stages: standardization and metadata extraction, personal data detection using lightweight LLMs, non-personal sensitive data detection using LLMs with retrieval-augmented generation (RAG), and microdata-level sensitivity analysis. This approach allows for automatic detection of sensitivities that considers both the inherent properties of the data as well as contextual information such as geopolitical conditions and Information Sharing Protocols. Key considerations in this implementation include both technical and ethical dimensions. For data protection, we devise both proprietary model APIs (using only table headers and metadata) for maximum accuracy, as well as local models to ensure sensitive data is not sent to third parties. The system is designed to adapt to different humanitarian contexts with high accuracy and is easy to maintain.

Initial experiments demonstrate significant performance improvements over previously used methods in HDX. The LLM-based approach using GPT-4o-mini, without any tuning or examples, already achieves an F1 score of 0.78, substantially outperforming baseline methods including Google Cloud DLP (F1: 0.54-0.56) and Microsoft Presidio (F1: 0.52). The LLM approach also demonstrates superior precision (0.78 vs. 0.39-0.55) and dramatically lower false positive rates (0.047 vs. 0.087-0.219), addressing a key limitation of current tools. Open-source models like Cohere's Aya model show promising results (F1: 0.65-0.70), offering viable options for local deployment. These improvements enable more accurate identification of sensitive information while reducing the burden of manual reviews, ultimately enhancing data protection for vulnerable populations in humanitarian contexts.

Authors

Mr Javier Teran (OCHA's Centre for Humanitarian Data) Ms Liang Telkamp (Centrum Wiskunde & Informatica (CWI)) Ms Madelon Hulsebos (Centrum Wiskunde & Informatica (CWI))

Co-author

Ms Melanie Rabier (OCHA's Centre for Humanitarian Data)

Presentation materials