Expert Meeting on Statistical Data Confidentiality

Name: Expert Meeting on Statistical Data Confidentiality
Start: 2025-10-15T08:45:00+02:00
End: 2025-10-17T16:00:00+02:00
Location: Poblenou Campus Auditorium

15–17 Oct 2025

Poblenou Campus Auditorium

Europe/Zurich timezone

Chris Jones

jonesc@un.org

Beyond Standard Metrics: Assessing NLP for De-Identification of Clinical Trial Narratives

16 Oct 2025, 17:35

10m

In-Person

Poblenou Campus Auditorium, Barcelona, Spain

Poblenou Campus Auditorium

Roc Boronat, 138 08018 Barcelona

Disclosure Risk

Prof. Mark Elliot (University of Manchester)

Introduction: The de-identification of unstructured free-text data is important for sharing large amounts of healthcare information generated by electronic health records, publications and clinical trials. To automate this process, information extraction (IE) and natural language processing (NLP) are essential tools. However, evaluating NLP performance in de-identification requires understanding the types of entities involved and their significance for re-identification. Methods: This paper looks at patient safety narratives from clinical trials from F.Hoffmann-La Roche as a case study of free-text data that would undergo de-identification for public release. The study employs a mixed-methods approach including quantitative text analysis and qualitative content analysis to identify challenges in the use of NLP for de-identification and to build a framework for the impact of different entity types on NLP performance in de-identification. Results: The findings show that clinical trial narratives are dense in patient-level medical information and the weighting of entities for evaluating NLP performance in de-identification depends on the intruder’s background knowledge, although some commonalities exist across different intruder types. The current NLP evaluation metrics don’t translate well to de-identification tasks as some entities are more crucial to remove than others. Qualitative insights suggest that different therapeutic areas may require different de-identification approaches due to the nature of the disclosive information. Notably, psychiatric clinical trial narratives posed significant challenges for NLP due to the complex semantic characteristics of the free-text data. Conclusions: Current NLP evaluation metrics are not well-suited for de-identification tasks. Certain therapeutic areas, such as psychiatry, push the limits of NLP capabilities more than others. Further research is needed to develop robust mathematical equations for evaluating NLP in the context of de-identification.

Nastazja Laskowski

Prof. Mark Elliot (University of Manchester) Prof. Goran Nenadic (University of Manchester)

SDC2025_Sb_UnivManchester_Elliot.pdf

Expert Meeting on Statistical Data Confidentiality

Chris Jones

Beyond Standard Metrics: Assessing NLP for De-Identification of Clinical Trial Narratives

Poblenou Campus Auditorium

Speaker

Description

Author

Co-authors

Presentation materials