Speaker
Description
Introduction: The de-identification of unstructured free-text data is important for sharing large amounts of healthcare information generated by electronic health records, publications and clinical trials. To automate this process, information extraction (IE) and natural language processing (NLP) are essential tools. However, evaluating NLP performance in de-identification requires understanding the types of entities involved and their significance for re-identification. Methods: This paper looks at patient safety narratives from clinical trials from F.Hoffmann-La Roche as a case study of free-text data that would undergo de-identification for public release. The study employs a mixed-methods approach including quantitative text analysis and qualitative content analysis to identify challenges in the use of NLP for de-identification and to build a framework for the impact of different entity types on NLP performance in de-identification. Results: The findings show that clinical trial narratives are dense in patient-level medical information and the weighting of entities for evaluating NLP performance in de-identification depends on the intruder’s background knowledge, although some commonalities exist across different intruder types. The current NLP evaluation metrics don’t translate well to de-identification tasks as some entities are more crucial to remove than others. Qualitative insights suggest that different therapeutic areas may require different de-identification approaches due to the nature of the disclosive information. Notably, psychiatric clinical trial narratives posed significant challenges for NLP due to the complex semantic characteristics of the free-text data. Conclusions: Current NLP evaluation metrics are not well-suited for de-identification tasks. Certain therapeutic areas, such as psychiatry, push the limits of NLP capabilities more than others. Further research is needed to develop robust mathematical equations for evaluating NLP in the context of de-identification.