Speaker
Description
The Secure Data Access Center (CASD) platform provides controlled access to sensitive administrative data for research purposes, ensuring strict adherence to confidentiality regulations. Operating in an offline environment, CASD enhances data security by minimizing external vulnerabilities while allowing researchers to access and analyze this data in their secure and totally isolated environment.
To improve research efficiency, CASD is integrating a feature that allows researchers to import small text-based content that could be statistical code, structured datasets , or nomenclature mappings for categorical variables. This capability reduces the need for manual coding tasks and streamlines research workflows. These imports are activated by the researcher themselves, without human intervention by CASD teams.
However, the introduction of text imports raises potential security concerns, particularly in terms of cyber risks such as the possibility of injecting hidden scripts or binaries that could exploit system vulnerabilities. That being said, these risks are already significantly mitigated by the strict isolation of research environments within CASD and the restricted size of possible imports.
To address these risks, CASD employs a dual-layered control system that evaluates imported content prior to its integration into the secure environment. The first layer utilizes a deterministic, rule-based approach that employs regular expressions and heuristic rules to scan text chunks for recognizable structured formats (e.g., CSV, JSON), as well as binary and code structures. When such content is detected, the system flags it and provides descriptions of the associated potential risks.
The second layer leverages an offline Large Language Model (LLM) to enhance content classification and detect anomalies. Operating in an isolated environment, the LLM analyzes the text to distinguish between scripts, structured data, categorical nomenclature, binary structures, and free text. It also identifies potentially malicious scripts, such as binaries or unknown content, which could pose security risks to the platform. The LLM achieves over 90% accuracy in classification tasks on the training dataset, significantly improving detection capabilities compared to rule-based filtering alone.
This AI-driven analysis significantly improves the platform’s ability to detect subtle confidentiality risks that may not be immediately apparent through rule-based filtering alone.
Finally, the system generates a detailed report for each imported text, documenting the detected features and anomalies. This report facilitates human oversight by providing security personnel with a transparent view of the evaluation process, allowing for manual review and validation of flagged content.