Speaker
Description
Trusted research environments have historically used rounding and thresholding as the recommended disclosure control method for exports of population data. However, within ONS Trusted Research Environments, for some datasets, perturbation is allowed in combination with thresholding. Code has been made available so researchers can create perturbed outputs using a specific level of noise.
This creates a problem: how can export checkers tell if an output has been correctly perturbed? Even with supporting information showing the raw counts, it is not obvious that a researcher has used the right method and parameters to create the perturbed counts for export.
To this end, machine learning methods were trialled on a set of synthetic training data (n=5000). Training data was created using perturbation code so datasets would resemble genuine exports. Five different types were produced, 50% were generated with the ‘correct’ method and parameters. Logistic Regression, XGBoost, Random Forest, K Nearest Neighbours, Naive Bayes and Support Vector Machine models were trained and evaluated.
This paper explores the results and how these models could be applied in the Trusted Research Environment context