Speakers
Description
We will present preliminary outcomes of an ongoing project at Statistics Netherlands (CBS), to be carried out between February and June 2026. It is an extension of prior efforts and builds upon an existing first proof of concept. The project explores the potential of Large Language Model (LLM)-based Retrieval-Augmented Generation (RAG) to support researchers in efficiently identifying relevant variables and automatically generating analysis datasets from CBS microdata.
CBS provides access to detailed microdata (under strict conditions) to researchers from academia and public agencies, enabling new insights in social and economic sciences. However, due to the large and complex nature of the microdata, researchers often face significant challenges in locating relevant variables and assembling datasets. This project aims to address these challenges by exploring how AI and (metadata) standards – such as the Generic Statistical Information Model (GSIM) – can streamline and enhance the data discovery and dataset creation process.
We plan to present an initial prototype system that demonstrates how LLMs and RAG can be used to support researchers in variable discovery and dataset creation. This system will build on the existing proof of concept and will be designed to leverage CBS metadata standards, to improve the structure and organization of metadata for more effective use by LLMs. The prototype will be tested through interactive interfaces, such as chatbots or research assistants, to evaluate the potential of RAG-based systems in supporting researchers throughout the data discovery and dataset assembly process.
In addition to presenting the prototype, we will share preliminary recommendations for future development and integration of AI tools within CBS's statistical production workflows. We will also provide a GSIM mapping of the relevant metadata elements used in the project, offering a structured reference for future enhancements and standardization efforts.