26–28 May 2026
Dorint Pallas Hotel
Europe/Zurich timezone

Lessons learned from pilot project to more effective description of statistical data using AI and RAG

Not scheduled
15m
In-Person
Dorint Pallas Hotel, Wiesbaden, Germany

Dorint Pallas Hotel

Wiesbaden, Germany

Speaker

Veiko Berendsen (Statistics Estonia)

Description

Statistics Estonia receives data from more than 100 administrative data sources, which may come with minimal metadata. While a new metadata information system (Colectica) is being implemented, our data warehouse remains the same (Oracle DB with a schema for each statistical activity/administrative data source). Schema types for administrative data sources and statistical activities are different. Each statistical activity has data stages in a data warehouse. The current metadata system in use does not meet customers' needs for a data catalogue for finding the right data.

As more metadata is needed, both for data lineage and for entities in a data model, we therefore carried out a project to enhance metadata creation using LLM (GPT-4 and GPT-5 Mini) and RAG (Retrieval-Augmented Generation). The RAG utilised data from the following sources:
(1) Similar earlier generated descriptions; and
(2) Other text documents like the contracts for data delivery/service.

We have, as a part of our wider project, created data dictionaries and business vocabularies for official registers, many of which are the sources of the official statistics.

We tried to focus on a single process for descriptions for different levels and object in the DDI model. In the end we limited it to Dataset (DDI.PhysicalInstance), Table (DDI.RecordLayout) and Variable (DDI.Instance Variable). We focused on certain attributes only: Name, Label, Description. We created a lean UI with features to:
(1) Upload sources,
(2) Select sources (by topic) and create new metadata by using RAG,
(3) Finetune the prompts, and
(4) Create output files.

This presentation will outlines some of the lessons learned from this work.

Author

Veiko Berendsen (Statistics Estonia)

Presentation materials

There are no materials yet.