21 October 2024 to 28 May 2026
Europe/Zurich timezone

Reusing, sharing, linking and standardising semantically rich metadata at Insee

14 Feb 2026, 12:00
10m

Speaker

Guillaume Duffes (Insee)

Description

Insee has been implementing for many years an ecosystem of repositories dealing with standardised metadata for statistical purposes. The finest level is currently the instance variable and its representation (numeric, text, code list, etc.).These objects serve multiple purpose and are consumed by several internal and external stakeholders:
- An internal platform to centralise ready-for-use datasets
- Archival system of data files
- Generation of codebooks
- Data structure and documentation of research files made available in a restricted and secured manner
These objects are currently expressed as DDI 3.3 fragments and managed by the Colectica tools suite.
In addition, a substantial effort is made to enrich the content of a core set of shared resources and to expand their reuse across the instance and represented variables. Those are mainly geographic information (list of regions, municipalities, etc.) and statistical classification (activities, products, occupations, etc.). All these items follow strict structural (hierarchy between objects) and naming conventions.
In parallel, attempts to harmonising and detecting similar variables or code lists on one hand, and associating variables to concepts that they measure using similarity metrics and embeddings techniques are underway. They aim to enhance significantly the metadata quality, reduce the number of items feeding search engines and indexers, streamline their management.

Next steps are expected to:
- Extend the documentation of the statistical dataset to all internal data producers
- Enrich the semantics and the scope of the information made available:
o associate concepts to variables by completing the variable cascade (conceptual variable, represented variable, instance variable)
o add harmonised unit types at different granularity levels (physical instances, logical records, variables, etc.)
o add harmonised sentinel values to the variable representations

Presentation materials

There are no materials yet.