Speaker
Description
Insee has been implementing for many years an ecosystem of repositories dealing with standardised metadata for statistical purposes. The finest level is currently the instance variable and its representation (numeric, text, code list, etc.).These objects serve multiple purpose and are consumed by several internal and external stakeholders:
- An internal platform to centralise ready-for-use datasets
- Archival system of data files
- Generation of codebooks
- Data structure and documentation of research files made available in a restricted and secured manner
These objects are currently expressed as DDI 3.3 fragments and managed by the Colectica tools suite.
In addition, a substantial effort is made to enrich the content of a core set of shared resources and to expand their reuse across the instance and represented variables. Those are mainly geographic information (list of regions, municipalities, etc.) and statistical classification (activities, products, occupations, etc.). All these items follow strict structural (hierarchy between objects) and naming conventions.
In parallel, attempts to harmonising and detecting similar variables or code lists on one hand, and associating variables to concepts that they measure using similarity metrics and embeddings techniques are underway. They aim to enhance significantly the metadata quality, reduce the number of items feeding search engines and indexers, streamline their management.
Next steps are expected to:
- Extend the documentation of the statistical dataset to all internal data producers
- Enrich the semantics and the scope of the information made available:
o associate concepts to variables by completing the variable cascade (conceptual variable, represented variable, instance variable)
o add harmonised unit types at different granularity levels (physical instances, logical records, variables, etc.)
o add harmonised sentinel values to the variable representations