The BReg-DCAT-AP model has three major components to describe: 1) public services; 2) data registries; 3) data components. Both, public service and data registry are based on existing vocabulary application profiles, i.e. CPSV-AP and DCAT-AP.
To make base registries interoperable, we need to define in detail the data that is stored / managed by those registries. Those catalogues may have datasets, services, i.e. web services, SPARQL endpoints, etc., and other registries.
So, the question is how can we define data models, and establish constraints in the definitions? In this discussion page, we will explore the available possibilities, taking into account all the potential use cases and the requirements to create machine-readable descriptions.
According to the requirements collected in previous working group meetings, the master data may be composed of different components, that can be described by the following metadata:
• Textual descriptions about the data (i.e. multilingual and expressed in natural language);
• Semantic concept (i.e. from a concept schema) of the resource;
• Quality information (i.e. authoritativeness, completeness, etc.) of the resource;
• Data type (i.e. a date, number, text, etc.);
For instance, a business registry holds data about organisations. Every EU Member State may want to include as much information as they want, but we could agree with a minimum set of metadata. Just as an example:
• Legal representative (a person);
• Registration number;
• VAT ID;
• Email; and
• Postal Address.
How to represent conformance to the model
If we had this minimum set of data predefined, we could establish constraints based on these specific “templates” using RDF + OWL. This approach is followed by TOOP project. These specific ontologies may create standard models for registry data.
In DCAT2, Datasets and Distributions may express conformance through the dcterms:conformsTo property. This would inform that a specific dataset or its distribution follows the specific “template” defined above.
How to represent the structure of the datasets
If the registry data is represented under the RDF model, there would be no need to add additional information, just using the common ontology would be enough to understand the nature of each resource, property, and value.
In the rest of the cases with datasets delivered in any format, i.e. spreadsheets, JSON, XML documents, etc., the dataset / distribution metadata should include additional information about the type of information we can expect from the repository. At least, structure, datatypes, and the concepts represented.
W3C Data Cube Vocabulary
As discussed in the latest working group meeting on the 28th of April, the W3C Data Cube Vocabulary may be a solution to represent the structure of datasets. Using the qb:dataStructure property, a dataset may be described through its different components, including semantic concepts (e.g. birth date), data types (e.g. date), textual descriptions, and other quality annotations (i.e. accuracy, authoritativeness, etc.).
Data Cube Vocabulary is compatible with the data cube model and the SDMX (Statistical Data and Metadata eXchange) standard. With this vocabulary, we would be able to represent any complex structure of data.
Some participants in the previous meeting shown non-conformity to this proposal to describe the structure of datasets, but with no concrete objections or proposed alternatives.
This discussion page will help us to find concrete examples to keep this proposal - or discard it and find better solutions.
We invite you to share your feedback in comments.