[Issue #05] Content of Datasets (DataElement)

Published on: 23/05/2019
Last update: 29/05/2019
Discussion

This issue comes from Issue #4, regarding how to describe the themes and the content of datasets. Issue #4 is about the thematic classification of datasets and this one about how to specify the content of datasets. This is, if datasets may be broken down in 'data elements' as in the Belgian Registry (http://vocab.belgif.be/ns/authsrc#):

  • Datasets are described as DataSource(s)
  • A DataSource contains (0..n) DataElement(s)
  • A DataElement describes what is contained within the dataset, including literals and the dcterms:type property with a set of SKOS concepts to be used (i.e., businesses, locations, and people).
Overview of classes and relations

Proposals/options to discuss:

  1. Create a similar structure of (DataSource ->) DataElement to contain the relevant pieces of data within datasets.
  2. Create a controlled fine-grained scheme to classify those topics (aligned with Eurovoc)
  3. Use Eurovoc terms directly to classify the topics.
  4. Use dcterms:type to classify Datasets directly.
  5. Create a new sub-property of dcterms:subject/dcat:theme to classify the fine-grained elements.

 

Shared on

Comments

Mon, 27/05/2019 - 19:00

This proposal would benefit from a pause and consideration of the changes being proposed by DCAT Version 2 (because in the Belgian model :DataSource is a subClass of dcat:Dataset [this is DCAT Version 1] - see https://vocab.belgif.be/ns/authsrc#DataSource ).  

There are significant changes from version 1 for dcat:Dataset

From https://www.w3.org/TR/vocab-dcat-2/#changes-since-20140116 Class: Dataset: In DCAT 2014 [VOCAB-DCAT-20140116dcat:Dataset was a sub-class of dctype:Dataset, which is a term of the DCMI Types vocabulary [DCTERMS]. This relationship has been removed in the revised DCAT vocabulary - see Issue #98.

[note that the Editor's Draft - https://w3c.github.io/dxwg/dcat/ - is now only going to have stylistic changes.  It is effectively completed]

In DCAT v2 a dcat:Resource dct:conformsTo some specification, and it is there that the model or set of constraints that describe the diverse data elements that make up the resource are placed.  This is perhaps the way to go for describing a dataset - it can then be either an XML schema, a spreadsheet of element names and descriptions, a UML diagram, or anything that be identified by a URI

Regarding proposal #1, it is probably worth discussing the merits of an approach similar to UN/CEFACT where there the "core components" can be either basic or aggregate business information entities. This means that a DataElement can contain other DataElements, and at each level there might be a link to a skos:Concept [or any other theme classifiers].  

 

 

Mon, 08/07/2019 - 12:57

We need a standardized way to describe the content (the data elements) in a dataset, in particular 1) if a DataElement isAuthoritative and 2) the Concept that a DataElement represents (the meaning of the DataElement). The content in the specification that a dataset conformsTo could be anything so we will not have a standardized way of describing these aspects using dct:conformsTo.

How about letting Dataset hasPart DataElement (Dataset as in DCAT2, a subclass of dcat:Resource)?

Tue, 03/09/2019 - 09:46

I've been trying to find a proposal that is compatible with all the existing implementations and future proposals. And this may be a simple proposal for this challenge just with the options we already have. (Challenge: To describe the atomic parts of datasets, indicating the specific topics and if the elements of the dataset are authoritative.)

Proposal: using dct:hasPart to break down dcat:Datasets into lower-level dcat:Datasets. For instance, a large spreadsheet containing several themes could be split into several parts.

This allows describing atomic parts of datasets, including information about quality (isAuthoritative or not) and theme. This is compatible also with registries (catalogues) serving datasets using APIs (i.e., DataServices). 

I've created several scenarios to illustrate the implementation using this model:

Scenario 1: Country A has a registry with two different documets that contains: vehicles and drivers. Both of them are are considered authoritative.

  • The <dcat:Catalog> has two <dcat:Dataset> described as authoritative and with the proper themes. 
  • See scenario 1

Scenario 2: Country B has a registry with an API that serves several documents: vehicles, drivers and other information about third-party organisations. Only data about vehicles and drivers are considered authoritative.

  • The <dcat:Catalog> has a <dcat:DataService> that serves three <dcat:Dataset> described as authoritative (or not) and with the proper themes. 
  • See scenario 2.

Scenario 3: Country C has a registry with a large spreadsheet that contains: vehicles, drivers and information about corporations. Only data about vehicles and corporations are considered authoritative.

  • The <dcat:Catalog> has a <dcat:Dataset> that is composed of three <dcat:Dataset> that are described as authoritative (or not) and with the proper themes. 
  • See scenario 3.

To sum it up, complex dcat:Datasets can be broken down into atomic dcat:Datasets. It is compatible with existing DCAT implementations and with the future DCAT 2 specification. No need to create new elements.

Thu, 05/09/2019 - 10:19

Using dct:hasPart instead of creating a new class DataElement may work, at the class level. 

At the property level, in addition to "isAuthoritative" (issue #02), we (in our national data catalog) need to link (using URIs) from a given (part of a) dct:dataset to the concept(s) that define(s) the meaning of the (dataelements in the) dataset. If we agree upon the need (that you need to know the meaning of the data in order to decide if the data is reusable for you), how is this solved in other countries, and how is it going to be solved in BRegDCAT-AP? In our current version of DCAT-AP-NO we had to use a national extension in order to link to concepts, conformsTo e.g. a datamodel (which may or may not contain concept definitions) was considered not machine-processible.

Fri, 06/09/2019 - 15:03

Scenario 3, in the previous comment, shows a solution that covers both authoritativeness and theme using existing properties (see diagram). Any atomic dataset (elements of complex datasets) has a theme.

In this case, dcat:theme is used to specify the concept that represent the thematic of the data. Since we agreed that Eurovoc is the right vocabulary to use, these dcat:theme properties would be linked to Eurvoc terms (e.g., NT3 motor car or http://eurovoc.europa.eu/4261). Eurovoc also include mappings to other schemes. Additionally, national bodies may use also their own terms.

Apart from this, the model also include the possibility to link specific models/schemes/standards through dct:conformsTo. This would facilitate the automatic processing of the information. For instance, we could have a dataset about organisations, that conformsTo CPOV.

Could this cover NO's requirements? Other comments from the rest?

 

Fri, 20/09/2019 - 10:15

Since we need to represent more than the theme of the data (already solved in the previous comments), I've looking for a proposal to represent this. As mentioned in previous meetings, a standard solution to indicate the structure of a dataset in terms of the concept being represented (e.g. person, organisation) is the W3C Data Cube Vocabulary.

A data cube is organized according to a set of components: dimensions, attributes and measures. This vocabulary provides a class (qb:DataStructureDefinition) that defines the structure of datasets (as qb:DimensionProperty, qb:AttributeProperty and qb:MeasureProperty). With this, we can indicate the concept and the type or code list used to represent the value.

Also, the SDMX standard includes a set of guidelines which define a set of common statistical concepts and associated code lists that are intended to be reusable across datasets (e.g., sdmx-measure:civilStatus, sdmx-measure:age):

With this representation, we could have a dataset about 'persons', with data like sex, age, or other measurements. Indeed, we could also define specific code lists for them (e.g. sdmx-measure:sex for measurement 'sex' and sdmx-code:sex-M for 'male'). Anyway, as I understood, this specification must be focused on the measurements, not the values.

To clarify this, see the next example to define the structure of a sample dataset:

eg:driversDataset a qb:DataSet;
    qb:structure eg:myDatasetDefinition .

eg:myDatasetDefinition a qb:DataStructureDefinition;
    rdfs:comment "personal data about drivers"@en;
    # The measure(s)
    qb:component
        [ qb:measure eg:sex],
        [ qb:measure eg:age],
        [ qb:measure eg:passport],
        [ qb:measure eg:firstName],
        [ qb:measure eg:lastName].  #whatever we need to define

To sum it up we define the structure of datasets.

This representation is widely used in statistical data and enables the automatic discoverability of compatible data. Does this fit in your models?

Tue, 15/10/2019 - 08:48

Again, we need 1) to say that a particular dataelement in the dataset isAuthoritative, and 2) to connect a particular dataelement in the dataset to its concept definition.

Refering to the example above using Data Cube Vocabulary, eg:myDatasetDefinition, 1) how do we say that for instance eg:passport isAuthoritative, and 2) how do we connect for instance eg:sex to its concept definition (e.g. a code list that defines the valid values of sex)? 

The proposal above using dct:hasPart (i.e. instead of using db:structure), has one drawback that we have to create as many "datasets" as there are dataelements in the dataset, with each so-called "dataset" containing only one single element. In addition, how do we connect a lower-level dataset (= dataelement) to its concept definition? Not dct:conformsTo nor dct:type. We need skos:Concept which is not in dcat:Dataset (yet).

Tue, 15/10/2019 - 12:09

Thanks for your feedback, Jim. 

1) eg:passport as authoritative element:

Using the previous example, and based on the Data-Quality-Vocabulary mechanism (isAuthoritative, as shown in Issue #02) to describe that datasets or components we could describe the structure of the dataset as: 

eg:myDatasetDefinition a qb:DataStructureDefinition;
   rdfs:comment "personal data about drivers"@en;
   qb:component
     [ qb:measure eg:sex],
     [ qb:measure eg:age],
     [ 
       qb:measure eg:passport;
       dqv:hasQualityAnnotation isa:isAuthoritative.
     ],
    ...

 

2) How to specify the components of the dataset structure

The most complex data we can find in our definitions are statistical data cubes. The cube model is a series of observations that can be characterized by dimensions (e.g. time, height), the metadata related to the measurements (e.g. population, life expectancy), and how data was measured and how it is expressed (e.g. units, status). 

Using the Data Cube Vocabulary, we can define the structure of the datasets through these specific three types of components (dimensions, measures, and attributes). Also, we can define the specific properties of components in depth. For instance, as you mentioned, we could consider a component 'sex', in our own namespace (eg:sex):

eg:sex a qb:MeasureProperty, qb:CodedProperty ;
    qb:concept eg-concept:sex ;
    rdfs:label "Sex"@en, "Kjønn"@no ;
    rdfs:comment """The state of being male or female."""@en ;
    qb:codeList eg:sexCodes .

Roughly, the previous example shows how we could describe basic components that compose the structure of datasets. There we can specify the concept (in our SKOS scheme of themes), and the list of possible codes to take (also expressed as a SKOS concept scheme —male, female, unknown, etc.—).

Fortunately, there are some definitions like this, so we can avoid defining everything. The SDMX standard includes some guidelines with sets of common statistical concepts and associated code lists that can be directly reused. For instance, components such as: civil status, currency, education level, age, etc. Following our example, we could have used 'sdmx-dimension:sex' as a measure with a set of concepts is already defined.

Using this approach, we have a complete and extensible way to describe the internal structure of datasets: the type of observation, the concept, potential values that may take, and other metadata. 

 

Thu, 17/10/2019 - 12:44

Thanks, Martin! 

That's what I wanted to be able to do. 

Concerning 1: We need to extend qb:ComponentSpecification with dqv:hasQualityAnnotation (and dqv:hasQualityMeasurement), explicitly in BRegDCAT-AP (otherwise we will have to extend in our national application profiles, for each and very MS).

Concerning 2: Similarly, we need to have qb:concept and qb:codeList, explicitly in BRegDCAT-AP (for the same reason as above).