Discussion - Data Cube Vocabulary to Describe Dataset Structure - or alternative?

Published on: 05/05/2020
Discussion

The BReg-DCAT-AP model has three major components to describe: 1) public services; 2) data registries; 3) data components. Both, public service and data registry are based on existing vocabulary application profiles, i.e. CPSV-AP and DCAT-AP

To make base registries interoperable, we need to define in detail the data that is stored / managed by those registries. Those catalogues may have datasets, services, i.e. web services, SPARQL endpoints, etc., and other registries. 

So, the question is how can we define data models, and establish constraints in the definitions? In this discussion page, we will explore the available possibilities, taking into account all the potential use cases and the requirements to create machine-readable descriptions.

According to the requirements collected in previous working group meetings, the master data may be composed of different components, that can be described by the following metadata:
•    Textual descriptions about the data (i.e. multilingual and expressed in natural language);
•    Semantic concept (i.e. from a concept schema) of the resource;
•    Quality information (i.e. authoritativeness, completeness, etc.) of the resource;
•    Data type (i.e. a date, number, text, etc.);

For instance, a business registry holds data about organisations. Every EU Member State may want to include as much information as they want, but we could agree with a minimum set of metadata. Just as an example:
•    Name;
•    Legal representative (a person);
•    Registration number;
•    VAT ID;
•    Email; and
•    Postal Address.

How to represent conformance to the model

If we had this minimum set of data predefined, we could establish constraints based on these specific “templates” using RDF + OWL. This approach is followed by TOOP project. These specific ontologies may create standard models for registry data. 

In DCAT2, Datasets and Distributions may express conformance through the dcterms:conformsTo property. This would inform that a specific dataset or its distribution follows the specific “template” defined above.

How to represent the structure of the datasets

If the registry data is represented under the RDF model, there would be no need to add additional information, just using the common ontology would be enough to understand the nature of each resource, property, and value.

In the rest of the cases with datasets delivered in any format, i.e. spreadsheets, JSON, XML documents, etc., the dataset / distribution metadata should include additional information about the type of information we can expect from the repository. At least, structure, datatypes, and the concepts represented.

W3C Data Cube Vocabulary

As discussed in the latest working group meeting on the 28th of April, the W3C Data Cube Vocabulary may be a solution to represent the structure of datasets. Using the qb:dataStructure property, a dataset may be described through its different components, including semantic concepts (e.g. birth date), data types (e.g. date), textual descriptions, and other quality annotations (i.e. accuracy, authoritativeness, etc.).

Data Cube Vocabulary is compatible with the data cube model and the SDMX (Statistical Data and Metadata eXchange) standard. With this vocabulary, we would be able to represent any complex structure of data.

Some participants in the previous meeting shown non-conformity to this proposal to describe the structure of datasets, but with no concrete objections or proposed alternatives. 

This discussion page will help us to find concrete examples to keep this proposal - or discard it and find better solutions.

We invite you to share your feedback in comments. 
 

Comments

Mon, 11/05/2020 - 10:58

I think we need to start with a use case, i.e. an answer to "What do we want to achieve?", related to the structural description of the base registries. This is not clear to me, and therefore it is hard to talk about what can be done and how.

With this vocabulary, we would be able to represent any complex structure of data.

I disagree that DCV can be used to represent any complex structure of data, especially in connection to the other statement:

In the rest of the cases with datasets delivered in any format, i.e. spreadsheets, JSON, XML documents, etc., the dataset / distribution metadata should include additional information about the type of information we can expect from the repository. At least, structure, datatypes, and the concepts represented.

DCV can only be used to describe statistical data cubes, i.e. data, which is structured like a multidimensional cube. This is nowhere near how e.g. a registry of citizens is structured (i.e. a set of records "First name", "Last Name", "place of birth", etc.) - there is no observation identified by dimensions, etc.

This leads back to the question of what do we want to achieve here? Is it a simple search for registries concerning "citizens"? If so, there is no need for DCV, as we can simply connect a DCAT-AP record of the registry to a skos:ConceptScheme using dcat:theme.

If it is something more, then what? With the description of the structure of various registries, I see multiple problems here, especially given the time frame.

- When we talk about data structures of non-RDF data, these are syntactically defined by various schema languages for various data formats (XSD for XML, JSON Schema for JSON, CSV on the Web for CSVs, etc.) Some files like excel spreadsheets do not even have a way of describing their structure. And here we are talking about files. However, some base registries (e.g. registry of all citizens) may not even exist in the form of files. They would consist of a database (which may have a relational database schema). Here, we can assume, that (at least that's how it is in Czechia) there is a set of W3C-style Web services to access the data, which are described by XSDs and WSDLs.

- I think it is safe to assume, that each of those registries in different countries will use a different data model, a different data format, and a different schema (if any), with no mapping to a shared ontology, which would be required to facilitate any kind of search over the base registry records. Their integration is clearly out of scope of ABR. All we can hope for is a very high-level mapping of DCAT-AP representations of those registries to some ontology/code list of concepts frequently found in a base registry, such as "Citizen", "Land parcel", "Company", etc. and then we can hope for being able to search for base registries related to "Citizens", across countries.

Based on the comments above, I think we need to say to which common ontology or code list we will map the DCAT-AP records of the registries (existing or new one? Who will maintain it after ABR ends?) and how (dcat:theme?) and whether we will be able to fulfill the use case defined like this.

Mon, 11/05/2020 - 15:57

Thanks for your input, Jakub.  

About "What do we want to achieve?", I must admit that the scope we defined was a bit ambiguous since we hadn't all the use cases defined when we launched the working group. We intend to describe registries of base registries in a similar same way we do for open data catalogues but collecting additional information about the entities running the registries, as well as the legal aspects of the service. We realised that these could be covered using the existing standards, DCAT-AP, CPSV-AP, and ELI. 

In one of the first meetings, we realised most of the member states wanted to describe the parts of a dataset. For instance, the Belgian initiative already has a class DataSource composed of DataElement(s) that are typed as specific concepts (e.g., person details, address, etc.). So, describing the concepts that are included in a dataset was added as an optional feature of the conceptual model. So, we need to describe in detail the conceptual elements of a dataset, not how to access the data nor the distribution format. Two main proposals were presented: VoId and Data Cube Vocabulary.

Up to date, Data Cube Vocabulary is the best standard we know to try to describe this, so we want to find examples as you mentioned to see it its model is not adequate to discard it. The registry of citizens (census) is a good example to analyse. The structure of those records would have a similar structure in all countries, as you mentioned: first name, last name, place of birth, etc.

My question is why we cannot represent that dataset structure as:

  eg:citizens a dcat:Dataset;
    qb:structure [
      a qb:DataStructureDefinition;
      rdfs:comment "Data about citizens..."@en;
      qb:component
            [ qb:measure eg:personalId ],
            [ qb:measure eg:sex ],
            [ qb:measure eg:firstName ],
            [ qb:measure eg:lastName ],
            [ qb:measure eg:birthDate ].
     ]…

Where those measures are described by their concept:

  eg:sex a qb:MeasureProperty, qb:CodedProperty ;
    qb:concept eg-concept:sex ;
    rdfs:label "Sex"@en, "Kjønn"@no ;
    rdfs:comment """The state of being male or female."""@en ;
    qb:codeList eg:sexCodes .

The gender is clearly well defined using data cube vocabulary. And, what about the rest of the components? Couldn't use the same mechanism? 

  eg:lastName a qb:MeasureProperty;
    qb:concept eg-concept:lastName ;
    rdfs:label "Surname"@en, "Etternavn"@no .

Elements such as last/first names are also used in statistics. For instance, this dataset about the occurring surnames (this other spreadsheet contains 12 Mbytes of surnames). 

About the conceptual elements, they could be defined as skos:Concepts, if they do not exist. For instance, Malta already has defined (and published) some statistical concepts we could use like Surname (or Family name), Name (or First name), Nationality, Languages, ability to speak, Formal education, Date of first live birth, and some other statistical concepts they use for the business registry, census, and other services. 

In the case we want to have common representations in terms of formats and structure, we could use dcterms:conformsTo, according to a specific ontology, but this is out of the scope at this point.

Regarding the theme of the resources, we will preserve the standard approach of DCAT-AP, to use dcat:theme with the Data Theme Taxonomy NAL, and more detailed using EUROVOC if possible (both maintained by the EU Publications Office).

Again, thanks for your feedback. The team will be glad to revise all the specification and do as many updates as it needs.

Mon, 11/05/2020 - 17:05

The reason I do not think DCV actually fits can be seen directly in your example when compared with the specification (https://www.w3.org/TR/vocab-data-cube/#data-cubes).

The main entity in DCV is an Observation, which represents a "cell" in a table, and contains measures, (typically) numerical values, maybe with units of measurement (attributes), identified by values on dimensions, such as cities, sex, date, etc.
In a typical data cube, you can do the slice operation, i.e. freeze one dimension to a specific value and get a cube with n-1 dimensions, and other statistical/mathematical operations.

With a data cube, you expect that you first pick values on dimensions to indicate what you are interested in (i.e. I am interested in life expectancy in Cardiff, in men, between 2004-2006) and you get a number (measurement). And you get a measurement for any combination of values on the dimensions, within the ranges of those dimensions.

In the case of the registry of citizens, you have no dimensions, it is not a data cube.
And even if first name, last name, sex, etc. were dimensions, there would be no measurement.
And in the extreme case where we would define the measurement as a "yes"/"no" value indicating that a person identified by the values on dimensions such as first name, last name, sex, place of birth etc. exists or not, working with such a cube would be unnatural: First, you would have to pick values on dimensions, i.e. a first name e.g. "Jakub", then Last name "Klímek", then sex "Male", then place of birth "Prague" and then you would get a measurement "true" saying there is such a person. This is exactly the opposite of how such a registry should work - i.e. lookup an ID and see the record. Also, it makes no sense to iterate through the dimensions, i.e. to see if a person with the same last name, sex, birthplace, but a different first name exists or not. Also, you would have to have a record for every combination of existing first name, last name, sex, birthplace, etc... For the registries, which are lists of things, the DCV vocabulary simply does not fit.

You say "we need to describe in detail the conceptual elements of a dataset" - the question stays - why? What is the task (example?) we want to be able to do using the description we are working on here?

Isn't a simple list of concepts represented in the dataset enough?

Wed, 13/05/2020 - 12:15

I agree that this example of the citizen's registry is not the typical case of statistical data, and it does not follow the concept of data cube –I cannot find many dimensions either. Also, the DCV was conceived as a complete vocabulary to represent the most complex models of statistical data. The question is why we cannot use some parts of the DCV to represent the simplest models as well. FOAF was conceived as a vocabulary to describe people and their social relations, but one of their most-used properties is foaf:homepage (a homepage for something). It is about reuse instead of reinventing standards.

In our example, we do not have perhaps more than a dimension (e.g., administrative area, country, or similar), but I do not see a dimension is required in the specification. So, according to the specification (in the non-normative section):

A DataSet is a collection of statistical data that corresponds to a defined structure.

Retrieving the definitions of statistical data, we find the concept of collections of quantitative data. This is very ambiguous, but perhaps this is the constraint. I think our example could be considered as statistical data.

Also, as you mentioned:

Observations: [...] is the actual data, the measured values.

As I explained in the previous comment, I think a 'Last Name' might be considered as an observation. It is the actual data that identifies the record we have in the database. It is a literal text, but it is still a value. We do not need to represent observations, we need to represent the structure of those observations.

Apart from the ambiguity of the 'statistics' term, I wonder why we cannot use the specific properties such as dq:structure or db:measure, to describe the structure of the dataset. In terms of the specification, the only limitation I see is the domain of the structure: 

qb:structure a rdf:Property, owl:ObjectProperty;
    rdfs:label "structure"@en;
    rdfs:comment "indicates the structure to which this data set conforms"@en;
    rdfs:domain qb:DataSet;
    rdfs:range qb:DataStructureDefinition.

In strict terms, the barrier I see is if that our BReg-DCAT-AP datasets would be considered dq:DataSet(s) intrinsically, and this could be a problem. 

 

Regarding your question (why we need this?), we are still looking for real examples to drive the work. In the meantime, we would like to avoid the creation of new vocabularies and reusing the existing ones (and this optional data cube structural approach seemed the most feasible for that). 

The first versions of the specification this was not included. Indeed, all the approaches were related to describe the theme of the datasets (i.e, discussion on using Eurovoc, and breaking down datasets into sub-datasets), but several members proposed the inclusion of more information about the internal structure of these datasets. See how we opened the discussion, the discussion was present in all the meetings last year, indeed in September we reopened it again.

So, again, thanks for your input. And we really appreciate feedback from those who pushed for adding this feature. 

 

Wed, 13/05/2020 - 13:56

I agree with Jakub that DCV is the wrong choice. I see three main reasons:

  1. It is aimed at statistical data.
  2. It will not manage to describe non-tabular data very well
  3. From my perspective it has the wrong granularity

Reason 3 is my main argument. I also provide a suggestion for how it could be done differently and a use case for how the information could be used.

Regarding 1

This is not a huge problem since the vocabulary is rather flexible. As you point out Martin all datasets would be via the entailment rules be instances of qb:Dataset. (This is the generic problem when vocabularies  use strict domains or ranges, it limits the reusability.)

Regarding 2

Lets take an example, an organization decides to provide a dataset available that describe their organizational structure togheter with information about the physical locations where it is located and its employees. The representation is either as an RDF graph (using the W3C Org ontology) or as a zip file with three csv files corresponding to the three "entity types" Organization, Person and Site.

Clearly, this cannot be described by DCV as it corresponds to at least three "tables" with interconnections. 

It could be argued that the example above is ill-defined, you should never have these three "entity types" in one dataset. But I think it is better to provide a models that matches the complexity of the world, not try to change the world by forcing a model on it.

Regarding 3

Describing that a dataset contains a firstname and lastname of a person is very specific information that should be provided in a schema level information indicated via dcterms:conformsTo (potentially indirectly by pointing to a prof:Profile that has a resource in the role "schema", see https://www.w3.org/TR/dx-prof/).

Instead I think it would be more useful to indicate which "entity types" a dataset contains. In the example above it would be "Person", "Organization" and "Site".

Now, what happens if we split the above example into three datasets, one for each entity type. The Person and Site datasets could be very simple as they would each contain only one entity type each ("Person" and "Site"). But the Organization dataset would be the dataset that is richer by connecting the other datasets together. It would have one main entity type and two entity types that are referenced. It could look like this:

Organization dataset
  title "Agency X organizational structure" 
  ... (more properties from DCAT / DCAT-AP here)

  entityType
     title "Organization"
     dct:subject   concept_URI1  (in EuroVoc or domain specific concept scheme)
     presenceInDataset  "authoritative"  // other options here could be "full", "stub"
     comment "The organization is the main entity described in this dataset, it references persons and sites."

  entityType
     title "Person"
     dct:subject concept_URI2
     presenceInDataset "reference"
     referenceMechanism "URI"

  entityType
     title "Site"
     dct:subject concept_URI3
     presenceInDataset "reference"
     referenceMechanism "URI"

Clearly more properties can be added to the entityType level, e.g. point directly to datasets if they are from the same dataprovide and considered in some sense to be "authoritative".

Note that it is not defined how the entity types are organized, they can be simply listed, structured in a hierarchy or a graph. In fact, it does not even say that the expression of each entity type is in it self reasonable to express as a table, it can be more complicated. What it DOES say, is that these are the entity types that the data provider considers to be core constituents or important in some way and highlighting them is important for various reasons. Likely they want to support a specific use case like the combination use case below.

Combination / interoperability use cases

If information about entity types is provided on datasets it would allow discovery of datasets that are good candidates for being combined. I see two main categories for combination:

  1. Complement - if they share a few, but not all of their entity types.
  2. Expand - if they share most of their entity types

Sharing of entity types are based on using the same concepts, detection of similar concepts using SKOS narrow/broader/*match properties is also possible but requires more advanced inference.

Why a structure

It could be argued that the construction is overly complicated, why not just use the dct:subject directly on the dataset. I think there are two main arguments:

  1. Using a dct:subject on the dataset provides a categorization on the dataset as a whole, I think the semantics is slightly wrong.
  2. It would be no room for describing how each entity type is expressed in this dataset, e.g. see the presenceInDataset and referenceMechanism suggestions above.

Naming

Above in the text I have used the term "entity type" as it better represents what I am trying to convey, referring to it as "data element" in previous discussion is also ok. But I would want to stress that (in my view) semantics should be clarified to mean things like "Person" and "Site" rather than properties like "foaf:givenName". That is, in RDF terminology it would correspond to Class rather than Property. In any case, the name should not exclude non-RDF datasets from expressing that they contain entity types.

    Wed, 13/05/2020 - 16:02

    Thanks for your input, Mattias.

    +1 to use the Profiles Vocabulary along with dcterms:conformsTo. I think this is key to enable the real interoperability. The future European Registry of Registries could define schemas for the Member States to follow.

    [Regarding your second statement]

    Apart from the already presented semantic doubts (about dimensions, statistical data, etc.), I still believe we could represent the structure of the dataset (the concepts included), independently of the distribution. To represent the organizational structure we could indicate something like:

      :organizationStructure ex:component
                [ ex:measure ex-concept:organization-info ],
                [ ex:measure ex-concept:person-info ],
                [ ex:measure ex-concept:site-info ],
          …

    Up to date, it seems that there is nobody in favour of the approach of Data Cube Vocabulary, so we are discarding this option. VoID could help us to describe the datasets in detail, but it is for Linked Data. Any suggestion of existing vocabulary?

    At least, there is one proposal that could serve to describe the dataset components, the Belgif vocabulary. It defines: belgif:DataSources (subClassOf dcat:Dataset) that are composed of (belgif:DataElements). Those data elements are typed (dcterms:type) as concepts. So, the approach will be similar:

        :mydataset a belgif:DataSource;
           dct:hasPart 
                     [ a DataElement; dct:type :concept1.],
                    [ a DataElement; dct:type :concepts. ], …

     

    Anyway, before solving what vocabulary should use/implement, we still need to clarify with real use cases if we really need this (to break down the datasets into conceptual data elements). If this is not required, we can skip this part and being focused on the dct:conformsTo description that we all agree that is useful. 

    Please send your thoughts.

    Wed, 13/05/2020 - 18:22

    I think the use of DCV makes sense for Statistical datasets, I guess that is why  StatDCAT-AP incorporated it.

    I like DCV, it is very powerful, but from my reading I do not see that it is a good enough fit. The Belgif vocabulary is much closer to the needs I am seeing, although I am not sure I understand the need for the DataSource class. I would also like to have some more properties along the lines I gave in the example in the last post.

    I also agree that VoiD is not a good enough option as it is restricted to the world of linked data.

    Real use-case / needs

    One need I have observed is that we have organizations that want to divide their big complex datasets into smaller reusable datasets. The example I took above is quite close to a real example with two tweaks.

    1. Not everyone is allowed to access the Persons dataset.
    2. The persons dataset refers the organization dataset instead of the other way around.

    Hence, to divide into 3 datasets makes it possible to have 2 public datasets and one private instead of having one big that is private. This is of course a good thing, but, it makes it harder for the data consumers to figure things out. How are they going to find that these three datasets are suitable to use together?

    Today I see five ways to inform the user that datasets are connected (Using DCAT2): 

    1. Provide the information in free-text, e.g. the dct:description.
    2. Provide the information in a separate documentation / landingpage.
    3. Provide the information in a information model that is reachable via conformsTo.
    4. Explicitly provide links between datasets via resource relations, e.g. dcterms:hasPart
    5. Explicitly provide links between datasets via qualified resource relations.

    Note that only 4 and 5 above is rigid enough and reasonably easy to parse for a dataportal to act on. What I mean with the latter is that when the user clicks in on a dataset the portal should have on the side a section which says "This dataset is complemented by / can be used togehter with the following datasets."

    But even 4 and 5 is hard to use since the the range is unrestricted, hence it cannot always be expected that it points to a dataset. For the qualified case (5) it is possible to provide set of different resource roles that would correspond to a "best used in conjuction with" relation, in those cases it could be inferred that the range is another dataset.

    If there is a hard requirement of avoiding to introducing brand new vocabulary, then perhaps option 5 is the best approach (for the need I outlined here). However, I would like to point out that there is a few drawbacks to this approach:

    1. It is not stated how the datasets fit together, just that they are related.
    2. There may be potentially many relations that need to be expressed.
    3. The focus is shifted away from describing the nature of each dataset to providing a set of links that needs to be maintained.
    4. All relations have to be expressed by a dataprovider, new combinations cannot be discovered based on different organizations reusing the same concepts. 

    In general, I think that DataElments is a good idea and it would definitely foster interoperability. Wether it will happen now or later, I guess that depends on how many other that have real use-cases for it. We did submit an application here in Sweden for a project that would have focused on developing this idea further, unfortunately it did not get funded even though we had a really good consortium. My guess is that interoperability and arguments around combinding datasets are hard to make unless you have enough momentum and quantity of open data to work with.

    Thu, 21/05/2020 - 11:04

    The ABR team has been in contact with the TOOP and SDG team, who will use this specification to find and collect information from Member States' base registries, and other use cases. After exploring the potential scenarios with them, we have concluded that for these projects scope:

    1) Datasets in a base registry that comply with a specific common representation of the information will indicate this thought the dcterms:conformsTo property and the Profiles Vocabulary. Thus, there will be several common ontologies that model the "internal structure" of the data (e.g., a corporation with a unique ID, VAT number, official address, etc.) that will be used as a reference for the registries. If a dataset complies with the proposed structure, it will be expressed using dcterms:conformsTo

    2) There is no need to share the details of the internal composition of the dataset. If the internal model reflects this information, and this is useful for other inter-departmental purposes, they may use specific vocabularies to describe them. This extension will be fully compatible with the specification in terms of semantics.

    I want also to point out that TOOP and SDG are considering the use of the Core Criterion and Core Evidence Vocabulary (CCCEV), which is being updated to the 2.0 version. The new version will include updated references to dcat:Datasets and dcat:DataServices. This would serve to indicate that datasets fulfill concrete requirements, but it would be out of scope for base registries.

    So, according to your comments —thanks for that— and the list of use cases we have received, we propose to resolve this open issue.

    1. To RECOMMEND the use of dcterms:conformsTo in the data resource descriptions, in the case there is any known ontology or schema of reference.
    2. To include references to the W3C Profiles Vocabulary as a mechanism to create application profiles or sub-schemes of reference.
    3. To remove the references to Data Cube Vocabulary as a mechanism to describe the structure of datasets. This will not impede its use in ad hoc cases. 

    As always, we follow a consensus-based process, so feel free to send objections to these proposals by the end of 28th May 2020. All comments are welcome.

    Tue, 17/08/2021 - 14:26

    I agree with Jakub and Matthias that CUBE (DCV) is not suited for this task, because

    1. qb:measure are supposed to be literals or lookups, but not structured objects.
    2. The subject of qb:measure is Observation, which are driven by qb:dimension. Persons are not observations and are not driven by dimensions (except in some eugenic Nazi anti-utopia).

    If the datasets are RDF, there are well-established ways how to describe them: ontologies (RDFS, OWL and schema.org) and shapes (SHEX, SHACL).

    If they are tabular, you can describe with CSVW.

    We did describe dataset details in the euBusinessGraph project (how many companies per jurisdiction and provider, which props, how many prop instances, which are open and which are closed/paid). But the props themselves were "notional" i.e. as used in the target RDF model, but not describing the source relational models.

    ABR, contact me vladimir dot alexiev at ontotext dot com, in case there's interest.

    Cheers!