MO12 - Grouping Datasets

08/04/2015

Requests have been made for the possibility to group Datasets together that have common characteristics, for example containing the same kind of data for different time periods (e.g. annual statistics) or geographical coverage (e.g. statistics per region).

It should be noted that DCAT proper does not consider any type of relationships between Datasets. As it is our intention to avoid as much as possible creating new model elements or new properties that would not be understandable for implementers of the more general DCAT specification, a proposal could be to use existing classes and only add dct:hasPart and dct:isPartOf relationships, as follows:

  • create description for a Dataset that acts as the 'master', e.g. "Annual statistics for temperature"
  • create descriptions for the constituent Datasets, e.g. "Annual statistics for temperature in 2014", "Annual statistics for temperature in 2013" etc.
  • link from the 'master' to the specific year using dct:hasPart
  • link from the specific year to the 'master' using dct:isPartOf

Alternative approaches are welcome, but a main consideration should be whether this approach or any alternatives are based on existing data and not on theoretical use cases.

Component

Documentation

Category

feature

Comments

Thu, 09/04/2015 - 20:38

I would support the option dct:hasPart and dct:isPartOf relationships to solve this.

Fri, 17/04/2015 - 10:56

I think there is a real use case for this. In fact most data catalogues are currently publishing temporal data series on their "own way" basically as different distributions of the same dataset which is IMO a non optimal way given that (1) there is no way to know the temporal relationship between those from a machine-readable perspective and (2) it is confusing wrt the main purpouse of distributions (alternative "forms" of the same dataset)

I like the hasPart isPartOf approach here, but also I'm wondering weakness of the approach reusing the Dataset class instead a new dedicated one (how to know them those have temporal relationship for example?)

Thu, 23/04/2015 - 11:01

I can't see any reason why a dataset that has been split up into different time periods or locations should be modelled as different dcat:Datasets.

 

DCAT defines a dcat:Dataset as a "collection of data", not a slice or a particular release. DCAT defines each dcat:Distribution to have its own release date.

 

The user is not helped by having a new dcat:Dataset for every slice. Each "Dataset" would have all the same metadata, apart from one field - what a waste. There are plenty of datasets which are updated monthly or weekly and run into hundreds of slices. The natural understanding is that all these updates are of the same single dataset.

 

Take a simple example: http://data.gov.uk/dataset/highways_agency_planned_roadworks#tab-data

This is a time series with 173 weekly updates so far, which displays naturally on a 'dataset page' in the same way that another (non-time-series) dataset displays on a single page. If a time-series was split into separate Datasets and then grouped using partOf, then the data portal would want to still show them on a single page, so you'd have this weird "Datasets together that have common characteristics" type of page for time series and other splits, alongside other normal datasets. And in the search results you'll want to add the groups and hide the Datasets that are grouped. So you create a huge amount of bother by using groups of Datasets, rather than just distributions - both for both the modeller, programmer and user experience.

 

The idea that distributions are 'alternative "forms" of the same data' seems a bit out of date, compared to the wide variety of real world datasets we publish and catalog nowadays. There's not a lot of publishing data in parallel in different formats. However it is quite normal to slice up a large dataset for ease of use, by time/geography/table as well as different aggregations or provide an API. Browsing the top ten used datasets on data.gov.uk, all are timeseries, one is a live API the rest by separate files and two are further broken down by data table. None are supplied in multiple formats.

 

So let's agree that a time series is a Dataset, not a group of Datasets. And let's write that in DCAT-AP-2015 as best practice.

Thu, 23/04/2015 - 16:09

I've just checked with the CKAN tech committee today, and there was no mistake about their feelings. Four more core CKAN developers, personally responsible for implementing most of the big data portals around the world including the national portals for US, Canada, Sweden, EU institutions, pan-EU portal beta etc. were pretty taken aback about the idea to create a Dataset for every release in a time series. They raised points such as the massive duplication between every release, difficulty of including these dataset groups in search results, and it perverting the meaning of the word 'Dataset'.

 

One pointed out that if distributions were only for different formats of the same data then it makes a mockery of the decision to add 'dct:license' to only the distribution - you could would only have a need to license distributions differently if they were for different data or resolutions of data.

 

This issue is called "Grouping Datasets" which sounds benign enough, but it really is about changing the definition of Dataset to be "collection or part of a collection of data". I really think that would be outside the scope of DCAT-AP.

Mon, 11/05/2015 - 10:08

The aspect of time is hard to capture. Ideally, different versions of the same dataset should be published as a single dcat:Dataset (as most of you agree upon) and their "timestamps" should be then possible to be captured. 

A possible solution could be to rely on on their provenance metadata (which is outside DCAT scope). Keeping track of a dcat:Dataset/dcat:Distribution provenance information is possible, e.g. with the PROV-O, and that could resolve the problem to a certain extend.

 

A high level example, as I'm not a provenance expert myself:
A version of a dcat:Distribution is a prov:Entity (ex:V1 a prov:Entity ) and its subsequent version a new prov:Entity which is a revision of the latest one (ex:V2 a prov:Entity ; prov:wasRevisionOf ex:V1).

 

To expose the previous versions, all prov:wasRevisionOf could be considered accompanied by their time-related metadata to indicate the time frames they were active. 

 

 

 

On a broader level, relations among datasets and grouping them is a bit more complicated than what dct:hasPart and dct:isPartOf can capture in my opinion, unless it is clearly clarified what the intension is.

 

Influenced by SKOS (which I admit that is not directly applicable but still relevant on a higher level I think), relations might be hierarchical or associative. 
A hierarchical link between two dcat:Datasets/dcat:Distributions might means that one is more general than the other (in the case of SKOS: skos:broadMatch, skos:narrowMatch).
An associative link between two dcat:Datasets/dcat:Distributions might means that the two are inherently "related", but in a more general way than purely hierarchical (skos:relatedMatch to state an associative mapping link between two concepts while skos:closeMatch is used to link two sufficiently similar concepts that they can be used interchangeably in some information retrieval applications). 

 

In this respect, what would be the exact scope of using dct:hasPart and dct:isPartOf? Or what type of relationships and grouping of datasets are encountered in the use cases?

Wed, 27/05/2015 - 15:31

In the two weeks since the call, there has been some further discussion off-list. The conclusion of the ISA team, the Publications Office and myself is that we will not be able to find solutions for all the different use cases. In fact, we have not been able so far to get a good grip on the various use cases. So the situation is still as it was and  therefore the proposal still stands that we will not make changes in the Application Profile to support data series. As a fall-back option, more specific relationships can be expressed using the more general property dct:relation that was already agreed as an addition to Dataset.

Sun, 21/06/2015 - 11:18

Discussions on this issue have not resulted in consensus. The working group agreed in the meeting of 10 June3 20156 to enable referencing ‘related’ datasets using dct:relation on Dataset. dct:hasPart and dct:isPartOf will not be added to Dataset.

The content of this field is kept private and will not be shown publicly.