MI2: Dataset series

01/02/2016
  • How to model dataset series?
    • Approaches for time series, dataset slices, other relationships

During the revision process of DCAT-AP in 2015, it was noted that the DCAT specification only considers relationships between a catalogue and the datasets described in the catalogue, and between a dataset and the distributions that represent the manifestations of the dataset.

The specification of DCAT is silent on any relationships between catalogues, between datasets and between distributions.

In real-world implementations, such relationships do exist and may be modelled in different ways. An example of such relationships are time-series. In some implementations, time-series are modelled as distributions of a single dataset; in others, as separate datasets with or without links between them.

A common approach to modelling such relationships in DCAT-AP would improve interoperability among catalogues.

 

Component

Documentation

Category

improvement

Comments

Mon, 08/02/2016 - 13:00

To start the discussion, I went to datahub.io and searched for ‘budget’. A quick survey brought up five different existing approaches. What are the views of implementers of DCAT-AP? Do one or more of these approaches also exist in your implementations?

 

Approach 1: one Dataset per year with one or more files (Distributions) that contain Budget figures for that year.

Examples:

 

Approach 2: one Dataset for a range of years with data files for each of the individual years as Distributions with descriptions that contain the year covered.

Example:

 

Approach 3: one Dataset for a range of years with one or more data files (Distributions) that cover the whole range of years

 

Approach 4: one Dataset describing a Website where a user might search, browse or navigate in a collection of data with no data files (Distributions)

Example:

 

Approach 5: one Dataset for a Website with one or more data files (Distributions) that link to a SPARQL endpoint or Website where a user might search, browse or navigate in a collection of data

Examples:

Thu, 18/02/2016 - 14:15

Maybe a Dataset should have assigned the optional parameters:

dct:hasPart

dct:isPartOf

 

Then it would be easy to model these relations between datasets...

Sun, 21/02/2016 - 19:46

Uwe, it would be interesting to know how people are doing this in practice. If approach 1 is prevalent, then indeed the use of dct:hasPart and dct:isPartOf works. However, it does not help with the other approaches.

 

Wed, 16/03/2016 - 15:02

Proposed resolution:

  • No common approach across data providers.
  • Suggestions for modelling:
    • Create multiple Distributions of a single Dataset if users are mostly interested in the collection as such
    • Create different Datasets if users are mostly interested in the individual members
    • If user expectations are difficult to determine, create separate Datasets and one combined Dataset with the members as Distributions.

Dataset series - versioning

  • DCAT-AP allows relating datasets as ‘versions’ using dct:hasVersion/dct:isVersionOf but it is not clearly described in which cases to use these properties.
  • Putting a versioning scheme in place and assigning the version number as value to owl:versionInfo can be used for indicating precedence/sequence among different versions.
  • adms:versionNotes can be used for describing the differences between and version and its previous one, or for indicating that a newer version is more valid than an older one.

Thu, 17/03/2016 - 11:34

Create multiple Distributions of a single Dataset if users are mostly interested in the collection as such

I do not agree that it is a good idea to create multiple distributions to represent a dataset-series. This will mean that a distribution also is a Dataset. We think it is a better idea to use the dct:type mechanism and define DatasetSeries as a type of dataset that may have relations to its instances. Alternatively one could create a new Entity called DatasetSeries to represent the semantics of a dataset series.

Thu, 17/03/2016 - 17:47

The current situation is that a lot of people model members of dataset series, e.g. time-series -- see approach 2 in comment #1 -- but also slices/extractions of larger data files, as Distributions for a single Dataset. Some people might think it is not a good idea, but it is existing practice.

 

In last year's revision process, there were two opposing opinions in the group, one for approach 1 and one for approach 2. We did not reach consensus.

 

So the guideline tries to identify under which circumstances you might want to do one or the other. I tried to look at it from the perspective of user expectations, rather than from a more theoretical modelling perspective.

 

We currently don't have DatasetSeries as a dataset type or as a class, so that approach would have to go on the list for future revisions of the profile.

Fri, 18/03/2016 - 14:32

From the DCAT spec:
dcat:Distribution represents an accessible form of a dataset as for example a downloadable file. 
  A dat:Distribution is not meant to provide subsets of a dataset. A dcat:Distribution aims to cover casese where different forms of the same dataset. Existing cases that missuse the dcat:Distribution should not prevail as the de facto way of dealing with subsets of a dcat:Dataset (if dataset series can be considered as subsets of a single dcat:Dataset).    Perhaps, expressing dataset series is even out of scope for the DCAT spec, because its semantics are beyond the description of the dataset per se. Describing the continuation might be faced as an independent issue whose semantics can be described on the side, since it is a certain case of relationship between datasets (as indicated also in the introduction of the issue).   An alternative solution could be that a dataset series is addressed as pagination is addressed in the case of Web APIs (i.e. something like previousDataset/nextDataset in the same context as Web APIs consider previousPage/nextPage). I.e. each dataset of a dataset series remains a dcat:Dataset and their sequence is described with at property like ex:previousDataset/ex:nextDataset. See hydra spec for details [1]. In this case, the semantics of the property should clearly indicate that the sequence is chronological (unless explicitely it is explicitely defined the type of collection, see below).   Considering dataset series as a way of grouping certain datasets is only one way of describing a collection of datasets that share certain properties. I am not convinced that such metadata is required to be so explicitely defined as trying to describe all possible relationships might prove to be impossible, redundant or even out of scope. However, if so, a ex:Collection might be a ex:TimeSeriesCollection (narrower of a ex:SeriesCollection concept which is narrower of the ex:Collection concept), as PagedCollection is in the case of the Hydra spec for the Web APIs [1] and each dcat:Dataset may declare membership to a certain Collection.   [1] http://www.hydra-cg.com/spec/latest/core/

Thu, 24/03/2016 - 00:20

Øystein, Makx,

About using dct:type to denote a dataset as a series, this is actually what is done in GeoDCAT-AP, by using the relevant resource type from the INSPIRE Registry:

http://inspire.ec.europa.eu/metadata-codelist/ResourceType/series

For the background and motivation of this approach, see the following comment https://joinup.ec.europa.eu/discussion/pr11-add-new-properties-dataset-express-haspart-and-ispartof-relationships#comment-16527

 

Thu, 24/03/2016 - 19:35

Andrea,

We could maybe include a reference to the GeoDCAT approach in the guideline under the option "Create different Datasets if users are mostly interested in the individual members".

 

I could include:

 

The GeoDCAT Application Profile proposes the following approach:

Is this statement accurate?

Wed, 30/03/2016 - 21:04

Andrea,

We could maybe include a reference to the GeoDCAT approach in the guideline under the option "Create different Datasets if users are mostly interested in the individual members".

I could include:

 

The GeoDCAT Application Profile proposes the following approach:

Is this statement accurate?

 

Actually, GeoDCAT-AP does not say how to link series to its children. The reason is twofold:

  1. This is not required in INSPIRE metadata, and it is not supported in the core profile of ISO 19115
  2. How to do this was not agreed upon in DCAT-AP 1.1

So, the only thing that can be said by referring to GeoDCAT-AP is:

Tue, 12/04/2016 - 14:19

We have worked on a use case about dataset series (periodical). Find attached here a google slides document you can comment with a description of the use case and a proposal for datasets representation. 

https://docs.google.com/presentation/d/1bSMi8ZzUcaFPMByWdp8PG3N4aMlXkl4bRo6u48rIKt0/edit?usp=sharing

Jean on the behalf of OP/OpenDataPortal

Wed, 13/04/2016 - 12:21

@Jean, thanks for sharing the presentation.

May I ask why you decided to use properties :memberOf / :hasMember, instead of dct:isPartOf / dct:hasPart?

Tue, 26/04/2016 - 05:59

@Andrea

I propose :hasMember to be conformant with FRBRoo which has a good representation of complexWork such as series, but using dct:isPartOf, might be simplier as the dct ontology is already used. Using dct:isPart will not differentiate the Dastaset representing a serie from an other Dataset which really has parts. If we use dct:isPartOf, it could be useful to have an other way to indicate that Dataset is a serial work, without Distributions.

The content of this field is kept private and will not be shown publicly.