-
How to model dataset series?
- Approaches for time series, dataset slices, other relationships
During the revision process of DCAT-AP in 2015, it was noted that the DCAT specification only considers relationships between a catalogue and the datasets described in the catalogue, and between a dataset and the distributions that represent the manifestations of the dataset.
The specification of DCAT is silent on any relationships between catalogues, between datasets and between distributions.
In real-world implementations, such relationships do exist and may be modelled in different ways. An example of such relationships are time-series. In some implementations, time-series are modelled as distributions of a single dataset; in others, as separate datasets with or without links between them.
A common approach to modelling such relationships in DCAT-AP would improve interoperability among catalogues.
Comments
To start the discussion, I went to datahub.io and searched for ‘budget’. A quick survey brought up five different existing approaches. What are the views of implementers of DCAT-AP? Do one or more of these approaches also exist in your implementations?
Approach 1: one Dataset per year with one or more files (Distributions) that contain Budget figures for that year.
Examples:
Approach 2: one Dataset for a range of years with data files for each of the individual years as Distributions with descriptions that contain the year covered.
Example:
Approach 3: one Dataset for a range of years with one or more data files (Distributions) that cover the whole range of years
Approach 4: one Dataset describing a Website where a user might search, browse or navigate in a collection of data with no data files (Distributions)
Example:
Approach 5: one Dataset for a Website with one or more data files (Distributions) that link to a SPARQL endpoint or Website where a user might search, browse or navigate in a collection of data
Examples:
Maybe a Dataset should have assigned the optional parameters:
dct:hasPart
dct:isPartOf
Then it would be easy to model these relations between datasets...
Uwe, it would be interesting to know how people are doing this in practice. If approach 1 is prevalent, then indeed the use of dct:hasPart and dct:isPartOf works. However, it does not help with the other approaches.
Proposed resolution:
Dataset series - versioning
I do not agree that it is a good idea to create multiple distributions to represent a dataset-series. This will mean that a distribution also is a Dataset. We think it is a better idea to use the dct:type mechanism and define DatasetSeries as a type of dataset that may have relations to its instances. Alternatively one could create a new Entity called DatasetSeries to represent the semantics of a dataset series.
The current situation is that a lot of people model members of dataset series, e.g. time-series -- see approach 2 in comment #1 -- but also slices/extractions of larger data files, as Distributions for a single Dataset. Some people might think it is not a good idea, but it is existing practice.
In last year's revision process, there were two opposing opinions in the group, one for approach 1 and one for approach 2. We did not reach consensus.
So the guideline tries to identify under which circumstances you might want to do one or the other. I tried to look at it from the perspective of user expectations, rather than from a more theoretical modelling perspective.
We currently don't have DatasetSeries as a dataset type or as a class, so that approach would have to go on the list for future revisions of the profile.
Øystein, Makx,
About using dct:type to denote a dataset as a series, this is actually what is done in GeoDCAT-AP, by using the relevant resource type from the INSPIRE Registry:
http://inspire.ec.europa.eu/metadata-codelist/ResourceType/series
For the background and motivation of this approach, see the following comment https://joinup.ec.europa.eu/discussion/pr11-add-new-properties-dataset-express-haspart-and-ispartof-relationships#comment-16527
Andrea,
We could maybe include a reference to the GeoDCAT approach in the guideline under the option "Create different Datasets if users are mostly interested in the individual members".
I could include:
The GeoDCAT Application Profile proposes the following approach:
Is this statement accurate?
Andrea,
We could maybe include a reference to the GeoDCAT approach in the guideline under the option "Create different Datasets if users are mostly interested in the individual members".
I could include:
The GeoDCAT Application Profile proposes the following approach:
one Dataset description is created with a dct:type of http://inspire.ec.europa.eu/metadata-codelist/ResourceType/series, linking to the members in the series using dct:hasPart;
for the individual members of the series, separate Dataset descriptions are created that can link back to the series using dct:isPartOf.
Is this statement accurate?
Actually, GeoDCAT-AP does not say how to link series to its children. The reason is twofold:
So, the only thing that can be said by referring to GeoDCAT-AP is:
We have worked on a use case about dataset series (periodical). Find attached here a google slides document you can comment with a description of the use case and a proposal for datasets representation.
https://docs.google.com/presentation/d/1bSMi8ZzUcaFPMByWdp8PG3N4aMlXkl4bRo6u48rIKt0/edit?usp=sharing
Jean on the behalf of OP/OpenDataPortal
@Jean, thanks for sharing the presentation.
May I ask why you decided to use properties :memberOf / :hasMember, instead of dct:isPartOf / dct:hasPart?
@Andrea
I propose :hasMember to be conformant with FRBRoo which has a good representation of complexWork such as series, but using dct:isPartOf, might be simplier as the dct ontology is already used. Using dct:isPart will not differentiate the Dastaset representing a serie from an other Dataset which really has parts. If we use dct:isPartOf, it could be useful to have an other way to indicate that Dataset is a serial work, without Distributions.