Skip to main content

PR5 - Add new property to relate Datasets in a time series

Anonymous (not verified)
Published on: 10/03/2015 Discussion Archived

Description

From: http://joinup.ec.europa.eu/mailman/archives/dcat_application_profile/2015-February/000120.html

Relate datasets to be able to create time series.

 

Proposed solution

Add new property to relate Datasets in a time series

Component

Code

Category

improvement

Comments

Makx DEKKERS
Makx DEKKERS Wed, 08/04/2015 - 11:03

The question is whether this functionality can be part of the proposal at MO12 or whether this could be done by pointers between Datasets like xhv:next and xhv:previous (see: http://www.w3.org/1999/xhtml/vocab). In addtion, the question is whether this information is readily available in existing metadata?

Anonymous (not verified) Thu, 09/04/2015 - 22:44

If we combine dct:isVersionOf and dct:temporal we have a solution to describe a time series. We can relate each individual to a dataset describing the serie, and organizing the individuals in a timeline with the dct:temporal information.

Anonymous (not verified) Fri, 17/04/2015 - 13:45

The problem here and with other similar issues elsewhere is that with such properties we may be indeed able to describe different series, but not the semantics of the series itself (i.e. temporal, etc.)

Anonymous (not verified) Fri, 17/04/2015 - 14:59

I strongly disagree with using different 'Datasets' for a time series. 90% of releases of public data are updates to an existing dataset - why on earth would you call that a new dataset?

 

For example, this data we have 93 updates to:

http://data.gov.uk/dataset/land-registry-monthly-price-paid-data

Do the authors of PR5 say we should split that dataset it into 93 datasets, each with exactly the same author, but with a slightly different distribution URL and different temporal-coverage? What a mess. And that's monthly data, but we have plenty of weekly data that runs to *hundreds*.

No, of course not, that's what distributions are for! Simply have 1 dataset and 93 distributions, each with specifying their temporal-coverage.

 

Doing it with distributions also helps someone looking at your portal's activity feed - they are probably much more interested in new data appearing, rather than regular updates to long-standing datasets, yet if you create new datasets for each of these then its more difficult to separate them off.

 

Indeed if you create a new dataset for every update then your number of datasets goes spiralling upwards!

 

Quite often we provide an aggregate of all the data in addition to it split up. This is an update to the same data, the same dataset, same metadata, it makes NO sense to put it into a separate dcat:dataset.

 

I can see where this has come from. The people producing the release want to feel they have contributed something substantial on every release, and get the validation that a 'new dataset' gives them. But our job is to help the users, not the publishers, and users are helped by having a single simple dataset and multiple distributions with properties saying what the slice is, be that temporal, spatial or any other thing. So let's not introduce relations for time series, as it only legitimizes bad practice.

Makx DEKKERS
Makx DEKKERS Fri, 17/04/2015 - 16:02

David, the model is indeed that this week's stats are a different dataset than last week's because the data is different. Yes, it could have been done differently but that is the model that is the basis for DCAT as I understand it. For example, information like temporal and spatial coverage are only defined for the Dataset, not for the Distributions. So you could declare the 2015 data to be the dataset and say that its temporal coverage is "2015", but there would be no way (within DCAT) to say that a particular file only covers the month of March.

 

 

 

Anonymous (not verified) Mon, 20/04/2015 - 18:48

Makx, I don't see where that intepretation comes from. The DCAT spec defines a Dataset as a "collection of data, published or curated by a single agent, and available for access or download in one or more formats" so it is pretty clear that an update or addition to the collection should go in the same dataset. It reinforces that here: "Examples of distributions include a downloadable CSV file, an API or an RSS feed" so clearly updates on an RSS feed are part of the same dataset. So let's stick to the W3C text here.

 

To be honest, I think W3C simply didn't consider time series at all carefully. And that's why they've not thought to suggest dct:temporal on a distribution, but there's no reason why we can't use it.

 

Time series is so crucial to data catalogues now, it is pretty important we model them well, and I just don't see any usefulness is fudging them so that each one is a datasets and you link them with 'next' and 'prev'. We have a few in data.gov.uk like this and users hate them: e.g. http://data.gov.uk/data/search?q=&publisher=nhs-barnsley-ccg

Makx DEKKERS
Makx DEKKERS Mon, 20/04/2015 - 21:09

David, the interpretation comes from my involvement in the development of DCAT at W3C. As far as I have understood all along, the intention was not that the mentioned "CSV file, API or RSS feed" would be distributions of the same Dataset but would be examples of distributions of different datasets.

Of course, I cannot say that the interpretation that all distributions of one dataset contain the same data in different formats is the correct one, but at least it is the one we have been using as the basis of the DCAT Application Profile since its inception. We have had this question for a similar case, and our answer was that, while everybody is free to do this any which they want (and I think it is the way CKAN does this), the DCAT-AP export needs to adhere to the rule different data-different dataset.

The W3C group that developed DCAT did not consider time series at all, period. More in general, relationships between datasets were not considered at all.

 

Anonymous (not verified) Tue, 21/04/2015 - 13:23

Makx, it's good to acknowledge that there has been barely anything published in DCAT or DCAT-AP-2013 that says that timeseries should be split into separate Datasets. So this is a great opportunity to update ourselves on what is being used in the field, what makes sense and put some guidance in DCAT-AP.

 

CKAN is certainly in favour of putting a time-series in a single dataset. For a long time CKAN's resource has had a date field. Certainly I don't know any open data site that adds a new dataset for every release.

 

I've mentioned several reasons to do time series as distributions, and I'll ask you all again. UK's Land Registry 'price paid' dataset provides the data in several different ways for different use cases - annual archives for previous years, 'year to date' for current year and monthly, as well as an API. Are you seriously arguing for this to be split into 93 datasets with one distribution in each? And you want some sort of 'forward' and 'backward' links to navigate them somehow? Please explain.

http://data.gov.uk/dataset/land-registry-monthly-price-paid-data

 

Real time data - e.g. bus arrivals API - it changes every few seconds - are you suggesting that a whole new dataset is created every minute or something because it's "new data"?!

 

Seeing that splitting time series into separate datasets runs against DCAT's definition of a dataset, the majority of existing usage and against user experience, can we agree to drop that idea?