PR15 - Add new properties to refer to the origin of the dataset and whether it was already harvested

10/03/2015

Description

From: http://joinup.ec.europa.eu/mailman/archives/dcat_application_profile/2015-March/000125.html

If the same dataset is harvested from different sources, it should be possible to safely detect these duplications in order to handle them appropriately. Together with the modification date, several strategies for resolving duplicates are possible, even if they are changed on different locations/portals.

Proposed solution

Add new properties to refer to the origin of the dataset and whether it was already harvested

Component

Code

Category

improvement

Comments

Fri, 17/04/2015 - 13:22

We certainly have the need for this sort of thing in the UK.

 

Makx wrote in the Resolution proposals: "The dataset origin can be expressed with dct:source as proposed in the resolution of the provenance issue. The harvesting status is out of scope for the DCAT-AP."

 

I agree. Knowing if it was harvested or not is a very shallow way to consider it, and doesn't help in most of the use cases.

What you need to know is a unique identifier for the source (dct:source is good), and then clearly you can avoid having duplicates when you've harvested records from two different places.

 

Storing the last modification date of the record is also useful, so that you can pick the more recent one.

 

And lastly, I'd quite like to include a reference to the catalogue it was sourced from - for a machine to translate the dct:source URI into a catalogue name and home page is sometimes not straightforward, so a link to the catalogue metadata would be a bonus.

Mon, 27/04/2015 - 16:41

The content of this field is kept private and will not be shown publicly.