DCAT-AP: How to manage duplicates? Switch to the latest release

Published on: 30/03/2016 Last update: 08/11/2017

How to manage duplicates?

Issue

Duplicates can occur when an aggregator harvests descriptions of datasets from various sources. There could be two situations:

1. In the harvested data, there are two or more descriptions of the same physical data file or API/end point – in this case, the download or access URLs in the descriptions are the same;

2. One or more of the harvested sources describe a copy of the data file or API/end point – in this case, the descriptions refer to different physical files.

Current situation

Many DCAT-AP implementers suffer to deal with the identification and handling of duplicate datasets. Duplicates are specifically a problem when a central data portal or aggregator, for example at a national level, scrapes datasets from other data portals, for example regional data portals. When the same dataset exists on several regional portals and they are not identified using the same stable identifier, it is difficult for the national data portal to automatically identify the duplicate datasets.

Recommendation

Assign a stable identifier to the dataset in the catalogue where the dataset is first published. This should be the primary identifier of the dataset. Include this identifier as the value of dct:identifier.
In the case of duplicates, other locally minted identifiers or external identifiers, like Datacite, DOI, ELI etc., will be assigned to the dataset. As long as they are globally unique and stable, these identifiers should be included as values to adms:identifier.
Harvesting systems should not delete or change the value of adms:identifier and only use it to compare harvested metadata to detect duplicates.

Rationale

The existence of duplicate datasets within and across data portals leads to multiple interoperability-related issues. Since representations of one dataset exist on several portals, it is difficult for a data consumer to identify which is the original source, which might be necessary to identify original licence statements, provenance information, linked data sets, etc.

Example

The example below describes the same data set on 3 data portals:

1. The original dataset uploaded on a data portal of a city. As the data set is first published here, a stable identifier is defined as a value of dct:identifier.

<rdf:Description rdf:about="http://data.city.eu/datasets/12345">

<rdf:type rdf:resource="http://www.w3.org/ns/dcat#Dataset"/>

<dct:title xml:lang="en">Companies located in the city harbour</dct:title>

<dct:identifier>http://data.city.eu/datasets/12345</dct:identifier>

</rdf:Description>

2. The dataset harvested on a regional data portal. A local identifier, specific to the regional portal, could be added as a value of adms:identifier. The global identifier (dct:identifier) remains unchanged.

<rdf:Description rdf:about="http://data.region.eu/datasets/34567">

<rdf:type rdf:resource="http://www.w3.org/ns/dcat#Dataset"/>

<dct:title xml:lang="en">Companies located in the city harbour</dct:title>

<dct:identifier>http://data.city.eu/datasets/12345</dct:identifier>

<adms:identifier rdf:parseType="Resource">

<skos:notation>10.1000/182</skos:notation>

</adms:identifier>

</rdf:Description>

3. A national data portal harvests both the city and regional portals. However, as the global, primary identifier remains unchanged, the national data portal will be able to automatically identify that these 2 sources refer to the same data set. The 2 sources will not be duplicated on the national data portal.

<rdf:Description rdf:about="http://data.country.eu/datasets/56789">

<rdf:type rdf:resource="http://www.w3.org/ns/dcat#Dataset"/>

<dct:title xml:lang="en">Companies located in the city harbour</dct:title>

<dct:identifier> http://data.city.eu/datasets/12345</dct:identifier>

<adms:identifier rdf:parseType="Resource">

<skos:notation>10.1000/182</skos:notation>

</adms:identifier>

<adms:identifier rdf:parseType="Resource">

<skos:notation>138472638</skos:notation>

</adms:identifier>

</rdf:Description>