DI6: Detecting and handling duplicates

Published on: 09/02/2016
  • How to detect and handle duplicates
    • E.g. when metadata is harvested from different portals (e.g. geoportal, open data portal)?






Thu, 11/02/2016 - 14:05

Duplicates can occur when an aggregator harvests descriptions of datasets from various sources.

There could be two situations:


  1. In the harvested data, there are two or more descriptions of the same physical data file or API/end point – in this case, the download or access URLs in the descriptions are the same
  2. One or more of the harvested sources describe a copy of the data file or API/end point – in this case, the descriptions refer to different physical files


In the first case, detecting the same URLs in different descriptions may indicate multiple descriptions of the same dataset. As it may be hard to determine which of the multiple descriptions of the dataset is the better one, it might be a sensible strategy to merge to multiple descriptions, de-duplicating properties when they have the same value.


In the second case, duplicates can only be detected by analysing the content of the descriptions. This case also covers the situation where no distributions are associated with dataset, for example when access is only possible through the landing page of the dataset. De-duplicating on the basis of similarity of metadata is not trivial. There may be properties that can be useful in determining similarity, such as the title, identifier or landing page, but care should be taken not to merge descriptions of datasets that are similar but different.


It would be interesting to know whether aggregators of dataset descriptions have developed strategies or tools to handle those cases.

Wed, 16/03/2016 - 16:05

Proposed resolution:

  • Assign a stable identifier to the dataset in the catalogue where the dataset is first published.
  • Include this identifier as the value of adms:identifier.
  • Either locally minted identifiers or external identifiers, like Datacite, DOI, ELI etc., as long as they are globally unique and stable.
  • Harvesting systems should not delete or change the value of adms:identifier and only use it to compare harvested metadata to detect duplicates. 

Thu, 17/03/2016 - 12:25

- Assign a stable identifier to the dataset in the catalogue where the dataset is first published

+1 on this

- Include this identifier as the value of adms:identifier.

This may work technically, but it is has some issues. Adms:identifier has range adms:Identifier, which means that adms:identifier must point to an object represented as an adms:Identifier, where the stable identifier is stored. Harvesting systems must also collect the Identifier objects. From a semantic point of view the problem lies in that is optional and is described as an alternative identifier. It would be better if you can create a new property that holds this stable identifier.

Thu, 17/03/2016 - 18:35

Your proposal to create a new property is out of scope for the work we're doing on the guidelines which is restricted to giving advice on how to use the current DCAT-AP specification. We can of course put this on the list for the next revision of the profile, but then we have no solution for the short term.

Fri, 18/03/2016 - 15:49

Dataset URIs are already identifiers...


But, perhaps it makes more sense to firstly aim to identify and explicitely define that there is a relationship between two datasets and then let tools decide on how they treat the datasets at the moment they 'parse' them. For instance, this can be done using the dct:relation property (which does not restrict neither the domain nor the range). (owl:sameAs might be too strict to be considered as an alternative, in my opinion) 


It might be dangerous to make assumptions regarding how identical two datasets are because this is a momentary assumption which might not be the same at a different time (given that versioning/evolution are not clearly addressed). 

Thu, 24/03/2016 - 20:54

I do agree with Anastasia that any mechanism to detect potential duplicates needs to take into account that things might be changing between the time the original data was created and the time that a dataset description is harvested.

The proposed recommendation to use a special ("global") identifier to help detecting potential duplicates does not provide a solution to the handling of such duplicates.

For example, even if two Dataset descriptions refer to the same global identifier, the metadata in the two descriptions might be different. What should be done with such differences, especially if they are conflicting? E.g. the two catalogues where the descriptions come from may use different classification schemes, or the description in one of the two catalogues contains corrections to the metadata.

Also, even if the two descriptions refer to the same physical file (dcat:downloadURL), do they refer to the exact same file or was there a change to the file between the time the data was described in the two catalogues.

So the proposal is only a suggestion that may help detection of potential duplicates but it's only a first step in further actions to determine what to do with those duplicates.

Tue, 12/04/2016 - 16:23

We support this proposal.

Jean on the behalf of OP/OpenDataPortal

Tue, 06/09/2016 - 18:32