Skip to main content

PR14 - Add new property to express lineage

Anonymous (not verified)
Published on: 10/03/2015 Discussion Archived

Description

From: http://joinup.ec.europa.eu/mailman/archives/dcat_application_profile/2015-February/000123.html

Data lineage is information about the data life-cycle - e.g., the data collection / creation methodology and workflow. Lineage is usually present in metadata of scientific datasets, but it is not uncommon also in the public sector. For example, specifying lineage is a legal requirement for INSPIRE metadata. It would be desirable to add "lineage" as an optional element in DCAT-AP 2.0. Lineage can be modelled by using dct:provenance.

An example of how to use it is provided in the reference specification of the INSPIRE profile of DCAT-AP. In the example above, lineage is specified through a free text description. For a machine readable representation, an option is to use PROV-O. This can be one of the use cases mentioned in issue JRC.8.4.

Proposed solution

Add new property to express lineage

Component

Code

Category

improvement

Comments

Anonymous (not verified) Fri, 10/04/2015 - 19:23

Yes, using dct:provenance for lineage info would align with GeoDCAT-AP developments.

 

dct:provenance, adms:version and adms:versionnotes would be quite expressive themselves.

 

However there would not be a formal link between datasets. PROV-O would provide this, but might be overkill. This links back to the 'relationship between datasets' discussion, and possible use of dct:isVersionOf.

 

In practice, sometimes distributions/resources are used as versions, especially in the case of dynamic datasets. For example, one dataset, with a different distribution for each year. Is this compliant with DCAT-AP?

Andrea PEREGO
Andrea PEREGO Wed, 15/04/2015 - 01:27

I wonder we can consider proposing a number of possible options depending on the use case.

For example:

  • "I can link to the input datasets": use dct:source.
  • "I have a textual description of the dataset's lineage": use dct:provenance.
  • "I have both": use dct:source and dct:provenance.

For any more complex use case, the recommendation can be to use PROV. For instance, PROV can be used to link also to the model used to generate the dataset from the input dataset(s), but also to represent a more complex workflow (e.g., consisting of a model "chain"), data that are collected directly from instruments or sensors, etc.

In my understanding, PROV can be proposed as the reference vocabulary for data provenance / lineage, and it may be replaced by less complex "solutions" (dct:source, dct:provenance) for specific (and simple) use cases.

These options needn't be mutually exclusive, and there shouldn't be interoperability issues here, since we already have a mapping between dct:source and prov:wasDerivedFrom and between dct:provenance and prov:has_provenance (see http://www.w3.org/TR/prov-dc/).

Anonymous (not verified) Mon, 27/04/2015 - 18:41