PR27 - Indicate type of data in Dataset

07/04/2015

It could be useful to indicate the type of data in a Dataset, for example to be able to give information that a dataset is a code list, textual information, spreadsheet data etc. This could be done by using dct:type on the Dataset, using a controlled vocabulary like DCMI Type, the Dataset Type vocabulary of the European Union Open Data Portal or the ADMS Asset Type vocabulary

From: http://joinup.ec.europa.eu/mailman/archives/dcat_application_profile/2015-February/000122.html

 

Component

Documentation

Category

improvement

Comments

Wed, 08/04/2015 - 10:13

I think we should be careful to only allow sub-types of a dataset, rather than other things. When looking through a data catalogue, it doesn't make sense to see a code-list. Although it is structured, it is not information - it is a list of arbitrary terms - so not a dataset. So I wouldn't want to see a code-list in a data catalogue. Nor would I want to see a natural language document in a data catalogue.

 

It would be good to see some examples in RDF of this proposal.

 

The suggestion of DCMI Type seems a bit odd, since one of its terms is http://purl.org/dc/dcmitype/Dataset and none of the others are datasets by definition.

 

I couldn't find the "Dataset Type vocabulary of the European Union Open Data Portal" - Jose what's the link?

 

There seems one example in ADMS which seems to be what Jose is getting at:

:Fruit_02 a adms:Asset ; dcterms:type <http://purl.org/adms/assettype/CodeList> .

So if the assettype terms were defined somewhere, we could say if a dataset/thing was a "semantic asset" i.e. reference data (e.g. code lists, taxonomies, dictionaries, vocabularies). Which is good, but I still question whether these things should be in a data catalogue.

 

For example, weather data is defined by the WMO, and their spec contains dozens of codelists for every field - esoteric things like types of "Runway deposit" http://codes.wmo.int/bufr4/codeflag/0-20-086 . Surely the specification for talking about the weather doesn't live in a data catalog? It's clearly useful, but let's have a separate catalogue for this sort of thing.

Wed, 08/04/2015 - 10:49

It needs to be noted, though, that the definition of a Dataset in DCAT is very broad: "A collection of data, published or curated by a single agent, and available for access or download in one or more formats." In the discussions in the W3C group the question came up and the conclusion was that in practice anything could be a dataset, PDF files, images etc.

Moreover, adms:Asset is explicitly declared to be a subclass of dcat:Dataset (http://www.w3.org/TR/vocab-adms/#asset) so a code list, which is an adms:Asset, is by definiton a Dataset.

So, if someone wanted to build a catalogue of code lists, they could create a dcat:Catalog in which the code lists are the datasets. 

The question is: do we want the DCAT-AP to take into account such a use case or not?

Wed, 08/04/2015 - 14:25

We should also not mix Dataset types (e.g. a code list) with Distribution media types and/or formats as defined in DCAT. 

 

@David, I do not agree that a code list is not a dataset (cf the definition of Dataset in DCAT). For example, a list containing all reconginsed countries or currencies are ihmo datasets of very high value. 

Thu, 09/04/2015 - 18:39

Several questions come to mind:

 

1. What kinds of dataset types would we want to distinguish, given the fact that "dataset" has a very broad meaning?

 

2. Are there existing controlled vocabularies that we can use for this?

 

3. Do dataset creators/maintainers have, or are willing to provide, information about the type of the datasets they have?

 

4. Do dataset catalogues usually have functionalities to filter on such a dataset type?

Thu, 09/04/2015 - 19:54

To type the dataset seems a useful things to do as Dataset is quite vague. We all have examples of various type of datasets : stastical data, controlled vocabularies, directories, bibliographical records...

The type of dataset is a business related subject, not to mix with format and mediaTypes.

At this stage we could add an object property. In a next release we could have a recommandation for a controlled vocabulary to use. 

"Type" can be a useful facet to filter the datasets in a open data portal.

 

Jean

Thu, 09/04/2015 - 19:54

To type the dataset seems a useful things to do as Dataset is quite vague. We all have examples of various type of datasets : stastical data, controlled vocabularies, directories, bibliographical records...

The type of dataset is a business related subject, not to mix with format and mediaTypes.

At this stage we could add an object property. In a next release we could have a recommandation for a controlled vocabulary to use. 

"Type" can be a useful facet to filter the datasets in a open data portal.

 

Jean

Thu, 09/04/2015 - 19:54

To type the dataset seems a useful things to do as Dataset is quite vague. We all have examples of various type of datasets : stastical data, controlled vocabularies, directories, bibliographical records...

The type of dataset is a business related subject, not to mix with format and mediaTypes.

At this stage we could add an object property. In a next release we could have a recommandation for a controlled vocabulary to use. 

"Type" can be a useful facet to filter the datasets in a open data portal.

 

Jean

Thu, 16/04/2015 - 13:13

In the EU ODP we are using the ADMS asset type controlled vocabulary (see https://joinup.ec.europa.eu/svn/adms/ADMS_v1.00/ADMS_SKOS_v1.00.html) to classify the datasets. This list is not sufficient to classify the EU datasets.

There is a facet to classify the Dataset which is not covered today, this facet is about the type of Dataset as we define as : Statistical data, Directory, controlled vocabulary, schema, knowledge base.... This classification facet is not covered by Theme/Subject which cover the domain of the dataset, it is not covered by format/mediaType which is about the serialization of the data and way to deliver the dataset. This is why we propose to add the optional property dct:type to dcat:Dataset with the following description "this property indicates the type of knowledge and knowledge organisation provided by the dataset".

The Publications Office is willing to maintain a controlled vocabulary for this property.

Fri, 17/04/2015 - 14:26

Do not let the language lawyers win!! They can make the words "data" and "dataset" mean all sorts of things, but data catalogs will lose their value if they are not stuffed with *real* data. We must fight to keep out stuff that is only loosely related. It's the basic principle of doing one thing well, or keeping it tightly cohesive and loosely coupled. Or the whole project will be devalued.

 

We all know data when we see it. Saying that all PDFs and images can be included in the definition is not helpful. Of course a PDF with a table of numbers is often data, or a weather satellite image is data, but most PDFs and images are not and we must exclude them. Let's not help legitimize all these grey-area fringes like 'images' as 'data' by offering them their own category in our DCAT-AP.

 

I see the importance of listing vocabularies and code lists, but let's list them separately, because they are  rather different beasts to datasets. They help describe and understand data, help re-use and comparison. It would be stupid to put them in the same pool as datasets. Whilst I agree there are some datasets which are also code lists or reference data as well, these edge cases should not distort the discussion.

@David, I do not agree that a code list is not a dataset (cf the definition of Dataset in DCAT). For example, a list containing all reconginsed countries or currencies are ihmo datasets of very high value.

@Nicos I would regard that as 'reference data', along with lists of doctors, airports etc, although it is technically also a 'code list' because it is a set of terms. I think it is important to differentiate though, since we want reference data in our data catalogs, but not pure-codelists. There is a bit of a grey area, but I don't think we should distort out catalogs. Are you really arguing for the inclusion of codelists en masse?

 

I mentioned the meteorological codelist - completely contrived for a particular purpose and it is simply not data. I recently worked on a huge school data vocabulary which had hundreds of codelists for things like how you record a child's reason for absence - this is not data. That sort of stuff will fill our data catalogs if we go down that road. The "Boolean codelist" which contains two value: True and False - that would be ridiculous in a data catalog - we would be laughing stocks for including it as a "dataset". No, let's differentiate data and metadata draw a line about what is a dataset, and yes there is a tricky grey area, but we can use our sense.

 

The one potential type that I think might be usefully highlighted is statistical data. And the reason is to exclude it - by virtue of it being a summary rather than the records themselves, it tends to exclude a whole category of analysis. But it's not something I've seen anyone else ask for, and would rather not introduce dct:type just for this minor thing.

 

So I challenge this group to justify encouraging a wide diffuse definition of data. And if we're comfortable with that, then perhaps we can have some examples of benefits to users of them knowing that a dataset is of a specific dct:type.

Sun, 21/06/2015 - 11:38

The working group has not succeeded to create a list of types over the past 2 months. Nevertheless, the working group agreed in its meeting of 10 June 2015 that a property dct:type on Dataset will be in draft for public review with remark that a controlled vocabulary will be established later.

The content of this field is kept private and will not be shown publicly.