Requirements for Controlled Vocabularies

16/04/2013

Section 9 includes a table with several controlled vocabularies that MUST be used for the listed properties, but most of them are not dereferencable (some don't have public URIs and others always return 404). This issue has also be raised in relation to EUROVOC https://joinup.ec.europa.eu/asset/dcat_application_profile/issue/eurovo…, but IMO should be extended to every required vocabulary.

We may need to stablish some minimal requirements for a controlled vocabulary to be propossed as a MUST use, e.g.:

- Openly licensed.

- Dereferencable.

- Operated/maintained by official EU bodies or institutions or otherwise by any recognised standardisation body, etc (see also https://joinup.ec.europa.eu/asset/dcat_application_profile/issue/regist…).

- ... others?

Once we have the minimal requirements we may need to revisit this list and take the proper actions to replace, downgrade (not required) or try to improve several of the vocabularies. 

 

(See also https://joinup.ec.europa.eu/discussion/controlled-vocabulary-requirement-version)

 

 

Component

Documentation

Category

Controlled vocabulary

Comments

Wed, 17/04/2013 - 22:07

I agree with the criteria listed above (see also the 45+ CAMSS assessment criteria ). Of course, the controlled vocabulary must in the first place suit our basic use case: to allow for a cross-portal search for governmental data sets. I think this also means that controlled vocabularies must meet the following basic requirements for search taxonomies

  • Provide overview: catagorise a huge collection of datasets (e.g. a metadata broker could easily collect more than 100K datasets from EU data portals) into a small number of (hierarchically structured) categories;
  • Easy to use: a small number of simple terms that everybody can understand and use as part of, for example, a faceted search;
  • Multilingual: ideally, the categories already have multi-lingual labels in all 23 official languages of the EU;
  • Complete: ideally each dataset can be categorised in at least one group (but no catch-all category should be allowed); 
  • Mutually exclusive: ideally, no dataset should pertain to two groups;
  • Uniformely distributed: in a normal population of datasets, there are no categories that have a disproportionately large number of datasets.
  • ...

​I hope that we will be able to reuse some of the controlled vocabularies that are currently used on open data portals. See also this ticket:

http://joinup.ec.europa.eu/discussion/which-vocabularies-are-used-your-data-portal

 

 

 

Thu, 18/04/2013 - 07:42

Version should be mandatory.

Mon, 22/04/2013 - 09:27

Could we clarify what openly license means? SKOS-HASSET is I think an importtant one to include but is does require a license which can be obtained upon request (http://www.data-archive.ac.uk/find/hasset-thesaurus/hasset-licence). Here is the rationale behind this decision the decision to license it: http://hassetukda.wordpress.com/2012/11/28/licensing-for-skos-hasset-wp…

 

 

Mon, 22/04/2013 - 11:47

Good question: what does openly licensed mean?

From a practical perspective the following three conditions should at least hold:

- anyone can search the vocabulary to find terms that they want to use in their instance metadata without the need to sign a licence agreement;

--> if it is difficult for people to find out what the terms are, then people will not use the vocabulary

- anyone can include the URIs of the terms in the vocabulary in instance metadata without restrictions and without the need to sign a licence agreement;

--> if there are restrictions for use of the URI, you can't use the terms

- anyone who receives such data should be able to follow the link to the term and get information about what the term means (including at least all labels and descriptions) again without signing licence agreements.

--> if you can't look up what a URI means, such a URI is useless.

 

 

Thu, 25/04/2013 - 21:37

The W3C Best Practices for Vocab Selection gives some useful selection criteria: http://www.w3.org/2011/gld/wiki/222_Best_Practices_for_Vocab_Selection. If I look at the controlled vocabularies we maintain in the Publications Office MDR (http://publications.europa.eu/mdr/authority/index.html), I think we meet the majority of the criteria, although there is still a lot of work to do:

  • Vocabularies should be self-descriptive: Not yet the case.
  • Vocabularies should be described in more than one language: Labels available in up to 24 official EU languages. Metadata definitions, comments and documentation in English.
  • Vocabulary reusability: Number of vocabularies used in the data exchange between the European Institutions and the member states, the EU ODP, production systems at the Publications Office, ...
  • Vocabularies should be accessible for a long period: There is the long term commitment from the Publications Office of the EU to maintain the vocabularies in the MDR
  • Vocabularies should be published by a trusted group or organization: The MDR is maintained by the Publications Office of the EU. There is a metadata gouvernance structure in place.
  • Vocabularies should have permanent URIs: The vocabularies in the MDR have permanent URIs that will not change, however they are not dereferencable yet. This is planned and should be available end of 2013/beginning of 2014.
  • Vocabularies should provide a versioning policy: New versions are regularly published with release notes, but only the latest version can be directly accessed through the MDR website. Previous versions are available in zip-packages. Direct access to versions should be available end of 2013/beginning of 2014.
  • Vocabularies should provide documentations: Work in progress

For me the most important criteria would be the persistence of the URI, the stability of the organisation behind the vocabulary, and, given our European context, the multilingual aspect.

Sun, 28/04/2013 - 13:39

Here a first attempt at summarising the requirements.

 

 

Controlled vocabularies should:
  • Have terms that are identified by URIs
  • Have term URIs that resolve to a description of the term
  • Have labels in multiple languages for its terms, ideally in all official languages of the European Union
  • Be operated and/or maintained by an institution of the European Union, or by recognised standards organisations
  • Be published under an open licence
  • Be maintained under publicly available persistence and versioning policies

 

Tue, 30/04/2013 - 09:38

I think that the most complete, mature and updated reference for this so far is again the Best Practices for Publishing Linked Data Document, which in fact is an evolution of the Best Practices for Vocab Selection document Willem have already pointed to. 

If we have a look at the "Is your linked data vocabulary 5 star?" and the "Vocabulary selection criteria" sections, I would try to align as much as possible with them. More specifically, I think that the documentation requirement should be also included as a minimum one.

If we think in terms of the vocabulary final user, you can do little or nothing with a vocabulary that is not documented, and you are also prompt to misuses it, with the correspondent consequences in terms of interoperability (note also that this is already the only "must" requirement at the BP document and also a 2-stars level one).

With regards to the publisher requirement, I don't think it must be required to be an institution of the European Union, but just any trusted group or organization.

Wed, 01/05/2013 - 09:44

My second attempt:

 

 

 

Controlled vocabularies SHOULD:
  • Be published under an open licence
  • Be operated and/or maintained by an institution of the European Union, by a recognised standards organisation or another trusted organisation
  • Be properly documented
  • Have labels and descriptions in multiple languages, ideally in all official languages of the European Union
  • Contain a relatively small number of terms (e.g. 10-25) that are general enough to enable a wide range of resources to be classified
  • Have terms that are identified by URIs with each URI resolving to documentation about the term
  • Have associated persistence and versioning policies
  These criteria do not intend to define a set of requirements for controlled vocabularies in general; they are only intended to be used for the selection of the controlled vocabularies that are proposed for this Application Profile.

Fri, 03/05/2013 - 07:18

I agree with the requirements, but the title of the section listing the vocabularies is somewhat misleading: 8.2. Proposed vocabularies => ... that MUST be used ...

So they are actually "Mandatory vocabularies" (MUST) ? (like "Mandatory properties" and "Mandatory classes" in the chapters 7.x)

Or only "Recommended" (= +/- Proposed) ?

Sat, 04/05/2013 - 17:13

I propose to change the section heading to "Controlled vocabularies to be used"

Sat, 04/05/2013 - 22:58

Thanks Carlos for the link to the Best Practices for Publishing Linked Data Document. The selection criteria mentioned are indeed a good reference.

I agree that documenting a vocabulary is essential, but when I look at the properties for which controlled vocabularies are suggested in section 8, I do not think that there will be many misunderstandings possible as to what a particular concept stands for (e.g. language, country, ...), so dereferenciation is IMHO less critical than with other vocabularies.

I have some questions/remarks concerning the suggested use of controlled vocabularies in section 8:

1. dct:format/dcat:mediaType. I'm not sure to have understood the intended difference between file format and media type. An example under 7.3.2 would be useful.
2. dct:publisher. The suggested MDR Corporate bodies table covers only European institution/bodies and some international organiations. A possible extension and the scope of this extension should be discussed
3. By adding a DCAT profile "context" to the Named Authority Lists in the MDR, we could filter only those concepts that are relevant for the DCAT profile and thus get managable controlled lists.

Sun, 05/05/2013 - 10:23

My view on your points:

1. In DCAT, there is a difference between the use of dct:format and dcat:mediaType. The latter "should be used when the media type of the distribution is defined in IANA, otherwise dct:format may be used with different values.". As we are not using the IANA mdeia types but the MDR NAL which "is based on the IANA MIME Medai type", maybe we should not be using dcat:mediaType>?

 

2. Chapter 8 has a remark to reflect the fact that the MDR NAL only applies to European instititions. Is it realistic to assume that you can extend the list to include all organisations that are involved in publishing datasets?

 

3. We could indeed try to profile the controlled vocabularies to the  concepts that are relevant for the application profile, but how would we go about identifying the subset?

 

Sun, 05/05/2013 - 14:21

Here some more ideas:

  • Controlled Vocabularies should be described in ADMS (recommended to be described in ADMS)
  • Controllel Vocabularies provide more benefits if these vocabularies are linked to other vocabularies
  • If selecting controlled Vocabs for a catalogue choose well known and already used ones first

Mon, 06/05/2013 - 22:02

We have already a good handful of requirement, but all of them not required so far (should vs. must). We may need also to focus on defining a minimal set of required ones.

That is IMO specially importat for those vocabularies required at the AP (table at 8.2 - no much difference between "must be used" and the new propossed "to be used"). Several of them doesn't fulfil the minimal desirable ones (URIs that doesn't dereference - for both human and machine versions - or even doesn't resolve, lack of documentation, etc.). I don't know whether all that is planned to change in the short term but, honestly, as per today that doesn't look like a portfolio of recommended best practices.

Tue, 07/05/2013 - 06:56

Carlos, the "to be used" was the proposed alternative for the section heading "proposed vocabularies". It  was noted by Bart that the section heading did not match the MUST that the section contains.

As to the requirements, I agree that we may not be able to find vocabularies that satisfy all requirements. We will need to determine what in the practical circumstances is the best we can do.

 

Tue, 07/05/2013 - 17:36

Makx, thanks for the clarifications ...

1. When we created the File types table, we were not aware of the existence of stable URI's for IANA media types and we have some file types that do not exist in IANA (Formex). So this table started as a table for internal purposes, but as mentioned in https://joinup.ec.europa.eu/discussion/registers-operated-ops-mdr-and-dcat-ap, we can add concepts upon request.

2. I must have worked on a previous version. I had not seen the remark, so forget about my question.

3. We have an attribute "use.context" where we indicate in which context a particular concept is used. We could add a value for the DCAT application profile (e.g. "DCAT_AP"). This would allow selecting only the relevant concepts. To be discussed ...

Thu, 01/08/2013 - 15:30

Login or create an account to comment.