Skip to main content

Review of DCAT AP - Draft proposal of the list for the categorization of data

Anonymous (not verified)
Published on: 17/03/2015 Last update: 05/10/2017 Document Archived

The file contains the mapping of Eurovoc domains (21 top level categories) with the lists used for categorization of data by Eurostat, the open data portals of some Member States and international organizations.

The objective is to create a new list that could be used as a recommendation of the DCAT-AP for data portals on the basis of the experience gained so far.

The file contains also two proposals for the new list: one shorter and one longer.

It's a first draft open for discussion and/or complementary input.

Nature of documentation: Other

Categorisation

Type of document
Document
Licence
European Union Public Licence, Version 1.1 or later (EUPL) 

Comments

Anonymous (not verified) Tue, 24/03/2015 - 17:17

I just wanted to say that we very much appreciate this effort and that the themes suggested by the DataTank software (http://ns.thedatatank.com/dcat/themes) were largely derived from the themes used in the City of Ghent (http://data.gent.be). These are based on the organisational structure of the City. I therefore suspect (hope) that this may be in line with the needs of other local governments as well.

The only themes missing from the long list are Sports (categorised under Health) and Events (categorised under Culture), which may be indicative of typical local government services.

Kind regards,
Thimo

 

 
Anonymous (not verified) Wed, 25/03/2015 - 16:54

It's great to have this grid and lots of really good ideas for categories are in here. I want to give my thoughts based on experience designing the categories for data.gov.uk, and focussing not on the good bits, but on the bits that I'd look at changing.

 

I think the primary use case for showing categories is:

 

"to offer a choice of subjects to a general user browsing the site" (which is less intimidating than knowing what to search for)

 

However there is a tendency to come at it from the opposite angle - "we have this particular bunch of datasets that don't fit neatly in an existing category, so lets invent a new category".

 

Clearly you need to find the natural groupings, avoiding splitting categories when there are a lot of overlap. And avoid the temptation to have categories that don't have many datasets - small ones are basically pointless, and clutter the choice. And may I suggest it is ok to *not* categorize a small proportion of outliers too, ones that really don't fit.

 

From a UX point of view it is bad to have lots of categories on offer without really justifying them. The user doesn't want to be offered more than 12 I'd say. 17 is pushing it and 22 seems utterly unwieldy and bureaucratic.

 

"Communication" - This seems very weak and isn't very popular among data portals. It also doesn't seem related to education, so seems weird to include its name as part of the category name. The Europa's Audiovisual & Media section says it is about "film, TV, video, radio", and surely there is not much government data on this, and it would be better in the "culture" section? Europa's Multilingualism is about "linguistic diversity" which can be measured and categorized under culture or society, and "learning of languages" which sounds like it might have only a few datasets, and sits in Education. CESSDA says it is "advertising" which again seems very niche in terms of public data and might be better in "culture" or "business", "information society" would be better under "technology", "language and linguistics" again is "culture" or "education", and "mass media" would be better under "culture".

 

"Transports" - I don't think this word makes sense with the 's' - can you call it "Transport"?

 

"Economy & Finance" - This (with Business) is one of data.gov.uk's smallest categories (10th out of 12), so I don't see the need to split Economy from Finance. Finance is just one part (although important) of the economy.

 

"Trade, industry, services" - This seems a very similar topic to Economy - why not merge them? They are both small - data.gov (US) has Business and Finance ranked 11th and 12th out of 16 topics. I imagine there is plenty of overlap too.

 

"Population and social conditions" and "Employment, job, consumer" - I think there is quite a lot of overlap so there's an argument to merge them. They both are measuring peoples lives - their job/poverty, housing conditions/costs, income/outgoings are all so linked. Whilst the majority of datasets in this area do categorize between the two ok, a reasonable proportion are border-line and need some thought - e.g. census data, hours of work, earnings, housing need, work force locations vs job locations. However data.gov.uk did merge these two categories and found it ended up pretty large (4th out of 12), so maybe that is an argument against.
 

"Research, innovation, technology" - Why "research" named first? There is only a tiny amount of data on research so far on portals. Can you justify this? Why not just call it "Science and Technology"? Have you got a sample of the research datasets that we can see, to judge if it is broadly categorizable under "Science"?

 

"Geospatial information" - Sorry, but on the face of it it seems to be nonsense... - geospatial isn't a subject, it is a property of data on any subject, so shouldn't be on this list. e.g. If you have property price data, you might put it in "Cities", but if you then break it down by region, does it suddenly move into the "Geospatial" category? Clearly not.

 

"Regions; Cities" - This has some oddities in it in the spreadsheet - "Policy" from Europa and "statistics" from Estat seem much too broad to include - I guess you'd include some things from those, but the spreadsheet doesn't say which. I think you need to say that this category is centred around, say, housing, land use, industrial facilities and building. And do you include land ownership? General-purpose maps? Rural classifications? Voting regions? Address data? Hill contours & bathymetric maps? Aerial photographs?
 

 

BTW The publicdata.eu categories on this spreadsheet are based on some very old ones in UK France and Italy than I believe none of those countries use any more. Indeed less than 2% of the datasets use these categories. So I'd suggest you remove that from your spreadsheet.
 

Anonymous (not verified) Wed, 15/04/2015 - 17:56

Dear David and Thimo,

Thank you for the feedback and detailed analysis of the proposal. It was a first draft so we are very happy to receive such detailed comments.

Based on them we reviewed the list "A" in Excel sheet changing some names of the categories and eliminating those that seemed to be indeed overlapping. The list has been modified namely by:

- adding "Culture and sport" to "Education"

- changing name of "Research, innovation, technology" to "Science and technology"

- removing "Trade, industry, services" (in the new proposal being part of "Economy and finance")

- removing "Employment, job, consumer" (in the new proposal being part of "Population and social conditions")

- removing "Geospecial information" which is indeed rather type of data than topic

- spliting "Government, Public sector" from "Regions, Cities"

 

Following your comments we would then propose a list containing 13 categories:

1. Agriculture, fisheries, forestry, food

2. Education, culture and sport

3. Environment

4. Energy

5. Transport

6. Science and technology

7. Economy and finance

8. Population and social conditions

9. Health

10. Government, public sector

11. Regions, cities

12. Justice, legal system, public safety

13. International issues

 

Further feedback and comments on the revised list will be welcomed.

 

Regards,

Agnieszka

Bart HANSSENS Thu, 06/08/2015 - 17:47

Nice list.

 

Some remarks though, or perhaps questions on how to map existing categories to this new list:

  • On the belgium.be portal, we also have a section on "Housing". but perhaps that could be part of 8.Population ?
  • Likewise, would "Work" considered to be part of "7.Economy" ?
  • Would "Tourism" be part of "11.Regions", or rather "7.Economy" ?

 

 

Anonymous (not verified) Fri, 17/04/2015 - 14:33

This was also extensinvenly discussed while developing the first DCAT-AP version (see related issue at https://joinup.ec.europa.eu/asset/dcat_application_profile/issue/invest…) where different member countries (Spain, Germany, Austria...) were already contributing their own approaches (see also https://docs.google.com/spreadsheet/ccc?key=0AtYBrl3GPikydEppeERJb2FxVD…) so I'm wondering why those are apparently not being taken into consideration for the current comparison document while others apparently less relevant for the work being developed here, such us the USA, indeed are.

Could anybody please explain the criteria being followed for that? I really think that previous contributions from MSs that have been working on the issue (and already made their own implementations) since time ago should also be considered for this.

 

Anonymous (not verified) Mon, 20/04/2015 - 18:25

Carlos, Thanks for posting those links to the previous discussions. I think it is clear that none of the existing vocabularies looked at were/are any good for our purpose: not COFOG, NACE, nor the 22 Eurovoc domains that ended up in DCAT-AP-2013.

 

There was some talk / assumption about being able to use a hierarchy of hundreds of terms, to be able to group similar datasets from different countries. However it was pointed out that this was tremendously difficult and you need an excellent controlled vocabulary tree and expert people to classify datasets. Although this is a laudable aim, I just don't see it as feasible to look into with our resources right now. But I'd be interested to hear if there are any proponents for this at some point.

 

Hence this re-draft has aimed at getting right only a first step towards a hierarchical model, which is to classify datasets only on a simple top-level list of terms. This helps the user to see what sort of subjects are available and their relative size - which is a modest advantage. But let's not kid our selves that by clicking on "Health" they are going to get much value from browsing through a list 2,000 Health datasets - you can't do comparisons when a category contains such diverse data as heart disorder stats and doctor waiting times - that use case requires a hierarchy. By selecting Eurovoc terms for the names of these top-level list, we get onboard with a maintained framework with future potential for grafting in branches with multiple levels, and the bonus that the terms are already translated.

 

I scanned through the Austria, Spanish and German categories on the Google doc and they look very similar to the 13 we have drafted, so I don't think we're far off. I'd be very pleased if more people can give input.

Hans OVERBEEK Fri, 24/04/2015 - 17:23

Presuming that the list of 13 categories in this discussion represents the current insight (2015-04-24), I think this is going the right way. This is a very workable list to which we can match our themes quite easily. There is one 'hot' theme however that I find difficult to match: "Migration and integration". It comes close to "International issues", but strange enough "Migration and integration" is just the opposite of an international issue.

 

Any ideas?

Alessio DRAGONI Tue, 09/06/2015 - 18:04

the final 13 categories will have dereferencable URIs and localized labels.. right ? It has been already decided where these will be made available?

Anonymous (not verified) Fri, 07/08/2015 - 17:45

Dear Mr Hanssens,

It's an important remark.

The Publications Office Eurovoc team will work on the mapping between new list and the Eurovoc domains.

If a mapping is required also for the other list (for example the one used by Belgium portal) a the list(s) of terms to be mapped should be submitted to OP. The list should be in one of the following formats: Excel, XML, RDF.

I think that the answers to your particular questions depend on the content of datasets that are classified within the existing categories so it would require knowledge of the current data stock. I'm not working in alignment team but if you send us a request for alignment our colleagues can help with it.

Regards,

Agnieszka Zajac