The DCAT Application profile for data portals in Europe (DCAT-AP) is a result of European efforts for standards-based harmonisation of dataset and data catalogue specifications. It aims at creating a European data ecosystem by increasing the discoverability and reuse of open government data and improving the interoperability of data portals. The achievement of this target would support the objectives of public administrations to fulfil their transparency goals, reduce barriers to cross-border business and allow consumers to develop innovative solutions based on data.
Today, almost 900,000 datasets coming from all over Europe are available for the exploration and reuse by public through a single point of access - the European Data Portal. The portal automatically harvests datasets from 77 European data catalogues. They are harmonised and linked together by a common cross-domain metadata language - DCAT-AP. The use of DCAT-AP prevents the coexistence of disconnected information islands of European data and the fragmentation of the open data landscape.
Moreover, 12 EU countries have implemented the DCAT Application Profile, using natively DCAT-AP with extensions in Open Data Portals, while the major part of other catalogues is DCAT-AP compliant.
DCAT-AP has been further extended by GeoDCAT-AP for describing geospatial datasets, dataset series and services. Another extension StatDCAT-AP aims at delivering specifications and tools that enhance interoperability between descriptions of statistical datasets within the statistical domain and between statistical data and open data portals.
By complying with DCAT-AP and its extensions, data publishers automatically comply with other existing international standards and specifications such as INSPIRE, ISO19115, SDMX and Schema.org. The latter will enable the datasets to be better discoverable by the search engines such as Google Dataset Search.
Using DCAT-AP for the EU Institutions and Agencies data - EU Open Data Portal’s point of view
The DCAT-AP is a reference metadata standard for open data of the EU Institutions, agencies and bodies as published on the EU Open Data Portal. The portal - created by Commission Decision 2011/833/EU - is the central point of access to over 13,500 datasets coming from 79 institutional data providers.
Describing the multidisciplinary datasets coming from the European Union authoritative sources goes together with a number of high-level requirements.
From the point of view of data publishers, the model should be flexible enough to embrace the cross-domain datasets by providing a framework of elements common to all metadata records.
From the point of view of data consumers, the model should cover essential elements allowing them to understand how the data has been produced, processed, what is the quality of the data, what other related data exist, how potentially it can be reused and what are the limitations of the reuse.
DCAT-AP, being a specification based on the Data Catalogue Vocabulary (DCAT) developed by W3C, makes use of various well-known vocabularies. Thanks to this, it brings together essential elements for describing datasets and their distributions and covers multiple needs. Also, being not only a metadata standard but an application profile, DCAT-AP adds additional constraints, such as a minimum set of required metadata fields and association of the controlled vocabularies into the description of datasets. This is necessary for insuring further harmonisation, interoperability and high quality of metadata.
The EU Open Data Portal makes extensive use of the controlled vocabularies, referring to 19 lists of multilingual terms, translated into 24 languages and available on the EU Vocabularies site of the Publications Office of the EU. Their use helps strengthening the multilingualism of the EU Institutions and the requirements to provide dataset descriptions possibly in all official EU languages, including the multilingual search facilities. It also increases general quality of metadata by avoiding misspellings, typos and different ways of expressing the same terms, and eases the automatic annotation of datasets.
Furthermore, DCAT-AP recommends a list of Data themes for thematically classifying the datasets with high-level topics. Grouping the datasets with the generic themes allows the EU Open Data Portal to offer alternative exploration and improve discoverability of datasets by browsing. Similar function but on a more granular level have the “controlled” keywords coming from the multilingual thesaurus – EuroVoc. EU Open Data Portal annotates datasets with the EuroVoc micro thesaurus in 23 languages, which consists of over 6,000 terms tailored for EU content. Existing 15 alignments between EuroVoc and a number of well-known domain specific thesauri, such as INSPIRE, Agrovoc, Gemet, Unesco and others, allow to easily transform annotations existing at the data providers premises into the EU Open Data Portal’s EuroVoc keywords. DCAT-AP is also flexible enough to embrace use of a combination of multiple thesauri and vocabularies. Therefore, the EU Open Data Portal envisages using natively some of the above-mentioned thesauri to enrich annotations, targeting professional re-users used to a certain domain-specific terminology.
An important requirement, mentioned also in the Commission reuse decision, is to offer metadata and data in machine-readable formats. DCAT-AP is expressed in the RDF linked open data format. It makes metadata not only machine-readable, but by expecting the resolvable Unique Resource Identifiers (URI) to be used for naming the elements, it makes them also linkable to other metadata and data, building potentially a web of interlinked datasets. All datasets published on the EU Open Data Portal are assigned a persistent URI from the URI registry of EU institutions and bodies operated by the Publications Office. Such an approach ensures the ability of linking resources and the long-term availability of metadata records also in case the existing infrastructure is migrated.
The EU Open Data Portal stores all of its’ datasets in an open source linked data store. It exposes the catalogue metadata - through the Graphical User Interface for the human consumption and through an API and SPARQL endpoint - in RDF format, for the machine access. RDF is such a flexible format which allows exporting metadata into various other formats, such as CSV, Excel, JSON and others. Moreover, managing datasets in the linked open data format makes EU Open Data Portal specific vendor and software independent. The model can be modified or adjusted just by manipulating the data itself, without the necessity of additional coding.
The full added value of using DCAT-AP can be unleashed in the linked data context. The revised version of DCAT will propose even more means for making datasets better findable and reusable. For instance, it will propose more semantics to describe different relation types between datasets. Using these semantics for interlinking the datasets will allow the EU Open Data Portal to enhance its’ services, for example, by suggesting to the users different related datasets and contents.
DCAT-AP standard can become a tool for internal information governance purposes
Despite all the advantages of a common metadata standard, many of the European governmental applications and databases are still operating in silos, though there is a need for seamless sharing and exchange of data across different governmental organisations for a successful data ecosystem.
The governments managing public, restricted and non-public datasets could make more extensive use of DCAT-AP to annotate their repositories and data collections. This would increase government-to-government reuse and would help to achieve the once-only principle. As a result, the governmental data would become interoperable inside organisations, and eventually a higher number and a better quality of open government datasets would be available to the public.
If the internal governmental systems will become more interoperable by using the commonly accepted data and metadata definitions for their data collections, the data portals, which have to rely still on many manual or semi-automatic mappings and transformations, will be able to pull the datasets automatically. In a long-term perspective, this is the only sustainable way of maintaining them alive.
To achieve this broader uptake of DCAT-AP, the governments would need, apart from the political will, a better overview of the existing tools. Also, it might be helpful to have a set of new tools allowing to extract metadata automatically into DCAT-AP to implement and manage DCAT-AP as a native standard or to facilitate mappings with the organisational native metadata standards already in place.