MO6 - Add new class :TranslatedText with properties for language, method and quality

Anonymous (not verified)

Published on: 10/03/2015 Discussion Archived

Description

From: http://joinup.ec.europa.eu/mailman/archives/dcat_application_profile/2015-March/000125.html

In multilingual environments, usually the translated parts of the metadata is presented to the user. It might be important for the user to know if the text is machine translated or by a real person. If there is a way to measure the quality for a translation, this could be a beneficial information as well.

Proposed solution

Add new class:TranslatedText with properties for language, method and quality

Component

Code

Comments

Makx DEKKERS Tue, 24/03/2015 - 20:51

From the request, it is not clear what the proposed solution would be. It seems that the proposal is to create a new datatype as a subclass of rdfs:Literal. The proposal could be to (a) apply this new datatype to all textual data in existing properties which might compromise interoperability with other DCAT implementations that will commonly use simple language-tagged string values or (b) create new properties such as :translatedTitle and :translatedDescription that use the new datatype.

Andrea PEREGO Tue, 24/03/2015 - 23:50

If I undestand correctly the use case, the objective is being able to attach text translations, along with information on their provenance.

If this is the case, an option would be to use the Open Annotation ontology and/or the PROV ontology.

About how to specify the "quality" of the translation, I guess this would be very much dependent on the evaluation methodology.

Bert VAN NUFFELEN Tue, 07/04/2015 - 16:43

Hi,

the question for me is if this kind of aspect is part of the "core" of dcat-AP or if it is part of an extension or a need for a specific kind of Open Data portals (i.e. harvesting portals that are not the prime source of the dataset description.)

In general, this provenance aspect can be applied on each field: for instance a description can be manually entered or being created from a legacy database.

We should reflect whether it is our objective to include provenance tracking of the metadata as part of the DCAT-AP specification or if this is left for each implementation. For instance today there is some global tracking by the notion of the catalog record. In that the modified date indicate when some meta data has been changed. In case of adding the translation predicates each time the automated translation is being executed the modified date of the catalog record is possibly changed.

I have the feeling that the request is too much targeting a single use situation today instead of a common situation.

A possible solution in the middle would be to add an optional property "AutomaticTranslatedLanguages" at the catalog level. That can indicate that texts in the mentioned languages may be the outcome of an automated translation process.

In this way consumers of the catalog information are informed that there is automated translation taking place in the catalog.

Another option is to follow the same approach as SKOS which defined SKOS-XL. They created an extension to SKOS that objectified each "string". Using that mechanism they added additional provenance to a value.

Anonymous (not verified) Thu, 09/04/2015 - 11:09

The use case was simply that we have to develop the pan european open data portal which is highly multilingual. Additionally, the commission makes clear that we have to indicate on the frontend (and I guess via API/SPARQL as well) if a text was translated by a machine or not. In any case we end up that we have to know for each property that can be translated if it is translated by human or by a machine. Hell, we even have to know it for any individual language of each property. In our original change request we naively thought literals could have additional attributes beyond lang and type. But thats not true. Actually, I don't see a solution except defining a new class, but this was not our intention.

Anonymous (not verified) Tue, 14/04/2015 - 18:08

I think there are more use cases that need this property.

In the EU ODP we intend to integrate machine translation of the title, description (on the level of dataset and distribution) and keywords. As we have some metadata that are translated by DGT (high quality translation) we definitely need to distinguish them from the metadata that will be translated automatically. Therefore, an optional property as proposed would be very useful. The solution proposed by Bert ("AutomaticTranslatedLanguages") seems to be reasonable and should complicate the model.

The other use cases are the national portals: many of them are bilingual (official language of the country plus usually English). As the translation is costly they might want to integrate an automatic translation system for their metadata, too. In all cases like that there is a need to inform users that the translation is done by machines.

Makx DEKKERS Tue, 14/04/2015 - 20:22

If I understand correctly the 'SKOS extension' approac and the Open Annotation approach are similar as they both 'objectify' Literals; SKOS by allowing relations between labels and the Open Annotation by creating a class cnt:ContentAsText. PROV-O would need to create an Entity for the text.

The proposal for a property on the Catalog level is simpler, but it is not very clear whether the relatively vague semantics would help. Bert suggests "the mentioned languages may be the outcome of an automated translation process" and "informed that there is automated translation taking place", but what could a consumer (machine or human) do with that information? It does not say anything about a particular piece of text you're looking at. Some text in a particular language could have been original, some translated by a human, some automatically translated. This might actually confuse the user.

I am just wondering whether the use case is not mostly local, i.e. a catalogue in Spain harvests metadata from Finland and applies machine translation from Finnish to Spanish, and then needs to tell its human users that a particular piece of text was translated by a machine. This can (and maybe should) be done entirely outside of DCAT in some local way -- including applying SKOS-XL, Open Annotation or PROV-O to every piece of text. You can do locally whatever you want.

The question is whether we really need to consider what happens if a portal in, say, Germany harvests the data from Spain, and gets a mix of original and translated Spanish?

Bart HANSSENS Thu, 16/04/2015 - 12:07

IMHO it is a bit overkill...

As Makx mentioned, I don't think the user actually benefits from this information, and if needed this can/should be done outside the DCAT exchange.

Anonymous (not verified) Thu, 16/04/2015 - 17:21

Well, the information about the translation method is more a meta-meta information and therefore probably should be better located in the catalog record (don't know how to achieve that). But it says something about the quality of the metadata and if the user can consider the translation as "correct" or if the user should handle it with care. If it is only managable outside dcat-ap I can only indicate translations that I made, and not translations made by other.

I still think it is a valuable information for the end user, especially if the user is an app developer and has to present the metadata (or parts of it) to someone else. Remember, app developers are a big target group of open data in general.

Anonymous (not verified) Mon, 27/04/2015 - 12:19

I am asked to provide a solution. I looked at the annotation owl, but to be honest, I didn't understand it well. Annotation in general would be a good solution in my opinion, but I cannot present a solution based on it. The same for the prov ontology.

Beside that, I see three approaches:

1. Using for any related Literal another sub class which defines the nessessary properties. I know that many other domains use this approach as Literal is very often not sufficiant. But I don't see that DCAT-AP can go this way without breaking compliancy with DCAT.

2. Using for any property and any language a specific predicate. E.g. dcat:title_in_english_is_automatic_translated (of course, could be shorter). This would results in a huge list of additional properties. Very ugly.

3. Probably the easiest way would be to use special language indicators like lang='en_auto'. Although it would not break DCAT or RDF, it weakens discoverability.

I am not a real expert in ontologies, so any help would be highly appreciated. If there is still a commitment that we should honor this change request ;-)

Makx DEKKERS Mon, 27/04/2015 - 16:57

Simon, option 1 breaks DCAT, option 2 is indeed very ugly.

Option 3 can be done in a standard way. Language tags to be used with rdfs:Literal are defined by BCP47, which allows the use of the "t" extension for text transformations defined in RFC6497 with the field "t0" indicating a machine translation.

A tag will look like: "en-t-es-t0-abcd", which conveys the information that the string is in English, translated from Spanish by machine translation using a tool named "abcd".

This would still be interoperable with applications that do not care whether text is translated; these would still be able to understand that the text is in English; applications that care about translations would understand that the English was translated from Spanish, and applications that care about how the translation was done would know that the English was translated from Spanish using a tool "abcd"

Anonymous (not verified) Tue, 28/04/2015 - 09:28

Thanks alot, Makx. Exactly what we are looking for and I definetly vote for this solution.

Anonymous (not verified) Wed, 29/04/2015 - 15:40

Thank you Makx.

The proposed solution sounds good for us, too.

Regards,

Agnieszka

Makx DEKKERS Sun, 03/05/2015 - 19:35

This solution is included in the section on multilinguality in Draft 3.

Makx DEKKERS Sun, 07/06/2015 - 12:34

MO6 - Add new class :TranslatedText with properties for language, method and quality

Description

Proposed solution

Component

Category

Comments