Skip to main content

Processing Components

AWS Batch

Doris uses AWS Batch to dynamically allocate an optimal amount of computational resources based on the volume of feedback and specific resource requirements of the submitted batch computing tasks.  AWS Batch plans, schedules, and executes your batch workloads within a queue. Each workload can be assigned a priority to define the execution order within the allocated queue. The number of concurrent batches running simultaneously will depend on the number of CPU allocated to the queue. This parallelism of computing power allows theoretically large-scale loads in a short period and high scalability. Indeed, if the loading increase, it just has to increase the number of CPUs available. The limitation of DORIS computing lies in the service used within the batch jobs use.

AWS Batch jobs are generated with a pre-set-up configuration called a job definition. The batch job definitions specify how the jobs need to run. Although each job must refer to a job definition, many parameters specified in the job definition can be overwritten during execution. Job definition parameters range from the number of CPUs, container images, IAM roles to the retry policy and the Linux command.

Furthermore, each batch job in Doris+ are set up to run 1000 feedbacks, and the process relies on AWS Comprehend, AWS Elasticsearch and eTranslation. These three APIs have a throttling limit. It means that there is a limit to the number of calls to their API per second. If this limit is not respected the service will return an error or block the connection. To avoid these issues, Doris+ limits the number of calls per second and the number of batch jobs running concurrently. More specifically, the number of available CPU per queue is limited to 10 with two CPUs running per Batch job.

The processing components scripts are found in the doris-python-code/shared/processors/

  • Batch_feedback_processor.py for general logic and processing closed questions.
  • Batch_feedback_processor_file_upload.py as a child class has the same functionality and specific functions for the processing of file_upload feedbacks.
  • Batch_feedback_processor_free_text.py is also a child class and has specific functions for the processing of free_text feedbacks.

Closed Questions

For the closed question type no special processing is needed and so it is fetched from the ‘feedbacks’ collection DocumentDB and directly stored in the ‘feedbacks_processed’ collection.

For the reporting however, the closed feedbacks need info from the consultation as the feedback has only answerId’s. In addition, the metadata is combined for filtering purposes in the Kibana dashboards.

​​​​​​​Open Questions

For open questions there is processing to be done in 2 phases because of eTranslation. AWS Comprehend first detects the language of the text and then sends them for translation in phase 1. In phase 2 theTranslationd English text undergoes further analysis with key phrase, sentiment and entities detection. See figure below which shows the order of calls in the phases. The same order applies for attachments.

The reporting part filters also those field further so it is able to visualize the results in word clouds etc.

dorisPlus-eTranslationAndAWSComprehendComponentsInProcessing
eTranslation and AWS comprehend components in processing

​​​​​​​Attachments

Attachments have more complicated processing. First, the attachment feedbacks need to be sorted to determine if they have actual attachments if the attachments are extracted, and if so, in which language and then if a translation text file exists as well. The S3 attachment bucket is therefore scanned and compared to the file path given in the feedback. Then the translation also happens with eTranslation but the results are stored on S3 instead of in a DocumentDB collection. An excerpt of that translation is still stored in the feedback itself.
After translation, topics detection will happen on the texts. For the reporting part, URLs to S3 are constructed so that the viewer can click to see the original file or the translation .txt file.

eTranslation

eTranslation is an online machine translation service provided by the European Commission. the machine translation service is available to EC information systems through an API. It is highly secure and within the commission firewall and can translate in all official EU languages. Furthermore, it is optimised to work on the text of EU matters.

There are 2 services of eTranslation used:

  • Translate documents - upload one or more documents

Translate text – text snippet

Translation of text snipped is used for open question while base64 encoding is used for attachment of file upload feedbacks. For the later, etranslation  accepts the following formats: .txt, .doc, .docx, .odt,.ott, .rtf, .xls, .xlsx, .ods, .ots, .ppt, .pptx, .odp, .otp, .odg, .otg, .htm, .html, .xhtml, .h, .xml, .xlf, .xliff, .sdlxliff, .rdf, .tmx and pdf. Besides, for attachment, it keeps the format of your original document when outputting the translation of a document.

After opening an account in eTranslation, two different rest calls to the eTranslation API are made using HTTP Digest authentication. Then the translation is received asynchronously at the API gateway as an entry point :

  1. Text snipped:
{

                "externalReference": _id,

                "callerInformation": {

                    "application": self.application,

                    "username": self.username

                },

                "textToTranslate": text,

                "sourceLanguage": source_language_code,

                "targetLanguages": ["EN"],

                "domain": self.domain,

                "requesterCallback": self.request_callback,

                "errorCallback": self.error_callback,

            }

In this API call, we refer to the identifier of the feedback to beTranslationd, the application to which we have access, the source language of the feedback that needs to beTranslationd into English. Finally, we add the caller's callback, the http address of our API gateway where the translation will be sent afterwards. In this case, before sending translation, the identifier of the feedback is enter in a DocDB collection with status untranslated. Upon reception of the text, a lambda function will update the DocDB status to translated and insert the translation.

  1. Base64 encoded attachment
{

                "externalReference": external_reference,

                "callerInformation": {

                    "application": self.application,

                    "username": self.username

                },

                "documentToTranslateBase64": {

                    "content": encoded_string,

                    "format": "txt",

                    "fileName": external_reference

                },

                "sourceLanguage": source_language_code,

                "targetLanguages": ["EN"],

                "domain": self.domain,

                "errorCallback": callback,#self.error_callback,

                "destinations": {"httpDestinations": [callback]}

            }

In this API call, we refer to the identifier of the feedback attachment to beTranslationd, the application to which we have access, the source language of the feedback to be transformed into English and the document encoded in base64. Finally, we add the http destination, the http address of our API gateway where the translation will be sent afterwards. In this case, before sending translation, the identifier of the feedback is enter in a DocDB collection with status untranslated. Upon reception of the text, a lambda function will send the decode the base64 encoded document and store it in an s3 bucket. The lambda will also update the DocDB status to translated.

Upon reception of the translation to the API gateway

One of the limitations of eTranslation is that the priority of our request in the translation queue drops to 0 if too many calls are sent simultaneously to the tools. Although the translation is not lost or the tool returns an error, the time to receive the translation increases with the number of calls. We have therefore developed a scalable waiting system between phases that require translation as well as a return system. In addition, sometimes static translations of 0.01% to 1% are not received and have to be returned.

For more information on eTranslation call: eTranslation documentation

Important considerations:

  • Maximum 4000 characters for ordinary text and unlimited for our purposes when using base64 encoding (for attachments).
  • Only single requests per feedbacks were made

AWS Comprehend

There are 5 services of AWS Comprehend used:

  • Language detection
  • Key phrase detection
  • Sentiment detection
  • Entities detection
  • Topics detection

Important considerations with AWS Comprehend services:

  • Checking the length of the text snippets in bytes (max 5000 bytes)
  • The number of simultanous calls that can be made and therefor to use single requests, batches of 25, or the jobs API for each service.
    • Topics detection only has a job API and this needed to have a folder on S3 to fetch input and store output
    • The other four services were applied with the batch API of 25 requests
  • De-duplication of results are important for key phrases and entities to avoid to much similar words in a single text to skew results for the aggregate