Skip to main content

Storage Components

There are three places where data is (temporarily) stored in this processing engine for different reasons: AWS DocumentDB, AWS S3 and AWS Elasticsearch. In DocumentDB, data is loaded by the processing engine after it was stored from the source extractions. Depending on the type of question of the feedback, additional services and storage will be used to store (intermediate) results in the processing. AWS S3 is used for specific purposes only as explained below. All processed feedbacks will finally be stored in Elasticsearch to be visualized by Kibana.

The relevant and reusable code snippets can be found in doris-python-code/shared/connectors/:

  • documentdb_connector.py
  • elastic_connector.py
  • s3_connector.py

Every code script or Lambda function that reads or writes data to these databases makes use of the below connector classes. The connections are managed by Application.py in in doris-python-code/shared/config.

AWS S3

Doris uses 2 buckets on AWS S3 for storing (intermediate) results: “doris-attachments” and “feedbacks-processing”. Other buckets are backups of the data stored in DocumentDB and the bucket “codepipeline-eu-west-1-818570815782” is used for the code pipeline of the code repository.

In doris-attachments all the original attachments (.pdf, .docx …), extracted text files (.txt) and translated text files belonging to feedbacks of question type FILE_UPLOAD are stored. If an extracted file without English translation is discovered, it is also translated during the processing and stored in the right place afterwards.

Structure in doris-attachments: {consultation_alias} / {ExtractedFiles || UploadedFiles || TranslatedFiles} / {feedback_id} / file_name.txt

In feedbacks-processing there are two folders used in the processing: “to-be-processed-ids” and “topics-detection-jobs”.

The first folder is used for the general orchestration of dividing ids of feedbacks for the different batches to be processed in different containers per question type.

The topics-detection-folder, on the other hand, is used as a staging ground for topics detection jobs. These jobs are for analysing the text of attachments (FILE_UPLOAD). Here the English text files are stored per consultation as an input for one job. The output is also first stored here and then extracted and processed in the feedback itself in the field attachment_topics.

AWS DocumentDB

Following collections are found in DocumentDB: 'consultations', 'feedbacks',  'feedbacks_processed',  'feedbacks_staged',  'feedbacks_staged_attachments',  'feedbacks_translated', 'feedbacks_translated_attachments' and 'monitoring'.

For consultations, feedbacks and feedbacks_processed the data model is the same as the one provided by Everis through the ETL flow where feedbacks_processed contains the extra text analysis fields like sentiment and key phrases. Therefore, they will not be listed again here.

The other collections are used in the processing of the feedbacks. 'feedbacks_staged',  'feedbacks_staged_attachments',  'feedbacks_translated', 'feedbacks_translated_attachments' are staging/intermediate collections for the translations with the eTranslation service because translations do not happen instantly. After the translation phase, these feedbacks are combined again and further processed and stored in feedbacks_processed. The collections 'feedbacks_staged' and 'feedbacks_staged_attachments' have the same data model as ‘feedbacks’ and therefore will not be listed here again. The 'monitoring' collection is related to the steps functions. In order to register the last runs and the results of those. The collections in bold have an updated data model, included below, that is different from the data model provided in earlier documentation of Doris+.

Overview of DocumentDB collections:

collection

explanation

consultations

see consultation collections ETL. Reference Confluence

feedbacks

see feedbacks collections ETL Reference Confluence

feedbacks_processed

see feedbacks collections ETL with possible text analysis fields Reference Confluence

feedbacks_staged

intermediate staging collection, data model see feedbacks collections ETL Reference Confluence

feedbacks_staged_attachments

intermediate staging collection, data model see feedbacks collections ETL Reference Confluence

feedbacks_translated

Storage collection of eTranslation results for free_text

feedbacks_translated_attachments

Storage collection of eTranslation results for file_upload

monitoring

used to register the last runs and the results of Steps orchestration functions. It is updated by ‘lambda_process_done’

Monitoring collection fields:

Field name

example

type

explanation

_id

ObjectId('5ea97d8fe91404571d2b17e0')

objectId

standard field

run_id

2ba90eaa-788f-6490-ee26-af9afec7850c

string

id or name given when starting execution of Step function

type

FREE_TEXT_EMPTY

string

type of feedback and content (empty or not)

upload_start_date

datetime.datetime(1900, 1, 1, 0, 0)

date

date of earliest possible feedback

upload_end_date

datetime.datetime(2020, 4, 7, 5, 22, 9, 729000)

date

date of latest uploaded feedback to be processed

run_start_date

datetime.datetime(2020, 4, 29, 13, 12, 37, 739000)

date

start date of the run

number_items

95396

integer

number of feedbacks included in the run

result

PARTIAL_LOAD_SUCCEED_PART_1

string

fail or succes message

run_end_date

datetime.datetime(2020, 4, 29, 13, 25, 45, 447000)

date

end date of the run

feedbacks_translated collection fields:

Field name

example

type

explanation

_id

ObjectId('5ea97d8fe91404571d2b17e0')

objectId

standard field

text

Deze zaken zouden bij het individu moeten...

string

text to beTranslationd

status

TRANSLATED

string

indicates if text was translated

time_sent

datetime.datetime(2020, 4, 24, 8, 36, 22, 452000)

date

time of sending translation request

request-id

20956402

string

request id of the e translation service

target-language

EN

string

language in which it needs to beTranslationd

translated-text

These issues should lie in the individual...

string

the actual translation

external-reference

5e70a93455bae80624848626

string

the object id of the feedback, for joining the text and feedback again

time_received

datetime.datetime(2020, 4, 28, 8, 16, 52, 166000)

date

time of receiving translation from eTranslation

feedbacks_translated_attachments fields:

Field name

example

type

explanation

_id

ObjectId('5ea97d8fe91404571d2b17e0')

objectId

standard field

text_base64

Y2VydGFpbmVzIG1lc3VyZX

base64 string

text encoded in base64

status

TRANSLATED

string

indicates if text was translated

time_sent

datetime.datetime(2020, 4, 24, 8, 36, 22, 452000)

datetime

time of sending translation request

request-id

20956402

string

request id of the e translation service

target-language

EN

string

language in which it needs to beTranslationd

translated-text-base64

MSBQcmVzaWRlbnQgRnJhbmsgTW9yZWVscyBWa

base64 string

the base64 encoded translation

translated-text

These issues should lie in the individual...

string

the actual translation

external-reference

350280_OpenPublicConsultationCCDEvaluation/ExtractedFiles/

5d80e482cfb1a19d046978d9/EXT_pdf_Bausparkassenverb_nde

_opinion_CCD_evaluation.txt

string

the path of the unique original text .txt file, for joining the text and feedback again

time_received

datetime.datetime(2020, 5, 11, 14, 20, 46, 866000)

datetime

time of receiving translation from eTranslation

 

AWS Elasticsearch

After all feedbacks have been processed, they are stored in Elasticsearch. This is done in the reporting part of the step functions. The data is filtered so that only relevant data for visualisation and analysis is saved under the index ‘feedbacks’. Kibana uses this index to visualize consultations and feedbacks based on the fields below.

feedbacks index data model Elasticsearch:

Name

example

type

explanation

_id

332885_5d95ff4acfb1a19d04826146_091

string

 a new composed id (unique identifier) in the format: <consultationId>_<feedbackId>_<questionOrder> to avoid duplicates (with diferent objectId) when loading from DocumentDB.

_index

feedbacks

string

name of the index in elasticSearch

_score

0

string

standard field in Elasticsearch index

_source

 

string

standard field in Elasticsearch index

_type

_doc

string

standard field in Elasticsearch index

attachment_english_text_filepath

https://ec2-52-16-9-29.eu-west-1.compute.amazonaws.com/s3/340922_Design2018/ExtractedFiles/5d95fb35cfb1a19d048244b3/EXT_pdf_0326_BUSINESSEUROPE_Comments_on_evaluation_of_EU_legislation_on_design_protection_consultation.txt

string

url link to the english .txt file on S3, extracted and/or translated from the original attachment (.pdf)

attachment_english_text_kw

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et magnis dis ...

string

text excerpt of the attachment in English

attachment_english_text_t

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et magnis dis ...

string

text excerpt of the attachment in English

attachment_language_code

en

string

language code of the original attachment file

attachment_name

0326_BUSINESSEUROPE_Comments_on_evaluation_of_EU_legislation_on_design_protection_consultation.pdf

string

original name of the attachment

attachment_original_filepath

https://ec2-52-16-9-29.eu-west-1.compute.amazonaws.com/s3/340922_Design2018/UploadedFiles/5d95fb35cfb1a19d048244b3/0326_BUSINESSEUROPE_Comments_on_evaluation_of_EU_legislation_on_design_protection_consultation.pdf

string

url link to the original attachment file on S3

attachment_original_text

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et magnis dis ...

string

text excerpt of the attachment in original language

attachment_topics

businesseurope, national, term, evaluation, company, comment, design

string

up to 30 terms, in order of importance, that were associated with the attachment text.

attachment_topics_all

'businesseurope', 'national', 'term', 'evaluation', 'company', 'comment', 'design'

string

up to 30 terms, in order of importance, that were associated with the attachment text.

closed_answer_ids

d41d8cd98f00b204e9800998ecf8427e

string

the id contained in a feedback which associates with a answer text enclosed in the consultation document

closed_answer_ids_options

d41d8cd98f00b204e9800998ecf8427e, c4ca4238a0b923820dcc509a6f75849b, eccbc87e4b5ce2fe28308fd9f2a7baf3, c81e728d9d4c2f636f067f89cc14862c

string

array all the possible ids on the posed question

closed_answer_texts

1

string

the text associated with the answer id

closed_answer_texts_options

, 1, 3, 2

string

array of all the possible textual answers

consultation_author_email

-

string

email of the author of the consultation

consultation_author_name

null null

string

name of the author of the consultation

consultation_end_date

-

string

the date that respondents could last fill the consultation, i.e. it was closed.

consultation_id

340922

string

the id of the consultation

consultation_kind

EUSurvey

string

source of the consultation either EUSurvey or BRP

consultation_short_name

Design2018

string

the shorter name of the consultation title

consultation_start_date

-

string

the date that respondents could first fill the consultation, i.e. it was opened.

consultation_title_kw

Consultation on Design Protection in the EU

string

the title of the consultation, on this title you can search the right consultation in Kibana, if it is missing, short_name is used.

consultation_title_t

Consultation on Design Protection in the EU

string

 

consultation_title_url

Consultation on Design Protection in the EU

string

 

consultation_type

OPC_LAUNCHED

string

type of consultation, mostly OPC_LAUNCHED sporadically a xxx_DRAFT or others

consultation_units

SECRETARIAT-GENERAL

string

the DG units that have access to the consultations, data & visualisations

dq_approved

 False

boolean

Flag of approval concerning the data quality of the feedback

dq_description

Text is longer than 5000 bytes

string

Description of what is wrong with the data quality of the feedback.

feedback_country

Belgium

string

country where the respondent/user lives

feedback_country_1MM_div_population

0.087

number

1 million divided by the population of the respondent's country. Used to get insight in relative responsiveness per country

feedback_country_iso_code

BE

string

the ISO 2 code of the country

feedback_id

5d95fb35cfb1a19d048244b3

string

 

freetext_answer

En nuestra opinión, un aumento de los requisitos de protección, concretamente del requisito de visibilidad (a partir del Asunto T-49/12 Biscuits Poult SAS)

string

answer in the original language

freetext_answer_english_kw

In our view, an increase in protection requirements, in particular the visibility requirement (based on Case T-49/12 Biscuits poult SAS)

string

answer after translation to english

freetext_answer_english_t

In our view, an increase in protection requirements, in particular the visibility requirement (based on Case T-49/12 Biscuits poult SAS)

string

answer after translation to english

freetext_answer_length

185

number

the character length of the answer, this can be maximum 5000 bytes (for the services we use) or it will be truncated.

freetext_entities

Many products, Italian, European, Basmati, zero-duty, European Union, India

string

all entities together detected by AWS Comprehend, below they are ordered per category. They are all lists of individual items.

freetext_entities_.COMMERCIAL_ITEM

Many products, Basmati

string

 

freetext_entities_.DATE

1 September 2020

string

 

freetext_entities_.EVENT

Volkswagen scandal, diesel

string

 

freetext_entities_.LOCATION

India

string

 

freetext_entities_.ORGANIZATION

European Union, India

string

 

freetext_entities_.OTHER

Italian, Basmati, European

string

 

freetext_entities_.PERSON

Ursula von der Leyen

string

 

freetext_entities_.QUANTITY

second industry, first generation, both, first

string

 

freetext_entities_.TITLE

Test Procedure, Type, Market Surveillance Ordinance

string

 

freetext_key_phrases

ENVIRONMENTAL IMPACT, INFORMATION, RECYCLING OPTIONS, REPAIR, MANY PRODUCTS, A DELIBERATE ATTEMPT

string

usually snippets of 1 to 5 words with the most meaning their sentences.

freetext_language_code

en

string

 

freetext_sentiment

negative

string

the sentiment the text has with choices [neutral, positive, negative, mixed]

question_group_id

1

string

the id of the group of questions or section where the question belongs to.

question_order

229

string

the order of the question in the consultation.

question_parent_choices

['Seville', 'Karlsruhe', '', 'Petten', 'Geel']

 string

Sometimes a question is not clear without the context of the previous question  (i.e. "Please explain")
  In that case the parent question and answers are fetched and added to the question_text.

question_parent_text

Please specify:

 string

Sometimes a question is not clear without the context of the previous question  (i.e. "Please explain")
  In that case the parent question and answers are fetched and added to the question_text.

question_text

Preceding question: Please specify:
With answers: ['Seville', 'Karlsruhe', '', 'Petten', 'Geel']
Please specify:

string

In the given example, the real question text is "Please specify:". However, a parentQuestionId was linked in the feedback and thus the question_parent_text and choices were fetched and inserted respectively after "Preceding question:" and "With answers:"

question_type

FREE_TEXT

string

the type of question with choices in order of frequency: [SINGLE_CHOICE, FREE_TEXT, MULTIPLE CHOICE, FILE_UPLOAD ]

user.companySize

MICRO

string

If the user is a company, the size is asked. Choices: [MICRO, SMALL, MEDIUM, LARGE]

user.id

461983

string

Usually a six-digit number, sometimes less.

user.type

ORGANIZATION

string

the type of respondent with choices: [CITIZEN, ORGANIZATION, ADMINISTRATION, OTHER]