There are three places where data is (temporarily) stored in this processing engine for different reasons: AWS DocumentDB, AWS S3 and AWS Elasticsearch. In DocumentDB, data is loaded by the processing engine after it was stored from the source extractions. Depending on the type of question of the feedback, additional services and storage will be used to store (intermediate) results in the processing. AWS S3 is used for specific purposes only as explained below. All processed feedbacks will finally be stored in Elasticsearch to be visualized by Kibana.
The relevant and reusable code snippets can be found in doris-python-code/shared/connectors/:
- documentdb_connector.py
- elastic_connector.py
- s3_connector.py
Every code script or Lambda function that reads or writes data to these databases makes use of the below connector classes. The connections are managed by Application.py in in doris-python-code/shared/config.
AWS S3
Doris uses 2 buckets on AWS S3 for storing (intermediate) results: “doris-attachments” and “feedbacks-processing”. Other buckets are backups of the data stored in DocumentDB and the bucket “codepipeline-eu-west-1-818570815782” is used for the code pipeline of the code repository.
In doris-attachments all the original attachments (.pdf, .docx …), extracted text files (.txt) and translated text files belonging to feedbacks of question type FILE_UPLOAD are stored. If an extracted file without English translation is discovered, it is also translated during the processing and stored in the right place afterwards.
Structure in doris-attachments: {consultation_alias} / {ExtractedFiles || UploadedFiles || TranslatedFiles} / {feedback_id} / file_name.txt
In feedbacks-processing there are two folders used in the processing: “to-be-processed-ids” and “topics-detection-jobs”.
The first folder is used for the general orchestration of dividing ids of feedbacks for the different batches to be processed in different containers per question type.
The topics-detection-folder, on the other hand, is used as a staging ground for topics detection jobs. These jobs are for analysing the text of attachments (FILE_UPLOAD). Here the English text files are stored per consultation as an input for one job. The output is also first stored here and then extracted and processed in the feedback itself in the field attachment_topics.
AWS DocumentDB
Following collections are found in DocumentDB: 'consultations', 'feedbacks', 'feedbacks_processed', 'feedbacks_staged', 'feedbacks_staged_attachments', 'feedbacks_translated', 'feedbacks_translated_attachments' and 'monitoring'.
For consultations, feedbacks and feedbacks_processed the data model is the same as the one provided by Everis through the ETL flow where feedbacks_processed contains the extra text analysis fields like sentiment and key phrases. Therefore, they will not be listed again here.
The other collections are used in the processing of the feedbacks. 'feedbacks_staged', 'feedbacks_staged_attachments', 'feedbacks_translated', 'feedbacks_translated_attachments' are staging/intermediate collections for the translations with the eTranslation service because translations do not happen instantly. After the translation phase, these feedbacks are combined again and further processed and stored in feedbacks_processed. The collections 'feedbacks_staged' and 'feedbacks_staged_attachments' have the same data model as ‘feedbacks’ and therefore will not be listed here again. The 'monitoring' collection is related to the steps functions. In order to register the last runs and the results of those. The collections in bold have an updated data model, included below, that is different from the data model provided in earlier documentation of Doris+.
Overview of DocumentDB collections:
collection |
explanation |
consultations |
see consultation collections ETL. Reference Confluence |
feedbacks |
see feedbacks collections ETL Reference Confluence |
feedbacks_processed |
see feedbacks collections ETL with possible text analysis fields Reference Confluence |
feedbacks_staged |
intermediate staging collection, data model see feedbacks collections ETL Reference Confluence |
feedbacks_staged_attachments |
intermediate staging collection, data model see feedbacks collections ETL Reference Confluence |
feedbacks_translated |
Storage collection of eTranslation results for free_text |
feedbacks_translated_attachments |
Storage collection of eTranslation results for file_upload |
monitoring |
used to register the last runs and the results of Steps orchestration functions. It is updated by ‘lambda_process_done’ |
Monitoring collection fields:
Field name |
example |
type |
explanation |
_id |
ObjectId('5ea97d8fe91404571d2b17e0') |
objectId |
standard field |
run_id |
2ba90eaa-788f-6490-ee26-af9afec7850c |
string |
id or name given when starting execution of Step function |
type |
FREE_TEXT_EMPTY |
string |
type of feedback and content (empty or not) |
upload_start_date |
datetime.datetime(1900, 1, 1, 0, 0) |
date |
date of earliest possible feedback |
upload_end_date |
datetime.datetime(2020, 4, 7, 5, 22, 9, 729000) |
date |
date of latest uploaded feedback to be processed |
run_start_date |
datetime.datetime(2020, 4, 29, 13, 12, 37, 739000) |
date |
start date of the run |
number_items |
95396 |
integer |
number of feedbacks included in the run |
result |
PARTIAL_LOAD_SUCCEED_PART_1 |
string |
fail or succes message |
run_end_date |
datetime.datetime(2020, 4, 29, 13, 25, 45, 447000) |
date |
end date of the run |
feedbacks_translated collection fields:
Field name |
example |
type |
explanation |
_id |
ObjectId('5ea97d8fe91404571d2b17e0') |
objectId |
standard field |
text |
Deze zaken zouden bij het individu moeten... |
string |
text to beTranslationd |
status |
TRANSLATED |
string |
indicates if text was translated |
time_sent |
datetime.datetime(2020, 4, 24, 8, 36, 22, 452000) |
date |
time of sending translation request |
request-id |
20956402 |
string |
request id of the e translation service |
target-language |
EN |
string |
language in which it needs to beTranslationd |
translated-text |
These issues should lie in the individual... |
string |
the actual translation |
external-reference |
5e70a93455bae80624848626 |
string |
the object id of the feedback, for joining the text and feedback again |
time_received |
datetime.datetime(2020, 4, 28, 8, 16, 52, 166000) |
date |
time of receiving translation from eTranslation |
feedbacks_translated_attachments fields:
Field name |
example |
type |
explanation |
_id |
ObjectId('5ea97d8fe91404571d2b17e0') |
objectId |
standard field |
text_base64 |
Y2VydGFpbmVzIG1lc3VyZX |
base64 string |
text encoded in base64 |
status |
TRANSLATED |
string |
indicates if text was translated |
time_sent |
datetime.datetime(2020, 4, 24, 8, 36, 22, 452000) |
datetime |
time of sending translation request |
request-id |
20956402 |
string |
request id of the e translation service |
target-language |
EN |
string |
language in which it needs to beTranslationd |
translated-text-base64 |
MSBQcmVzaWRlbnQgRnJhbmsgTW9yZWVscyBWa |
base64 string |
the base64 encoded translation |
translated-text |
These issues should lie in the individual... |
string |
the actual translation |
external-reference |
350280_OpenPublicConsultationCCDEvaluation/ExtractedFiles/ 5d80e482cfb1a19d046978d9/EXT_pdf_Bausparkassenverb_nde _opinion_CCD_evaluation.txt |
string |
the path of the unique original text .txt file, for joining the text and feedback again |
time_received |
datetime.datetime(2020, 5, 11, 14, 20, 46, 866000) |
datetime |
time of receiving translation from eTranslation |
AWS Elasticsearch
After all feedbacks have been processed, they are stored in Elasticsearch. This is done in the reporting part of the step functions. The data is filtered so that only relevant data for visualisation and analysis is saved under the index ‘feedbacks’. Kibana uses this index to visualize consultations and feedbacks based on the fields below.
feedbacks index data model Elasticsearch:
Name |
example |
type |
explanation |
_id |
332885_5d95ff4acfb1a19d04826146_091 |
string |
a new composed id (unique identifier) in the format: <consultationId>_<feedbackId>_<questionOrder> to avoid duplicates (with diferent objectId) when loading from DocumentDB. |
_index |
feedbacks |
string |
name of the index in elasticSearch |
_score |
0 |
string |
standard field in Elasticsearch index |
_source |
|
string |
standard field in Elasticsearch index |
_type |
_doc |
string |
standard field in Elasticsearch index |
attachment_english_text_filepath |
string |
url link to the english .txt file on S3, extracted and/or translated from the original attachment (.pdf) |
|
attachment_english_text_kw |
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et magnis dis ... |
string |
text excerpt of the attachment in English |
attachment_english_text_t |
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et magnis dis ... |
string |
text excerpt of the attachment in English |
attachment_language_code |
en |
string |
language code of the original attachment file |
attachment_name |
0326_BUSINESSEUROPE_Comments_on_evaluation_of_EU_legislation_on_design_protection_consultation.pdf |
string |
original name of the attachment |
attachment_original_filepath |
string |
url link to the original attachment file on S3 |
|
attachment_original_text |
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et magnis dis ... |
string |
text excerpt of the attachment in original language |
attachment_topics |
businesseurope, national, term, evaluation, company, comment, design |
string |
up to 30 terms, in order of importance, that were associated with the attachment text. |
attachment_topics_all |
'businesseurope', 'national', 'term', 'evaluation', 'company', 'comment', 'design' |
string |
up to 30 terms, in order of importance, that were associated with the attachment text. |
closed_answer_ids |
d41d8cd98f00b204e9800998ecf8427e |
string |
the id contained in a feedback which associates with a answer text enclosed in the consultation document |
closed_answer_ids_options |
d41d8cd98f00b204e9800998ecf8427e, c4ca4238a0b923820dcc509a6f75849b, eccbc87e4b5ce2fe28308fd9f2a7baf3, c81e728d9d4c2f636f067f89cc14862c |
string |
array all the possible ids on the posed question |
closed_answer_texts |
1 |
string |
the text associated with the answer id |
closed_answer_texts_options |
, 1, 3, 2 |
string |
array of all the possible textual answers |
consultation_author_email |
- |
string |
email of the author of the consultation |
consultation_author_name |
null null |
string |
name of the author of the consultation |
consultation_end_date |
- |
string |
the date that respondents could last fill the consultation, i.e. it was closed. |
consultation_id |
340922 |
string |
the id of the consultation |
consultation_kind |
string |
source of the consultation either EUSurvey or BRP |
|
consultation_short_name |
Design2018 |
string |
the shorter name of the consultation title |
consultation_start_date |
- |
string |
the date that respondents could first fill the consultation, i.e. it was opened. |
consultation_title_kw |
Consultation on Design Protection in the EU |
string |
the title of the consultation, on this title you can search the right consultation in Kibana, if it is missing, short_name is used. |
consultation_title_t |
Consultation on Design Protection in the EU |
string |
|
consultation_title_url |
Consultation on Design Protection in the EU |
string |
|
consultation_type |
OPC_LAUNCHED |
string |
type of consultation, mostly OPC_LAUNCHED sporadically a xxx_DRAFT or others |
consultation_units |
SECRETARIAT-GENERAL |
string |
the DG units that have access to the consultations, data & visualisations |
dq_approved |
False |
boolean |
Flag of approval concerning the data quality of the feedback |
dq_description |
Text is longer than 5000 bytes |
string |
Description of what is wrong with the data quality of the feedback. |
feedback_country |
Belgium |
string |
country where the respondent/user lives |
feedback_country_1MM_div_population |
0.087 |
number |
1 million divided by the population of the respondent's country. Used to get insight in relative responsiveness per country |
feedback_country_iso_code |
BE |
string |
the ISO 2 code of the country |
feedback_id |
5d95fb35cfb1a19d048244b3 |
string |
|
freetext_answer |
En nuestra opinión, un aumento de los requisitos de protección, concretamente del requisito de visibilidad (a partir del Asunto T-49/12 Biscuits Poult SAS) |
string |
answer in the original language |
freetext_answer_english_kw |
In our view, an increase in protection requirements, in particular the visibility requirement (based on Case T-49/12 Biscuits poult SAS) |
string |
answer after translation to english |
freetext_answer_english_t |
In our view, an increase in protection requirements, in particular the visibility requirement (based on Case T-49/12 Biscuits poult SAS) |
string |
answer after translation to english |
freetext_answer_length |
185 |
number |
the character length of the answer, this can be maximum 5000 bytes (for the services we use) or it will be truncated. |
freetext_entities |
Many products, Italian, European, Basmati, zero-duty, European Union, India |
string |
all entities together detected by AWS Comprehend, below they are ordered per category. They are all lists of individual items. |
freetext_entities_.COMMERCIAL_ITEM |
Many products, Basmati |
string |
|
freetext_entities_.DATE |
1 September 2020 |
string |
|
freetext_entities_.EVENT |
Volkswagen scandal, diesel |
string |
|
freetext_entities_.LOCATION |
India |
string |
|
freetext_entities_.ORGANIZATION |
European Union, India |
string |
|
freetext_entities_.OTHER |
Italian, Basmati, European |
string |
|
freetext_entities_.PERSON |
Ursula von der Leyen |
string |
|
freetext_entities_.QUANTITY |
second industry, first generation, both, first |
string |
|
freetext_entities_.TITLE |
Test Procedure, Type, Market Surveillance Ordinance |
string |
|
freetext_key_phrases |
ENVIRONMENTAL IMPACT, INFORMATION, RECYCLING OPTIONS, REPAIR, MANY PRODUCTS, A DELIBERATE ATTEMPT |
string |
usually snippets of 1 to 5 words with the most meaning their sentences. |
freetext_language_code |
en |
string |
|
freetext_sentiment |
negative |
string |
the sentiment the text has with choices [neutral, positive, negative, mixed] |
question_group_id |
1 |
string |
the id of the group of questions or section where the question belongs to. |
question_order |
229 |
string |
the order of the question in the consultation. |
question_parent_choices |
['Seville', 'Karlsruhe', '', 'Petten', 'Geel'] |
string |
Sometimes a question is not clear without the context of the previous question (i.e. "Please explain") |
question_parent_text |
Please specify: |
string |
Sometimes a question is not clear without the context of the previous question (i.e. "Please explain") |
question_text |
Preceding question: Please specify: |
string |
In the given example, the real question text is "Please specify:". However, a parentQuestionId was linked in the feedback and thus the question_parent_text and choices were fetched and inserted respectively after "Preceding question:" and "With answers:" |
question_type |
FREE_TEXT |
string |
the type of question with choices in order of frequency: [SINGLE_CHOICE, FREE_TEXT, MULTIPLE CHOICE, FILE_UPLOAD ] |
user.companySize |
MICRO |
string |
If the user is a company, the size is asked. Choices: [MICRO, SMALL, MEDIUM, LARGE] |
user.id |
461983 |
string |
Usually a six-digit number, sometimes less. |
user.type |
ORGANIZATION |
string |
the type of respondent with choices: [CITIZEN, ORGANIZATION, ADMINISTRATION, OTHER] |