Technologies stack

For the project implementation, there are multiple technologies that will take part of the expected solution. It is a main priority –as long as it is possible- that chosen technologies are open-source solutions. These technologies are widely supported by different software communities on the Internet.

The technologies to use in each of the blocks of the architecture are explained below.

ETL processing
Core Solution Component
Visualisation Component
Database Management System
AWS Services

1. ETL processing

Talend is an open source data integration platform that provides various software and services for data integration, data management, enterprise application integration, data quality, cloud storage and Big Data. It helps in transforming heterogeneous data into homogeneous to ease the analysis and other processes a user wants to do with handled information.

It is an open-source multiplatform software (Windows and macOS only) licensed under two different licenses based on the edition: Apache License 2.0 for the free one and proprietary for the paid ones.

2. Core Solution Component

Python

Python is an interpreted, object-oriented, high-level programming language with dynamic semantics. Its high-level built-in data structures, combined with dynamic typing and dynamic binding, make it very attractive for rapid application development, as well as for use as a scripting or glue language to connect existing components together.

Its philosophy emphasizes code readability and its syntax allows users to express complex concepts in fewer lines of code than in other languages, such as Java or C.

The Python interpreter and its extensive standard library are available in source or binary form free of charge for all major platforms and can also be freely distributed.

Selenium

Selenium is a framework for testing web applications. Its main use is to automatize web applications tests execution but its utilities also allow to simulate native human-like web navigation given the user the possibility to do a full interaction with the webpage.

It can be run against most modern web browsers and it is multiplatform.

It is open-source software under an Apache 2.0 license.

Beautiful Soup

Beautiful Soup is a Python library designed for HTML and XML documents parsing. This way, it builds a parse tree that can be used to extract data from HTML making it very useful for web scraping. Its main capabilities are:

It provides a few simple methods for navigating, searching, and modifying a parse tree
It automatically converts incoming documents to Unicode and outgoing to UTF-8, so there is no need to worry about encodings.
It uses popular Python parsers like lxml and html5lib, giving different choices to think about.

Pandas

Pandas is a Python library that provides high-performance and easy-to-use data structures and data analysis tools. It is useful to load stop words dictionaries when doing text mining processes.

Google Custom Search

Google Custom Search is a platform provided by Google that allows web developers to feature specialized information in web searches, refine and categorize queries and create customized search engines, based on Google Search.

It is a private commercial product.

3. Visualisation Component

Logstash

Logstash is a server-side data processing pipeline that ingests data from different sources simultaneously, transforms it, and then sends it to a selected stash (usually an Elasticsearch instance).

It is open-source software licensed under Apache License 2.0.

Elasticsearch

Elasticsearch is a distributed, RESTful search and analytics engine capable of solving a growing number of use cases. As the heart of the Elastic Stack, it centrally stores your data so you can discover the expected and uncover the unexpected.

Some of the parts of the software are open-source with mostly Apache License while other parts are commercial.

It is part of the Amazon Elasticsearch service used in this project

Kibana

Kibana is a data visualization plugin for Elasticsearch. It provides visualization capabilities on top of the content indexed on an Elasticsearch cluster. It allows multiple types of graphic representations like bar, line, scatter plots, pie charts and maps for large volumes of data.

It is open-source software licensed under Apache License 2.0.

It is part of the Amazon Elasticsearch service used in this project

4. Database Management System

PostgreSQL

PostgreSQL is a relational database management system. It is developed by a worldwide team of volunteers and none private organizations have control over it, being its source code available free of charge.

It is a very popular database that supports text, images, sounds, and video.

PostgreSQL is an open-source multiplatform software licensed under an own license (PostgreSQL License) that is an Open Source license between MIT and BSD licenses.

pgAdmin

pgAdmin is a visual management tool for PostgreSQL and derivative relational databases. It may be run either as a web or desktop application.

It is the most used solution for managing PostgreSQL databases. pgAdmin has a desktop and a server mode –the first one is focused on local use and the second one allows multiple users accessing over the web.

It is a multiplatform (Windows, Linux and macOS) open-source software licensed under the PostgreSQL licence.

5. AWS Services

Amazon Cognito

Amazon Cognito is a scalable access control service that provides the authentication, authorization and user management tools for web applications, simplifying the integration with the services used in the project (it is the recommended user service to use along with Kibana).

Amazon Elasticsearch

Amazon Elasticsearch Service (Amazon ES) is a managed service that eases the deployment, operation and scalability of Elasticsearch clusters in the AWS Cloud. Amazon ES is Amazon own implementation of Elasticsearch software, offering automation on clustering scale and other self-managed infrastructure options. This service includes Elasticsearch and Kibana software:

Report abusive content Share