Navigation path

Open source observatory

(
 
)
4.68/5 | 174 votes
Editor's choice

Entrada, a Big Data platform for Internet domain name registries, available as open source

(
 
)
4/5 | 4 votes |
Editor's Choice

The R&D team of the Dutch Internet domain name registry SIDN has recently made its Entrada platform available as open source. Entrada — an acronym for ENhanced Top-level domain Resilience through Advanced Data Analysis — is an experimental Big Data platform specifically developed for building applications to detect botnets and other malicious systems. By increasing the security and stability of the .nl domain, Entrada makes the (Dutch) Internet a safer place.

The deployment at SIDN Labs stores about 145 billion records containing information on the DNS queries received by the authoritative name servers of the .nl zone, and the answers they provided. Analysis of this information allows the registry to detect botnets, pop-up/burner domains to sell stolen, fake or illegal goods and drugs, and other malicious clients.

We have conducted many analyses using Entrada. One of the applications we have developed is the resolver reputation service, which automatically detects botnet clients. We share this information with AbuseHUB, who passes it on to the appropriate Internet Access Provider. AbuseHUB is a collaboration of seven Dutch access providers, SIDN and SURFnet. It fights botnets by collecting information about infected computers and sharing this with the associated access providers.

The .nl domain is one of the safest top-level domains out there. That is another reason why sharing this software is a good idea. Generally, top-level domains become less secure as registration become cheaper. That implies that there are many other registries and DNS operators who will probably detect much more malicious activity using Entrada than we do.

Policy context

SIDN is the registry for the national .nl top-level domain (TLD) of the Netherlands. It is responsible for administering of 5.6 million domain names, registered through about 1400 registrars. As such, SIDN calls itself a private foundation with a public task.

We process about 1.3 billion DNS queries per day, says Maarten Wullink, Research Engineer at SIDN Labs and technical lead for the Entrada project. Most of these queries are legit, but some of them come from botnets or are the result of other malicious activities. Entrada was specifically developed to find such requests, which requires a system that is able to analyse large volumes of network data. It provides a platform that allows us to evaluate Big Data applications to:

  • safeguard the stability of .nl,
  • increase the safety of the (Dutch) Internet, and
  • detect botnets and abuse.

Resolver reputation

Detecting DNS queries from malicious systems is one thing; finding out who is behind these queries is far more difficult, Wullink says. As a registry we have a global view, and we see mostly queries from large service providers like Google Public DNS and Internet Access Providers (IAPs). They run caching name servers for their customers, and these servers hide most of the user population from us. We estimate that 99.6 percent of the queries we receive come from shared caching resolvers run by DNS service providers and Internet Access Providers.

Because of this organisation of the DNS infrastructure, access providers have a more detailed view, though only a local one. They could, however, use the Entrada system to detect suspicious queries coming from their customers and directly relate these activities to the IP addresses of their users.

An example application of Entrada that we are working on involves connecting our system to the anti-spam lists published by the Spamhaus Project. If we see a relatively high number of queries for MX records [which contains the mail entrance for a domain name] from IP address ranges that are not supposed to send e-mail, we could notify the Spamhaus Project or the public security intelligence provider. Information like that can be used to create profiles of the reliability (reputations) of DNS resolvers.

Description of target users and groups

Focus

Our deployment of Entrada is currently focused on the security and stability of the .nl zone, and further increasing our understanding of it, says Wullink. In the future we may extend that to operational, commercial and marketing applications. But for now we have explicitly decided to stay away from the latter because of the privacy implications.

Since the IP address of the computer that sends a query is sensitive information, we have developed a privacy framework [1, 2] that integrates the legal, technical and organisational aspects of privacy management relating to the use of DNS data.

To comply with this framework, we have to write a separate privacy policy for each Entrada application. This describes:

  • the purpose for which the data is used,
  • what filters will be applied, e.g. anonymisation, pseudonymisation, aggregation,
  • who has access to the data,
  • the type of application, i.e. research or production, and
  • other security measures.

These policies have to be approved by our Privacy Board and will be published online within the next few months.

The DNS query data for the .nl zone will always remain at SIDN, Wullink emphasises. Within SIDN, it is only accessible to a limited number of people, and only parts of the data are available — under strict rules — to university researchers, AbuseHUB and security providers. Sharing data requires a separate agreement and privacy policy. If a government organisation wants some information, it needs a warrant.

Description of the way to implement the initiative

The experimental Entrada setup at SIDN Labs currently consists of two control systems and six data systems. The cluster contains 7 Tbyte of data, which is stored in triplicate in a distributed system based on Hadoop [the most popular storage and processing platform for Big Data]. The data is acquired from two of SIDN's authoritative name servers. The 145 billion records make up 25 percent of the total traffic over the last two years. In the current setup, records older than 18 months are anonymised by removing their IP addresses. A production deployment of Entrada would probably have a shorter data retention period — just long enough to support the applications using it.

We use an efficient format to store all of this information, Wullink says. We keep almost all the data from the DNS queries: the IP address (but not the reverse domain name), the country from which the query originated, the Autonomous System (AS, a network identification), who owns this AS, and how long it took us to answer the query. From the answers we store only the meta-data [data about data] and the DNS status flags. If we really needed to, we could regenerate the actual DNS answers from our historic Domain name Registration System (DRS).

Technology solution

Entrada is built on various open-source components, mostly software projects of the Apache Software Foundation (ASF):

We have left the underlying open source software components as they are, Wullink explains. The Entrada platform basically consists of these generic components and a set of specific scripts and program code that configures and connects all the parts after these standard packages have been installed. We created new code for automatic conversion from the pcap format to the Parquet format, and to coordinate the workflow for storing in Entrada. The source code is available on GitHub. It contains scripts to install the database definitions and to run data extraction and conversion jobs using cron.

Although the Hadoop software is freely available from Apache, we are using Cloudera's release of Hadoop, which we recommend because of its excellent management features and because you will have to install Cloudera Impala anyway. Of course, you will need to be familiar with Hadoop, but I would say that Entrada is very accessible to skilled system administrators and users familiar with SQL. There is nothing in the repository that would confuse a Unix system manager.

Main results, benefits and impacts

AbuseHUB

We have conducted many analyses using Entrada, says Wullink. One of the applications we have developed is the resolver reputation service, which automatically detects botnet clients. We share this information with AbuseHUB, who passes it on to the appropriate Internet Access Provider. AbuseHUB is a collaboration of seven Dutch access providers, plus SIDN and SURFnet [the organisation responsible for the Dutch ICT infrastructure for education and research]. It fights botnets by collecting information about infected computers and sharing this with the associated access providers.

According to the latest Phishing Trends Report from the Anti-Phishing Working Group (APWG), the Netherlands ranks number four in the list of spam-sending countries, this being a consequence of its large hosting sector.

Entrada provides a more direct and faster way to detect malicious clients than traditional methods like spam traps [honeypots to collect spam], says Marco Davids, also a Research Engineer at SIDN Labs. Being a so-called 'reliable notifier' in AbuseHUB speak, we report infected computers to AbuseHUB when they abuse .nl domain names in their spam/phishing messages. That way we are helping Internet Access Providers fight malware infections on their users' computers.

Phishing

In another project, Giovane Moura, Data Scientist at SIDN Labs, used Entrada to develop a system that automatically detects new domains that are potentially being used for phishing, says Wullink. The system checks the DNS traffic pattern for newly registered domain names and looks for typical patterns. Normally, it takes some time for a new domain to be developed and for a website to attract visitors. A phishing domain often shows different behaviour, generating a lot of traffic from many different countries and networks within a few days.

We are currently running a pilot with a few large registrars who receive a notification when we detect a domain that is potentially being used for phishing. Registrars have a strong interest in keeping malicious domains out of their portfolios, because they often also operate hosting businesses. These domains are often paid for using stolen credit card data, for example.

Finally, we regularly work with universities on projects where the Entrada platform is being used for research purposes. Delft University of Technology, for example, receives data from our Entrada system on a daily basis. This data is used in the REMEDI3S-TLD project.

Return on investment description

Wullink has put about one year of work into developing Entrada, albeit not full-time. He estimates the total investment in the platform to be 200-250 thousand euro. Adding the cost of the hardware for the current deployment at SIDN gives a total of 300 thousand euro. A new deployment based on the existing software would cost a lot less than that, of course.

Track record of sharing

According to Wullink, the decision to make Entrada available as open source was approved by the management team. It fits our goal of making the Internet in general a safer place. Cost sharing was not the main factor in this case; instead we hope to build a community of users who will create new ideas and applications, and share these with us and others.

Entrada is not a commercial product, but we may need to charge anyone who needs a significant amount of support or specific functionality. Feature requests that benefit the whole community we can do at our own expense.

Use cases

Sharing and re-use of the Entrada software at this moment is still limited. Since this system addresses a very specific problem, the potential user group is relatively small, Wullink says. We have been in touch with researchers from an Italian university, who have already provided some patches to the software. Several peer registries have shown an interest in Entrada and we know that one of them has actually started using it. At this stage, we are focused on promoting the Entrada project at Internet conferences, so we are working to increase its installed base.

The .nl domain is one of the safest top-level domains out there. We detect only a few dozen infected Dutch IP addresses per day, and a few thousand addresses from other countries. For the latter, it is difficult to forward these alerts to an authority abroad that has the capability to clean up these infections. The discovery of a command-and-control server is a rare event. Still, we want to keep tabs on things because we have a reputation to uphold.

That brings us to another reason why sharing this software is a good idea. Generally, top-level domains become less secure as registration becomes cheaper. That implies that there are many other registries and DNS operators, for instance in the Middle East and Asia, who will probably detect much more malicious activity using Entrada than we do.

Case Info

Acronym:
Entrada
Website URL:
Start date:
2013

Information

highlight:
Open source observatory
Case status:
Research
Case type:
Open source case study
Geographic coverage:
Netherlands
Implementation cost:
€49-299,000
Themes:
Communication (infrastructure), Education, Science and Research
Scope:
International, National
Type of service:
IT Infrastructures and products
Technology choice:
Mainly (or only) open standards, Open source software
Type of initiative:
Project or service