A Comparative study one of the Hadoop distribution Hortonworks with Amazon Web Service (AWS) and Microsoft Azure

Kawser Ahmed Pinto
13 min readNov 21, 2020

Kawser Ahmed Pinto
School of Computer Science
Universiti Sains Malaysia
Penang, Malaysia
kawser098@gmail.com

Abstract In the current year, most of the big organizations or small organizations having big data. We are living in the age of big data, which is growing exponentially. Most companies are moving forward to computerization. Cloud services make the path easier for us in this day’s big organization handling big data to analyze, storing, and predict the model. So, all the organizations which are handling big data are required to take cloud services with the vast option to take the service. As well as they can take the Support and training from the data service center which is Hortonworks that can provide the open-source software to manage the Big data, analyze, storing a large number of data. For choosing the cloud service provider got lots of options to choose but better to take the best options which will be Microsoft Azure and Amazon Web Service (AWS). In this research paper, the researcher has given brief details of Hortonworks, Amazon Web Service (AWS), and Microsoft Azure. It is very much difficult sometimes for non-professional to choose the best cloud services for Big Data. In this paper, the researcher does provide the comparative study of one of the Hadoop distribution with the other two distribution, which will help users to make better decisions of choosing the best service.

Keywords — Amazon Web Service (AWS), Hortonworks, Microsoft Azure, Big Data, Hadoop Distribution, Cloud Service

Introduction (Brief Background)
In this current era, most of the new technologies are providing a large number of data that are needed to be collected, stored, categorized, and stored. We are living in the age of big data, where data are growing exponentially. Recently most companies getting benefits from using big data technology. Many distributions used for managing Big Data in Hadoop architecture are available in the market such as MapR, Cloudera, and Hortonworks [1]. Basically, cloud computing can be comparable to grid computing, where the unused processing cycle for all computers within a network to handle and solve the problems too hard for any machine which stand-alone [2]. It provides important resources that can be shared over the internet [3]. If we look into a few years back, a company which is a web-based company that needed to host and maintain their billing system, as well as the payment gateway for long-time and it, was so expensive contracts with the company of credit card, bank and security companies so on [4].

Officially since 2006, Amazon Web Service (AWS) is completing its 10- years anniversary. Maybe people might say it’s old, while others would say that Amazon took time to get into this excellent competition. While one of the competitors was introduced to the world in 2010 which is Microsoft Azure.

A model in an Infrastructure as a Service (IaaS) that, the hardware components that occur in a data center. Besides that storage, networking hardware, servers as well as hypervisor layer. Software as a Service (SaaS) –basically requires the company to install and run the applications on their personal data-center or personal server. If the users want to disconnect SaaS offers, can be disconnect anytime. Both are the most popular and leading Cloud Service Providers [5]. In this article, we will present our comparative study among one of the Hadoop distribution which is Hortonworks with the other two distribution/services which are Amazon Web Service and Microsoft Azure.

I. HISTORY AND EVALUATION OF THE DISTRIBUTION / SERVICES
A. Hortonworks
In the year 2011 a company formed as an independent, the Yahoo team members was in charge of the Hadoop project which is formed Hortonworks. Hadoop is a big contributor and it does not create for selling the license but it can sell the support and training [6]. Hortonworks was built to provide the data service and application which is the internet of things (IoT). Hortonworks basically was a data software company that supports open-source software to manage Big Data. The Hortonworks data platform (HDP) includes the Apache Hadoop to storing, processing, and analyze a huge set of data. In January 2019 Hortonworks fully has done with merge with Cloudera. Basically, Hortonworks implements Hadoop technology for example Hive, MapR, Pig, Hadoop Distributed File System, HBase, ZooKeeper, and so on [7]. The chief executive from yahoo Eric Baldeschweiler and the operating officer Rob Bearden from SpringSource and Peter Fenton was the member of the Benchmark Partner board. While the symbol of the Hadoop is the elephant, based on that they refer to the name Horton the Elephant [8] [9].

Figure 1: Hortonworks in the Hadoop platform (HDP) [14]

B. Amazon Web Service (AWS)
In July 2002 the AWS platform was published. The early days of the AWS platform are made up of only a few tools and services. In 2003, when Chris Pinkham and Benjaman Black presented the AWS concept that was publicly reformulated as a paper that was describing the Amazon standardized the retail computing infrastructure, automated and fully dependent on the web services [11].
Therefore, in the year 2006, they began offering their IT services to the business as web services which is known as cloud computing. Basically, Amazon Web Service provides four kinds of products which are: Compute, Storage, Database, and Networking services. One of the main benefits of cloud computing is that brings up the high expenses to the low cost which boosts up your business [10]. In the year 2006 on March 14, the web service of Amazon was re-launched and the most three initial service that combined to offer to Amazon which are S3 cloud storage, Simple Queue Service (SQS), and EC2. The founder and the vice president of the Amazon web service (AWS) in the year 2006, Andy Jassy made a statement that Amazon S3 helps the developer, developers do not have to get panic where they will keep the data and will it be safe and secure for them [12].

Figure 2: Components of Amazon EC2 [4]

C. Microsoft Azure
In the mid of 2000, Microsoft Azure began their journey as the red dog project. By the time Amazon had already published the cloud computing services so Microsoft was following to get the same stage as Amazon. In 2008 one of the conferences for the Microsoft professional Developers, lately two years AWS went live with simple storage of service. The chief software architect of Microsoft Ray Ozzie had announced that they are planning to publish their own cloud computing service which is called Windows Azure. Microsoft plans to offer five key categories of their cloud services which are Windows Azure, .Net service for the developers, File sharing on live service, Microsoft SQL service for the database, Microsoft SharePoint service, and Dynamics CRM service SaaS offered. Ray Ozzie said to the audience that it is a transformation of the strategy and software of Microsoft.
Microsoft start checking out on display the version of their cloud services and in the year of 2010 February, Microsoft made available the Windows Azure platform. Comparing Azure with AWS with many analytics, from time to time Microsoft had improved the Microsoft Azure. Microsoft has its support for open-source software and it is the main choice for enterprises. Microsoft Azure competition with Google Cloud, IBM, and AWS but among the competitors, Microsoft Azure is unique and it can be the top cloud service provider among the others the opinion based on the cloud leaders [13]. Microsoft Azure does offer Software as a service (SaaS), a Platform as a service (PaaS), and Infrastructure as a service (IaaS). Any kind of programming language, tools, and framework which is currently bringing to the top marketplace of services that can be used by the customers [26].

Figure 3: Cloud Service Models [4]

I. Highlights of the distribution/services and its components

A. Hortonworks [14]

Highlight

o Hortonwork's purpose of the economic model is to sell their support and training not to sell their license.

o It is the Big Hadoop contributor.

o Performance is high for ETL medium of Pig and Tez

o The data lifecycle management through Apache Falcon

o Uses existing data platform to embed Hadoop

Components

o Apache Pig

o Apache HBase

o Apache Oozie

o Apache Sqoop

o Apache Falcon

B. Amazon Web Service [15]

Highlights

o Mobile Friendly Access

o Create restricted user accounts

o Allows users to scale up/down by using Auto Scaling

o Low cost-storage

Components

o Amazon EC2

o Amazon S3

o Amazon SimpleDB

o Amazon RDS

o Amazon SQS

C. Microsoft Azure [15]

Highlights

o Deploying and managing services

o Microsoft HDI insight manage the Hadoop Service in Azure cloud by using HDP

o It has the flexibility to store and retrieve unstructured data.

Components

o Azure Virtual Machine

o SQL Server Database

o Relational Database

o Cloud Service

o Mobile Service

II. Comparison among the distributions/services and its components (Analysis/Results) [18]

A comparative study was conducted between one of the Hadoop distribution (Hortonworks) with the Amazon Web Service (AWS) and Microsoft Azure. The easiest way to compare the big cloud services based on the services. We are focusing on the services which are given below:

· Compute Services

· Database Services

· Security Services

· Storage Services

· Pricing Model

A. Compute Services

Table 1: Compute Service comparison on Hortonworks, AWS, and Microsoft Azure

B. Database Services

Table 2: Database Service comparison on Hortonworks, AWS, and Microsoft Azure

C. Security Services

Table 3: Security Service comparison on Hortonworks, AWS and Microsoft Azure

D. Storage Services

Table 4: Storage Service comparison on Hortonworks, AWS and Microsoft Azure

E. Pricing Model

Table 4: Pricing Model comparison on Hortonworks, AWS, and Microsoft Azure

IV. Advantages and disadvantages of hortonworks, amazon web service (AWS) and Microsoft azure

Hortonworks Advantages & Disadvantages [24]

Advantages:

- Hortonworks supports the Windows platform, which is the only Hadoop Distributions.

Disadvantages:

- Hortonworks Data Platform (HDP) has a very simple Ambari management interface, HDP does not have high-quality features.

Amazon Web Service (AWS) Advantages & Disadvantages [25]

Advantages:

- The method of licensing is simpler

- AWS have stronger support for BI analytics

- Support of DevOps is better

- Provide low latency and data centers for availability

Disadvantages:

- Small amount of hybrid-cloud-friendly

- Having too many products, cause the selection process much harder

- Hybrid Strategy is weak and incompatible

- Those who are not familiar with the language of technology, do get confused because of the choices offered by the AWS.

Microsoft Azure Advantages & Disadvantages [25]

Advantages:

- Ability for users and developers to create, maintain and deploy applications

- Azure does fully support Microsoft legacy applications.

- Azure does support the environments of mixed Linux/Windows

- Better understanding of enterprise requirements.

Disadvantages:

- Compared to AWS, Azure has less flexibility for the platforms of non-windows.

- If you pay as you go, will charge extra.

- To fix the problems have to spend extra money.

V. DISCUSSION
Therefore, based on our comparative study one of the Hadoop distributions (Hortonworks) with Amazon web service (AWS) and Microsoft Azure, mainly focus on the different Hadoop distributions that are provided by Big Data. We are focusing on the comparisons of the big cloud services based on their services and it gives us a clear picture of their distribution, such as compute the services, Database services, security services, storage services, and pricing model on big data. First of all, we have found that Hortonworks was built to provide the data service and application which is the internet of things (IoT). Hortonworks basically was a data software company that supports open-source software to manage Big Data. While Amazon Web Service provides a different kind of infrastructure services as well as provide an infrastructure for deploying web-scale solutions like Amazon simple storage service (S3), and Elastic compute cloud (EC2). On the other hand, Microsoft Azure does provide the three main types of cloud services such as Software as a Service (SaaS), Platform as a Service (PaaS), and Infrastructure as a Service (IaaS).

Today’s education sector is flooding with a large number of data that is related to students, faculty, courses, and results. So, if we do study and analyze this large data it can provide a good insight that can be used to improve the education sector. One of the real-world examples is, the University of Alabama having 38000 students can be more students which have a large number of data. Previously when they did not have any real solutions on how to analyze this much data, so it was useless by the time. But now the University administrations have the solutions to analyze and do the data visualizations on big data to draw a pattern on this student’s data which is helping them to do the recruitment, operations so on [27].

The second example based on real-world which is Healthcare. Basically, the Healthcare industry does have a large number of data which is big data. The Healthcare industry gets introduced by wearable devices and sensors which are electronic devices that can record such as Apple. Apple company has brought something new to this generation which is Apple HealthKit, CareKit which is basically made to record the apple users — iPhone users to access and store their real-time health records on their phones [27].

The last example of big data in Transportation, big data has been used to make the transportation industry more effective and easy. Like using big data in the transportation industry it can help to understand the user’s needs based on the routes and do utilize the route to produce the best service such as find out the shortcut route, less waiting time. One of the best examples is Grab and Uber, basically grab and Uber generates a large number of data based on the drivers, locations, vehicles, each and every trip from every vehicle so on. All of this data to be analyzed and predict the demand, fares, and the locations of the drivers [27].

RECOMMENDATIONS
Based on the analysis on the Hadoop distribution (Hortonworks) with the Amazon Web Service (AWS) and Microsoft Azure would say that the Multi-Vendors cloud service provider (CSP) must be chosen by the organizations. Because the organizations or a company wants to have better SaaS features they must be delegated to the CSP. Hortonworks must improve their Security service which is not that much secured compared to AWS and Azure, storage service and mostly they should focus on the pricing model which is much expensive than the other two distributions. The comparative studies will help us to identify the main features of Hadoop distributions for Big Data to provide a better concept of Big Data for future works.

CONCLUSION
In conclusion, Hortonworks merge with Cloudera which is the parent of Hortonworks. And Hortonworks is basically a data Software Company, where Amazon Web Service (AWS) and Microsoft Azure are the cloud base. These two provide cloud services with high security. Microsoft has highly protected security which is via the Azure platform, on the other hand, Amazon Web Service has their security via EC2. When we think about the Cloud Service Provider, there is no best one unless it all derives to what is the best matches for your requirements.

References

[1] Vanika, Aman Kumar Sharma, “A Comparative Study of Hadoop-Based Big Data Architectures” International Journal for Science and Advance Research In Technology — Volume 4 Issue 8, ISSN [Online]: 2395–1052, August-2018

[2] http://www.webopedia.com/TERM/C/cloud computing

[3] “Introduction to Cloud Computing”, Dialogic, 2010.

[4] Prof Vaibhav A Gandhi, Dr. C K Kumbharana, “Comparative Study of Amazon EC2 and Microsoft Azure cloud Architectures” International Journal of Advance Networking Applications, ISSN No.: 0975–0290, page: 117–123, September 2018

[5] “http://www.tomsitpro.com/articles/azure-vs-aws-cloud-comparison, 2–870” By William Van Winkle, January 31, 2015.

[6] Giles Stephen, Khan Umair, “Pro Hortonworks Data Platform: Harness the Power and Promise of Big Data with HDP”, Apress, ISBN 978–1–4842–0668–3, 2015.

[7] Joab Jackson (November 1, 2011). “HortonWorks Hones a Hadoop Distribution”. PC World. Retrieved November 2, 2019

[8] Charles Babcock (June 29, 2011). “Hadoop Big Data Startup Spins Out Of Yahoo”. Information Week.

[9] Cade Metz (June 28, 2011). “Yahoo! seeds Hadoop startup on open-source dream: Hortonworks hears a Big Data revolution”. The Register. Retrieved October 27, 2019

[10] Sajee Mathew “Overview of Amazon Web Services” November 2014

[11] Benjamin Black — EC2 Origins”. Blog.b3k.us. January 25, 2009. Retrieved November 2, 2019

[12] “Amazon — Press Room — Press Release”. phx.corporate-ir.net. Retrieved November 2, 2019

[13] Cynthia Harvey (May 23, 2017) Microsoft Azure: https://www.datamation.com/cloud-computing/microsoft-azure.html

[14] Allae Erraissi, Abderrahim Tragha, Abdessamad Belangour, “A Comparative Study of Hadoop-based Big Data Architectures” International Journal of Web Application, Volume 9 Number 4, December 2017.

[15] Prof Vaibhav A Gandhi, Dr. C K Kumbharana, “Comparative Study of Amazon EC2 and Microsoft Azure cloud Architectures” International Journal of Advance Networking Applications, ISSN No.: 0975–0290, page: 117–123, September 2018

[16] “A Comparative Study on Microsoft Azure vs. Amazon Web Services” Sysfore Technologies Pvt. Ltd.

[17] Pranay Dutta, Prashant Dutta, “Comparative study of cloud services offered by Amazon, Microsoft & Google” International Journal of Trend in Scientific Research and Development (IJTSRD), Volume: 3, Issue: 3, March-April 2019

[18] https://www.educba.com/aws-vs-azure/

[19] https://docs.aws.amazon.com/whitepapers/latest/aws-overview/security-services.html

[20] https://www.business.com/articles/azure-vs-aws-cloud-comparison/

[21] https://blog.purestorage.com/flasharray-now-certified-hortonworks-data-platform/

[22]https://www.insight.com/en_US/shop/product/JP408A/HEWLETT%20PACKARD%20ENTERPRISES/JP408A/HDP%20ENTPL%204N%2050TB%20RW%20STRG%201YR%2024X7%20LTU/#

[23] Dan Kangas, Weixu Yang, Ajay Dholakia, Brian Finley “Lenovo Big Data Validated Design for Hortonworks Data Platform Using Think System Servers” Lenovo Press, 14 December 2017, https://lenovopress.com/lp0828.pdf

[24] https://www.dezyre.com/article/cloudera-vs-hortonworks-vs-mapr-hadoop-distribution-comparison-/190

[25] https://www.guru99.com/azure-vs-aws.html#5

[26] What is Microsoft Azure and Why Use It? Available online: https://www.sumologic.com/resource/whitepaper/what-is-microsoft-azure-and-why-use-it/ (Retrieved on 10 November 2019).

[27]https://intellipaat.com/blog/7-big-data-examples-application-of-big-data-in-real-life/

--

--