While the world of big data has a variety of tools and applications to enable users to process and analyse their data, much of the attention has focused on the Hadoop platform and solutions based on it.
There are several reasons for Hadoop’s popularity. It is an open‑source project maintained by the Apache Software Foundation, and so users are less at risk of getting themselves locked into a proprietary platform for their big data infrastructure deployment.
However, the chief reason is that Hadoop’s architecture makes it easy to scale up to match the requirements of the specific problem, and it is able to handle data that is not easily indexed and stored using a traditional database-management system.
According to Amr Awadallah, co-founder and chief technology officer of Cloudera, enterprise customers just want something capable of solving their data problems. Hadoop is the most appropriate tool in most cases.
“The key driver of why Hadoop has been successful is that it solves a real problem, and that problem is the ability to store data regardless of type, and then do queries and analytics on top of that,” he said.
Storing and managing different types of data in one place to make a data lake – or an enterprise data hub, as Cloudera prefers to call it – is the problem that Hadoop has been built to solve.
Cloudera is one of the leading developers of Hadoop solutions, and it hit the headlines earlier this year when its platform became the one favoured by IT giant Intel, which invested $740m in the firm. As part of the deal, Intel is promoting Cloudera’s Hadoop distribution to customers as its preferred version, while Cloudera is in turn optimising its platform to run on Intel Xeon hardware.
Hadoop is actually an open-source framework for building a distributed system to store and process big data, rather than a ready-made solution in itself. It is built around the concept of using clusters of compute nodes, basically x86 servers, to store the data using a distributed filing system spread across all the nodes. The distributed nature of the setup means Hadoop can also perform parallel processing on the data in the location where it resides.
This is a key part of Hadoop’s success, as this is what enables it to scale up in order to match the requirements of the application. It also means that customers need to invest in lots of x86 servers to process the data, which explains Intel’s interest in the Hadoop ecosystem. “You can’t scale by simply buying faster and faster CPUs,” said Awadallah.
“The only way to solve big data problems is ‘divide and conquer’, which means that we spread out the data on multiple servers, and as somebody submits a query or a job, we break down that job into pieces and then we fan it out to run on all of the servers where data is stored.
“We try and run these applications as close as possible to the data, so the data is not being copied or moved over the network because this causes delays, and this ‘divide and conquer’ is really why Hadoop scales very well.”
The Hadoop platform thus comprises the Hadoop Distributed File System (HDFS) for storing data; the YARN (Yet Another Resource Negotiator) framework for job scheduling and cluster resource management; and the MapReduce framework for running the actual algorithm to process the data.
However, MapReduce is just one way to process and analyse data, and is restricted to batch processing of jobs, which may make it unsuitable for some applications, such as those where data changes rapidly. Consequently, the number of applications available to run atop Hadoop has expanded to include newer ones such as Apache Spark, which is being developed as a high-speed analytics engine that can supersede MapReduce, as it can process data in real time.
Others include Impala, which is a massively parallel SQL query engine developed by Cloudera; Apache Cassandra, an open-source distributed database-management system; and Apache SolrCloud, for distributed search of documents such as emails and log files.
“Hadoop has also been opened up so that other applications can run inside the platform. SAS, for example, can run its analytics engine natively inside Hadoop. Informatica, which is an ETL (extract, transform and load) tool runs natively, so does Splunk,” said Awadallah.
Hadoop is open source, so there is nothing stopping an organisation from downloading the Hadoop software library from Apache and building it themselves from scratch. However, as Awadallah points out, there are users who are willing to take this route, but the majority of organisations “just want to do analytics”, and this is where Hadoop vendors such as Cloudera come in.
“You can compare what we do with Red Hat: you can get the Linux kernel direct from the Linux foundation, but Red Hat takes that kernel and puts a lot of pieces of software around it to make a useful operating system that you can use in a production environment, and that what’s we do with Hadoop,” Awadallah explained.
Cloudera customers not only get a tested Hadoop distribution, but extra tools, software maintenance and intellectual property (IP) indemnification against any claims arising from their use of the Hadoop platform. The extra software Cloudera integrates includes Flume and Sqoop, which are tools to bring data into Hadoop from unstructured data sources and structured sources such as relational databases, respectively. Cloudera also includes proprietary tools for monitoring and auditing, according to Awadallah.
“When you are running the system in production, you need to be able to do troubleshooting when something goes wrong, and monitoring, security and auditing, and all these things come as part of the tools you only get when you are a subscriber of Cloudera,” he said.
Other Hadoop distributors offer a similar value-add that they wrap around the basic Hadoop platform to deliver a useful package for customers. Along with other vendors, Cloudera is looking to make Hadoop more robust, scalable, reliable and secure, while cloud is also an area of interest.
“Right now, most of our customers are on-premise, and a big concern is how to make Hadoop play better between on-premise and cloud. If you have a cluster on Amazon or Microsoft Azure and a local cluster, how does data move between the two?” Awadallah said.
With interest and investment in big data continuing to grow, it looks like Hadoop is a platform that many organisations will be getting more familiar with in future.
For more information on enterprise mobility, visit the Intel IT Center.
BT wants to make the public switched telephone network history within eight years
Personal data being purloined by third parties via Facebook Login API
MacOS and iOS are better off apart, says CEO Tim Cook
Or they'll no longer be entitled to updates and bug patches