BUDAPEST: Big data may be one of the current hot topics in business computing, but users are finding it difficult to successfully implement big data projects, something that the Apache open-source community is seeking to address with numerous efforts aimed at making big data easier.
The Apache: Big Data Europe conference saw Canonical founder Mark Shuttleworth outline some of the key problems in starting a big data project, such as simply finding engineers with the skills needed just to build the infrastructure for operating tools such as Hadoop.
"Analytics and machine learning are the next big thing, but the problem is there are just not enough 'unicorns', the mythical technologists who know everything about everything," he explained in his keynote address, adding that the blocker is often just getting the supporting infrastructure up and running.
Shuttleworth, pictured above, went on to demonstrate how the Juju service orchestration tool developed by Canonical could solve this problem. Juju enables users to describe the end configuration they want, and will automatically provision the servers and software and configure them as required.
This could be seen as a pitch for Juju, but Shuttleworth's message was that the open-source community is delivering tools that can manage the underlying infrastructure so that users can focus on the application itself.
"The value creators are the guys around the outside who take the big data store and do something useful with it," he said.
"Juju enables them to start thinking about the things they need for themselves and their customers in a tractable way, so they don't need to go looking for those unicorns."
The Apache community is working on a broad range of projects, many of which are focused on specific big data problems, such as Flume for handling large volumes of log data or Flink, another processing engine that, like Spark, is designed to replace MapReduce in Hadoop deployments.
"We think of [Spark] as the analytics operating system. Never before have so many capabilities come together on one platform," said Anjul Bhambrhi, vice president of big data products at IBM, during her keynote at the conference.
Spark is a key project because of its speed and ease of use, and because it integrates seamlessly with other open-source components, Bhambrhi explained.
"Spark is speeding up even MapReduce jobs, even though they are batch oriented by two to six times. It's making developers more productive, enabling them to build applications in less time and with fewer lines of code," she claimed.
IBM is working with Nasa and Seti to analyse radio signals for signs of extra-terrestrial intelligence, using Spark to process the 60Gbit of data generated per second by various receivers, according to Bhambrhi.
Other applications IBM is working on with Spark include genome sequencing for personalised medicine via the Adam project at UC Berkeley in California, and early detection of conditions such as diabetes by analysing patient medical data.
"At IBM, we are certainly sold on Spark. It forms part of our big data stack, but most importantly we are contributing to the community by enhancing it," Bhambrhi said.
However, it is still early days for big data platforms, and it seems that much work needs to be done before the technology will see widespread adoption in businesses and other organisations.
EE, O2, Vodafone, Three and Airspan open the bidding
Worried about data privacy? Here are several ways to secure your Facebook account
The ICO is seeking an urgent warrant to investigate a major data breach - everything you need to know as the story continues to unfold