IBM has seen the future and it is Apache Spark. The firm has announced major backing for the open source data processing framework, saying it will integrate Spark into its own products and open source its own machine learning technology to boost Spark's capabilities, as well as invest in training and education.
Announced at the Spark Summit in San Francisco, IBM's plans for the platform are based on a belief that Spark will become increasingly important for driving analytics across a whole range of applications, including commerce, science and engineering, and analysing data from the Internet of Things.
IBM will integrate Spark into its analytics and commerce platforms, as well as offer Spark as a service on its Bluemix developer cloud.
The firm is also making its SystemML machine learning technology available as open source and partnering with Databricks to integrate this with Spark, and committing resources to educate data scientists.
"I like to think of Spark as an analytics operating system for the modern enterprise," said Rob Thomas, IBM's vice president of product development for big data and analytics, comparing it with the importance that Linux has attained over the past two decades.
"It's a foundational component on which developers can build all sorts of applications in an enterprise of any size. My view is that anyone using data in the future is going to be leveraging Spark at a significant scale. It changes the speed at which applications can be built and analytics done in an enterprise," he told V3.
For this reason, Spark is set to be offered as a service on IBM's Bluemix platform-as-a-service cloud for developers, who will be able to use the service in their applications.
This is currently a closed beta, and interested parties can apply for access. However, IBM will open it up as a public beta shortly, Thomas said.
One of the major parts of IBM's Spark move is the contribution of its SystemML technology to the open source community in order to significantly boost the machine learning capabilities of Spark, which IBM sees as one of the key differentiators for the Spark project.
SystemML has been in development and research for nearly a decade, and IBM said it is working with Databricks, a company founded by the creators of Apache Spark, to integrate the two.
"Databricks pointed out to us that the biggest weakness right now is around machine learning, specifically in the MLlib part of the project, so we are going to be contributing our System ML technology to augment that part of the ecosystem," Thomas said.
"The secret sauce here is that this is not just an algorithm library; this is a declarative engine so you can understand the language of algorithms and new ones can be developed easily. So as new problems and data sets come along, you can apply those same algorithms to it," he explained.
But one of the major stumbling blocks to broader adoption of big data projects and analytics is a shortage of the necessary skills, and IBM is also moving to address this by helping to educate more data scientists and engineers.
This is being achieved through partnerships with AMPLab at the University of California, Berkeley and the online Big Data University, where IBM launched a Spark Fundamentals course about a month ago. IBM is also working with partners such as the Galvanize project to provide face-to-face courses in the US.
Overall, IBM is putting a lot investment behind Spark, as it is a key technology for the future of all kinds of areas of computing in the future.
"This is the fastest growing open source project in history. It only became a top-level project in 2014, yet it is the most active project by far in the data ecosystem," Thomas said.
"As we've started to work with a number of clients, we've started to see that this is really going to change how organisations will do analytics at scale.
"We're living in a world where there are two billion smartphones and growing. The data that is going to be generated in organisations is astronomical, and it's going to take a new approach. We believe Spark is that new approach."
New regulation expected to cut greenhouse gas emissions by about 17 million metric tonnes between 2020 and 2050
Molybdenum ditelluride is a two-dimensional material that can be easily stacked into multiple layers to create a memory cell
New light-guiding nanoscale device can control and monitor a nanoparticle trapped in a laser beam with high sensitivity
Optical traps are scientific instruments in which a focused laser beam is used to exert an attractive or repulsive force on a microscopic object to hold it in place
Scientists estimate that the exoplanet has already lost up to 35 per cent of its mass over its lifetime