Microsoft has unveiled a machine learning library for Apache Spark that, it claims, will make help to data scientists more productive when using the popular open-source big data processing tool.
The software and, increasingly, cloud services company aims to increase the rate of experimentation by data scientists, and enable them to make better use of new machine learning techniques - including deep learning, on very large datasets.
Microsoft said that while its customers have found Spark to be a powerful platform for building scalable machine learning models, they've struggled with low-level APIs to index strings, assemble feature vectors and coerce data into a layout expected by machine learning algorithms.
It said its library simplifies many of these tasks for building models in PySpark, enabling the data scientists to be more productive and focus on the data science aspect of machine learning, while the library would take care of tokenising strings, converting them into numerical vectors, assembling the numerical vectors together and indexing the label column.
In addition, Microsoft said that its MMLSpark tool provides Python APIs that operate on Spark DataFrames and are integrated into the SparkML pipeline model.
"By using these APIs, you can rapidly build image analysis and computer vision pipelines that use the cutting-edge DNN algorithms," it said.
One of the capabilities of MMLSpark is using a pre-trained neural network to extract features from images and then pass these feature on to traditional machine learning models such as logistic regression or decision forests. Another capability is to be able to train a DNN model when a pre-trained model is too domain-specific and therefore unsuitable.
"You can use Spark worker nodes to pre-process and condense large datasets prior to DNN training, then feed the data to a GPU VM for accelerated DNN training, and finally broadcast the model to worker nodes for scalable scoring," it said.
Finally, data scientists can use OpenCV-based image transformations to read in and prepare their data.
MMLSpark has been released as an open source project on GitHub, and Microsoft is welcoming contributions, particularly from people who can provide feedback on issues, request features, report bugs, and from those who can contribute documentation, new features and bug fixes.
Cotton seedling freezes to death as Chang'e-4 shuts down for the Moon's 14-day lunar night
Fortnite easily out-earns PUBG, Assassin's Creed Odyssey and Red Dead Redemption 2 in 2018
Meteor showers as a service will be visible for about 100 kilometres in all directions
Saturn's rings only formed in the past 100 million years, suggests analysis of Cassini space probe data
New findings contradict conventional belief that Saturn's rings were formed along with the planet about 4.5 billion years ago