BUDAPEST: The Hadoop framework that underpins many big data projects is to get improvements, including erasure coding and better queue handling in compute nodes, to avoid deadlock and thus improve reliability.
At the Apache: Big Data Europe conference, Akira Ajisaka, one of the "committers" who oversees development of Hadoop, detailed changes coming in release 2.8 of the distributed processing framework, many of which cover the Hadoop Distributed File System employed by the platform.
Ajisaka, who is an engineer at NTT Data in Japan deploying Hadoop for customers, said that the upcoming Hadoop 2.8 release will support Erasure Coding for low-cost data redundancy, and include a feature called RPC Congestion Control designed to provide greater reliability.
RPC Congestion Control is designed to address the problem of a stuck queue bringing down the entire Hadoop cluster, implementing fair scheduling with exponential backoff, and will be enabled by default in release 2.8.
Many features of 2.8 are still under development, so the team overseeing it cannot say when it will be ready for release, Ajisaka said.
However, as Hadoop is an upstream component in many big data solutions, the new features will eventually filter through into Hadoop-based platforms such as those developed by Hortonworks, Cloudera and even Microsoft with its Azure Data Lake.
Other issues being addressed in Hadoop include support for heterogeneous storage implementations that mix solid state drives (SSDs) with hard drives for better performance. Introducing storage type definitions and block placement policies mean that users can configure Hadoop to write 'hot' data to SSDs first.
End-to-end encryption of data is also now supported, enabling encrypted files to be written to the data node. This requires a unique key for each file, requested from a key management server running in the cluster.
"It's a complicated process, but this way it is done transparently so you don't need to rewrite applications," Ajisaka said. It also results in a very low overhead. The TeraSort benchmark took 49 seconds for 1GB of encrypted data, as opposed to 47 seconds for the same data in an unencrypted state.
Ajisaka rounded off with a plea to the Hadoop user community to get involved with the development process, especially if they want to see a particular feature implemented.
"Anyone who wants a feature can join and contribute to make development faster. You can contribute by creating or testing patches, as well as reporting bugs," he said.
BT wants to make the public switched telephone network history within eight years
Personal data being purloined by third parties via Facebook Login API
MacOS and iOS are better off apart, says CEO Tim Cook
Or they'll no longer be entitled to updates and bug patches