Twitter has revealed the secrets of its real-time technology, which allows it to display relevant search results within 10 minutes of a news story breaking.
While much of the IT industry has been enraptured by the possibilities of so-called big data, Twitter has been grappling with a far thornier challenge: fast data.
According to Gilad Mishne, and his team of researchers at Twitter, such was the demand for real-time information Twitter found it couldn't rely on Hadoop.
Hadoop, they said, was written for big data, but was ill-suited for fast data.
Because of the lightening pace at which Twitter works, it has to be able to make rapid connections between potential search items, which – until a big event happens – may not have been linked, the group said.
“Prior to Marissa Mayer's appointment as Yahoo's CEO, the query 'Marissa Mayer' had little semantic connection to the query 'Yahoo'; but following news of that appointment, the connection is immediate and obvious,” they wrote in a newly published research paper.
But understanding the deluge of data that comes from Twitter's firehose is no simple task. According to Mishne's team, 17 percent of the 1,000 query terms in one hour will not be in the top 1,000 an hour later.
So Twitter has established a petabyte-scale analytics platform, based on the Hadoop framework, but also incorporating components such as Pig, HBase, ZooKeeper and Vertica.
But even with what the team described as 'careful software engineering', this big data set up was simply not quick enough to be able to make related query suggestions for news events.
“Hadoop was simply not designed for jobs that are latency sensitive,” they wrote. “By the time we begin to make relevant related query suggestions, the breaking news event might have already passed us by.”
To ensure its users get relevant alternative search suggestions to time-sensitive events, or even get spelling suggestions on the fly, the team developed a new approach to searching fast data.
This was based on an in-memory processing engine, that is fed by Twitter's firehose and its so-called Blender query hose. This Blender is a front-end search broker for Twitter's web client and Twitter's family of search services – searching for user accounts, or tweets and so forth.
“We learned from the Hadoop implementation that two signals (tweets and search sessions) were sufficient to generate good results,” they explained.
Of course, Twitter is far from calling time on Hadoop, which is still integral to its data architecture. And indeed, Mishne's team recognised that their solution to real-time searches is something of a stop gap measure.
But they hope that by sharing their experience, they can focus attention on the need for different computing models and processing platforms that are needed for fast data.
“It would be desirable to build a generic data processing platform capable of handling both 'big data" and 'fast data'," they concluded.