All the latest UK technology news, reviews and analysis


Twitter lifts the lid on its real-time data mastery

30 Oct 2012
twitter-screen

Twitter has revealed the secrets of its real-time technology, which allows it to display relevant search results within 10 minutes of a news story breaking.

While much of the IT industry has been enraptured by the possibilities of so-called big data, Twitter has been grappling with a far thornier challenge: fast data.

According to Gilad Mishne, and his team of researchers at Twitter, such was the demand for real-time information Twitter found it couldn't rely on Hadoop.

Hadoop, they said, was written for big data, but was ill-suited for fast data.

Because of the lightening pace at which Twitter works, it has to be able to make rapid connections between potential search items, which – until a big event happens – may not have been linked, the group said.

“Prior to Marissa Mayer's appointment as Yahoo's CEO, the query 'Marissa Mayer' had little semantic connection to the query 'Yahoo'; but following news of that appointment, the connection is immediate and obvious,” they wrote in a newly published research paper.

But understanding the deluge of data that comes from Twitter's firehose is no simple task. According to Mishne's team, 17 percent of the 1,000 query terms in one hour will not be in the top 1,000 an hour later. 

So Twitter has established a petabyte-scale analytics platform, based on the Hadoop framework, but also incorporating components such as Pig, HBase, ZooKeeper and Vertica.

But even with what the team described as 'careful software engineering', this big data set up was simply not quick enough to be able to make related query suggestions for news events.

“Hadoop was simply not designed for jobs that are latency sensitive,” they wrote. “By the time we begin to make relevant related query suggestions, the breaking news event might have already passed us by.”

To ensure its users get relevant alternative search suggestions to time-sensitive events, or even get spelling suggestions on the fly, the team developed a new approach to searching fast data.

This was based on an in-memory processing engine, that is fed by Twitter's firehose and its so-called Blender query hose. This Blender is a front-end search broker for Twitter's web client and Twitter's family of search services – searching for user accounts, or tweets and so forth.

“We learned from the Hadoop implementation that two signals (tweets and search sessions) were sufficient to generate good results,” they explained.

Of course, Twitter is far from calling time on Hadoop, which is still integral to its data architecture. And indeed, Mishne's team recognised that their solution to real-time searches is something of a stop gap measure.

But they hope that by sharing their experience, they can focus attention on the need for different computing models and processing platforms that are needed for fast data.

“It would be desirable to build a generic data processing platform capable of handling both 'big data" and 'fast data'," they concluded.

  • Comment  
  • Tweet  
  • Google plus  
  • Facebook  
  • LinkedIn  
  • Stumble Upon  
More on Internet
What do you think?
blog comments powered by Disqus
Poll

BYOD vs CYOD vs BYOC poll

Which approach is your firm taking to managing employees' mobile devices?
20%
14%
5%
20%
29%
12%

Popular Threads

Powered by Disqus
Google Android logo

How to take a screenshot on Android

A step by step guide to how to screen-grab on a Google-powered smartphone

Updating your subscription status Loading
Newsletters

Get the latest news (daily or weekly) direct to your inbox with V3 newsletters.

newsletter sign-up button
hpv33

Data protection: the key challenges

Deduplication is a foundational technology for efficient backup and recovery

rdc2

iPad makes its mark in the enterprise

The iPad can become a supercharged unified communications endpoint, allowing users to enhance their productivity

Java Software Developer

Java Software Developer Location: Manchester, Lancashire...

3rd Line Windows Support - Exchange, VMware, Hyper V, Networ

3rd Line Windows Support - Exchange, VMware, Hyper V...

Oracle PLSQL Developer

Oracle PLSQL Developer - Livingston - Up to 40k Key...

Junior Oracle PLSQL Developer

Junior Oracle PLSQL Developer - Livingston - £25,000...
To send to more than one email address, simply separate each address with a comma.