All the latest UK technology news, reviews and analysis


Twitter lifts the lid on its real-time data mastery

30 Oct 2012
twitter-screen

Twitter has revealed the secrets of its real-time technology, which allows it to display relevant search results within 10 minutes of a news story breaking.

While much of the IT industry has been enraptured by the possibilities of so-called big data, Twitter has been grappling with a far thornier challenge: fast data.

According to Gilad Mishne, and his team of researchers at Twitter, such was the demand for real-time information Twitter found it couldn't rely on Hadoop.

Hadoop, they said, was written for big data, but was ill-suited for fast data.

Because of the lightening pace at which Twitter works, it has to be able to make rapid connections between potential search items, which – until a big event happens – may not have been linked, the group said.

“Prior to Marissa Mayer's appointment as Yahoo's CEO, the query 'Marissa Mayer' had little semantic connection to the query 'Yahoo'; but following news of that appointment, the connection is immediate and obvious,” they wrote in a newly published research paper.

But understanding the deluge of data that comes from Twitter's firehose is no simple task. According to Mishne's team, 17 percent of the 1,000 query terms in one hour will not be in the top 1,000 an hour later. 

So Twitter has established a petabyte-scale analytics platform, based on the Hadoop framework, but also incorporating components such as Pig, HBase, ZooKeeper and Vertica.

But even with what the team described as 'careful software engineering', this big data set up was simply not quick enough to be able to make related query suggestions for news events.

“Hadoop was simply not designed for jobs that are latency sensitive,” they wrote. “By the time we begin to make relevant related query suggestions, the breaking news event might have already passed us by.”

To ensure its users get relevant alternative search suggestions to time-sensitive events, or even get spelling suggestions on the fly, the team developed a new approach to searching fast data.

This was based on an in-memory processing engine, that is fed by Twitter's firehose and its so-called Blender query hose. This Blender is a front-end search broker for Twitter's web client and Twitter's family of search services – searching for user accounts, or tweets and so forth.

“We learned from the Hadoop implementation that two signals (tweets and search sessions) were sufficient to generate good results,” they explained.

Of course, Twitter is far from calling time on Hadoop, which is still integral to its data architecture. And indeed, Mishne's team recognised that their solution to real-time searches is something of a stop gap measure.

But they hope that by sharing their experience, they can focus attention on the need for different computing models and processing platforms that are needed for fast data.

“It would be desirable to build a generic data processing platform capable of handling both 'big data" and 'fast data'," they concluded.

  • Comment  
  • Tweet  
  • Google plus  
  • Facebook  
  • LinkedIn  
  • Stumble Upon  
More on Internet
What do you think?
blog comments powered by Disqus
Poll

Green IT poll

How important is it to your business that a cloud provider uses renewable energy like solar or wind to power their data centres?
21%
6%
4%
3%
66%

Popular Threads

Powered by Disqus
Xperia Z2 vs Galaxy Note 3 video review.jpg

Xperia Z2 vs Galaxy Note 3 video review

We pit Sony's 2014 flagship against Samsung's ruling phablet

Updating your subscription status Loading
Newsletters

Get the latest news (daily or weekly) direct to your inbox with V3 newsletters.

newsletter sign-up button
hpv3may

Getting started with virtualisation

Virtualisation can help you reduce costs, improve application availability, and simplify IT
management. However, getting started can be challenging

ibmv3may

Converting big data and analytics insights into results

Successful leaders are infusing analytics throughout their organisations to drive smarter decisions, enable faster actions and optimise outcomes

Front End Developer - Guildford

Job; Front End Developer - Guildford This ‘creative...

Python Developer - Major Cloud Hosting Provider - nr Guildford

Python Developer, Django (or similar), MySQL, Surrey;...

Senior PHP Developer - Full Stack, LAMP, Java, eCommerce

PHP DEVELOPER, LAMP, JAVA, JAVASCRIPT, FULL STACK, ECOMMERCE...

C++ Visual C++ STL GUI UI Software Engineer Cambridge £45k

C++ Visual C++ STL Software Engineer Cambridge £45k...
To send to more than one email address, simply separate each address with a comma.