All the latest UK technology news, reviews and analysis

Twitter lifts the lid on its real-time data mastery

30 Oct 2012

Twitter has revealed the secrets of its real-time technology, which allows it to display relevant search results within 10 minutes of a news story breaking.

While much of the IT industry has been enraptured by the possibilities of so-called big data, Twitter has been grappling with a far thornier challenge: fast data.

According to Gilad Mishne, and his team of researchers at Twitter, such was the demand for real-time information Twitter found it couldn't rely on Hadoop.

Hadoop, they said, was written for big data, but was ill-suited for fast data.

Because of the lightening pace at which Twitter works, it has to be able to make rapid connections between potential search items, which – until a big event happens – may not have been linked, the group said.

“Prior to Marissa Mayer's appointment as Yahoo's CEO, the query 'Marissa Mayer' had little semantic connection to the query 'Yahoo'; but following news of that appointment, the connection is immediate and obvious,” they wrote in a newly published research paper.

But understanding the deluge of data that comes from Twitter's firehose is no simple task. According to Mishne's team, 17 percent of the 1,000 query terms in one hour will not be in the top 1,000 an hour later. 

So Twitter has established a petabyte-scale analytics platform, based on the Hadoop framework, but also incorporating components such as Pig, HBase, ZooKeeper and Vertica.

But even with what the team described as 'careful software engineering', this big data set up was simply not quick enough to be able to make related query suggestions for news events.

“Hadoop was simply not designed for jobs that are latency sensitive,” they wrote. “By the time we begin to make relevant related query suggestions, the breaking news event might have already passed us by.”

To ensure its users get relevant alternative search suggestions to time-sensitive events, or even get spelling suggestions on the fly, the team developed a new approach to searching fast data.

This was based on an in-memory processing engine, that is fed by Twitter's firehose and its so-called Blender query hose. This Blender is a front-end search broker for Twitter's web client and Twitter's family of search services – searching for user accounts, or tweets and so forth.

“We learned from the Hadoop implementation that two signals (tweets and search sessions) were sufficient to generate good results,” they explained.

Of course, Twitter is far from calling time on Hadoop, which is still integral to its data architecture. And indeed, Mishne's team recognised that their solution to real-time searches is something of a stop gap measure.

But they hope that by sharing their experience, they can focus attention on the need for different computing models and processing platforms that are needed for fast data.

“It would be desirable to build a generic data processing platform capable of handling both 'big data" and 'fast data'," they concluded.

  • Comment  
  • Tweet  
  • Google plus  
  • Facebook  
  • LinkedIn  
  • Stumble Upon  
More on Internet
What do you think?
blog comments powered by Disqus

Windows 10 poll

What are your first impressions of Windows 10?

Popular Threads

Powered by Disqus
V3 Sungard roundtable event - Cloud computing security reliability and scalability discussion

CIOs debate how to overhaul businesses for the digital era

V3 hosts roundtable with Sungard Availability Services

Updating your subscription status Loading

Get the latest news (daily or weekly) direct to your inbox with V3 newsletters.

newsletter sign-up button

Getting started with virtualisation

Virtualisation can help you reduce costs, improve application availability, and simplify IT
management. However, getting started can be challenging


Converting big data and analytics insights into results

Successful leaders are infusing analytics throughout their organisations to drive smarter decisions, enable faster actions and optimise outcomes

Vendor Services Manager (IT) - Central London

Vendor Services Manager (IT) - Central London A truly...

Data Quality Analyst

Our vision is to make Lloyds Banking Group the best bank...

Contract Cloud AWS - Consultant Engineer Urgent

Contract Cloud Consultant Engineer -AWS Milton Keynes...

IT/IS Manager Infrastructure (hands-on)

IT/IS Manager Infrastructure (hands-on) North West/Lancashire...
To send to more than one email address, simply separate each address with a comma.