It is a fact of life that startups that centre themselves around social media have a fairly tough time when it comes to handling the onslaught of data required to make a robust, reliable and meaningful service. The up-front costs required to do this are far beyond what most small businesses could dream of.
Two year-old social analytics startup DataSift is a prime example of such a model, initially living off the trust and cash of venture capitalists, but confident that the expensive business model would quickly pay off. Indexing everything possible in social media from tweets to Facebook posts and even Wikipedia edits is not cheap.
In a bid to ensure up to two billion new objects per day have enough value – be it a brand that wants to know what its customers are saying about it behind its back, or a large corporation wanting to find out the general populus' opinions on oranges (this has happened) – DataSift sticks a lot of added value onto every single object, including natural language analysis, sentiment analysis and a lot more.
This results in a database that's around four petabytes in size, growing at a rate of four terabytes per day. Simply storing that data is enough of a challenge, but sifting through 500 billion archived items – each with plenty of added metadata – at a moment's notice creates a whole new set of difficulties.
Nick Halstead, founder and CTO of DataSift, told V3 about the challenges his company faces on a daily basis. "You generally have to be a year ahead of the game," he said. "With the hardware we purchase it can take a month to order hundreds of machines – you can't just pop down the road and buy a couple of servers.
"We're in Amazon territory where you have to think about your systems sending the job off to multiple data centres, and choosing the one which has the most capacity, taking into account the locality of where the customer data is."
It is not just the technology which is a challenge, it's finding the people to run it all. "You can't hire people to do what we do at this scale," explained Halstead. "We hire people who can grow and understand." The structure of DataSift's databases has had to grow, too. In a previous life as social analytics firm TweetMeme, Halstead and his crew were using MySQL databases, which worked fine when Twitter was a fairly small operation.
DataSift's custom analytics layer now sits on top of a Hadoop Distributed File System (HDFS) running across tens of thousands of machines. The advantage of using Hadoop is that firms such as Halstead's can do what they do on "commodity" hardware, although he is quick to point out that he does not consider DataSift's hardware to be a commodity.
"It's a very expensive setup but it does very rapidly become cost effective once you have multiple customers on it. If you look at our margins – purely on hardware – pretty quickly it became very viable."
As the amount of data ramps up, so does the effort required to mine data for DataSift's customers. Halstead believes there is currently no need to delete objects that many might consider useless; to find out if they are useless is in itself a difficult task.
With eyes on keeping their storage under control, Halstead says he will be taking DataSift to a stock market flotation. Some market analysts have placed the value of a stock market listing at more than $1bn. But the next step required to take its service to the next level is indexing content such as audio, video and images – which will undoubtedly create a whole new storage conundrum.
Using photocatalysts to convert carbon dioxide into usable energy such as methane or ethane
Trained on curated data from Moorfields Eye Hospital, the neural network also shows clinicians how it reached its judgement
Yokohama National University demonstrate technology that could lead to a fault-tolerant universal quantum computer
Top-of-the-range Threadripper 2990WX now available from Scan, Ebuyer, Overclockers, Novatech and Amazon