Many firms suffer a lack of flexibility and scalability in their IT estate, but few have so many regular change requirements as Paddy Power Betfair.
The company grew quickly, partly by customer acquisition and partly by merger, and had an IT estate that wasn't scalable or flexible enough for its needs. Worse, the development and production environments didn't match, making it hard for changes to be tested properly.
So the betting company decided to put some of its applications in the cloud as part of a broader IT refresh. The betting industry is bound by strict rules, but when the firm went to the regulators to check what they could and couldn't do, the answer was effectively ‘What's a cloud?'.
Eventually, Paddy Power Betfair was told that websites can be hosted wherever, but important functions such as random number generation have to be performed on a box hosted in the country in which the service is being consumed.
"So we went hybrid. But we had big problems with dependencies, as there were so many projects on the go," explained Stephen Lowe, director of technology at the firm, addressing Computing's recent Cloud and Infrastructure Summit.
Paddy Power Betfair has in the region of 1,000 people in its technology team, of which 800 are developers or delivery teams.
"They're pushing out 500 to 600 changes a week to the production estate. The poor old infrastructure guys get left until the last minute, so a last-minute network change comes in and the whole process gets stuck," said Lowe.
He explained that many development changes have a dependency on work that needs to be performed by the network team, but that the team has its own backlog, making it hard to accommodate new changes.
"Say a change needs security. That takes time. So you complete that and do all the handovers, then you've lost three to four days' work out of your two-week sprint, which slows us down. The network team was huge bottleneck for us," said Lowe.
Another problem was that a change worked fine in development, but fell over when moved to production. The reason, according to Lowe, was the speed of the firm's growth.
"Over time we grew rapidly. We had 20 per cent growth of the customer base every year for 20 years. The dev estate didn't keep up with the production estate, so what you test in dev is nothing like what the production estate looks like. We needed one estate which was mirrored across both environments," he explained.
The final problem was capacity, especially with the firm's loads proving extremely spiky. Paddy Power Betfair's data centres largely run at around 20-30 per cent capacity, until Saturday when the figure more than doubles with the football and horse racing schedules.
"For one hour on Saturday afternoons everything's really hot. And then Grand National day is a point of pride. Staying up on that day is one of the big challenges. It's three to four times as big as anything else we do," said Lowe.
But scaling all of the firm's infrastructure for those points in time is inefficient, so the decision was made to adopt a hybrid strategy, renting capacity in the public cloud rather than having a large data centre consuming power and needing cooling all year round.
Lowe and his teams came up with four principal reasons to build a cloud environment: more stability, better testing, faster delivery and better scalability. He also wanted to give the dev team more control over infrastructure.
"If the dev team could control the bits they need, they can make the changes they require without bothering the network team. So we decided to make everything code. You can check in a firewall change as a piece of code, then use continuous delivery processes," he said.
But in order for that to be viable, you can't have differing test and production environments. Standardisation is needed, so that was the next job on the schedule of works.
"Now we can run almost everything on x86. There is still some custom hardware, but if everything is x86 it's easier to scale up. You don't need to worry about specialist vendors, or weird firmware upgrades, and you can manage it all as code," said Lowe.
This standardised the environments and resolved the capacity problems, and the next challenge was adopting a DevOps working model.
"The ops team's mantra used to be ‘We keep the lights on' and the dev team's was ‘We change stuff' and never the two shall meet. Then the big DevOps revolution came. Everyone told me different definitions of DevOps, but we created a team anyway without really knowing what it was," Lowe explained.
One of the first attempts was to stick some devs and ops staff in a single team and see what happened.
"It was good but didn't solve the problem. It just moved the bottlenecks to the DevOps team. We didn't get anything through the pipeline faster, but it did improve the communication," said Lowe.
So the firm decided to train the developers to help them understand operations. They helped the developers understand how code runs in production, what it interfaces with, how load balancing works and how traffic moves around the network.
"So the devs were all trained with ops skills, and were now much more independent. That was good. But ops still thinks that its job is to keep the lights on. And the developers say if we're now doing all this infrastructure work, we're responsible for it," said Lowe.
Infected apps have been downloaded more than 50 million times
Customers of regular price-raising ISP and cable operator claim nationwide outages started on Monday
Pixel 2 smartphones and a Pixel-branded laptop also planned by Google
The moment you've all been waiting for...