A culture in which blame is not pinned on individuals helps Google to quickly and correctly identify and fix the root causes of faults and other technical problems.
That's according to Philip Beevers, site reliability manager at the internet giant, at the launch of Google's new UK region of its cloud platform product, where Telegraph CTO Toby Wright also described the benefits of the improved speed it would bring.
Beevers explained that this philisophy is at the core of his site reliability engineering (SRE) team at Google.
"We have a blameless post-mortem culture," said Beevers. "So after any kind of failure or outage we try to understand the root cause of the problem to stop it happening again. The idea is it's blameless, I can't stress that enough. We genuinely believe that it's processes not people that fail, and we assume all engineers act with the best of intentions.
"It means there's no fear of career consequences with investigation issues," he continued. "We're not just looking for prevention, but also the ability to detect similar problems in future. So we design to mitigate outages more quickly, and we look for ways to mitigate the impact [of outages] more rapdily than in the incident we already had," Beevers explained.
The goal of the SRE function is to make Google's services more reliable for its customers. But Google's approach is a little different to most firms, Beevers said.
"Our SRE function is what you get when you ask software engineers to design an operations function. This is very different to a traditional ops function. These engineers have the same skills as the teams that build our services, but with different domain of application: reliablity and scalability.
"So there's a parity of skills between product developers and the site reliability engineers, and that changes the relationship. It ecourages people to transfer between two groups, and means that there's a free exchange of ideas and principles," said Beevers.
Another difference, he added, is the way Google measures and calibrates its reliability. Instead of purely trying to keep faults to a minimum, Google works out how many issues it can have before its customers start to feel the pain.
"That gives us an error budget," said Beevers. "It's the amount of errors you can have before you inflict undue pain on your customers.
"So if you have some error budget left then it's okay to keep launching stuff, but if not, you have to stop to avoid causing pain. We use data to take the emotion out of the decision, so there's no longer a confrontation between people wanting to launch new products and people wanting to improve reliability," he argued.
Computing's DevOps Summit returns on 19 September. Attendance is free to qualifying IT leaders and other senior IT professionals, but places will go fast, so secure yours now.
Kicking Palantir off of AWS is among their demands, too
Rafaela Vasquez was watching The Voice at the time of the crash, new evidence shows
PUBG price slashed on Steam after selling more than 50 million copies - as daily player numbers plunge
Use the same password for every website? It might be time to change them all