A culture in which blame is not pinned on individuals helps Google to quickly and correctly identify and fix the root causes of faults and other technical problems.
That's according to Philip Beevers, site reliability manager at the internet giant, at the launch of Google's new UK region of its cloud platform product, where Telegraph CTO Toby Wright also described the benefits of the improved speed it would bring.
Beevers explained that this philisophy is at the core of his site reliability engineering (SRE) team at Google.
"We have a blameless post-mortem culture," said Beevers. "So after any kind of failure or outage we try to understand the root cause of the problem to stop it happening again. The idea is it's blameless, I can't stress that enough. We genuinely believe that it's processes not people that fail, and we assume all engineers act with the best of intentions.
"It means there's no fear of career consequences with investigation issues," he continued. "We're not just looking for prevention, but also the ability to detect similar problems in future. So we design to mitigate outages more quickly, and we look for ways to mitigate the impact [of outages] more rapdily than in the incident we already had," Beevers explained.
The goal of the SRE function is to make Google's services more reliable for its customers. But Google's approach is a little different to most firms, Beevers said.
"Our SRE function is what you get when you ask software engineers to design an operations function. This is very different to a traditional ops function. These engineers have the same skills as the teams that build our services, but with different domain of application: reliablity and scalability.
"So there's a parity of skills between product developers and the site reliability engineers, and that changes the relationship. It ecourages people to transfer between two groups, and means that there's a free exchange of ideas and principles," said Beevers.
Another difference, he added, is the way Google measures and calibrates its reliability. Instead of purely trying to keep faults to a minimum, Google works out how many issues it can have before its customers start to feel the pain.
"That gives us an error budget," said Beevers. "It's the amount of errors you can have before you inflict undue pain on your customers.
"So if you have some error budget left then it's okay to keep launching stuff, but if not, you have to stop to avoid causing pain. We use data to take the emotion out of the decision, so there's no longer a confrontation between people wanting to launch new products and people wanting to improve reliability," he argued.
Computing's DevOps Summit returns on 19 September. Attendance is free to qualifying IT leaders and other senior IT professionals, but places will go fast, so secure yours now.
Climate change likely forced inhabitants of Indus Valley civilisation to resettle in the Himalayan foothills
Shift in weather patterns made agriculture almost impossible in the Indus Valley region
Researchers claim that the magnetic properties of a thin-film material can be controlled by applying a small voltage
Dubbed Antlia 2, the ghost galaxy sits just 130,000 light-years away from the Milky Way
Delays to the roll-out of age verification for adult websites hasn't stopped government from considering extending them to more websites