Having a blameless culture... or not

A contentious topic follows - I reckon there may be disagreement with my opinions - feel free to respond :)

I’m currently reading Site Reliability Engineering, describing the DevOps strategies at Google. It’s an interesting book - one item they discuss is their blameless culture in responding to failures. I’ve never worked at Google, so I don’t know how much their reputation for having teams of A-list engineers is true and how much is good PR, but the assumptions that make those practices effective in Google development sites aren’t true in the general case (at least in my experience). Specifically, how much one can rely on the conscientiousness and goodwill of all engineers.

One of the reasons contractors get brought into companies is to provide short term skills boosts for projects in trouble. We get brought in to put out fires, or resource a project that is already in trouble. Sometimes we get to see the companies where everything isn’t all working well. If companies had everything as right as Google seems to have at software engineering, then there would be no need for contractors. The companies would have exactly the right number of highly skilled, highly motivated, competent and conscientious engineers. Wouldn’t that be nice?

The reality isn’t as shiny. Most companies, while on average have teams of pretty good engineers, also have to deal with engineers at the bottom end of the bell curve. These encompass a range of issues - the work-to-rule jobsworths who do the minimum required to get paid, the well-intentioned but careless engineers who don’t check their work and make mistakes, up to the people who knowingly do sub-standard work because they quite frankly couldn’t give a #### about their job, team, product or company. Most of us have seen or had to deal with each of these at one point. If you honestly haven’t noticed anyone like that during your career, I’ve got some bad news for you… ;)

Having a blameless culture is all fine when errors are honest mistakes that slipped though despite the best efforts of everyone involved. I’ve made some howlers in my time, so I am in no way claiming infallibility. In those cases everyone is rightly focused beyond who screwed up, to how it was possible to screw up, and how to prevent it happening (in that way) again.

However when the mistakes occur because someone (or more realistically a whole chain of someones) didn’t do their job right, then the correct response in my opinion is ensure those responsible get their collective asses handed to them (depending on the severity of the failure, and how often those particular people are making mistakes), while simultaneously doing the rest of the postmortem to reduce the risk of recurrence.

The difficulty here is ensuring that the blame is correctly allocated. Indeed that is often so difficult that it motivates the blameless culture in the first place - it’s better to let a mistake due to carelessness (at best) risk recurrence, and allow an environment where the careless, incompetent, or outright lazy can thrive, rather than throw blame at the wrong people.

The approach described in the book I’m reading stresses focussing on events rather than people during postmortems. In a well functioning company or development environment that makes perfect sense. What about the cases where the systemic causes of failures are the people? Not that there weren’t adequate checks and balances at the system level to allow careful teams (in the sense that they care about the product) to detect issues early, but that avoiding those issues just wasn’t a priority for the people involved.

I’ve had to deal with some utter cowboys in the extended teams I work with - people who knew they were screwing up, but also knew they could happily get away with it. It’s not pleasant. In some projects it required us to construct safety nets within and around the team, to catch bugs and mistakes before they could do significant damage. To monitor everything the bad actors touched, and build walls of defensive tests to protect our work from dodgy external (to the team) subsystems. Having to check the work of teams our team interacted with because we just couldn’t trust them to do the job correctly.

It was a resource drain, and a cause of considerable ill-feeling within the team and larger enterprise. The better solution would be to remove the under-performing or rogue elements from the project, but I’ve rarely seen that happen. I have a lot of respect for the team leads and managers that did actually bite the bullet and restructure teams to remove the problem elements.

Paradoxically, one of the things I like about XP in practice is that it doesn’t cope well with bad actors within a team. Yes, pairing and the other practices mitigate the issue in the short term, and potentially motivate the under-performers to up their game - team spirit is strong with XP. However if there isn’t a turnaround then it is common practice to remove such developers from the team. There’s little scope for someone to coast within an XP team.

At the other end of the scale, having a culture where people live in fear of getting blamed for a screw-up (rightly or wrongly) is also unhealthy. I’ve been in teams and environments where that was the working practice. It doesn’t work out well in the long term, the culture is unpleasant, and the more mobile, skilled developers leave to go somewhere nicer.

There has to be a ‘happy’ middle ground, where conscientious engineers can happily get their work done, while also maintaining enough accountability that bad actors cannot thrive in the company.

I’d be interested to know how others deal with these issues. It’s something that has recurred repeatedly over the years, particularly in the larger organisations I’ve worked with. Smaller companies have less scope for these issues, there’s only so much dead-weight a small company can support before going bust. Also, bad actors can’t hide in the system as easily in a small team or company.

Shared at https://www.linkedin.com/pulse/having-blameless-culture-donal-stewart