Frequent releases reduce risk

This post expands on a train of thought initiated by Dan North in his talk “Kicking the Complexity Habit” at NDC London 2014.

“Frequent releases reduce risk” – this is something you hear all the time in conversations about Continuous Delivery.  How exactly is this the case?  It sounds counter-intuitive.  Surely releasing more often is introducing more volatility into Production? Isn’t it less risky to hold off releasing as long as possible and take your time with testing to guarantee confidence in the package?

Let’s think about what we mean by risk.

What is risk?

Risk is a factor of the likelihood of a failure happening combined with the worst case impact of that failure:

Risk = Likelihood of failure x Worst case impact of failure

Therefore an extremely low risk activity is when failure is incredibly unlikely to happen and the impact of the failure is negligible.  Low risk activities also include those where either of these factors is remarkably low such that it severely reduces the effect of the other.

Playing the lottery is low risk – the chance of failing (i.e. not winning) is very high, but the impact of failing (i.e. losing the cost of the ticket) is minimal, hence why many people play the lottery.

Flying is also low risk due to the factors being balanced the opposite way. The chance of a failure is extremely low – flying has a very high safety record – but the impact of a failure is extremely high.  We fly often as we consider the risk to be very low.

High risk activities are when both sides of the ratio are high – high likelihood of failure and high impact, for example extreme sports such as free solo climbing and cave diving.

Large, infrequent releases are riskier

Rolling a set of changes into a single release package increases the likelihood of a failure occurring – a lot of change is happening all at once.   
The worst case impact of a failure includes the release causing an outage and severe data loss.  Each change in a release could cause this to happen.  
The reaction to try and test for every failure is a reasonable one, but it is impossible.  We can test for the known scenarios but we can’t test for scenarios we don’t know about until they are encountered (“unknown unknowns“).   This is not to say that testing is pointless, on the contrary it provides confidence that the changes have not broken expected, known, behaviour.  The tricky part is balancing the desire for thorough testing against the likelihood of them finding a failure and the time taken to perform and maintain them.
Build up an automated suite of tests which protect against the failure scenarios you know about, and each time a new one is encountered add it to the test suite.  Increase your suite of regressions tests, but keep them light, fast and repeatable.  
No matter how much you test, Production is the only place where it counts.

Small, frequent releases reduce the likelihood of a failure

Releasing often, containing as small a change as possible, reduces the likelihood that the release will contain a failure.

There’s no way to reduce the impact of a failure – the worst case is still that the release could bring the whole system down and incur severe data loss, but we lower the overall risk with the smaller releases.

Release small changes often and reduce the likelihood of a failure and therefore the risk.

One Comment

  1. Hi Chris

    This is a great introduction to the counter-intuitive truth that smaller, more frequent changes reduce risk.

    You are entirely right that smaller changes reduce the probability of a failure, as splitting a large experiment into smaller independent experiments reduces variation in outcomes i.e. by dividing up the risk associated with a release into smaller releases, we can more accurately estimate where failure probability truly resides and act upon that new information.

    However, saying “there is no way to reduce the impact of a failure” is incorrect – smaller, frequent changesets can also reduce the cost of failure as well as the probability of failure. I outline why in my Release Testing Is Risk Management Theatre talk (see 35:17 of https://skillsmatter.com/skillscasts/5394-release-testing-is-risk-management-theatre). The cost of a failure is also two-dimensional – the economic impact of failure and the duration of failure. The former is hard to influence let alone control due to external market forces, but duration is easy to control due to Little's Law – less WIP means a lower lead time, which means a lower failure duration, which means a lower cost.

    That is why production defect fixes always go out in small changesets – to minimise the opportunity cost of the current defect, *and* to reduce the probability of causing further defects with the fix.

    S

    Like

    Reply

Leave a comment