What We Call Security: Recovery-based Risk Management (4/7)

This might just be my most controversial instalment in this series, for security practitioners anyway.

I’m going to come straight and say it: I don’t like how we do “Risk Management” in Information Security. I think that in its current guise a lot of it is of very little real value, especially from a strategic or long-term standpoint.

Let me try to clarify that somewhat. As a CISO, is it is my job to “manage” risk, or to sustainably reduce how much we have and create? I prefer the latter. Long-term, it’s a lot less work for me, and a lot better for the business.

I feel we too often operate with an assumption that the business wanting to do or achieve X means Y risk, but the reality is that how we go about doing X greatly affects Y.

In short, it’s usually possible to have the outcomes the business wants with a lot less risk (and I mean before throwing a lot of mitigating security resource at it), but it involves building the business processes with risk in mind rather than retroactively managing the resulting unnecessary or excess risk.

This is something that can be done with the proactive strategies and concepts I laid out in the Security as Quality instalment.

In other words, I don’t like how the security status quo has scoped “Risk Management” and the approach that is typically used for it. Together, at a macro level, they might even contribute to us staying stuck where we are; constantly “managing” new risks rather than stopping their creation and reducing the total number of risks [to manage] at any one time, over time.

In my opinion, a lot of it stems from following two things:

1.     We do not approach Risk Management from a fundamental business process angle. I.e.: We do not tend to focus on changing the processes responsible for continuously generating new risks, instead only dealing with the resulting risks without reducing the flow.

Risk Management is often a firefighting and reactive mitigation function rather than one that makes sustainable improvements to the responsible business process in the first place.

2.     The fact that even in the limited scope (as per above) of status quo Risk Management, it is risk scoring that serves as the basis on which most other risk management actions are taken.This would be fine were it not for the fact that, whether qualitative or quantitative, I don’t feel we’re particularly good at being accurate.

We are, as humans, often hilariously bad at determining risks and the actual causes of risks (correlation). E.g.: Pigs kill more people than sharks, car skill more people than guns.

We also tend to simplify correlations to where A causes B, but there are often so many other factors involved that our simplified conclusion can be way off or even completely backwards. This can lead to not just poor risk calculations, but proposed actions that have little effect and can sometimes make things worse.


·     Most risks are assigned arbitrary values, or even 1-5 numbers, that are disconnected from a financial business impact.

·     "Quantitative" assessments, rarely are. They are still based on arbitrary assumptions that can be significantly off, whether the assessor realises it or not. I find quantitative numbers rarely consider the full complexity of the situation.

·     We often lack the technical understanding of what could happen for each potential scenario. In other words, we do not know every part of every system (IPs, ports, services, operating systems, versions, installed software, patch levels, exposure windows, etc.), and all the possibilities they provide at any given time for a potential compromise.

·     We rarely understand the business context and therefore actual business impact of systems. Which business process would be stopped, degraded, corrupted by a certain system being hit, or impact to another system from it, and the associated financial cost. We then need to repeat this assessment exercise for every possible permutation of systems being hit for any given hypothetical breach.

·     There are likely many unknowns that would affect potential attack chains, leading to risk estimates being wrong by orders of magnitude. Very few are aware of every single asset of every single type they have, let alone their dependencies, interactions, roles, access profiles, what vulnerabilities they have (or could eventually have), and how that may impact lateral movement to other things.

·     The state of our environment, the threat actors, and the array of exploitable situations is forever in flux. So are your risk scenarios and associated potential impacts.

·     If you want to get granular, there are literally millions of potential attack vectors in even a mid-sized company. Too many to track.

·     Perhaps most important of all: Good luck explaining all this stuff to a Board in under a minute!

Not surprisingly, I received strong pushback from practitioners on these views. But the fact is that in the dozens of breaches I’ve investigated the main cause of the breach had usually not even been identified in the Risk Register. When it was, the risk measurements, including quantitative ones, were way off.

The reality of what happened and the realised impacts (which can be counted in the aftermath of a breach) typically did not line up with what was in the risk assessment.

I’m sure many readers can anecdotally relate to what I’m saying here, especially those that have suffered a breach first-hand.

In short, I don’t believe our current approach is effective and I would like to propose an alternative. Let’s call it Recovery-Based Risk Management.

The basic premise is this: If you've sorted your recovery procedures properly, then your maximum recovery time is a known quantity, let’s say X.

The business likely knows, or can readily calculate, the loss figure for X amount of downtime.

This could be calculated for the whole business or at a department or function level, but in either case you have a maximum single incident risk impact figure.

By working backwards from the maximum negative impact instead of trying to workout every possible combination of risk, threat, and vulnerability, current and future, on systems we are not fully familiar with (if at all), and about which we don’t know the full impact to business, we can work backwards from that maximum impact figure and dramatically simplify the equation.

We now have a maximum impact, which should have been set within the business’ risk tolerance, due to our recovery capability. If there is an incident, the threat vector used, the vulnerabilities targeted, the sequence of the attack, how many components of the business process were affected all effectively become irrelevant. We can just assume the maximum recovery time for each specific business function and associated system(s).

We don’t even need to do calculations for the impact is in business (financial) terms because we can ask the business. After all, it’s their job to know!

Importantly, this approach is also infinitely easier to explain to management, which makes it easier to get support for. Having a single maximum risk figure is also a level of detail more suitable for executive reporting.

At this point, since we’ve essentially capped the possible impact of incidents, the main objective becomes reducing their frequency. Something that, as covered previously, is best done at a strategic level where the aim is to reduce the amount of vulnerable surface we generate as a business, rather than endlessly ramping up the reactive capacity to try and mitigate every attack we’ve made possible.

I can then apply traditional score-based practices at this broader level (rather than for individual technical risks and controls) where, due to having a definitive measurable maximum impact, they work far more effectively.

For example, rather than focusing on mitigating (“managing”) my, say, 1,500 most significant vulnerabilities, I can prioritise the remediation of the handful of issues in my IT and business processes that ultimately caused them.

Are most of my issues caused by bad code? Is it my architectural practices? What about my IAM? Is it all due to lacking management support? These tend to be relatively easy to answer and prioritise compared to thousands of individual technical vulnerabilities.

It only takes a quick glance at what types of vulnerabilities we have make give a usefully accurate qualitative assessment as to how many are caused by any of these root issues. And these more fundamental and less technical issues where we can use the more traditional score-based Risk Management approaches effectively because there are far fewer variables.

They make sense when dealing with a handful of broad issues, far less so with thousands of technical risks with countless potential combinations and dependencies.

To give an extreme example, in terms of total risk reduction over time, it may even be mathematically better to completely ignore all the problems we have with our current assets and only focus on the processes that will produce new systems. After all, the vulnerabilities we have now will age out along with the systems they are found on, and eventually disappear. But the real point is that if we do not focus at least part of our efforts on fixing the processes that lead to those vulnerabilities existing in the first place, we will never be able to decrease how many total issues we need to “risk manage.”

When Italy was the kidnapping capital or the world in the 1980’s, the government did something radical to stop the problem: They made it illegal to pay the kidnappers, even freezing the assets of the victim’s friends and family so that they could not raise the ransom.

When this law first passed, it was not a good time for the victims that had already been kidnapped and could not be ransomed home. But kidnappings immediately became an ineffective way to make money and all but stopped.

I am not advocating that we completely stop addressing the technical issues that we have in our environments today, but rather that some of them can be left to the safety net of recovery so that we can shift resource to where it can do more strategic good. The more we shift, the fewer risks the business will produce for us to manage, allowing ever more resource to be shifted to proactive causes, accelerating the positive trend ever further.

In simpler words, the safety net of effective recovery allows us to shift resource from chasing technical problems, to fixing the business processes that produce them. To focus on the things that will have a lasting effect on lowering the curve.

One argument I hear against this approach, where we leverage the recovery safety net to “forego” firefighting to some degree in order to focus on root causes, is that when it comes to the CIA triad, it might work with issues relating toIntegrity and Availability but not Confidentiality.

This is absolutely correct.

But that doesn’t mean we shouldn’t do it, because this “Recovery-Based Risk Management” approach can also indirectly help drive significant improvements to the confidentiality side of things.

Firstly, resource is freed due to the simplification of the Risk Management process and the lower number of risks needing mitigation (because their impacts have been mitigated instead through recovery). That freed resource can be refocused on issues that can’t be addressed with recovery, namely those possible breaches of confidentiality.

Secondly, a solid recovery plan requires knowing where your data is. That means that aspart of the exercise of setting up your recovery capability, you tend to find out where the sensitive data is located.

This not only tells you where you might need confidentiality controls, but also where you don’t. Systems housing data subject to confidentiality concerns make up only a fraction of most environments.

In other words, while a “Recovery-Based Risk Management” approach means treating the Confidentiality part of the CIA triad separately, it helps you decrease its scope and gives you additional resource to tackle it.

So, in summary, recovery gives us some “provocative” things to think about when it comes to how we calculate risks and prioritise resource to remediate in the most effective way, for the business, and overtime. I have yet to find a way where what I’ve referred to “Recovery-Based Risk Management” here couldn’t at the very least augment the status quo, if not improve it dramatically and I invite everything to consider what it might mean for them. I hope you’ll consider it too.

And once you appreciate that, consider the advantages and changes to risk impacts of having the fastest recovery solution in the world: Hitachi Vantara and VM2020’s CyberVR. More on that in a future instalment!

Goodbye for now but please join us next time when we have a look at trends in the operational resilience space, and how rapid recovery is critical to meeting incoming regulation and the associated liability. Take care!

Next article: What We Call Security: Recovery & Regulation (5/7)

Make the shift today towards proven cyber resilience

If you’re ready to prove the impact your cyber initiatives are having in a business context through evidence-based solutions, we’re ready show you.

Request Demo