What We Call Security: The Importance of Recovering Well (7/7)

Welcome to the final instalment of this series on how recovery can enable us to progress how we do information security.

We’ve seen how we can get proactive, shift focus and resource towards that approach thanks to the safety net of recovery, use Digital Twins to tackle some of our toughest legacy challenges and to accelerate forensics, what the regulatory considerations are, how speed really matters, and much more.

Most of these capabilities have one particular caveat though: Your recovery capability needs to be good. It must work well at recovering your business processes in the real world. All of them. You cannot work towards implementing a proactive programme tomorrow if you aren’t confident in your ability to recover today. The distraction will consume you.

And this is where we address the more traditional or conventional aspects of backups and recovery. You may recall that one of the first things I said in this series was that I was not an expert in storage and recovery. Fortunately, there’s plenty of best practices out there for us to work from.

And this being 2023, what blog wouldn’t be complete without us asking ChatGPT a few things? Let’s have a little fun with our favourite artificial friend and get their thoughts on recovery:

Organisations rely heavily on digital data and systems to run their operations. Losing this data due to system failure, cyberattacks, or natural disasters can have a devastating impact on the organisation's ability to function. Therefore, it is essential for organisations to have an effective backup and recovery plan in place.

Well thanks for that, Captain Obvious. So how do we achieve that? What’s the process?

1.     Identify critical data and systems.

The first step in planning for effective backups and recovery is to identify the critical data and systems. Not all data and systems are equally important to the organisation. Therefore, it is essential to identify the critical data and systems that need to be backed up and prioritised for recovery in case of a disaster.

Organisations can use the following criteria to identify critical data and systems:

•     Importance to business operations: Identify the data and systems that are essential for business operations. This could include customer data, financial data, inventory data, and other data that are crucial to the organisation's day-to-day operations.

•     Regulatory requirements: Identify the data and systems that are required by law or regulation. For example, the organisation may be required to keep certain financial data for a specific period of time.

•     Recovery time objective (RTO): Identify the data and systems that have a short RTO. RTO is the maximum amount of time that an organisation can afford to be without a system or data before it starts to impact business operations.

•     Recovery point objective (RPO): Identify the data and systems that have a short RPO. RPO is the maximum amount of data loss that an organisation can afford in case of a disaster.

 

Ok, that sounds pretty decent, applying long-established common-sense principles. Though it’s important to note that just because the IT supporting a process is down doesn’t mean it can’t be done manually. And it’s not just how quickly something starts having an impact, but how big. The $10,000,000 impact at midnight due toa failed batch job is a bigger worry to me than the $10 per minute we’re losing on some live activity, for example! So do consider those things in your prioritisation. Sounds basic but I’ve seen this not done properly in a number of cases.

Another thing I would like to point out is how ChatGPT keeps bundling together “Data and Systems.”

When it comes to backups, I would personally advise focusing on the data first as there may not be a one-to-one correlation between data and systems, then the business process(es) they’re used in, then the systems supporting those processes.

When it comes to finding the data however, looking at the business process first is likely your best option. Experience tells me this is the best way because the business process dictates all.

That doesn’t mean it may not be collecting excess and unnecessary data. But that’s aseparate architectural (and possibly liability) issue beyond the scope of our discussion here.

Ok, next step?

2.     Determine backup frequency and retention period.

Once the critical data and systems have been identified, the organisation needs to determine the backup frequency and retention period. Backup frequency refers to how often the data and systems are backed up, while retention period refers to how long the backup data is kept.

The backup frequency and retention period should be determined based on the criticality of the data and systems, RTO, and RPO. For example, if the critical data and systems have a short RTO and RPO, they should be backed up more frequently and kept for a longer retention period.

Organisations can use the following guidelines for determining backup frequency and retention period:

•         Daily backup: For data and systems that are critical to business operations and have a short RTO and RPO, daily backups should be performed.

•         Weekly backup: For data and systems that are less critical, weekly backups may be sufficient.

•         Monthly backup: For data and systems that are not critical, monthly backups may be sufficient.

The retention period should be determined based on regulatory requirements and business needs. For example, if the organisation is required to keep financial data for seven years, the retention period for financial data backups should be at least seven years.

More common sense here. But again, we see the generalised mention of “data and systems”which should probably be replaced by a consideration of the business process that needs the data. In fact, ChatGPT seems to have a habit of glancing over the business putting other things first. This is a symptom of the status quo and something we need to change if we are to understand our organisations well enough to protect them.

One consideration ChatGPT hasn’t mentioned is the importance of the timeliness of a specific type or subset of data. Names and birthdays may be static, but if your organisation’s role is to track trends then historical data, with a full record or high enough sampling rate may also be important. This touches on the subject of full backups versus incremental ones, and the possible use of transaction to not just bring data up to date but have its full history.

Something that may also be worth pointing out is that in some cases we may need copies of data significantly fresher than one day old which is the most frequent suggested here. In fact, even minutes could be too much. Ensure you will have the data you need for each purpose.

All could require different approaches and planning to ensure we have a copy of the relevant data to support the recovery strategy for each business process.

Now we need to pick a solution (or solutions) that meets our needs and start working out and documenting the backup [and recovery] processes accordingly.

3.     Implement backup and recovery policies and procedures.

Backup and recovery policies and procedures should include the following:

•         Backup frequency and retention period: Document the backup frequency and retention period for critical data and systems.

•         Backup and recovery methods: Document the backup and recovery methods used, including tape backup, disk backup, cloud backup, or hybrid backup.

•         Disaster recovery plan: Document the disaster recovery plan and the steps to be taken in case of a disaster.

•         Roles and responsibilities: Document the roles and responsibilities of the backup and recovery team and other stakeholders.

Backup and recovery policies and procedures ensure that the backup and recovery plan is followed consistently and helps to reduce the risk of data loss or system downtime.

After all that we should now have a documented backup and recovery process.

But how effective is it?

Next up we need to test if what we’ve thought up and documented will actually work in terms of the recovery of business processes. You’d be staggered at how few organisations have done this well or at all only to get caught out in a big way.

It’s critical to define also not just how we will recover systems and data, but how the people performing the business processes will resume using them. There’s nothing quite like having worked all night to get systems ready again for the business day, only to have the entire workforce twiddling their thumbs in the morning because no one knows how to start or connect to the business application as no one’s ever had to do it before in living memory, it was always just “on”.

4.     Test backups and recovery

Once the backup and recovery solutions have been implemented, it is crucial to test them regularly. Testing ensures that the backups are valid and can be used for recovery in case of a disaster. Testing also helps to identify any issues or gaps in the backup and recovery plan.

Organisations can test backups and recovery in the following ways:

•         Partial restore: Restore a portion of the data to ensure that the backups are valid and can be restored.

•         Full restore: Restore all the data to ensure that the backups are valid and can be restored to the original system.

•         Simulated disaster: Simulate a disaster to test the recovery process and identify any issues or gaps in the backup and recovery plan.

•         Tabletop exercise: Conduct a tabletop exercise to test the backup and recovery plan and identify any issues or gaps in the plan.

 

Here I’d add to not just test the recovery solution but also the plan, including the business processes and people running them to make sure things align nicely to the business and not just the IT.

Identify any issues, adjust your plan and documentation, and retest until everything goes smoothly.

There’s a slight issue that the status quo best practices don’t always mention; testing this stuff is really hard. It’s hard because it’s traditionally extraordinarily time consuming, disruptive, and limited by spare resource, both in terms of manpower and computing resource.

Doing testing requires significant computing and storage resource the business doesn’t likely have spare, which can result in partial and fragmented testing which may let us down in a real-world scenario. Having to do it carefully due to the associated risks also exacerbates the human effort required.

But it’s another great use case for Thin Digital Twins like those leveraged by VM2020’s CyberVR. Full scale recovery can be tested with far less worry. Meaning you cannot only make sure you test everything, but you can test it faster and quickly reset test cycles and repeat until your recovery process is successfully proven. (Not to mention perform your forensics in parallel in the case of a breach.)

Remember that all these principles apply to all data and systems, whether physical or virtual, on-premises or cloud. While people usually assume Digital Twinning solutions only work on virtual machines, I’m glad to say CyberVR gives us options for all of these scenarios.

Remember when we said that Hitachi Vantara’s Protector, when used in conjunction with VM2020’s CyberVR could recover 1,500VMs with a petabyte of data in 70 minutes?

Well, I thought I’d ask ChatGPT how long it thought that would take. “Days” was the answer I got back. And once I added the elements of immutability, validation, forensics, and different elements of cloud, virtual, and physical on premise, it became “Several days or weeks.”

You can see why this is exciting and a big deal when it comes to recovery. And, as a CISO, how it enables me to deliver a strategic and proactive security programme to improve my organisation’s inherent resilience (where we don’t get knocked down in the first place), thanks to the safety net and the change in priority it allows due to Recovery-Focused Risk Management.

Furthermore, from a from a security perspective, it’s essential for your backups to be not only encrypted, but also immutably stored so that they cannot be compromised either from a confidentiality or integrity standpoint.

Another important element, that also significant impacts our ability to do Recovery-Focused Risk Management with quantitative accuracy, is how consistently we can recover from incidents (and we can even test this thanks to Thin Digital Twins).

This accuracy means we can provide more assurance to our Board and make us exponentially more likely to meet our recovery targets. It helps us not just correctly prioritise (budget) business resources for risk management according to the financial business risk, but also improves our ability to set and meet regulatory targets should the worst happen.

One thing I want to highlight is that there is a difference between traditional backups and the kind of recovery we have been discussing in this blog, as these differences may not be clear for security practitioners who don’t live and breathe storage.

Traditional backups are great for recovering files and data, but don’t provide a capability around recovering systems and business processes. In other words, they do not focus on recovering business services, which has traditionally required a lot of additional effort.

That doesn’t mean traditional backups are “bad”. To the contrary, sometimes all you want to do is back-up and restore files, and it is something those methods are time-tested and exceedingly good at, just keep in mind the difference between recovering critical business services and merely recovering or restoring files.

The chart below can serve as a guide which highlights and explains the differences:

Some parting words to my fellow security practitioners in this, the final instalment, of this blog:

Like me, most of us are not experts in storage and recovery. But we must accept and appreciate that it is a highly complex discipline, likely rivalling that of security.

This instalment only covers the utmost basics, and I would recommend leveraging expertise from specialised consultants (such as those at Hitachi Vantara), due to the experience and expertise needed at a business and IT operations level as well as in terms of the degree of product and technology knowledge needed to get it right.

But whether we use internal or external expertise to get it right, once it is we gain immense possibilities in changing how we do Risk Management, and how we can approach the work of securing and reducing the risk to our organisations. Moving away from mitigating and firefighting technical issues caused by business processes (including IT), to instead affect those processes themselves, building security into them rather than adding it to them wherever possible, and creating sustainable improvement in our organisation’s security posture similar to the trend we saw in the aviation sector in this blog series’ introduction.

As technology becomes ever more important in everyone’s lives, we have an opportunity to make a lasting mark, one that truly matters.

Thank you for reading.

Make the shift today towards proven cyber resilience

If you’re ready to prove the impact your cyber initiatives are having in a business context through evidence-based solutions, we’re ready show you.

Request Demo