Why is a disaster recovery plan necessary for SaaS data?
We hear this question at least once a day. When businesses develop their own software, they have control over data privacy, security, and reliability. Cloud applications take away much of the responsibility of hosting your own solutions, but not all of it.
It's called the shared responsibility model, and it's often associated with AWS, but the concept of shared responsibility governs all cloud computing, including Atlassian.
Essentially, you and the cloud provider share the responsibility of protecting your data. SaaS companies will guarantee everything except user access and user data. You are on the hook for safeguarding those things.
While some SaaS products offer backup and restore capabilities, they may lack fine-grained recovery controls, comprehensive data visibility through detailed audit logs, or guarantees of data safety. Therefore, having a disaster recovery plan for each SaaS product you depend on is essential.
Step one: Define your RTO and RPO
The two key metrics that should drive your recovery strategy are the Recovery Point Objective (RPO), which determines your data loss tolerance and the Recovery Time Objective (RTO), which is setting the benchmark for minimizing downtime.
The RPO metric revolves around one fundamental question, "How much data can you afford to lose?" If you back up data once every 24 hours at midnight, and a disaster strikes at 11:59 PM, you lose an entire day's worth of data!
That level of risk will be tolerable for some businesses but unacceptable for others. Defining your risk tolerance for data loss will guide your technical decisions in meeting these requirements effectively.
To determine your RPO, you'll want to consider the criticality of your data and the impact of potential data loss on your organization. Different services or applications may have varying RPO thresholds. For instance, mission-critical systems may demand real-time replication with zero or near-zero data loss, while non-critical systems might tolerate a longer interval between backups.
The RTO metric focuses on how quickly you can recover from a disaster and resume normal operations.
Imagine a scenario where a meteorite strikes your data center. How long would it take to get your systems back up and running? This involves factors such as procuring alternate infrastructure or restoring from backups. The time required for recovery varies depending on the service in question.
Some services can achieve rapid recovery times, potentially within minutes, while others may take considerably longer, perhaps an entire day or more. Understanding the unique recovery timelines for each service or application is essential for effective disaster recovery planning.
A key way to get started is by engaging with stakeholders across your organization to ensure their buy-in and agreement on the defined RPO and RTO targets. This collaboration will foster a shared understanding of the potential impacts of a disaster and the required recovery timelines.
Step two: Choose a recovery strategy
Choosing the right strategy boils down to a trade-off of robustness versus cost. Here's a graphic from AWS that also applies to products provided by Atlassian and helps illustrate the spectrum of choices:
Let's start on the left side with the "Backup & Restore" solution. It's the simplest and most affordable option. However, the recovery time can take hours, or even longer. Essentially, you're restoring your latest backup(s) to your disaster recovery location.
Moving along, we have the "Pilot Light" option. With this approach, you run some essential services in a reduced capacity. Most services are running but scaled down to a "scale to zero" level. Code or application updates are pushed to the DR location just as you would update your primary location.
Next up is the standby strategy. Here, everything is up and running, albeit at a smaller capacity compared to your primary environment. It's similar to the "Pilot Light" option, but all services are operational with at least some capacity and nothing scaled to zero.
Finally, we have the active/active solution. This is the most comprehensive approach where you run full services in two parallel streams, allowing you to switch between them in near real-time. However, it's worth noting that this option comes with doubled costs, making it less feasible for some companies.
Your choice of recovery strategy depends on your risk tolerance and how much you're willing to invest to mitigate that risk. Depending on your industry, you'll have core systems that form the foundation of your operations, as well as peripheral systems. While a full day of downtime for a core system can be painful for most businesses, the impact may be less significant if it's a tool for running marketing programs, gathering statistics, or another secondary service.
This means you'll need distinct disaster recovery plans for various systems within your organization. While there may be overlapping elements between service disaster recovery plans, it's important to consider each service's plan due to potential variations in RPOs and RTOs.
Step three: Testing a disaster recovery plan
To effectively test your disaster recovery plan, here is a helpful checklist to help you focus your efforts:
- A tabletop test is a great way to start by putting all the ideas and methods in front of everyone. It allows all stakeholders to have a say in how you should proceed with your disaster recovery efforts.
- Walkthrough any internal and external dependencies that could potentially hinder or even prevent your disaster recovery plan from being effective. It's vital to address and resolve these dependencies before implementing the plan.
- This meeting forms the foundation of our disaster recovery plan, so we must document everything thoroughly. It's a good idea to get participants to sign off on the documentation to avoid any confusion later on.
- Create accountability lists to identify who is on call if a disaster disrupts your business. These people will be responsible for executing different phases of the DR plan. The list should be clearly outlined and updated so newer team members are aware of who does what in an emergency.
- While you can't plan for every possible disaster, it's important to assess and prioritize the risks you face. Whether it's malware attacks, data center outages, or third-party provider outages, pick which ones to prepare for.
- When SaaS tools are vital to processes and workflows, outages can drain productivity and cash. Atlassian's fourteen-day outage in April 2022 is one such example, which ended up affecting over 50,000 users. The downtime resulted in the revocation of access to critical SaaS products like Jira, Confluence, and Opsgenie. It also resulted in the loss of data. By Atlassian's own calculation, the average cost of downtime to customers is $5,600, but the actual cost varies from business to business.
Disaster recovery checklist
At a glance, your checklist for testing your disaster recovery plan should look something like this:
- Understand why you need a SaaS disaster recovery plan.
- Set your RPO and RTO.
- Confirm with relevant stakeholders what business functions you want to protect and secure.
- Decide on the internal and external tooling required to carry out a disaster recovery plan, taking into consideration the security and privacy of your SaaS data.
- Create clear and relevant accountability lists that illustrate who will be on call for what and in what situations.
- Scope the types of disasters you're planning for because you can't plan for everything.
- Document the plan, make it easily accessible, and have the proper stakeholders sign off on it.
Published: Jul 19, 2023
Updated: Oct 28, 2024