High availability and Disaster Recovery Overview
More and more companies are relying on their infrastructure in the Cloud. Availability is essential. This article provides an overview of what precisely High Availability means and what you need to consider. I will take Microsoft Azure’s technology and technical capabilities (related to High Availability) as a general example.
What is High Availability?
Critical systems that cannot tolerate any downtime must be implemented with high availability. This means they can operate even if some components fail. Otherwise, service interruptions could cause significant damage or loss of funds.
What are the three critical components of a High Available environment?
The following three principles are used to establish a design for a High Available environment:
- Single point of failure – A single point of failure is a component that, if it fails, will cause the entire system to fail. Ensure that all mission-critical elements have an additional redundant component that can take over if the worst happens
- Failure detectability – Collect data from running systems and detect when components fail or become unresponsive
- Reliable crossover – It is also essential to build redundancy into these systems. A mechanism to automatically switch from the currently active component to the redundant component if the monitor indicates that the active component has failed
Technical components enabling high availability
- Backup and recovery – a system automatically backs up data to a secondary location and restores it to the source or another destination upon request. High availability can have the same architectural structure as backup, except for replication frequency, which affects system availability and potential data loss. The final solution is real-time data mirroring
- Load Balancing – A load balancer manages traffic and routes it between multiple systems that can serve that traffic. A load balancer can detect the failure of one of the target systems and redirect traffic to another available system. This can also be achieved through DNS changes (or multi-target zone files) or connection redirection
- Clustering – A cluster contains multiple nodes (computers) that act as a single server to provide continuous uptime. Users typically access the entire cluster and treat it as one node. Any node in the cluster may fail over to another node in the event of a failure. These nodes have access to the same shared data store. Therefore, if one node is unavailable, the other nodes will take over the load. A high-availability cluster can contain a large number of nodes
The Big 5 High Availability Checklist
Step 1: Define the requirements
Based on the business risk, determine the critical components. Identify these components and determine if they are critical or not. Designate the maximum (the desired) impact to determine which components in the process should be made Highly Available.
The impact can be represented as :
- Uptime % – the percentage your system is operational
- Mean time to repair (MTTR) – the average time needed to fix a failed hardware component
- Mean time between failure (MTBR) – the average time required to replace a failed component
- Recovery Time Objective (RTO) – the maximum length of time a failure can occur before restoration
- Recovery Point Objective (RPO) – the maximum amount of data loss that’s tolerable
Azure provides its Service Level Agreement or SLA, for each service. However, you can create redundancies to ensure a higher SLA than Azure offers. I strongly suggest consulting with vendors, including your Managed Service Providers, regarding SLAs for all critical components.
Step 2: Plan your High Availability Architecture
- Start with an inventory of all the critical components – Identify possible error types, the impact of every error type, and the recovery strategies. Make sure your IT department and partner(s) support them. Determine the extent of redundancy required for every component. Avoid single critical points of failure and use load balancing to distribute requests between resources
- Replicate your data – Make sure your data is replicated to another location (preferably offsite). It is impossible to meet RPO and RTO requirements if actual data is not available
- Test your HA setup and disaster plan – Ensure that the HA scenario has been tested and that critical processes continue to run in the event of a disruption. Provide additional capacity and schedule regular recurring testing. Verify that business activities can continue during an emergency
- Consider costs – Remember that every layer of redundancy effectively impacts your cloud costs (at least while the redundant components are active or in place). Also, confirm you have got all the necessary licenses and infrastructure capacity (storage, networking and bandwidth) to support additional redundant instances
- Document everything – Document the necessary steps for automatic or manual failover. Make sure that the documentation is updated regularly and clear for everybody
Step 3: Test..test..test..
It’s essential to regularly test systems under realistic conditions. Doing so ensures reliability and helps identify any issues that may arise. Procedures should be tested while experiencing different faults (including multiple defects and faults simultaneously) to understand recovery time. Additionally, testers should measure how long it takes for systems to recover after a failure in critical processes.
Additional tests which could be useful are:
- Test regularly – Run a real load test until the system fails and observe the behavior of the failure mechanism. Make sure you plan this test every couple of weeks
- Test your Load Balancer probes – Make sure that your (Load Balancer) health probes are working correctly
- Conduct disaster recovery exercises – Conduct planned or unplanned experiments in the event of a system failure and your team must work quickly according to your disaster recovery plan
- Test the monitoring system – Regularly check that the data from the monitoring system is correct to ensure that you can catch errors on time
Step 4: Deploy and manage your environment consistently
- Any changes can lead to failure – Using an automated deployment process can reduce the chance of introducing errors into your system. This also makes it easy to recover from any outages or errors in the future. Whenever new Azure services are launched, new application code is deployed, or changes are made to existing configurations, errors can occur
- Consider availability in the approval process – Design your release process to allow updates with minimal service disruption – try to achieve continuous updates of critical components that do not require downtime. For example, you can use blue-green versions or a similar strategy to serve multiple versions of the production environment at the same time and switch between them to migrate to the new version
- Plan rollback – To ensure your system returns to a previously working state, design a rollback process that automatically restores your system to a previous version. Additionally, you should set up an automated deployment for a complete environment to represent your last known good configuration
And finally, Step 5: Monitor Application Health
- Find bugs on time using probes and inspections – Timely detection of failures is critical for high availability. For example You could implement Azure health probes and probe functions to get the latest service availability data in Azure. It would be best if you always to perform audit functions outside the application
- Be aware of subscription restrictions – Errors may occur if the allowed limits for any Cloud platform (Ex: Azure) service are exceeded. Therefore, ensure you understand the storage, compute, throughput, and other limitations of each cloud service you use
- Use logging and auditing – Use the monitoring and logging capabilities whenever possible. Use system and audit logs and don’t forget performance data (latency, throughput, errors)
- Watch for declining health metrics – Don’t just pay attention to complete failure. Deteriorating health indicators could be a warning sign of impending disruption. Build an early warning system by identifying key indicators of application health and alerting operators when the system reaches problematic thresholds
Follow our Twitter and Facebook feeds for new releases, updates, insightful posts and more.