Cloud basics 101: High Availability

Understanding basic cloud architecture concepts for resilient systems

Abderrahmane Smimite
3 min readOct 31, 2018
Credits: (picture is not mine, I’ve lost track of the original author)

Understand High Availability (HA) concepts

This is an extract of my series on Cloud basics. This set of slides explains some general concepts on how to build resilience and ensure high availability for your cloud applications.

Keep in mind that I am presenting this content for non-tech-savvy, so there might be some debatable inaccuracies :)

This post is part of my lightning talks and describes how we manage HA design as well as a quick overview of Kubernetes.

Everything fails, all the time. Werner Vogels, CTO, amazon.com

Summary

My ten steps to design highly available systems:

Business drives HA requirements

Define your SLA (Service Level Agreement) and how many nines you actually need. Everything else will depend on it (provider, regions, availability zones, multi-provider, etc.) Here’s one simple calculator : https://uptime.is to much the nines to the actual tolerable downtime (one Nine, which is 90% is equivalent to a maximum downtime of 36,5 days per year where five nines, 99,999% is equivalent to 5,25 min)

One advice: don’t over-do it, at least not at the begining. You’ll always have room for improvment.

Define your objectives and resources

Based on the previous items, you need to identify your Recovery Time Objective (RTO) ( → How quickly must the system recover) and theRecovery Point Objective (RPO) ( → How much data can you afford to lose); accordingly, how much money/time can you invest to meet these objectives.

Keep-it-simple strategy : less layers, more resilience

Complex systems tend to:

  • make it harder for people to understand, manage and troubleshoot
  • increase the risk of failures
  • could make the fault recovery more difficult and in some cases more risky

→ Less is more

Understand your application and environment

Consider deployment constraints during the first design stages and keep improving progressively (DevOps). Among the things that you need to keep in mind is built-in fault tolerance, your system scalability, are your services stateful or stateless and define the overall recoverability approach. Implement redundancy when possible in order to prevent single failure from bringing down the entire system. My golden Rule is:

“Assume everything fails, and design backwards”

Keep a close look at the state of the art (including providers capabilities)

Understand design patterns, clustering, Replication, containerization/ virtualization, orchestration, etc. maybe you need multi-cloud, maybe you don’t. Get a good understand of the cloud providers capabilities and cross them with your applications requirements.

Know when to use managed services

Do not reinvent the wheel and keep your effort to your actual business value, unless there is really a good reason to do otherwise. One basic case is database as a service.

Build resilience instead of strength

your system will fail, eventually and you will miss someting, as I did for the ‘t’. Think about how you system can self heal and recover rather than tightening everything.

Keep up with technical updates and team training

Cloud technologies are changing and evolving and this is the fun part of it. Stay tuned for updates on providers services, 3rd party tool, communities such as the CNCF and keep up with the training (and hands-on).

Define your just-enough process

Whenever I talk about this point, I picture this episode for the IT Crowd where there is Fire but Moss is sending an email because it’s the process 😉

Keep challenging your design

Cloud technology is evolving like crazy so keep reviewing your design to see how to optimize its cost and performance.

At ERCOM, we build highly available and scalable systems for secure communications as SaaS solutions. Checkout the website for more: https://www.ercom.com

--

--

Abderrahmane Smimite

Ph.D, CISSP, SPC | Simple solutions to complex problems | Cyber Security, Platform Engineering, SRE, Data and Product Management