Cloud basics 101: High Availability
Understanding basic cloud architecture concepts for resilient systems
Understand High Availability (HA) concepts
This is an extract of my series on Cloud basics. This set of slides explains some general concepts on how to build resilience and ensure high availability for your cloud applications.
Keep in mind that I am presenting this content for non-tech-savvy, so there might be some debatable inaccuracies :)
This post is part of my lightning talks and describes how we manage HA design as well as a quick overview of Kubernetes.
Everything fails, all the time. Werner Vogels, CTO, amazon.com
Summary
My ten steps to design highly available systems:
Business drives HA requirements
Define your SLA (Service Level Agreement) and how many nines you actually need. Everything else will depend on it (provider, regions, availability zones, multi-provider, etc.) Here’s one simple calculator : https://uptime.is to much the nines to the actual tolerable downtime (one Nine, which is 90% is equivalent to a maximum downtime of 36,5 days per year where five nines, 99,999% is equivalent to 5,25 min)
One advice: don’t over-do it, at least not at the begining. You’ll always have room for improvment.
Define your objectives and resources
Based on the previous items, you need to identify your Recovery Time Objective (RTO) ( → How quickly must the system recover) and theRecovery Point Objective (RPO) ( → How much data can you afford to lose); accordingly, how much money/time can you invest to meet these objectives.
Keep-it-simple strategy : less layers, more resilience
Complex systems tend to:
- make it harder for people to understand, manage and troubleshoot
- increase the risk of failures
- could make the fault recovery more difficult and in some cases more risky
→ Less is more
Understand your application and environment
Consider deployment constraints during the first design stages and keep improving progressively (DevOps). Among the things that you need to keep in mind is built-in fault tolerance, your system scalability, are your services stateful or stateless and define the overall recoverability approach. Implement redundancy when possible in order to prevent single failure from bringing down the entire system. My golden Rule is:
“Assume everything fails, and design backwards”
Keep a close look at the state of the art (including providers capabilities)
Understand design patterns, clustering, Replication, containerization/ virtualization, orchestration, etc. maybe you need multi-cloud, maybe you don’t. Get a good understand of the cloud providers capabilities and cross them with your applications requirements.
Know when to use managed services
Do not reinvent the wheel and keep your effort to your actual business value, unless there is really a good reason to do otherwise. One basic case is database as a service.
Build resilience instead of strength
your system will fail, eventually and you will miss someting, as I did for the ‘t’. Think about how you system can self heal and recover rather than tightening everything.
Keep up with technical updates and team training
Cloud technologies are changing and evolving and this is the fun part of it. Stay tuned for updates on providers services, 3rd party tool, communities such as the CNCF and keep up with the training (and hands-on).
Define your just-enough process
Whenever I talk about this point, I picture this episode for the IT Crowd where there is Fire but Moss is sending an email because it’s the process 😉
Keep challenging your design
Cloud technology is evolving like crazy so keep reviewing your design to see how to optimize its cost and performance.
At ERCOM, we build highly available and scalable systems for secure communications as SaaS solutions. Checkout the website for more: https://www.ercom.com