The technology news has recently seen a few Disaster Recovery (DR) stories that remind us, as both consumers and professionals, of the need to be secure in the knowledge of the service that we provide to you.
The expectations for digital products have never been higher — we all want intuitive insights and functions that are always on, matched with vast data sources, alongside high performance, across multiple devices.
For any technology company there needs to be a laser focus on the service and the data within that service. I regularly ask myself:
- Is the service stable, and operating within its expectations?
- Is the service performant and able to serve its client?
- Is the data secure?
- Is the data backed up?
And further… if we lose the whole stack — could we re-create it? (really quickly!)
DR Fire Drills
As part of a rolling calendar of ‘fire drills’ the whole DevOps team at E Fundamentals run these failure scenarios ensuring that we have a common understanding of the services and all its elements that are required to provide the service. By taking down each part of the service in turn, we practice, learn and improve.
Each time we run our process, we learn not only among the wider team but also what doesn’t work, or has stopped working. Our natural pace of development means that our platform is always changing, and our ability to restore services by script needs to increment in turn.
Finally we look to improve. We benchmark on two core values — The Recovery Time Objective and the Recovery Point Objective (RTO and RPO). In basic terms that means — how quickly can you restore how much of the service? In product terms that means how long is the service unavailable for, and following a failure, how much data will be restored to the re-created system? A pioneering DevOps team’s target will be: the same day and everything (all the data used by your clients). And whilst all this is going on behind the scenes, we also set up processes to ensure that data is still accurate and accessible for our clients during this drill.
Practice Makes Perfect
In a recent fire-drill, we deleted and restored both the core back end data gather system and the client reports/dashboards system. In running this across environments (test and production systems) we involved all-hands and made this an immersive experience, with process documents open, whiteboards ready and of course the stopwatch! In a rolling ownership model, one developer was shadowed by the rest following a detailed procedure step by step. Not only did we re-write the document as we went, but also sought to increase the automation at every step, and updated scripts as we went.
It was not only pleasing to see a positive result, but also see the restoration times fall in each iteration (through environments). Our first run of 2hrs+, was reduced to 40 minutes for our data gather system (that creates our daily insights across thousands of products).
Amazon’s Tech Failure
Within the same week we heard of two stories of DR process failure — Amazon S3 and GitLab — that act as a reminder to be prepared at all times to the best of your abilities. In both cases great products were temporarily lost not due to the underlying brilliance of the product and platform — but human error and a slip in process; a typo in the coding bought down the whole system.
We also learned a thing or two about boundaries in our own technology stack that reach far beyond our ‘walls’. During one cycle, we overlapped with the Amazon S3 outage — but we mainly use Google Cloud — so that’s okay right? Well, not if parts of your process use Amazon S3 that may well include 3rd party dependency services and Docker container storage.
The exciting world of cloud makes many things possible, but it also creates a web of dependencies that you need to be well aware of. So, if ‘they’ are down, so are you…
DR and system protection may not be the most exciting part of the Product Creation process — but always ask of your product and your team — what if we lose the whole stack? And whilst you’re rallying around that cause, do remember what the two particular services mentioned above did brilliantly during their experiences — communicate, communicate, communicate. A void is always filled with pessimism and transparency is the natural antidote.
Are you prepared? The clock is ticking…
Adrian Butter wrote this article whilst working as the CTO of e.fundamentals https://www.efundamentals.com/