In the latest of our series of blog posts about the infrastructure that supports our digital services, Tim Britten talks about moving to an Active-Active model.
The graph above shows the total number of HTTP requests flowing through the HMRC Tax Platform by infrastructure supplier on the 5th busiest day of HMRC’s year, the 26th of January. This illustrates an exercise where we sought to prove that the new active side to the Tax Platform, running in our new second supplier, was capable of taking 100% of our customers’ traffic and therefore making us entirely resilient to a supplier failure.
Although the change above would have been indiscernible for our customers it marks a significant advance for the HMRC Tax Platform as it now runs on multiple clouds, provided by multiple suppliers, using different technologies. This achievement is substantial, and this blog is about how we did it and why it’s important.
2015 was a busy year for HMRC Digital and, watching it unfold from the Web-operations team who built, develop and keep the Tax Platform running, was fairly daunting. Each week we had a new service launched onto the platform and at least 50 incremental releases (yes 50!) so it was fairly apparent that soon the Tax Platform would be hosting the majority of HMRC’s critical services and the consequences of downtime were becoming more and more frightening. At the same time requirement for resilience grew.
At the beginning of the year we only had one infrastructure supplier on the books. They provide a VMware hosted cloud in which we had built the first iterations of the Tax Platform. We had developed an active/passive failover capacity in their second data centre to provide us with some geographical resilience but this was only akin to carrying a spare tyre; it takes about 20 minutes to change, it’s embarrassing when you get a flat and your passengers/customers still get annoyed with you while they wait. We started looking at new designs and suppliers which we hoped would give us a level of reliability able to cope with full data centre failures without the customer even noticing.
We had several objectives. Firstly to add a second, identical and active side to the Tax Platform. By actively running traffic through both sides we’d know that at a point in time both are healthy and able to receive traffic. With this model, if one does fail and you have to route traffic away to the healthy side the change is only significant in terms of volume for the surviving side. By forcing you to keep both sides healthy and always ready for traffic you have a better chance of keeping the stack running during failure than if you were to run an active/passive arrangement where sudden introduction of traffic to a cold stack is inevitably more risky.
Secondly, we wanted to have our new side in a different geographical location. This ensures that we are not susceptible to scenarios affecting one datacentre, a power cut being the typical example (inevitably it’s something less exciting but equally as annoying).
Our third objective was to remove our reliance on any one supplier or technology. Working in an industry where innovation moves faster than any other this is really important. Using new suppliers with different underlying cloud technologies gives us the best chance of future-proofing our investment.
Using a second supplier also comes with the advantage that they will use different partners. Having, for example, different hardware, DDOS protection, and Internet service providers all mean that you are less likely to discover you have a single point of failure that is duplicated in both data centres. You are also less likely to be suddenly left open to a security vulnerability found in your cloud provider’s technology stack, as you can simply move live traffic away until they can patch.
With these objectives in mind we set about procuring our new supplier. This wasn't just about finding the right product and supplier, but building in our stringent security and data sovereignty requirements from the start - something which comes hand in hand with handling the whole population's tax data. Throughout the year we ran small discoveries with a number of suppliers from GCloud and it was not until October that we started working with a second supplier who provide a secure, UK based OpenStack cloud. By December we had built out a new performance testing environment with silos in both suppliers. Testing went well so in January we progressed the new design into production starting with 100% through the tried and tested original supplier. Once in this position we gradually increased the proportion of traffic to our second supplier while we carefully watched the logs for any unusual activity. Everything looked healthy so on the evening of the 26th we ramped up to 100% and for the first time since the platform went live had no traffic running through our first supplier.
For us this not only makes us more resilient but puts us in a much more versatile position. Depending on the advances in VMware and OpenStack we can move down either route or both in the coming years. We can also avoid downtime by upgrading one silo at a time while its partner deals with the traffic.
Looking ahead we want to expand from an active-active model to a multi-active model. By bringing on a third supplier we can actually reduce our infrastructure footprint. For example, currently we run 2N with N being the amount of infrastructure needed to deal with 100% of our traffic. If one datacentre fails we know we still have 1N and therefore are resilient. By bringing on a third supplier we can run at 1.5N, with each supplier providing a third of the required infrastructure assuming that two suppliers don’t fail at the same time. This also gives us the same technological and commercial advantages described above.
This blog has tried to concentrate on the broader concepts of the work we have performed to make the Tax Platform multi-active. We plan to publish a more technical blog which will go into more detail around the changes we had to make to components of the Tax Platform, how we are organising routing, replicating data and much more.