I’m Tim Britten, Digital Service Manager for Infrastructure and Operations. I’m working on HMRC’s Tax Platform – and specifically infrastructure that supports our digital services. In his latest post, Kalbir Sohi talked about how we have been exploring cloud computing and automation. I’m going to share a bit more about how this is already improving how we work.
The UK tradition of legislative annual tax deadlines combined with the very human tendency for leaving things until the last minute (among other reasons) has meant that HMRC is used to dealing with peaks and troughs in our customer traffic. When you set a legislative obligation to your users, part of the deal is that you provide a reliable and easy process with which they can meet that obligation. As 94% of our transactions are dealt with digitally our peak events have received a fair bit of focus within HMRC Digital and its predecessors. Downtime is never an option.
Before cloud based computing, when preparing for a peak, organisations would have to physically add computers, storage, and networking to improve the performance and stability of a service. This approach is not easy, and it certainly is not cheap. The major downside financially is that it’s nigh on impossible to loan out those machines to someone else when you’re not using them, which means you're paying for kit designed to deal with the busiest hour, on the busiest day, all year round. That is not the only problem. You have to accurately anticipate months ahead the possible weight and timing of peak traffic to ensure you order, install and configure the machines in time.
Scaling up, but only when we need it
When we started building HMRC’s Tax Platform we understood that by using cloud infrastructure we could actually solve some of these problems. In a cloud environment we can scale dynamically to peaks. Using a cloud supplier you pay for compute and storage resource on demand, and by the hour you use it, this means you can tune your platform to only order what infrastructure it needs in real time.
This has several major advantages. First, we can scale up to increase the performance behind our digital services and therefore meet our user demand incredibly quickly, this improves both our users’ experience and our platform’s stability. Secondly, we can scale down during the rest of the year so we’re not paying for kit we’re not using.
The Web Operations team at HMRC has designed and built the Tax Platform to make the most of these characteristics. We didn’t want peaks to be dates in the diary that our predecessors had greeted with great trepidation, having spent months planning. We are developing a platform which scales rapidly, and in the future, automatically, to changes in traffic. Through this approach our peak events are becoming “business as usual” requiring less human monitoring. In meeting this aspiration we have ensured our primary goal of a good experience for all users, but we have also saved on the traditional support and service management overheads that came with paying people to watch less intelligent systems day and night.
Case study: tax credits renewals
In its relatively young life the platform has already dealt with a few peaks. The most recent ended on 31 July - the deadline for people to renew their tax credits. More than 750,000 people renewed online during the peak, with over 50,000 on the final day. For us who work in operations it was our most boring peak to date, which is a good measure of its success!
We have been using real-time and historical analytics to scale our infrastructure with our traffic profile which has meant we kept our performance high and level throughout. While we may have kept a closer eye on our dashboards than most months, we relied on our monitoring and alerting and didn’t increase our support costs by having engineers work in the evenings.
In Web Operations we love a good graph so thought we would share some of the more interesting ones from the tax credits renewals peak to demonstrate some of the points covered in this blog.
Our microservice infrastructure allowed us to only increase the instances of components required for the renewals peak instead of the entire platform. The number of instances was increased by 150% at the beginning of July as traffic started to ramp up. This increased our application level resilience and ensured that our performance stayed flat throughout the month despite the increasing traffic. Our server side response time remained flat and always below 35ms throughout the peak, our page analytics again stayed flat and at a 2 (sec) mean.
Looking at our metrics you can see that as traffic builds throughout the month we don’t see any considerable rise in CPU or memory usage. The platform metrics remain fairly flat as we scale to deal with the increased load.
Immediately after 31 July we had an opportunity to review our production infrastructure and made reductions at a server level. Our changes will make significant savings on our monthly infrastructure costs.
We’re looking to blog in more detail about the design of the Tax Platform Infrastructure soon so you can get an idea of how some the concepts talked about here are implemented by the Tax Platform. As ever please do leave a comment or ask a question on the blog using the comment feature below.