https://hmrcdigital.blog.gov.uk/2015/08/14/peaks-and-clouds-improving-how-we-work/

Peaks and clouds: improving how we work

I’m Tim Britten, Digital Service Manager for Infrastructure and Operations. I’m working on HMRC’s Tax Platform – and specifically infrastructure that supports our digital services. In his latest post, Kalbir Sohi talked about how we have been exploring cloud computing and automation. I’m going to share a bit more about how this is already improving how we work.

The UK tradition of legislative annual tax deadlines combined with the very human tendency for leaving things until the last minute (among other reasons) has meant that HMRC is used to dealing with peaks and troughs in our customer traffic. When you set a legislative obligation to your users, part of the deal is that you provide a reliable and easy process with which they can meet that obligation. As 94% of our transactions are dealt with digitally our peak events have received a fair bit of focus within HMRC Digital and its predecessors. Downtime is never an option.

Before cloud based computing, when preparing for a peak, organisations would have to physically add computers, storage, and networking to improve the performance and stability of a service. This approach is not easy, and it certainly is not cheap. The major downside financially is that it’s nigh on impossible to loan out those machines to someone else when you’re not using them, which means you're paying for kit designed to deal with the busiest hour, on the busiest day, all year round. That is not the only problem. You have to accurately anticipate months ahead the possible weight and timing of peak traffic to ensure you order, install and configure the machines in time.

Scaling up, but only when we need it

When we started building HMRC’s Tax Platform we understood that by using cloud infrastructure we could actually solve some of these problems. In a cloud environment we can scale dynamically to peaks. Using a cloud supplier you pay for compute and storage resource on demand, and by the hour you use it, this means you can tune your platform to only order what infrastructure it needs in real time.

This has several major advantages. First, we can scale up to increase the performance behind our digital services and therefore meet our user demand incredibly quickly, this improves both our users’ experience and our platform’s stability. Secondly, we can scale down during the rest of the year so we’re not paying for kit we’re not using.

The Web Operations team at HMRC has designed and built the Tax Platform to make the most of these characteristics. We didn’t want peaks to be dates in the diary that our predecessors had greeted with great trepidation, having spent months planning. We are developing a platform which scales rapidly, and in the future, automatically, to changes in traffic. Through this approach our peak events are becoming “business as usual” requiring less human monitoring. In meeting this aspiration we have ensured our primary goal of a good experience for all users, but we have also saved on the traditional support and service management overheads that came with paying people to watch less intelligent systems day and night.

Case study: tax credits renewals

In its relatively young life the platform has already dealt with a few peaks. The most recent ended on 31 July - the deadline for people to renew their tax credits. More than 750,000 people renewed online during the peak, with over 50,000 on the final day. For us who work in operations it was our most boring peak to date, which is a good measure of its success!

We have been using real-time and historical analytics to scale our infrastructure with our traffic profile which has meant we kept our performance high and level throughout. While we may have kept a closer eye on our dashboards than most months, we relied on our monitoring and alerting and didn’t increase our support costs by having engineers work in the evenings.

In Web Operations we love a good graph so thought we would share some of the more interesting ones from the tax credits renewals peak to demonstrate some of the points covered in this blog.

Our microservice infrastructure allowed us to only increase the instances of components required for the renewals peak instead of the entire platform. The number of instances was increased by 150% at the beginning of July as traffic started to ramp up. This increased our application level resilience and ensured that our performance stayed flat throughout the month despite the increasing traffic. Our server side response time remained flat and always below 35ms throughout the peak, our page analytics again stayed flat and at a 2 (sec) mean.

HTTP Traffic profile vs Mean Response Time
HTTP Traffic profile vs Mean Response Time
Page Analytics Average page load time (July).
Page Analytics Average page load time (July)

Looking at our metrics you can see that as traffic builds throughout the month we don’t see any considerable rise in CPU or memory usage. The platform metrics remain fairly flat as we scale to deal with the increased load.

HTTP Traffic profile vs CPU and Memory usage
HTTP Traffic profile vs CPU and Memory usage

Immediately after 31 July we had an opportunity to review our production infrastructure and made reductions at a server level. Our changes will make significant savings on our monthly infrastructure costs.

We’re looking to blog in more detail about the design of the Tax Platform Infrastructure soon so you can get an idea of how some the concepts talked about here are implemented by the Tax Platform. As ever please do leave a comment or ask a question on the blog using the comment feature below.

5 comments

  1. Bryan Hayward

    I like the way peaks for a specific part of the service are handled and can see this working well for upcoming services.

    Link to this comment
  2. Phil Batchelor

    Cloud Technology: It works! Well done WebOps!

    Link to this comment
  3. John Wrigley

    Although I won't claim to have understood every word, I recognise this as a genuine good news story - no spin. Congratulations to all involved, and may all your future peaks be boring.

    Link to this comment
  4. Simon Britten

    Great blog, would it be worth while trying to refocus the visuals on the graph which look interesting but a bit blurred. Is it too late?
    It would also be interesting to hear a bit about how the HMRC monitors security with the cloud suppliers and how the reporting mechanism works when there is a breach or security issue.

    Some tecky stuff on how a cloud works, no secrets of course but the generality of a field full of computers - how many giga watts of power it needs at peak time- is it next to a solar field etc!?

    These tecky details might come under a "Tech " heading for the anoraks?
    Thanks

    Link to this comment
    • Tim Britten

      We have a number of mechanisms in place to monitor the security of our platform. Although we can't go into too much detail publicly we keep align our protective monitoring to the Good Practice Guide 13 (http://gpg13.com/). Our cloud suppliers are picked based on their ability to meet strict protective monitoring capabilities and alert us upon incidents 24/7/365.

      The point of using an Infrastructure as a service supplier is that as a customer you do not have to be an expert in how cloud computing actually works. Details such as the amount of power required and how cloud technologies such as VMware, Openstack etc. work is abstracted away by your supplier. It is good to understand the impact these technologies have upon the availability and performance of the virtual, but within HMRC we do not need to cultivate expert knowledge in this field. Therefore for the answers to your questions you would need to contact and IaaS supplier or read more detailed material around different cloud technologies and data centres.

      Link to this comment