When you’re running one of Europe’s biggest IT operations, getting to the root cause of a problem isn’t always easy.
I’m Pete and it’s my team that run HMRC’s IT. Our aim is to give our 60,000 HMRC colleagues the best possible IT experience so that, in turn, they can provide a great service to the millions of customers we serve. We take it seriously, and when things sometimes go wrong I’ll admit we can get just as frustrated as our colleagues in the business do.
So when we recently started to see random crashes on our email service for no apparent reason we were determined to get to the bottom of it. This is how we tracked down what was causing the problem.
The disruption didn’t affect everybody; a single application on a server was crashing. With about 10,000 mailboxes on that server, it still meant that the disruption was quite widespread.
The crashes were happening once or twice a day and, although the service was recovering itself quickly by switching to backup copies of the mailboxes on other servers, colleagues were seeing a range of problems until the service stabilised.
Random crashes are one of the hardest problems to diagnose, especially when the service recovers automatically as there often isn’t enough time to capture the cause of the fault.
Although Outlook is used globally, Microsoft hadn’t seen the problem before. We loaded a Microsoft-recommended tool onto our servers ready to catch the next crash. But this tool stopped the service from recovering itself, humph! We then started to see longer periods of disruption and we were still none the wiser to that cause.
We knew we couldn’t keep extending the disruption in the hope of capturing something useful. So we decided to work with our suppliers to investigate, using the known crash times as the starting point and using all available sources of data to identify any common patterns. We focussed on a couple of minutes either side of each crash.
To put this into context, this meant that we were essentially looking for a single event in the email traffic generated by 85,000 mailboxes. It really was like looking for the needle in the proverbial haystack.
The problem was proving to be significant and complex so we worked with Microsoft to escalate the investigation to the highest level of Outlook product support. This secured 24/7 support from Microsoft’s specialists who worked with our Messaging Support Team.
We had a breakthrough. Between us and Microsoft we saw one email box that was being used at the same time of every crash! So we quickly isolated this mailbox to a single server out of harm's way and, with help from the user, we were able to repeat the exact sequence of events that caused the crash. The most likely explanation is that a corruption in the mailbox was causing the server crash. We had found the needle!
Once the issue was resolved we searched all 85,000 mailboxes to make sure no one else had this problem (no one has!) We have also introduced enhanced alerting so we can spot this if this ever happens again, although we don’t expect it to because Microsoft has now changed Outlook itself to not fail.
Whenever we have a problem with a service, everyone is focused on resolving it as quickly as possible. I’m incredibly proud of my team and the great work they do day in, day out. Hopefully this story gives a little bit of insight into some of the work they do - and how much effort it sometimes takes to get a problem fixed.
Director, IT Development, Test and Operations