Once you’re done patching, take a moment to look at your IT processes, and consider how adding AI & Machine Learning could set you up to deal with the next emergency better.
So, what are you doing about Meltdown and Spectre?
In case you have been living under a rock (or are still recovering from a particularly good New Year’s Eve party), here’s a quick recap — feel free to skip the next couple of paragraphs if you already know all about these issues…
In early January, severe vulnerabilities affecting most popular processor architectures have been disclosed. Operating system developers were actually notified some time beforehand, and patches were rapidly available for all major systems — and browsers. Yes, web browsers; unfortunately, one of the attack vectors is actually via Javascript executing in a user’s browser.
At a very high level, all three vulnerabilities (Spectre is actually the name of two separate issues, CVE-2017-5753 and CVE-2017-5715, while Meltdown is prosaically known as CVE-2017-5754) relate to speculative execution. All three vulnerabilities are connected, to the point that they were discovered independently by as many as four different teams.
Under normal circumstances, speculative execution means that the CPU will “guess” what might be the next instructions requested, and execute those using idle cycles. If the guess is correct, the result is a perceived increase in system responsiveness, because the results are already available — and if not, no harm is done, and the CPU simply runs the next instruction normally.
The difficulty with this approach — and the source of these vulnerabilities — is making sure that all of the various running processes cannot snoop on each other’s data in memory, including in particular sensitive user data: passwords, credit card numbers, and so on. Various methods were supposed to keep processes’ data separate, especially the central kernel, but through a variety of techniques, mostly involving very precise timing, it turns out to be possible to back-solve and read out what should be private data — even, once again, from inside a web browser. Sorry, I still haven’t quite got over that one.
If you need a more in-depth analogy, Ben Thompson published a great one at Stratechery.
The Spectre Of IT Operations Overload
Okay, so that is where we are: install your OS and browser vendors’ patches, and keep an eye on this issue for your next big hardware refresh. Apart from the usual headache of distributing patches, and dealing with the dependencies from doing that, though, what does this have to do with day to day IT operations?
Here’s the problem: Nowadays, security vulnerabilities are not just CVEs discussed on dedicated mailing lists by small numbers of specialists. They are media celebrities, with exciting names: before Meltdown and Spectre, we had Rowhammer, GHOST, Shellshock, Sandstorm, and of course Heartbleed, the first vuln to really break into the mainstream.
These hitherto obscure infosec issues are now reported in the mainstream news, not just in the tech press. That visibility may be a good thing if it pushes more people to patch their personal systems and avoid being affected, but the downside for ITOps is that, for the next year or so (or until the next big bug), everything that happens will be blamed either on the bug itself, or on its patch or workaround.
This is particularly true for Meltdown and Spectre, since the fixes for these vulnerabilities will reduce or even eliminate the performance gains from speculative execution. It is far from clear how large that impact will be, not least because it varies widely between use cases, but some users are reporting doubling of CPU utilisation.
www.twitter.com/berenguel/status/949608846179397633
This distraction is going to exacerbate the unfavourable signal-to-noise ratio that ITOps are already contending with. It’s hard enough to figure out what are real alerts and how they relate to each other, without being distracted by the suspicion that part of the problem might be due to this family of issues or one of its patches. All of that is on top of the effort and stress involved in getting a critical patch distributed everywhere in a timely manner.
There Is No Quick Fix For IT Operations
Now, I don’t want this to come off as an ambulance-chasing post of the sort we always see after every big breach or disclosure. Nothing could have protected you from this one, unless you are really into retro-computing; as many people jokingly pointed out on Twitter, VAX systems, PDPs, and the like are unaffected. Also, there isn’t really a complete fix yet, and the best advice is simply to keep current with your patches, which you really should be doing anyway.
More generally, though, it should be clear by now that this is not an isolated occurrence. There’s always another patch to roll out, another release to deploy, another change to make. IT Operations is no longer a back-office process that can be meticulously planned out, but an ongoing real-time activity. And that means it has to be done fundamentally differently.
The old approaches that assumed exhaustive planning and documentation no longer hold true. Everything moves too fast for that to work. Instead of manual processes, telephone bridges, and single-digit event/alert ratios, IT Operations in 2018 needs automation everywhere, streamlined collaboration, and small numbers of relevant, actionable alerts sifted automatically from the Big Data event streams that modern infrastructure generates.
AI & Machine Learning techniques are the only way to take enough friction out of IT Operations to be able to react nimbly to the next Meltdown or Spectre — or sudden project idea from marketing, new sales campaign, or change of heart from the corner office. The emerging discipline of AIOps is all about embedding the latest algorithmic techniques into ITOps, together with streamlined collaboration between all the different specialist roles that need to be informed or involved.
Once you’re done with this round of patches, take a moment to evaluate your current IT Operations process, and consider how each fire drill is impacting them. It may be time to complement your existing specialist systems with an AI-driven overlay that can gain you the breathing space needed to deal with new situations without everything being an emergency.