When the Meltdown and Spectre CPU security vulnerabilities were publicly disclosed on Jan. 3, they set off a flurry of activity among IT users and cloud operators around the world. In a panel moderated by eWEEK at the OpenStack Summit in Vancouver, B.C., on May 24, operators detailed how they dealt with patching for Meltdown and why it was a time-consuming process.
When it comes to OpenStack, no operator in the world is larger than CERN, home of the Large Hadron Collider (LHC) and an OpenStack cloud infrastructure that has approximately 300,000 compute cores. Arne Wiebalck is responsible for the overall operations of CERN’s OpenStack cloud, and when vulnerabilities like Meltdown and Spectre appear, it’s his responsibility to react and deploy the corresponding fixes.
“CERN usually closes for two weeks during the winter break, so this actually was made known when everyone was still off,” he said.
CERN has a dedicated team that is responsible for cyber-security, according to Wiebalck. His operations team coordinated with the security team to understand what needed to be done to mitigate the risks of Meltdown and Spectre.
“What actually happened in the end is that we decided to shut down the whole cloud and patch,” Wiebalck said.
Given the scale of CERN’s OpenStack cloud, shutting down and patching was not a trivial ordeal. Wiebalck said his team had to shut down and reboot more than 30,000 virtual machines and communicate that a shutdown was going to happen to thousands of CERN cloud users.
“We’ve run this cloud for roughly five years now in production, and I think it was the first time that we had to actually shut down everything,” he said.
CERN didn’t simply switch everything off at the same time, Wiebalck said, but rather executed the patching, shutdown and reboot process in stages over the period of several days. CERN used an iterative process, initially shutting down approximately 200 hypervisors to see if they would come back and if there were any errors.
While CERN, like most large IT shops, makes use of automation processes, when it came to patching and rebooting for Meltdown and Spectre, Wiebalck said that it involved a lot of manual processes that humans had to run and monitor.
“It was real humans. We have some tools actually to talk to hundreds of machines, of course, but it was actually me and my colleague basically doing this more or less manually,” Wiebalck said.
OpenStack Infrastructure
Clarke Boylan is the project technical leader for the OpenStack infrastructure project and is responsible for running the systems used to build OpenStack software that is used in clouds around the world. Boylan, like Wiebalck at CERN, had to reboot a large number of systems to patch for Meltdown and Spectre.
Boylan said that the OpenStack infrastructure team divided up the patching work among staff and made use of the Ansible configuration management technology to make sure that patched kernels were in place.
“We still had humans watching carefully to make sure that services were still running in the expected manner when they came back,” Boylan said.
With the Meltdown and Spectre patches, there were concerns about potential performance degradation issues, which is something that Boylan said his team monitored. The top priority for the OpenStack infrastructure teams was getting the Linux kernel patches deployed as quickly as possible.
Going a step further, Boylan noted OpenStack Nova compute project developers added a feature to Nova to allow enhanced control of CPU feature flags, so that cloud operators could restrict access to the more dangerous parts of the CPU and to mitigate the patches’ impact on performance.
For those in the OpenStack community like Cisco engineer Dave McCowan, who is the former project technical leader for the OpenStack Barbican secrets management project, the Meltdown and Spectre issue serves as a good lesson for cloud operators.
“The lesson learned is to plan for any eventualities,” McCowan said. “When you think about architecting a cloud and planning your tools, know that you might need to patch or replace anything in the system from the hardware on up.”
Sean Michael Kerner is a senior editor at eWEEK and InternetNews.com. Follow him on Twitter @TechJournalist.