The fact that June 30, 2015, will be one second longer than normal is something that almost no one is going to even notice, but for computing systems that are not equipped to deal with it, the “leap second” could cause major problems.
A leap second is what the International Earth Rotation and Systems Service (IERS) has been using every now and then since 1972 to bring atomic clocks back into sync with the Earth’s slowing rotation. So far, world clocks have been reset in this fashion 26 times.
The last time was in 2012, and it caused computers at many organizations, including LinkedIn, Gawker, Mozilla and Reddit to experience brief hiccups as the internal clocks on their systems went out of sync with external clocks. The most high-profile organization to get hit was Qantas Airlines, which suffered a two-hour disruption of critical systems, including those used for flight check-ins in Australia.
To avoid such issues, Google has implemented what it describes as a “clever” method for accommodating the extra second. The company’s approach is to “smear” away the leap second during a 20-hour window when internal clocks on all the company’s servers will be slowed by roughly 14 parts per million, Noah Maxwell and Michael Rothwell, Google site reliability engineers, wrote in a blog post May 21.
“At the end of the smear window, the entire leap second has been added, and we are back in sync with civil time,” the two engineers noted. The smear method is a little simpler than the approach Google developed in 2011 to address the issue, they said.
At that time, Google employed what it called a “leap smear” method under which it modified all of the Network Time Protocol (NTP) servers used by its internal systems to keep accurate time. Google added a couple of milliseconds to every update provided by its NTP systems over a time window before the leap second actually occurs, the company noted at that time. “This meant that when it became time to add an extra second at midnight, our clocks had already taken this into account, by skewing the time over the course of the day,” a Google engineer said at the time.
Taking care of the extra second is important because any system that depends on time sequencing could experience problems when it encounters one, Maxwell and Rothwell said.
“This problem is accentuated for multi-node distributed systems, because a one-second jump dramatically magnifies time-sync discrepancies between multiple nodes,” they wrote.
As an example, they pointed to two separate events going into a database with the same time, or even a later event being recorded as having happened before because of the leap second. Most software, including that used by Google, isn’t designed to handle the problem, the engineers said in explaining the company’s rationale for accommodating the extra seconds in tiny, almost imperceptible bits over a 20-hour window.
All Google Compute Engine services have been prepped to handle the leap-second issue, and organizations using the default NTP service or the system clock should not have any problems, the Google engineers said.
Those using an external time service may encounter issues, they added. It’s unclear yet how external NTP services are handling the extra second, so it’s hard to know how internal and external clocks will sync in those cases.
“If possible, you should avoid using external NTP sources on Compute Engine during the leap event,” Maxwell and Rothwell wrote.