What We Should Learn from July 8, the Great Glitch Day

NEWS ANALYSIS: We are treating symptoms of software and security illnesses rather than determining the underlying causes, software quality expert tells eWEEK.

July 8 Great Glitch Day

We get pelted every day with so many news stories about IT failures, hacker attacks, zero-day exploits and inside-job data breaches that we're in danger of becoming numb to them. Too much of anything will do that to you.

But the July 8 crashes that hit a handful of well-known organizations stand alone, even against the Targets, Sonys, IRSs and Home Depots of the world.

We're talking about what's becoming known as the Great Glitch Day, July 8, 2015, when business critical computer systems at United Airlines, the New York Stock Exchange and the Wall Street Journal all went down for several hours. These might have been the results of hackers, but they also could have been the result of problems inside the software, which can be just as dastardly to a business as a hacker.

The Atlantic magazine called it "The Day the Computers Betrayed Us." That publication didn't even count the failure of the 911 system in Seattle.

We're Not Hearing About Many Big Glitches

Hacker attacks and glitches like those noted above remind us clearly that IT is a non-perfect science, that errors will happen, and that software is an ever-evolving art form, of sorts.

Software quality expert Lev Lesokhin, Executive Vice-President of Strategy and Analytics at CAST Software, reminded eWEEK about the government's system failure that prevented thousands of foreign visitors from obtaining visas to enter the United States. He was also willing to bet that a lot of people didn't hear about the computer glitch in the radar system that grounded virtually every flight in and out of New Zealand late in June.

Computer snafus are nothing new to United Airlines, Lesokhin said. "As far back as 2006, the airline's reservation system was being knocked off line for hours at a time," he told eWEEK. "Then, a few weeks ago, there was the Royal Bank of Scotland; a glitch in its computer system prevented customers from receiving 600,000 payments for several days. It wasn't the first time RBS had computer problems; in 2012, millions of its customers were unable to access their accounts because of a software failure. RBS was fined £56 million for what its chairman acknowledged were "unacceptable weaknesses in our systems."

Examining all these instances that took place within a relatively short window of time begs the question: Exactly how many others happened—and are happening now—that we will never hear about?

"We really shouldn't be surprised by these continual system outages. If anything, we should be more surprised that they don't happen or aren't reported more often," Lesokhin said. "It's bad now, and unless things change dramatically, they'll only get worse."

Treating Symptoms, Not the Cause

Lesokhin, a former SAP executive and McKinsey & Co. researcher, sees the underlying problem this way: Every time a situation like this arises, there's a mad scramble to fix the immediate problem. Too often, that's the end of the effort until the next snafu erupts.

"It's akin to treating the symptoms of an illness, rather than trying to determine the underlying cause,” Lesokhin said.

This is not to suggest that IT organizations don't care or that they don't want their systems to work as perfectly as they can. It's just that too often they are overwhelmed by the immensity of it all, Lesokhin said.

"Consider this: When a company develops customized software to automate specific business processes, their applications can run hundreds of thousands of lines, and even more often, millions of lines of code. Each line of code is written to perform a specific function, and if even one of those lines does not interact with the rest of the system as it's designed to do, it can bring down that entire system, as we have seen time and time again. A routine software update can cause an inconsistency with the existing code, and havoc ensues," Lesokhin said.

"So, while United can blame 'computer automation issues,' and the NYSE can blame a 'computer glitch' or a routine update, it still does not resolve the issue of how to try and prevent this from happening again—and again."

Proactive Security Coming into Play

Companies are adding proactive security to their current armored-car approaches that rely on firewall, private networks, password authentication and others. As time goes on, IT hardware and software makers are beginning to build security into products at the protocol and silicon levels, but the progress is slow.

These measures are years later than they should be, but at least they are starting to happen.

Enterprises need to be able to analyze and measure the computer applications that are at the heart of their IT systems for things such as their robustness (how well they'll operate under stress), or whether a given application is optimized to work as efficiently as possible, or whether it is securely architected, Lesokhin said.

"These analyses must be turned into analytics that are front and center in the CIOs' and business executives' dashboards. It is no longer the responsibility of the techies to make sure that systems are well constructed. It has become the responsibility of the business and the IT executives to keep score on the structural quality of their most critical IT applications," Lesokhin said.

It's a tough task and one that is not easily accomplished, but systems exist that can automate the process. For example, the Object Management Group has recently approved a set of global standards proposed by the Consortium for IT Software Quality (CISQ) that would help companies quantify and meet specific goals for software quality.

We Don't Know Amount of Losses from Breaches

As of today, neither the NYSE nor United Airlines have publicly disclosed how much they stand to lose as a result of the outages, but it may well be substantial. A trading firm reported a loss of $450 million three years ago in a single day because of software malfunction. If nothing else, it's a serious knock on their reputations.

"In the meantime, I'll just keep watching the headlines for the next snafu that arises. Because as things currently stand, unfortunately it's just a matter of time," Lesokhin said.

I'm reasonably sure we all will be watching.

Chris Preimesberger

Chris J. Preimesberger

Chris J. Preimesberger is Editor-in-Chief of eWEEK and responsible for all the publication's coverage. In his 13 years and more than 4,000 articles at eWEEK, he has distinguished himself in reporting...