IBM's Breakthrough Distributed Computation for Deep Learning Workloads

NEWS ANALYSIS: Why deep learning is a literal ‘killer app’ for computers, and how IBM has figured out how to distribute computing for much faster processing of big-data artificial intelligence workloads.

IBM.deeplearning

Off the top, it sounds simple enough: You have one big, fast server processing an artificial-intelligence-related, big data workload. Then the requirements change; much more data needs to be added to the process to get the project done in a reasonable span of time. Logic says that all you need to do is add more horsepower to do the job.

As Dana Carvey used to say in his comedy act when satirizing President George H.W. Bush: “Not gonna do it.”

That’s right: Until today, adding more servers would not have solved the problem. Deep-learning analytics systems up to now have only been able to run on a single server; use cases simply haven’t been scalable by adding more servers, and there are major backend reasons for that.

All that is now history. IBM on Aug. 8 announced that its researchers have changed this by coming up with new distributed deep learning software that has taken quite a while to develop. This is very probably the biggest step forward in artificial intelligence computing in at least the last decade.

Connecting Servers for AI Jobs Sounds Easy, but Isn't

By merely being able to connect a group of servers together to work in concert on a single problem, IBM Research has uncovered a milestone in making Deep Learning much more practical at scale: to train AI models using millions of photos, drawings or even medical images and by increasing the speed and making significant gains in image recognition accuracy possible as evidenced in IBM's initial results. 

Also on Aug. 8, IBM released a beta version of its Power AI software for cognitive and AI developers to build more accurate AI models to develop better predictions. The software will help shorten the time it takes to train AI models from days and weeks to hours.

What exactly makes deep learning so time-consuming to process? First of all, it involves many gigabytes or terabytes of data. Secondly, the software that can comb through all of this information is only now being optimized for workloads of this kind.

One thing a lot of people haven’t yet gotten straight is what sets deep learning apart from machine learning, artificial intelligence and cognitive intelligence.

Deep Learning a Subset of Machine Learning

“Deep learning is considered to be a subset, or a particular method, within this bigger term, which is machine learning,” Sumit Gupta, IBM Cognitive Systems Vice-President of High Performance Computing and Data Analytics, told eWEEK.

“The best example I always give about deep learning is this: When we’re teaching a kid how to recognize dogs and cats, we show them lots of images of dogs, and eventually one day the baby says ‘dog.’ The baby doesn’t look at the fact that the dog has four legs and a tail, or other details about it; the baby is actually perceiving a dog. That’s the big difference between the traditional computer models, where they were sort of ‘if and else’-type models versus perception. Deep learning tries to mimic that by this method called neural networks.”

The problem with deep learning is that it is extremely computationally intensive, Gupta said. The high communication overhead was by far the biggest challenge.

“It just kills computers; it’s the ultimate ‘killer app’ for computers,” Gupta said with a laugh. “We’ve been using GPU (graphics processing units) accelerators to accelerate deep learning ‘training.’ What we do is give these computer models millions of images, but then we have to train them on computers with powerful GPUs (to record and understand what the images entail).

“Most deep learning frameworks scale to multiple GPUs in a server, but not to multiple servers with GPUs. Specifically, our team wrote software and algorithms that automate and optimize the parallelization of this very large and complex computing task across hundreds of GPU accelerators attached to dozens of servers. This is hard to do!”

IBM Found the 'Ideal Scaling'

Gupta said IBM Research posted close to ideal scaling with its new distributed deep learning software that achieved record low communication overhead and 95 percent scaling efficiency on the open source Caffe deep learning framework over 256 GPUs in 64 IBM Power systems.

Previous best scaling was demonstrated by Facebook AI Research of 89 percent for a training run on Caffe2, at higher communication overhead. Using this software, IBM Research achieved a new image recognition accuracy of 33.8 percent for a neural network trained on a very large data set (7.5 million images). The previous record published by Microsoft demonstrated 29.8 percent accuracy. 

A technical preview of this IBM Research Distributed Deep Learning code is available today in IBM PowerAI 4.0 distribution for TensorFlow and Caffe.

IBM demonstrated the scaling of the Distributed Deep Learning software by training a ResNet-101 deep learning model on 7.5 milion images from the ImageNet-22K data set, with an image batch size of 5,120.  The team used a cluster of 64 IBM Power servers with a total of 256 NVIDIA P100 GPU accelerators, achieving a scaling efficiency of 88 percent, enabled by very low communication overhead.  

Distributed deep learning holds promise to fuel breakthroughs in everything from consumer mobile app experiences to medical imaging diagnostics. But progress in accuracy and the practicality of deploying deep learning at scale is gated by technical challenges running massive deep learning based AI models, with training times measured in days and weeks, Gupta said.  

What the Analysts Are Saying

“This is one of the bigger breakthroughs I have seen in a while in all of the deep learning industry announcements over the last six months,” Patrick Moorhead, president and principal analyst of Moor Insights & Strategy told eWEEK. “The interesting part is that it is from IBM, not one of the web giants like Google, which means it is available to enterprises from on-prem use using OpenPOWER hardware and PowerAI software or even through cloud provider Nimbix.  

“What’s most impressive is the near-linear scaling as you add scale-out nodes, between 90 and 95 percent performance. The simple way to look at this is scale-out AI versus traditional, scale-up most everyone uses today. You can add an order of magnitude more performance.” 

Rob Enderle of The Enderle Group told eWEEK that the significance of the IBM announcement is “the fact that you can scale the performance of the Deep Learning operation with hardware. There used to be a hard limit of how many GPUs you could use on a deep learning operation, IBM has effectively removed that limit effectively allowing a firm to buy down the time it takes to finish an operation with hardware. 

“This is a big step, particularly in areas like security and fraud protection, because the length of time it took to train a system was typically measured in days but damages could run to millions in minutes.  With this you should be able to put in place solutions that can more reasonably address some of these massive exposures more timely.”

Charles King of Pund-IT told eWEEK that “the speed improvements IBM achieved (in a visual recognition training run using Caffe, a complex neural network and a very large - 7.5 million images - database) was startling. The previous record holder's (Microsoft) system completed the run in 10 days, achieving 29.8 percent accuracy. IBM's cluster with its new DDL library finished the run 7 hours, and achieved 33.8 percent accuracy.

“Additionally, IBM's DDL library and APIs are available to anyone who utilizes the company's Power Systems and its PowerAI V4. Plus, along with initially supporting Caffe and TensorFlow AI frameworks, IBM plans to make the library and APIs available for Torch and Chainer," King said.

“Overall, by substantially eliminating DL training bottlenecks and blowing away current performance leaders, IBM's new DDL library and APIs should make AI projects more compelling and attractive to businesses and other organizations.”

Chris Preimesberger

Chris Preimesberger

Chris Preimesberger is Editor of Features & Analysis at eWEEK, responsible in large part for the publication's coverage areas. In his 12 years and more than 3,900 stories at eWEEK, he has...