Close
  • Latest News
  • Cybersecurity
  • Big Data and Analytics
  • Cloud
  • Mobile
  • Networking
  • Storage
  • Applications
  • IT Management
  • Small Business
  • Development
  • Database
  • Servers
  • Android
  • Apple
  • Innovation
  • Blogs
  • PC Hardware
  • Reviews
  • Search Engines
  • Virtualization
Read Down
Sign in
Close
Welcome!Log into your account
Forgot your password?
Read Down
Password recovery
Recover your password
Close
Search
Menu
eWEEK.com
Search
eWEEK.com
  • Latest News
  • Cybersecurity
  • Big Data and Analytics
  • Cloud
  • Mobile
  • Networking
  • Storage
  • Applications
  • IT Management
  • Small Business
  • Development
  • Database
  • Servers
  • Android
  • Apple
  • Innovation
  • Blogs
  • PC Hardware
  • Reviews
  • Search Engines
  • Virtualization
More
    Home Big Data and Analytics
    • Big Data and Analytics
    • IT Management
    • Networking
    • Servers
    • Storage

    Best Practices for Preparing Data Centers for AI, ML and DL

    By
    EWEEK EDITORS
    -
    July 30, 2020
    Share
    Facebook
    Twitter
    Linkedin
      GPU.Computing

      The intensive demands of artificial intelligence, machine learning and deep learning applications challenge data center performance, reliability and scalability–especially as architects mimic the design of public clouds to simplify the transition to hybrid cloud and on-premise deployments. 

      GPU (graphics processing unit) servers are now common, and the ecosystem around GPU computing is rapidly evolving to increase the efficiency and scalability of GPU workloads. Yet there are tricks to maximizing the more costly GPU utilization while avoiding potential choke points in storage and networking. 

      In this edition of eWEEK Data Points, Sven Breuner, field CTO, and Kirill Shoikhet, chief architect, at Excelero, offer nine best practices on preparing data centers for AI, ML and DL. 

      Data Point No. 1: Know your target system performance, ROI and scalability plans.

      This is so they can dovetail with data center goals. Most organizations start with an initially small budget and small training data sets, and prepare the infrastructure for seamless and rapid system growth as AI becomes a valuable part of the core business. The chosen hardware and software infrastructure needs to be built for flexible scale-out to avoid disruptive changes with every new growth phase. Close collaboration between data scientists and system administrators is critical to learn about the performance requirements and get an understanding of how the infrastructure might need to evolve over time.

      Data Point No. 2: Evaluate clustering multiple GPU systems, either now or for the future. 

      Having multiple GPUs inside a single server enables efficient data sharing and communication inside the system as well as cost-effectiveness, with reference designs presuming a future clustered use with support for up to 16 GPUs inside a single server. A multi-GPU server needs to be prepared to read incoming data at a very high rate to keep the GPUs busy, meaning it needs an ultra-high-speed network connection all the way through to the storage system for the training database. However, at some point a single server will no longer be enough to work through the grown training database in reasonable time, so building a shared storage infrastructure into the design will make it easier to add GPU servers as AI/ML/DL use expands.

      Data Point No. 3: Assess chokepoints across the AI workflow phases. 

      The data center infrastructure needs to be able to deal with all phases of the AI workflow at the same time. Having a solid concept for resource scheduling and sharing is critical for cost effective data centers, so that while one group of data scientists get new data that needs to be ingested and prepared, others will train on their available data, while elsewhere, previously generated models will be used in production. Kubernetes has become a popular solution to this problem, making cloud technology easily available on premises and making hybrid deployments feasible.

      Data Point No. 4: Review strategies for optimizing GPU utilization and performance.

      The computationally intensive nature of many AI/ML/DL applications make GPU-based servers a common choice. However, while GPUs are efficient at loading data from RAM, training datasets usually far exceed RAM, and the massive number of files involved become more of a challenge to ingest. Achieving the optimal balance between the number of GPUs and the available CPU power, memory and network bandwidth–both between GPU servers and to the storage infrastructure–is critical.

      Data Point No. 5: Support the demands of the training and inference phases. 

      In the classic example of training a system to “see” a cat, computers perform a numbers game–the computer (or rather the GPUs) need to see lots and lots of cats in all colors. Due to the nature of accesses consisting of massive parallel file reads, NVMe flash ideally supports these requirements by providing both the ultra-low access latency and the high number of read operations per second. In the inference phase, the challenge is similar, in that object recognition typically happens in real time–another use case where NVMe flash storage also provides a latency advantage. 

      Data Point No. 6: Consider parallel file systems and alternatives. 

      Parallel file systems like IBM’s SpectrumScale or BeeGFS can help in handling the metadata of a large number of small files efficiently, enabling 3X-4X faster analysis of ML data sets by delivering tens of thousands of small files per second across the network. Given the read-only nature of training data, it’s also possible to avoid the need for a parallel file system altogether when making the data volumes directly available to the GPU servers and sharing them in a coordinated way through a framework like Kubernetes.

      Data Point No. 7: Choose the right networking backbone. 

      AI/ML/DL is usually a new workload, and backfitting it into existing network infrastructure often cannot support required low latency, high bandwidth, high message rate and smart offloads required for complex computations and fast and efficient data delivery. RDMA-based network transports RoCE (RDMA over Converged Ethernet) and InfiniBand have become the standards to meet these new demands. 

      Data Point No. 8: Consider four storage system levers to price/performance. 

      1) High read throughput combined with low latency, that doesn’t constrain hybrid deployments and can run on either cloud or on-premise resources.

      2) Data protection. The AI/ML/DL storage system is typically significantly faster than others in the data center, so recovering it from backup after a complete failure might take a very long time and disrupt ongoing operations. The read-mostly nature of DL training makes it a good fit for distributed erasure coding where the highest level of fault tolerance is already built into the primary storage system, with a very small difference between raw and usable capacity.

      3) Capacity elasticity to accommodate any size or type of drive, so that as flash media evolve and flash drive characteristics expand, data centers can maximize price/performance at scale, when it matters the most. 

      4) Performance elasticity. Since AI data sets need to grow over time to further improve the accuracy models, storage infrastructure should achieve a close-to-linear scaling factor, where each incremental storage addition brings the equivalent incremental performance. This allows organizations to start small and grow non-disruptively as business dictates.

      Data Point No. 9: Set benchmarks and performance metrics to aid in scalability. 

      For example, for deep learning storage, one metric might be that each GPU processes X number of files (typically thousand or tens of thousands) per second, where each file has an average size of Y (from only a few dozen to thousands) KB. Establishing appropriate metrics upfront helps to qualify architectural approaches and solutions from the outset, and guide follow-on expansions.

      If you have a suggestion for an eWEEK Data Points article, email [email protected].

      MOST POPULAR ARTICLES

      Android

      Samsung Galaxy XCover Pro: Durability for Tough...

      CHRIS PREIMESBERGER - December 5, 2020 0
      Have you ever dropped your phone, winced and felt the pain as it hit the sidewalk? Either the screen splintered like a windshield being...
      Read more
      Cloud

      Why Data Security Will Face Even Harsher...

      CHRIS PREIMESBERGER - December 1, 2020 0
      Who would know more about details of the hacking process than an actual former career hacker? And who wants to understand all they can...
      Read more
      Cybersecurity

      How Veritas Is Shining a Light Into...

      EWEEK EDITORS - September 25, 2020 0
      Protecting data has always been one of the most important tasks in all of IT, yet as more companies become data companies at the...
      Read more
      Big Data and Analytics

      How NVIDIA A100 Station Brings Data Center...

      ZEUS KERRAVALA - November 18, 2020 0
      There’s little debate that graphics processor unit manufacturer NVIDIA is the de facto standard when it comes to providing silicon to power machine learning...
      Read more
      Apple

      Why iPhone 12 Pro Makes Sense for...

      WAYNE RASH - November 26, 2020 0
      If you’ve been watching the Apple commercials for the past three weeks, you already know what the company thinks will happen if you buy...
      Read more
      eWeek


      Contact Us | About | Sitemap

      Facebook
      Linkedin
      RSS
      Twitter
      Youtube

      Property of TechnologyAdvice.
      Terms of Service | Privacy Notice | Advertise | California - Do Not Sell My Info

      © 2020 TechnologyAdvice. All Rights Reserved

      Advertiser Disclosure: Some of the products that appear on this site are from companies from which TechnologyAdvice receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. TechnologyAdvice does not include all companies or all types of products available in the marketplace.

      ×