Close
  • Latest News
  • Artificial Intelligence
  • Big Data and Analytics
  • Cloud
  • Networking
  • Cybersecurity
  • Applications
  • IT Management
  • Storage
  • Sponsored
  • Mobile
  • Small Business
  • Development
  • Database
  • Servers
  • Android
  • Apple
  • Innovation
  • Blogs
  • PC Hardware
  • Reviews
  • Search Engines
  • Virtualization
Read Down
Sign in
Close
Welcome!Log into your account
Forgot your password?
Read Down
Password recovery
Recover your password
Close
Search
Logo
Logo
  • Latest News
  • Artificial Intelligence
  • Big Data and Analytics
  • Cloud
  • Networking
  • Cybersecurity
  • Applications
  • IT Management
  • Storage
  • Sponsored
  • Mobile
  • Small Business
  • Development
  • Database
  • Servers
  • Android
  • Apple
  • Innovation
  • Blogs
  • PC Hardware
  • Reviews
  • Search Engines
  • Virtualization
More
    Home Big Data and Analytics
    • Big Data and Analytics
    • IT Management
    • Networking
    • Servers
    • Storage

    Best Practices for Preparing Data Centers for AI, ML and DL

    By
    eWEEK EDITORS
    -
    July 30, 2020
    Share
    Facebook
    Twitter
    Linkedin
      GPU.Computing

      The intensive demands of artificial intelligence, machine learning and deep learning applications challenge data center performance, reliability and scalability–especially as architects mimic the design of public clouds to simplify the transition to hybrid cloud and on-premise deployments. 

      GPU (graphics processing unit) servers are now common, and the ecosystem around GPU computing is rapidly evolving to increase the efficiency and scalability of GPU workloads. Yet there are tricks to maximizing the more costly GPU utilization while avoiding potential choke points in storage and networking. 

      In this edition of eWEEK Data Points, Sven Breuner, field CTO, and Kirill Shoikhet, chief architect, at Excelero, offer nine best practices on preparing data centers for AI, ML and DL. 

      Data Point No. 1: Know your target system performance, ROI and scalability plans.

      This is so they can dovetail with data center goals. Most organizations start with an initially small budget and small training data sets, and prepare the infrastructure for seamless and rapid system growth as AI becomes a valuable part of the core business. The chosen hardware and software infrastructure needs to be built for flexible scale-out to avoid disruptive changes with every new growth phase. Close collaboration between data scientists and system administrators is critical to learn about the performance requirements and get an understanding of how the infrastructure might need to evolve over time.

      Data Point No. 2: Evaluate clustering multiple GPU systems, either now or for the future. 

      Having multiple GPUs inside a single server enables efficient data sharing and communication inside the system as well as cost-effectiveness, with reference designs presuming a future clustered use with support for up to 16 GPUs inside a single server. A multi-GPU server needs to be prepared to read incoming data at a very high rate to keep the GPUs busy, meaning it needs an ultra-high-speed network connection all the way through to the storage system for the training database. However, at some point a single server will no longer be enough to work through the grown training database in reasonable time, so building a shared storage infrastructure into the design will make it easier to add GPU servers as AI/ML/DL use expands.

      Data Point No. 3: Assess chokepoints across the AI workflow phases. 

      The data center infrastructure needs to be able to deal with all phases of the AI workflow at the same time. Having a solid concept for resource scheduling and sharing is critical for cost effective data centers, so that while one group of data scientists get new data that needs to be ingested and prepared, others will train on their available data, while elsewhere, previously generated models will be used in production. Kubernetes has become a popular solution to this problem, making cloud technology easily available on premises and making hybrid deployments feasible.

      Data Point No. 4: Review strategies for optimizing GPU utilization and performance.

      The computationally intensive nature of many AI/ML/DL applications make GPU-based servers a common choice. However, while GPUs are efficient at loading data from RAM, training datasets usually far exceed RAM, and the massive number of files involved become more of a challenge to ingest. Achieving the optimal balance between the number of GPUs and the available CPU power, memory and network bandwidth–both between GPU servers and to the storage infrastructure–is critical.

      Data Point No. 5: Support the demands of the training and inference phases. 

      In the classic example of training a system to “see” a cat, computers perform a numbers game–the computer (or rather the GPUs) need to see lots and lots of cats in all colors. Due to the nature of accesses consisting of massive parallel file reads, NVMe flash ideally supports these requirements by providing both the ultra-low access latency and the high number of read operations per second. In the inference phase, the challenge is similar, in that object recognition typically happens in real time–another use case where NVMe flash storage also provides a latency advantage. 

      Data Point No. 6: Consider parallel file systems and alternatives. 

      Parallel file systems like IBM’s SpectrumScale or BeeGFS can help in handling the metadata of a large number of small files efficiently, enabling 3X-4X faster analysis of ML data sets by delivering tens of thousands of small files per second across the network. Given the read-only nature of training data, it’s also possible to avoid the need for a parallel file system altogether when making the data volumes directly available to the GPU servers and sharing them in a coordinated way through a framework like Kubernetes.

      Data Point No. 7: Choose the right networking backbone. 

      AI/ML/DL is usually a new workload, and backfitting it into existing network infrastructure often cannot support required low latency, high bandwidth, high message rate and smart offloads required for complex computations and fast and efficient data delivery. RDMA-based network transports RoCE (RDMA over Converged Ethernet) and InfiniBand have become the standards to meet these new demands. 

      Data Point No. 8: Consider four storage system levers to price/performance. 

      1) High read throughput combined with low latency, that doesn’t constrain hybrid deployments and can run on either cloud or on-premise resources.

      2) Data protection. The AI/ML/DL storage system is typically significantly faster than others in the data center, so recovering it from backup after a complete failure might take a very long time and disrupt ongoing operations. The read-mostly nature of DL training makes it a good fit for distributed erasure coding where the highest level of fault tolerance is already built into the primary storage system, with a very small difference between raw and usable capacity.

      3) Capacity elasticity to accommodate any size or type of drive, so that as flash media evolve and flash drive characteristics expand, data centers can maximize price/performance at scale, when it matters the most. 

      4) Performance elasticity. Since AI data sets need to grow over time to further improve the accuracy models, storage infrastructure should achieve a close-to-linear scaling factor, where each incremental storage addition brings the equivalent incremental performance. This allows organizations to start small and grow non-disruptively as business dictates.

      Data Point No. 9: Set benchmarks and performance metrics to aid in scalability. 

      For example, for deep learning storage, one metric might be that each GPU processes X number of files (typically thousand or tens of thousands) per second, where each file has an average size of Y (from only a few dozen to thousands) KB. Establishing appropriate metrics upfront helps to qualify architectural approaches and solutions from the outset, and guide follow-on expansions.

      If you have a suggestion for an eWEEK Data Points article, email cpreimesberger@eweek.com.

      eWEEK EDITORS
      eWeek editors publish top thought leaders and leading experts in emerging technology across a wide variety of Enterprise B2B sectors. Our focus is providing actionable information for today’s technology decision makers.
      Get the Free Newsletter!
      Subscribe to Daily Tech Insider for top news, trends & analysis
      This email address is invalid.
      Get the Free Newsletter!
      Subscribe to Daily Tech Insider for top news, trends & analysis
      This email address is invalid.

      MOST POPULAR ARTICLES

      Latest News

      Zeus Kerravala on Networking: Multicloud, 5G, and...

      James Maguire - December 16, 2022 0
      I spoke with Zeus Kerravala, industry analyst at ZK Research, about the rapid changes in enterprise networking, as tech advances and digital transformation prompt...
      Read more
      Applications

      Datadog President Amit Agarwal on Trends in...

      James Maguire - November 11, 2022 0
      I spoke with Amit Agarwal, President of Datadog, about infrastructure observability, from current trends to key challenges to the future of this rapidly growing...
      Read more
      IT Management

      Intuit’s Nhung Ho on AI for the...

      James Maguire - May 13, 2022 0
      I spoke with Nhung Ho, Vice President of AI at Intuit, about adoption of AI in the small and medium-sized business market, and how...
      Read more
      Applications

      Kyndryl’s Nicolas Sekkaki on Handling AI and...

      James Maguire - November 9, 2022 0
      I spoke with Nicolas Sekkaki, Group Practice Leader for Applications, Data and AI at Kyndryl, about how companies can boost both their AI and...
      Read more
      Cloud

      IGEL CEO Jed Ayres on Edge and...

      James Maguire - June 14, 2022 0
      I spoke with Jed Ayres, CEO of IGEL, about the endpoint sector, and an open source OS for the cloud; we also spoke about...
      Read more
      Logo

      eWeek has the latest technology news and analysis, buying guides, and product reviews for IT professionals and technology buyers. The site’s focus is on innovative solutions and covering in-depth technical content. eWeek stays on the cutting edge of technology news and IT trends through interviews and expert analysis. Gain insight from top innovators and thought leaders in the fields of IT, business, enterprise software, startups, and more.

      Facebook
      Linkedin
      RSS
      Twitter
      Youtube

      Advertisers

      Advertise with TechnologyAdvice on eWeek and our other IT-focused platforms.

      Advertise with Us

      Menu

      • About eWeek
      • Subscribe to our Newsletter
      • Latest News

      Our Brands

      • Privacy Policy
      • Terms
      • About
      • Contact
      • Advertise
      • Sitemap
      • California – Do Not Sell My Information

      Property of TechnologyAdvice.
      © 2022 TechnologyAdvice. All Rights Reserved

      Advertiser Disclosure: Some of the products that appear on this site are from companies from which TechnologyAdvice receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. TechnologyAdvice does not include all companies or all types of products available in the marketplace.

      ×