solution brief test drive ontap ai with trace3 & flexential · more recently, gpu deep learning...

2
Solution Brief Test Drive ONTAP AI with Trace3 & Flexential ® Start Your AI Journey on a Managed, Scalable and Tested Platform The Challenge AI is fueling the next industrial revolution with its ability to quickly transform data into business-critical insights. The algorithms behind this transformation have grown in variety, scale, and complexity. Data scientist communities already have the daunting task of keeping abreast and leveraging the methods best suited for their tasks and organization. In addition, they are also expected to help make infrastructure decisions that will keep their organization relevant. Machine learning and deep Learning are arguably among of the most performance-intensive workloads running in modern data centers. Although GPUs perform the brunt of the work in these workloads, they must be combined with a high performance, low latency storage and internetworking fabric that ensures accelerated movement of large datasets across the infrastructure. Designing an optimized infrastructure for the unique demands of AI can be a daunting challenge for many enterprises. Building for predictable performance Designing an infrastructure that delivers performance at scale requires a deep understanding of a variety of AI algorithms and their dependencies on the compute, storage, and interconnect. To get the best return on investment, organizations must balance high performance compute and storage technologies with high bandwidth and low latency interconnect fabric. IT organizations need to select and deploy the right infrastructure that’s tuned for this workload such that data science teams can focus on algorithms and model prototyping instead of system design, integration and troubleshooting. Delayed time to insight Organizations have a business imperative to quickly turn data into insight. However, setting up and fine-tuning AI infrastructure can be time-consuming. Infrastructure related issues and the resulting delays can impact business outcomes. Without an optimized solution, businesses will find the ROI of their AI investment to be elusive. Data center readiness Modern AI infrastructures pack immense capacity in a small physical footprint. Many private data centers do not have the adequate power and cooling required to host these high performance compute resources. Escalating costs Organizations need to weigh their performance and efficiency goals against cost considerations and fixed budget. Setting up an AI infrastructure can be a challenge for organizations with tight budget constraints. Security and network concerns Increasing data needs include private information and analysis. Security, compliance, as well as strong network low-latency integration between cloud platforms, data centers, and on-premises needs are critical. HIPAA, PCI, ISO27001, SOC1-3, etc. are critical attributes to ensure data is secured and connected and are supported by Flexential ® Professional Services if companies need additional assistance. Key Benefits Ready for use on professionally run infrastructure The ONTAP AI systems are hosted in the Flexential ® data center & FlexAnywhere™ powered data center network, fine-tuned for performance by Trace3. The Trace3 team will work with the Test Drive customer to deploy Slurm or Kubernetes, which are NVIDIA supported solutions, to enable users to run multiple (and multi-node) jobs on this high performance cluster. Best-in-class performance ONTAP AI infrastructure comprises of best-of-breed elements – NVIDIA DGX™ systems, NetApp AFF 800 Flash Storage, and Mellanox Spectrum Ethernet switches. NVIDIA ® DGX™ Systems deliver groundbreaking AI performance and faster insights through the integration of the world’s fastest data center accelerators – the NVIDIA V100 Tensor Core GPUs, interconnected with NVIDIA NVLink and NVSwitch, with Mellanox ConnectX interfaces for exceptionally fast communication between nodes.

Upload: others

Post on 22-May-2020

17 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Solution Brief Test Drive ONTAP AI with Trace3 & Flexential · More recently, GPU deep learning ignited modern AI — the next era of computing — with the GPU acting as the brain

Solution Brief

Test Drive ONTAP AI with Trace3 & Flexential®Start Your AI Journey on a Managed, Scalable and Tested Platform

The ChallengeAI is fueling the next industrial revolution with its ability to quickly transform data into business-critical insights. The algorithms behind this transformation have grown in variety, scale, and complexity. Data scientist communities already have the daunting task of keeping abreast and leveraging the methods best suited for their tasks and organization. In addition, they are also expected to help make infrastructure decisions that will keep their organization relevant.

Machine learning and deep Learning are arguably among of the most performance-intensive workloads running in modern data centers. Although GPUs perform the brunt of the work in these workloads, they must be combined with a high performance, low latency storage and internetworking fabric that ensures accelerated movement of large datasets across the infrastructure. Designing an optimized infrastructure for the unique demands of AI can be a daunting challenge for many enterprises.

Building for predictable performanceDesigning an infrastructure that delivers performance at scale requires a deep understanding of a variety of AI algorithms and their dependencies on the compute, storage, and interconnect. To get the best return on investment, organizations must balance high performance compute and storage technologies with high bandwidth and low latency interconnect fabric. IT organizations need to select and deploy the right infrastructure that’s tuned for this workload such that data science teams can focus on algorithms and model prototyping instead of system design, integration and troubleshooting.

Delayed time to insightOrganizations have a business imperative to quickly turn data into insight. However, setting up and fine-tuning AI infrastructure can be time-consuming. Infrastructure related issues and the resulting delays can impact business outcomes. Without an optimized solution, businesses will find the ROI of their AI investment to be elusive.

Data center readinessModern AI infrastructures pack immense capacity in a small physical footprint. Many private data centers do not have the adequate power and cooling required to host these high performance compute resources.

Escalating costsOrganizations need to weigh their performance and efficiency goals against cost considerations and fixed budget. Setting up an AI infrastructure can be a challenge for organizations with tight budget constraints.

Security and network concernsIncreasing data needs include private information and analysis. Security, compliance, as well as strong network low-latency integration between cloud platforms, data centers, and on-premises needs are critical. HIPAA, PCI, ISO27001, SOC1-3, etc. are critical attributes to ensure data is secured and connected and are supported by Flexential® Professional Services if companies need additional assistance.

Key BenefitsReady for use on professionally run infrastructureThe ONTAP AI systems are hosted in the Flexential® data center & FlexAnywhere™ powered data center network, fine-tuned for performance by Trace3. The Trace3 team will work with the Test Drive customer to deploy Slurm or Kubernetes, which are NVIDIA supported solutions, to enable users to run multiple (and multi-node) jobs on this high performance cluster.

Best-in-class performanceONTAP AI infrastructure comprises of best-of-breed elements – NVIDIA DGX™ systems, NetApp AFF 800 Flash Storage, and Mellanox Spectrum Ethernet switches.

• NVIDIA® DGX™ Systems deliver groundbreaking AI performance and faster insights through the integration of the world’s fastest data center accelerators – the NVIDIA V100 Tensor Core GPUs, interconnected with NVIDIA NVLink and NVSwitch, with Mellanox ConnectX interfaces for exceptionally fast communication between nodes.

Page 2: Solution Brief Test Drive ONTAP AI with Trace3 & Flexential · More recently, GPU deep learning ignited modern AI — the next era of computing — with the GPU acting as the brain

The SolutionNetApp ONTAP® AI is a proven architecture, powered by NVIDIA DGX supercomputers and NetApp cloud connected, all flash storage all connected through Mellanox networking technology. This reference architecture solution for AI infrastructure eliminates design complexity, delivers predictable performance at scale, simplifies deployment, and delivers operations peace of mind for IT infrastructure managers.

Flexential®, NVIDIA, NetApp, Mellanox, and Trace3 have teamed up to introduce a “Test Drive” program that makes ONTAP AI, readily available for customers. The Test Drive platform comprises NVIDIA DGX-1 and DGX-2 systems and NetApp AFF 800 Flash storage arrays interconnected by Mellanox SN2700 100GbE Ethernet switches. Internet connectivity as well as multi-cloud connectivity (e.g. AWS Direct Connect) up to 100GB is provided by Flexential®’s FlexAnywhere™ platform, with integrated DDOS (Dynamic Denial of Service) mitigation for IP services.

As part of the Test Drive program, Flexential® colocation data centers will host the ONTAP AI platform. The infrastructure is fine-tuned for performance and professionally run by Trace3. Trace3 will also manage the entire Test Drive process, from initial project scoping to data migration and ongoing management of the environment through project completion.

Getting StartedStarting your AI-based business initiative has never been easier. The Test Drive program initiated by NVIDIA, Flexential®, Mellanox, NetApp, and Trace3 makes ONTAP AI-based deep learning systems readily available for enterprises to use. For more information on the program, reach out to your Sales Manager or email at [email protected].

About NetAppNetApp is the data authority for hybrid cloud. We provide a full range of hybrid cloud data services that simplify management of applications and data across cloud and on-premises environments to accelerate digital transformation. Together with our partners, we empower global organizations to unleash the full potential of their data to expand customer touchpoints, foster greater innovation and optimize their operations. For more information, visit www.netapp.com. #DataDriven

About NVIDIANVIDIA‘s (NASDAQ:NVDA) invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI — the next era of computing — with the GPU acting as the brain of computers, robots and self-driving cars that can perceive and understand the world. NVIDIA’s DGX Systems are the world’s first supercomputers purpose-built for the unique demands of AI in the enterprise and powered by the world’s fastest data center accelerators. To learn more, visit www.nvidia.com/dgx.

About MellanoxMellanox Technologies (NASDAQ: MLNX) is a leading supplier of end-to-end Ethernet and InfiniBand intelligent interconnect solutions and services for servers, storage, and hyper-converged infrastructure. Mellanox intelligent interconnect solutions increase data center efficiency by providing the highest throughput and lowest latency, delivering data faster to applications and unlocking system performance. Mellanox offers a choice of high performance solutions: network and multi-core processors, network adapters, switches, cables, software and silicon, that accelerate application runtime and maximize business results for a wide range of markets including high performance computing, enterprise data centers, Web 2.0, cloud, storage, network security, telecom and financial services. More information is available at www.mellanox.com.

About Flexential®

Flexential® offers flexible and essential services that help organizations optimize their journey of IT transformation while simultaneously balancing cost, scalability, compliance and security. The company is committed to building trusted relationships and delivering tailored solutions that suit the individual needs of its customers. Utilizing its people, values and reliable performance, Flexential® is deeply invested in the success of its customers, who trust it to deliver core data center solutions of colocation and connectivity, as well as cloud, managed solutions and professional services. Flexential®’s robust suite of assets spans 21 markets and comprises 40 highly redundant and connectivity-rich data centers. For more information on Flexential®, please visit www.Flexential.com.

About Trace3Trace3 is a premier provider of consultation services and advanced technical solutions for information management. Founded in 2002, Trace3 empowers organizations to embrace the ever-changing IT landscape through elite engineering and dynamic innovation. With deep roots in the data center, Trace3 offers a broad mix of end-to-end technology services and solutions. These range from artificial intelligence and data insights to cloud computing and security consulting. Trace3 also maintains a Venture Capital (VC) CIO Briefing program, with a sharp focus on emerging technologies, and provides clients with extensive, primary research focused on the latest IT trends. The company continues to expand its footprint with a base of more than 6,000 customers, 800 employees, and 22 office locations across the United States. Trace3, LLC, is privately held by H.I.G. Capital.

Key Benefits continued...

• The NetApp AFF system is the industry’s first end-to-end NVMe solution. NetApp AFF provides the highest possible throughput at the lowest possible latency. NetApp AFF is a state-of-the-art storage system that enable you to meet enterprise storage requirements with the industry leading performance, superior flexibility, cloud integration, and best-in-class data management.

• With a 300ns port-to-port latency, Mellanox Spectrum Ethernet switches are the fastest 100GbE Ethernet switch in the industry. Spectrum switches provide a robust, high bandwidth data path for RoCE based GPU-GPU and GPU-Storage communications. Additionally, Spectrum switches support simplified RoCE configuration and built-in advanced network telemetry to reduce mean time to issue resolution.

Fully leverage the power of GPU’sWhen it comes to AI training and modeling, you can spend a lot of time on data access and writing metadata. Also, while that data I/O occurs, GPUs sit idle. That idle time is even more significant when you have a big GPU cluster with several servers— each with multiple GPUs.

ONTAP AI high throughput and data I/O speeds can help you cut down on GPU idle time by keeping the GPUs engaged more often. ONTAP AI is designed around the AI processing power of NVIDIA GPUs, giving you much higher GPU utilization for training and for inference workloads than competitive solutions provide.

Flexible consumption modelsCustomers have three flexible ways to consume the infrastructure:

• Run their workloads on the ONTAP AI system and purchase/lease the system if they like it

• Purchase the system outright and let Flexential® host it in their state-of-the-art data center

• Optimize capital expenses by leasing the ONTAP AI system and using it as a service