optimized infrastructure for big data analytics on mapr ......mapr-xd — mapr-xd cloud-scale data...

20
August 2018 By Optimized Infrastructure for Big Data Analytics on MapR from Hitachi Vantara David Pascuzzi Reference Architecture Guide

Upload: others

Post on 16-Mar-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

August 2018

By

Optimized Infrastructure for Big Data Analytics on MapR from Hitachi Vantara

David Pascuzzi

Reference Architecture Guide

FeedbackHitachi Vantara welcomes your feedback. Please share your thoughts by sending an email message to [email protected]. To assist the routing of this message, use the paper number in the subject and the title of this white paper in the text.

Revision History

Revision Changes Date

MK-SL-082-00 Initial release June 29, 2018

MK-SL-082-01 Added support for Hitachi Advanced Server DS220 August 23, 2018

Table of ContentsKey Solution Elements 1

Hitachi Advanced Server DS120 1

Hitachi Advanced Server DS220 2

Cisco Switches 2

MapR Converged Data Platform 3

Solution Design 5

Server Configuration 5

Network Architecture 7

Rack Deployment 9

Validation 12

1

Optimized Infrastructure for Big Data Analytics on MapR from Hitachi VantaraReference Architecture Guide

Accelerate your deployment by leveraging this reference architecture to help guide you in implementing Hitachi Vantara’s optimized infrastructure for big data analytics on MapR. Reduce the risk of implementing an improper architecture with this guide.

This reference architecture guide shows how to configure our optimized infrastructure for big data analytics on MapR. This shows an example environment to deploy big data infrastructure for advanced analytics. This integrated big data infrastructure uses the following:

Hitachi Advanced Server DS120 — This is a flexible 1U server designed for optimal performance across multiple applications.

Hitachi Advanced Server DS220 — This is a flexible 2U server designed for optimal performance across multiple applications.

MapR Converged Data Platform — The MapR Converged Data Platform enables direct processing of files, tables, and event streams. Unlike “connected” environments that require complex integrations and orchestration, convergence offers a streamlined architecture that enables real-time insights, a consistent security framework across compute engines, higher resource utilization, and reduced administrative overhead. This solution is certified with MapR 6.

Cisco Nexus 3048 — This 48-port 1 GbE switch provides a management network. It is used both as a leaf switch and a spine switch.

Cisco Nexus 93180YC-E/FX — This 48-port switch provides 10 GbE connectivity for intra-rack networks. It is used as the leaf switch for the data network. Designed with Cisco Cloud Scale technology, it supports highly scalable cloud architectures.

Cisco Nexus 93180LC-EX — This 24-port switch provides 40 GbE connectivity for inter-rack networks. It is used as the spine switch for the data network, and it supports flexible migration options. It is ideal for highly scalable cloud architectures and enterprise data centers.

Note — Testing of this configuration was in a lab environment. Many things affect production environments beyond prediction or duplication in a lab environment. Follow the recommended practice of conducting proof-of-concept testing for acceptable results in a non-production, isolated test environment that otherwise matches your production environment before your production implementation of this solution.

Key Solution ElementsThese key solution elements power this big data solution. You can create your scale-out configuration to power MapR Converged Data Platform in an optimized infrastructure for big data analytics.

This solution supports using either the 1U Hitachi Advanced Server DS120 or the 2U Advanced Server DS220.

Hitachi Advanced Server DS120 Optimized for performance, high density, and power efficiency in a dual-processor server, Hitachi Advanced Server DS120 delivers a balance of compute and storage capacity. This rack mounted server has the flexibility to power a wide range of solutions and applications.

1

2

The highly-scalable memory supports up to 3 TB RAM using 24 slots of 2666 MHz DDR4 RDMM. DS120 is powered by the Intel Xeon scalable processor family for complex and demanding workloads. There are flexible OCP and PCIe I/O expansion card options available. This server supports up to 12 small form factor storage devices with up to 4 NVMe.

This 1U server allows you to have a high CPU to storage ratio. This is ideal for balanced and compute-heavy workloads.

Figure 1 shows the front and back of this server.

Figure 1

Hitachi Advanced Server DS220With a combination of two Intel Xeon Scalable processors and high storage capacity in a 2U rack-space package, Hitachi Advanced Server DS220 delivers the storage and I/O to meet the needs of converged solutions and high-performance applications in the data center.

The Intel Xeon Scalable processor family is optimized to address the growing demands on today’s IT infrastructure. The server provides 24 slots for high-speed DDR4 memory, allowing up to 3 TB of memory per node when 238 GB DIMMs are used. This server supports up to 12 large form factor storage devices and an additional 2 small form factor storage devices.

This 2U server uses large form factor storage to allow you to have dense storage with lower power consumption. This is ideal where the maximum storage per rack is your largest concern. This larger form factor also provides you with more expansion options.

Figure 2 shows front and back views of the DS220.

Figure 2

Cisco SwitchesThese solutions reduce complexity and cost, as well as enable virtualization and cloud computing to increase business agility.

This solution includes the following Cisco switches to provide Ethernet connectivity:

Cisco Nexus 3048

Cisco Nexus 93180YC-E/FX

Cisco Nexus 93180LC-EX

2

3

This solution uses a leaf-spine network architecture. This network architecture can be replaced to match the rest of your network configuration.

This reference architecture uses Hitachi Advanced Server DS120 for a compute-intensive solution and for very high I/0 intensive solutions. Designs that support a storage-intensive solution are available.

MapR Converged Data PlatformThe industry’s leading unified data platform, MapR Converged Data Platform, can simultaneously do analytics and applications with speed, scale, and reliability. It converges all data into a data fabric that can store, manage, process, apply, and analyze as data happens.

Integrating Hadoop, Spark, and Apache Drill with real-time database capabilities, global event streaming, and scalable enterprise storage, MapR Converged Data Platform powers a new generation of big data applications.

MapR delivers enterprise grade security, reliability, and real-time performance while dramatically lowering hardware and operational costs of your most important applications and data. Supporting a variety of open source projects, MapR is committed to using industry-standard APIs.

Figure 3 shows MapR Converged Data Platform integrating with workloads and deployment models.

Figure 3

MapR supports dozens of open source projects. It extends Hadoop by providing its own custom features. Figure 4 on page 4 shows the components of the MapR Data Platform and other systems that it integrates with.

MapR-XD — MapR-XD Cloud-Scale Data Store is the industry’s only exabyte scale data store for building intelligent applications with the MapR Converged Data Platform.

Map-ES — MapR-ES is a publish and scribe messaging system built into the MapR platform, it is compatible with Kafka APIs.

MapR-DB — MapR-DB is a high performance No-SQL Database that lets you run analytics on live data. With it you can handle multiple use cases and workloads on a single cluster. To provide a seamless interface with your current solutions, it can be accessed with HBASE APIs and JSON APIs.

3

4

MapR Data Fabric for Kubernetes — The MapR Data Fabric includes a natively integrated Kubernetes volume driver to provide persistent storage volumes for access to any data located on-premises, across clouds, and to the edge. Stateful applications can now be easily deployed in containers for production use cases, machine learning pipelines, and multi-tenant use cases.

MapR Control System — MapR Control System is used to manage, administer and monitor the MapR Hadoop cluster.

Multi-Tenancy — MapR supports multi-tenancy surpassing YARNs capabilities.

High performance — With faster file access and an optimized MapReduce, MapR customers can deploy one third fewer nodes than other Hadoop distributions.

High Availability — MapR automatically eliminates single points of failures.

Disaster Recovery — MapR provides disaster recovery with mirror copies of data replicated to other sites and consistence point in time snapshots.

Figure 4

MapR provides opensource ecosystem projects to handle a variety of big data management tasks. Projects include Apache Storm, Apache Pig, Apache Hive, Apache Mahout, YARN, Apache Sqoop, Apache Flume, and more.

MapR’s open interface allows it to inter-operate with many commercial tools, such as the following:

SAP HANA

Pentaho

Oracle

4

5

Solution DesignUse this detailed design of an integrated infrastructure from Hitachi Vantara to implement an optimized infrastructure for big data analytics. It has options for Hitachi Advanced Server DS120 and Advanced Server DS220. However, this reference architecture focuses on a design using Advanced Server DS120. Contact Hitachi Vantara Global Sales for your complete options.

“Server Configuration” on page 5

“Network Architecture” on page 7

“Rack Deployment” on page 9

This design does not limit the maximum number of nodes. The size of every solution depends on your specific deployment. For a large deployment, the recommendation is that you validate the network to see that it meets your individual requirements.

Server ConfigurationThis solution uses multiple server nodes, either Hitachi Advanced Server DS120 (1U) or Advanced Server DS220 (2U). The architecture supports using these servers in multiple configurations.

Unlike other Hadoop distributions, MapR does not require special configuration for master nodes. This reference architecture guide describes these nodes:

“MapR Nodes” on page 5

“Edge Node” on page 7

“Hardware Management Server” on page 7

MapR Nodes

The design of these MapR nodes to meet MapR best practices and to provide flexibility in your deployment.

The Intel CPUs used in this solution have 6 memory channels to achieve optimal performance.

The recommended options are to use 12 or 24 DIMMs. For optimal performance, use 12 SAS drives, which can be replaced with SSDs.

Table 1 shows the standard MapR configuration options.

TABLE 1. MAPR HARDWARE OPTIONS

Component Description

Model Hitachi Advanced Server DS120

and/or

Hitachi Advanced Server DS220

5

6

For higher system availability, configure the operating system storage as RAID-1 If more storage is needed for the operating system, use the first storage disk.

For current system configuration recommendations, see the MapR Documentation.

MapR recommends using 2 TB SAS storage devices or SSD devices. Using SAS, SSDs, or a mixture of both, depends on your specific use case. Unlike other Hadoop distributions, MapR has its own file system, MapR-F, which is formatted by MapR software.

CPU 2 Intel 4110 processors, 8-core, 2.1 GHz

or

2 Intel 6128 processors, 6-core, 3.4 GHz

or

2 Intel 6140 processors, 18-core, 2.3GHz

Memory Options 384 GB: 12 × 32 GB DDR4 R-DIMM, 2666 MHz

or

768 GB: 24 × 32 GB DDR4 R-DIMM, 2666 MHz

Network Connections Intel XXV710 10 GbE dual port SFP28 (LP-MD2)

1 GbE LOM management port

Disk Controllers LSI 3516 RAID controller

Operating System Disks Advanced Server DS120: 2 × 128 GB MLC SATADOM for operating system

Advanced Server DS220: 2 small form-factor storage devices

Storage Disk 12 storage devices from the following list:

1.0 TB SAS

1.8 TB SAS

960 SSD

1.96 SSD

Racks 42 U rack

Number of Servers Advanced Server DS120: Up to 36 servers per rack

Advanced Server DS220: Up to 18 servers per rack

TABLE 1. MAPR HARDWARE OPTIONS (CONTINUED)

Component Description

6

7

The following is the RAID controller configuration for the storage devices

Use RAID-1 in single disk raid sets. This is like JBOD, except it makes use of the RAID controller features

The Stripe Size is 1024 MB.

The I/O Policy is cached.

The Read Policy is no read ahead.

The Write Policy is write-through.

The Drive Cache is disabled.

Edge Node

An edge node provides access to the MapR system. This access is typically to either a MapR webserver or by using client tools. These nodes can run any software and come in any configuration.

Hardware Management Server

You can include an optional hardware management server in this architecture. This server allows access to the out-of-band management network.

Table 2 lists the hardware used for this server.

Network ArchitectureThis architecture can use either two or three logical networks.

For redundancy and performance, MapR recommends having two networks between the nodes over having one network, with one NIC Ports on each network. This configuration allows MapR to optimize the communication between the systems using MAPR_SUBNETS to perform traffic management.

This document focuses on using two networks.

TABLE 2. HARDWARE MANAGEMENT SERVER

Component Description

Chassis Hitachi Advanced Server DS120

CPU 1 Intel 4110 processor, 8 cores at 2.1 GHz;

Memory Options 64 GB using 2 × 32 GB DIMMs

Network Connections 2 Intel XXV710 10 GbE, dual port

1 GbE LOM management port

Disk Controllers Intel RTEs on the mother board

OS 2 × 128 GB SATADOM configured as RAID-1

Storage No other storage is needed

7

8

When using two networks, use redundant ports and switches.

Figure 5 shows the network configuration. The management connections between switches are not shown.

Network 1 — Connectivity between nodes

Network 2 — Connectivity between nodes (optional)

Management Network — Out-of-band hardware management

Figure 5

Switches

This architecture requires using the following three types of switches:

Leaf Data Switches — Cisco Nexus 93180YC-E/FX

Leaf data switches connect all nodes in a rack together. Uplink the leaf switches to the spine data switches.

Each switch is a physically separate network.

Spine Data Switches — Cisco Nexus 93180LC-EX

These spine data switches interconnect leaf switches from different racks. There is a pair of spine switches for each of the two data networks.

Connect two switches together using an inter-switch link (ISL). This lets both switches act together as a single logical switch. If one switch fails, there still is a path to the hosts.

Used in multi-rack configurations, connect the leaf data switches to the spine using a redundant 100 GbE link.

One set of spine switches supports 24 racks of 36 nodes.

8

9

Leaf and Spine Management Switches — Cisco Nexus 3048

These leaf and spine switches connect the management ports of the hardware to the management server. When there is more than one rack, use a spine switch to connect all the management leaf switches together.

Uplink the management network to the in-house management network.

Data Networks

Use the data network for communications between the nodes.

The standard configuration has up to 36 nodes on the data network per rack. This provides an oversubscription ratio of 1.8:1.

Depending on network requirements, the Intel NIC’s speed can be increased to 25 GbE. This increases the oversubscription ratio to 4.5:1.

Management Network

The management network allows for access to the nodes using the 1 GbE LAN on motherboard (LOM) interface. This network provides out-of-band monitoring and management of the servers. You can uplink this network to the client management network.

Rack DeploymentEvery deployment has its own requirements, using different software combinations. These rack deployment examples show some basic configurations.

There are many services that can be part of a deployment that are not listed in these examples.

Use MapR Installer to deploy MapR Converged Data Platform. Refer to MapR Installer for more details.

To design a configuration to meet your needs, contact MapR Sales or Hitachi Vantara Global Sales.

Small Deployment

Figure 6 on page 10 shows a small single rack deployment of 10 nodes.

9

10

Figure 6

There is a total of 10 nodes:

The 2 nodes labeled Service Node run the following:

Container location databases (CLDB)

Webserver

Resource Manager

The 3 nodes labeled ZooKeeper run Apache ZooKeeper

10

11

All nodes run the following:

Node Manager

NFS Gateway

File server

Drill

It uses top-of-rack switches.

Multi Rack

This example design has three racks for 108 nodes and one management server. This is one example; there are many ways to deploy a MapR solution and many different components that can interface with it.

Figure 7 on page 12 shows this sample system.

The following run on the three nodes labeled Web Server:

Container location databases (CLDB)

Web Server

Resource Manager runs on the three nodes labeled Resource Mgr.

Elasticsearch runs on the node labeled Elasticsearch.

OpenTSDB runs on the five nodes labeled OpenTSDB.

Apache Zookeeper runs on the five nodes labeled Zookeeper.

All nodes run Warden

The following run on the rest of the nodes labeled General Purpose:

Node Manager

MapR NFS GateWay

File Server

Drill

ToR Switches in each rack

Spine switches in a separate network rack

11

12

Figure 7

ValidationThere are many different configurations of hardware and software that can be used in this solution. A basic validation of this solution was done in Hitachi Vantara's lab environment.

Testing was performed on a small 4 node cluster, with 6 SAS drives per node.

MapR Professional Service provides cluster validation scripts. These are available on a github repository. Find all scripts and instructions in MapRPS/cluster-validation.

The pre-installation scripts do not have pass or failure. Instead, the output is checked for anything that is not expected but not wrong.

The test environment used Red Hat Enterprise Linux 7.4, following Red Hat’s recommendation to use chrony instead of ntpd. The scripts report using chrony as an area to investigate.

12

13

The following scripts were run:

cluser-audit.sh

disk-test.sh

memory-test.sh

network-test.sh

The post installation scripts verify the behaviors of the system. While they are based upon benchmark scripts, the system is not setup to measure performance.

The following post installation scripts were run, and results reviewed by MapR:

runRWspeedTest.sh — This test script shows basic I/O performance of MapR storage.

runTeraGenSort.sh —. This test script runs the TeraGen test to generate data and the TeraSort test to sort it.

The environment was tested with a three-node and a four-node configuration.

The three-node run took 2 hours, 1 minute, 15 seconds to run.

The four-node run took 1 hour, 25 minutes, 48 seconds to run.

No tuning was done. This test validated that the system works correctly.

Figure 8 on page 14 shows the utilization trends for the four-node run. The start and stop time of TeraGenSort are identified by the arrows.

13

14

Figure 8

14

15

Figure 9 shows a snapshot of the system utilization during the four-node run. MapR compression provides a 37% space savings.

Figure 9

15

For More InformationHitachi Vantara Global Services offers experienced storage consultants, proven methodologies and a comprehensive services portfolio to assist you in implementing Hitachi products and solutions in your environment. For more information, see the Services website.

Demonstrations and other resources are available for many Hitachi products. To schedule a live demonstration, contact a sales representative or partner. To view on-line informational resources, see the Resources website.

Hitachi Academy is your education destination to acquire valuable knowledge and skills on Hitachi products and solutions. Our Hitachi Certified Professional program establishes your credibility and increases your value in the IT marketplace. For more information, see the Hitachi Vantana Training and Certification website.

For more information about Hitachi products and services, contact your sales representative, partner, or visit the Hitachi Vantara website.

1

Corporate Headquarters2845 Lafayette StreetSanta Clara, CA 96050-2639 USAHitachiVantara.com | community.HitachiVantara.com

Contact InformationUSA: 1-8000446-0744Global: 1-858-547-4526HitachiVantara.com/contact

Hitachi Vantara

© Hitachi Vantara Corporation, 2018. All rights reserved. HITACHI is a trademark or registered trademark of Hitachi, Ltd. All other trademarks, service marks, and company names are properties of their respective owners

Notice: This document is for informational purposes only, and does not set forth any warranty, expressed or implied, concerning any equipment or service offered or to be offered by Hitachi Vantara Corporation.

MK-SL-079-01, August 2018