big data benchmarking with rdma solutions

Big Data Benchmarking with RDMA solutions

Oracle Open World 2013

Leading Supplier of End-to-End Interconnect Solutions

Host/Fabric SoftwareICs Switches/GatewaysAdapter Cards Cables

Comprehensive End-to-End InfiniBand and Ethernet Portfolio

Virtual Protocol Interconnect

StorageFront / Back-EndServer / Compute Switch / Gateway

56G IB & FCoIB 56G InfiniBand

10/40/56GbE & FCoE 10/40/56GbE

Fibre Channel

Virtual Protocol Interconnect

A scalable fault-tolerant distributed system for data storage and processing

Hadoop has two main systems• Hadoop Distributed File System: self-healing high-bandwidth clustered storage. • MapReduce: distributed fault-tolerant resource management and scheduling coupled with a scalable data

programming abstraction.

Key values • Flexibility – Store any data, Run any analysis. • Scalability – Start at 1TB/3-nodes grow to petabytes/1000s of nodes. • Economics – Cost per TB at a fraction of traditional options.

Hadoop Framework

HDFS™(Hadoop Distributed File System)

Map Reduce HBase

DISK DISK DISK DISK DISK DISK

Hive Pig

Map Reduce

Three Areas for Accelerations

Data Analytics• Explore inefficiencies in existing analytics frameworks and systems• Accelerate data processing to deliver faster results

Storage• Explore ways to refine dominant file system• Take advantage for direct attached disk to accelerate data access

Distributed Storage• Leverage popular distributed storage systems with Big Data applications• Use existing systems for usage with Big Data frameworks

~88% CPU Utilization

I/O Offload Frees Up CPU for Application Processing U

~53% CPU Utilization

~47% CPU Overhead/Idle

~12% CPU Overhead/Idle

Without RDMA With RDMA and Offload

Plug-in architecture • Open-source, latest GA version 3.1 (6/10/2013)• Google code repository at: https://code.google.com/p/uda-plugin/

Accelerates Map Reduce Jobs• Accelerated merge sort

Efficient Shuffle Provider • Data transfer over RDMA • Supports InfiniBand and Ethernet

Supported Hadoop Distributions• Apache 3.0 – In the main trunk!• Apache 2.0.3 – In the main trunk!• Apache Hadoop 1.0.x ; 1.1.x• Cloudera Distribution Hadoop 3 &4• Hortonworks HDP 1.x• GPHD 1.2

Supported Hardware• ConnectX®-3 VPI• SwitchX-2 based systems

Unstructured Data Accelerator - UDA

Map Reduce HBase

Hive Pig

Map Reduce

Double Map Reduce Performance with UDA

*TeraSort is a popular benchmark used to measure the performance of Hadoop cluster

~50% Disk Access CPU Efficiency 2.5X

**1TB Data Set, 20x dual X5670 (Westmere) Machines, 10x HDD Base; Vanilla GPHD1.2; UDA GPHD1.2+UDA

2X Faster Job Completion! Increase the Value of Data!

HDFS is the Hadoop File System• The underlying File system for HBase and other NoSQL Data Bases

More Drives, Higher Throughput is Needed

SSDs Solutions Must use Higher Throughput• Bounded by 1GbE and 10GbE

HDFS Acceleration; Joint Project With Ohio State University

Map Reduce HBase

Hive Pig

SSDs Become De-Facto standard in HDFS deployment• Read capability is a critical factor for application performance

E-DFSIO, Part of Intel’s HiBench test suite, profiles aggregated throughput on the cluster• 1GbE network impede any performance benefit from SSD deployment

Unlocking the Power SSDs In Hadoop Environment

E-DFSIO, Showing the Power of SSD @ HDFS

OrangeFS as Hadoop Storage Solution

Mellanox VPI Card• MCX354A-FCBT

Mellanox Edge Switches• MSX10xx; MSX60xx

Cloudera Certified – CDH3 and CDH4

E5-26x0 (Sandy Bridge) Machines• Dual Socket• 4+ cores each socket• 32GB+ of DRAM

Disk Drives• At least 5 x 1TB, SAS, 10K RPM

Hadoop Configuration• At least one Name Node + Job Tracker• At least 4 Data Nodes

Installation:• Your selection of Hadoop Distribution or other Big Data solution (Such as Cassandra)

Networking• ConnectX-3 VPI card, FDR, 40GbE and 10GbE• SwitchX based systems: MSX6036F, MSX1036B and MSX1016 • Mellanox’s FDR, 40GbE and 10GbE Cable Solutions

http://www.mellanox.com/related-docs/whitepapers/WP_Deploying_Hadoop.pdf

Simple Building Block for Big Data Solution

EMC 1000-Node Analytic Platform Accelerates Industry's Hadoop Development 24 PetaByte of physical storage

• Half of every written word since inception of mankind Mellanox VPI Solutions

Test Drive Your Big Data

2X Faster Hadoop Job Run-TimeHadoop

Acceleration

High Throughput, Low Latency, RDMA Critical for ROI

Thank You

big data benchmarking with rdma solutions

mellanox technologies

data storage

hadoop file system

processing hadoop

apache hadoop

data processing

hadoop storage solution

nosql data

Technology

d5.4 analytic modelling ... - big data benchmarking

a dwarf-based scalable big data benchmarking methodology ·...

automating big data benchmarking and performance analysis...

high-performance big data analytics with rdma over nvm...

using rdma efﬁciently for key-value services · herd 3 1....

accelerating spark with rdma for big data processing: early

benchmarking big-data workflows across european academic

詹剑锋：big databench—benchmarking big data systems

benchmarking datacenter and big data systems

accelerating and benchmarking big data and deep...

virtual rdma devices - openfabrics alliance€¦ ·...

accelerating big data processing with rdma - enhanced...

import rdma: zero-copy networking with rdma and python

laporan big benchmarking sem 5

benchmarking de tecnologias de big data - universidade do...

benchmarking datacenter and big data...

big neuroscience infrastructure · 2017. 3. 1. · • rdma...

big data benchmarking: applications and...

benchmarking hadoop and big data

rdma verbs - rdma consortium