accelerating hadoop* clusters with ssd technologies · apache* – 100% open source intel ®...

17
Accelerating Hadoop* Clusters with SSD Technologies Christian D. Black Datacenter Solutions Architect NSG, Intel Corporation *Other names and brands may be claimed as the property of others.

Upload: others

Post on 20-May-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Accelerating Hadoop* Clusters with SSD Technologies · Apache* – 100% Open Source Intel ® Distribution for Apache Hadoop*, Cloudera* (CHD) Greenplum* (GDH),MapR* (MRH), Hortonworks*

2013 SNIA Analytics and Big Data Summit. © Intel Corporation. All Rights Reserved.

Accelerating Hadoop* Clusters with SSD Technologies

Christian D. Black Datacenter Solutions Architect

NSG, Intel Corporation

*Other names and brands may be claimed as the property of others.

Page 2: Accelerating Hadoop* Clusters with SSD Technologies · Apache* – 100% Open Source Intel ® Distribution for Apache Hadoop*, Cloudera* (CHD) Greenplum* (GDH),MapR* (MRH), Hortonworks*

2013 SNIA Analytics and Big Data Summit. © Intel Corporation. All Rights Reserved.

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

"Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.

All products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.

Intel product plans in this presentation do not constitute Intel plan of record product roadmaps. Please contact your Intel representative to obtain Intel’s current plan of record product roadmaps.

Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as Terasort , HiBench, and Iometer, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Source: Internal Testing

Results have been simulated and are provided for informational purposes only. Results were derived using simulations run on an architecture simulator or model. Any difference in system hardware or software design or configuration may affect actual performance.

Configurations:

Configuration: Terasort* - Intel® Xeon 5600 & E5, 7200 RPM HDD & Intel ® 520 Series SSD, Intel ® 1GbE and 10Gb Ethernet, and open source Apache Hadoop * & Intel ® Distribution for Apache Hadoop*

Configuration: Iometer* - See http://www.intel.com/go/ssd for detailed products specifications and testing configurations.

Configuration: Intel ® Cache Acceleration Software - Intel ® Xeon E5, 7200 RPM HDD & Intel ® SSD DC S3700 Series, Intel ® 10Gb Ethernet, Intel ® Distribution for Apache Hadoop* & Intel ® Cache Acceleration Software

Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.

Copyright © 2013 Intel Corporation. All rights reserved.

Legal Disclaimer

NSG *Other names and brands may be claimed as the property of others.

2

Page 3: Accelerating Hadoop* Clusters with SSD Technologies · Apache* – 100% Open Source Intel ® Distribution for Apache Hadoop*, Cloudera* (CHD) Greenplum* (GDH),MapR* (MRH), Hortonworks*

2013 SNIA Analytics and Big Data Summit. © Intel Corporation. All Rights Reserved.

Whoami? Background - Hadoop* Distributions & Ecosystem Fitting Hadoop* into the Landscape Lots of businesses talking… Intel® SSDs for the datacenter Intel® SSDs and Hadoop* Where Intel® SSDs can best speed Hadoop* Main Clusters & Intel® Cache Acceleration Software Your Workload Really Matters Info, Contacts, and Questions

Agenda

NSG *Other names and brands may be claimed as the property of others. *Other names and brands may be claimed as the property of others. 3

Page 4: Accelerating Hadoop* Clusters with SSD Technologies · Apache* – 100% Open Source Intel ® Distribution for Apache Hadoop*, Cloudera* (CHD) Greenplum* (GDH),MapR* (MRH), Hortonworks*

2013 SNIA Analytics and Big Data Summit. © Intel Corporation. All Rights Reserved.

Datacenter Solutions Architect, NSG Moved from Intel IT to NSG in Q1 ‘13 Started on an Atari* 400 in 1980

Anyone recall Pilot* & Logo*? 20 years of Enterprise IT Experience

USAF Migration from Novell* to Windows NT* MCSE*, SAP*, Datacenter Architecture, & IT

Research/Pathfinding Driving SSD into Enterprise IT @ Intel® for last 4 years

Hadoop* experience… 15x cluster builds w/5 different distros 18 months supporting research & development internally

whoami?

NSG *Other names and brands may be claimed as the property of others.

4

Page 5: Accelerating Hadoop* Clusters with SSD Technologies · Apache* – 100% Open Source Intel ® Distribution for Apache Hadoop*, Cloudera* (CHD) Greenplum* (GDH),MapR* (MRH), Hortonworks*

2013 SNIA Analytics and Big Data Summit. © Intel Corporation. All Rights Reserved.

Just over the peak of the hype cycle at 8 years old! Intel® reliable & long term commitment to ‘Big Data’

Intel® Distribution for Apache Hadoop* software Q4’12 Active contributions to Intel® Distribution for Apache

Hadoop* software and Apache* open source projects Why Hadoop*? Hadoop* invented to download & index the Internet Application & software framework

HDFS (Hadoop* File System) Map-Reduce (Distributed Compute Framework)

Sift and store mountains of data at ¢/GB

Background - Hadoop*

NSG *Other names and brands may be claimed as the property of others. *Other names and brands may be claimed as the property of others. 5

Page 6: Accelerating Hadoop* Clusters with SSD Technologies · Apache* – 100% Open Source Intel ® Distribution for Apache Hadoop*, Cloudera* (CHD) Greenplum* (GDH),MapR* (MRH), Hortonworks*

2013 SNIA Analytics and Big Data Summit. © Intel Corporation. All Rights Reserved.

Infrastructure Vendors Apache* – 100% Open Source Intel® Distribution for Apache Hadoop*, Cloudera* (CHD)

Greenplum* (GDH),MapR* (MRH), Hortonworks* (HDH), + 8-10 or more…

Differentiation % of Open Source in

codebase Software, Services, and/or

Support – Revenue Model In ‘Big Data’ there are 100s

of Companies building on… 10-15 Open Source Projects

Distributions & Ecosystem

NSG

Open Source

Commercial Software

# http://mattturck.com/2012/10/15/a-chart-of-the-big-data-ecosystem-take-2/

*Other names and brands may be claimed as the property of others. 6

Page 7: Accelerating Hadoop* Clusters with SSD Technologies · Apache* – 100% Open Source Intel ® Distribution for Apache Hadoop*, Cloudera* (CHD) Greenplum* (GDH),MapR* (MRH), Hortonworks*

2013 SNIA Analytics and Big Data Summit. © Intel Corporation. All Rights Reserved.

Hadoop* augments/expands current data systems…

Fitting Hadoop* into the Landscape

NSG

# H

adoo

p Fr

amew

ork

Data Warehouse

Non-Traditional Database

OLTP Systems Analytics &

Business Intelligence SAS, OLAP, etc…

Traditional Database

Data Load - ETL Data Manipulation

#Hadoop architecture diagram and component inclusion in the Intel® Distribution for Apache Hadoop* Software @ hadoop.intel.com. Intel product plans in this presentation do not constitute Intel plan of record product roadmaps. Please contact your Intel representative to obtain Intel's current plan of record product roadmaps.

*Other names and brands may be claimed as the property of others. 7

Page 8: Accelerating Hadoop* Clusters with SSD Technologies · Apache* – 100% Open Source Intel ® Distribution for Apache Hadoop*, Cloudera* (CHD) Greenplum* (GDH),MapR* (MRH), Hortonworks*

2013 SNIA Analytics and Big Data Summit. © Intel Corporation. All Rights Reserved.

Many are… Sharing how they’re setup… Sharing what they’ve learned…

Nobody’s sharing ‘How’ they’ve got value!

Lots of businesses talking…

NSG

1 Source: Hadoop Wiki - Compilation courtesy of V. Saletore – IAG/DCSG Q4’12 Average Hadoop Cluster Size Industry Wide = 16 Hosts (servers)

1 Yahoo*, Facebook*, EBay*

Cluster Size Range 120 to 4500 Hosts

Big Data

1 LinkedIn*, 40+ other Companies

Cluster Size Range

20 to 120 Hosts

Semi-Big Data

1Adobe*, Hulu*, and 100’s of others…

Cluster Size Range

2 to 20 Hosts

Tiny Big Data

*Other names and brands may be claimed as the property of others. 8

Page 9: Accelerating Hadoop* Clusters with SSD Technologies · Apache* – 100% Open Source Intel ® Distribution for Apache Hadoop*, Cloudera* (CHD) Greenplum* (GDH),MapR* (MRH), Hortonworks*

2013 SNIA Analytics and Big Data Summit. © Intel Corporation. All Rights Reserved.

Intel® SSD DC S3700 – High Endurance Product

2Sequential Read @ 500 MB/s & Sequential Write @ 450 MB/s

24k Random IOPS - 75k Read - 36k Write @ 45-65µs

2Up to 14PB Endurance (800GB Drive) Intel® SSD DC S3500 – Standard Endurance Product

2Sequential Read @ 500 MB/s & Sequential Write @ 450 MB/s

24k Random IOPS @ 75k Read & 11.5k Write & 50-65µs

2Up to 450TB Endurance (800GB Drive)

Full data path protection – Parity, CRC, memory ECC, LBA tag validation1, and power loss protection (PLI)

Intel® SSDs For The Datacenter

NSG

0

10000

20000

30000

40000

50000

60000

70000

0 500 1000 1500 2000

Tight Predictable IOPS

1 Abbreviations : CRC – Cyclical Redundancy Check, LBA – Logical block Address, PLI – Power Loss Imminent 2 Data based on Intel® SSD DC S3700 and DC 3500 Series data sheets. Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as Iometer*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Source: Internal Testing Configuration: See http://www.intel.com/go/ssd for detailed products specifications and testing configurations.

*Other names and brands may be claimed as the property of others. 9

Page 10: Accelerating Hadoop* Clusters with SSD Technologies · Apache* – 100% Open Source Intel ® Distribution for Apache Hadoop*, Cloudera* (CHD) Greenplum* (GDH),MapR* (MRH), Hortonworks*

2013 SNIA Analytics and Big Data Summit. © Intel Corporation. All Rights Reserved.

Hadoop* from the disk perspective 128MB-256MB sequential IO operations Write once, read many, occasional re-balance Perfect for $.04/GB 7k RMP spinning rust @ 130-150MB/Sec Temp intermediate/spillover data creates disk contention

SSDs provide SSD 450-500MB/Sec Sustained Intel internal testing shows ‘pure SSDs’ provide up to 80%1 performance

increase for 1TB Terasort* in Hadoop* $1-$2.35/GB for SSD…

Due to cost per GB… SSDs are perceived as a tough sell with typical enterprise IT financial constraints

Intel® SSDs and Hadoop*

NSG

1 Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as Terasort or HiBench, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Source: Internal Testing Configuration: Intel ® Xeon 5600 & E5, 7200 RPM HDD & Intel ® 520 Series SSD, Intel ® 1GbE and 10Gb Ethernet, and open source Apache Hadoop * & Intel ® Distribution for Apache Hadoop*

*Other names and brands may be claimed as the property of others. 10

Page 11: Accelerating Hadoop* Clusters with SSD Technologies · Apache* – 100% Open Source Intel ® Distribution for Apache Hadoop*, Cloudera* (CHD) Greenplum* (GDH),MapR* (MRH), Hortonworks*

2013 SNIA Analytics and Big Data Summit. © Intel Corporation. All Rights Reserved.

So where do SSDs make sense?

NSG *Other names and brands may be claimed as the property of others.

11

Page 12: Accelerating Hadoop* Clusters with SSD Technologies · Apache* – 100% Open Source Intel ® Distribution for Apache Hadoop*, Cloudera* (CHD) Greenplum* (GDH),MapR* (MRH), Hortonworks*

2013 SNIA Analytics and Big Data Summit. © Intel Corporation. All Rights Reserved.

Hadoop Development Clusters Subset of a company’s ‘Big Data’, typically <20 servers Java*/MapReduce*/Hadoop* developers time critical

Up to 97%1 accelerated development with a combination of current Intel® products over last generation of Intel® products

Balanced node resources with the right mix of Intel® SSD + Intel® Xeon® E5 processor + Intel® 10Gb Ethernet

Intel® SSD DC S3700 Series Frequent Cluster Refresh

Intel® SSD DC S3500 Series Moderate Cluster Refresh

Where SSDs can best speed Hadoop*

NSG

1 Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as Terasort or HiBench, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Source: Internal Testing Configuration: Intel ® Xeon 5600 & E5, 7200 RPM HDD & Intel ® 520 Series SSD, Intel ® 1GbE and 10Gb Ethernet, and open source Apache Hadoop * & Intel ® Distribution for Apache Hadoop*

*Other names and brands may be claimed as the property of others. 12

Page 13: Accelerating Hadoop* Clusters with SSD Technologies · Apache* – 100% Open Source Intel ® Distribution for Apache Hadoop*, Cloudera* (CHD) Greenplum* (GDH),MapR* (MRH), Hortonworks*

2013 SNIA Analytics and Big Data Summit. © Intel Corporation. All Rights Reserved.

Specialized RTQ Clusters Subset of a company’s ‘Big Data’, typically <20 servers

Impala*, Hbase*, or other Real-Time Query Clusters Fast Query Returns Business Critical

Up to 97%1 accelerated development with a combination of current Intel® products, over last generation of products

Balanced node resources with the right mix of Intel® SSD + Intel® Xeon® E5 processor + Intel® 10Gb Ethernet

Intel® SSD DC S3700 Series Frequent Cluster Refresh

Intel® SSD DC S3500 Series Moderate Cluster Refresh

Where SSDs can best speed Hadoop*

NSG

1 Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as Terasort or HiBench, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Source: Internal Testing Configuration: Intel ® Xeon 5600 & E5, 7200 RPM HDD & Intel ® 520 Series SSD, Intel ® 1GbE and 10Gb Ethernet, and open source Apache Hadoop * & Intel ® Distribution for Apache Hadoop*

*Other names and brands may be claimed as the property of others. 13

Page 14: Accelerating Hadoop* Clusters with SSD Technologies · Apache* – 100% Open Source Intel ® Distribution for Apache Hadoop*, Cloudera* (CHD) Greenplum* (GDH),MapR* (MRH), Hortonworks*

2013 SNIA Analytics and Big Data Summit. © Intel Corporation. All Rights Reserved.

Main Hadoop Clusters >20 Servers MapReduce* Intermediate Data (Temp/Intermediate)

+ 1x Intel® SSD DC S3700 Series per host Intermediate Data written to SSD (Hadoop* .xml Setting) Alleviates contention on HDDs by moving temp/intermediate to SSD Product selection based on estimated max generated spillover

Main Hadoop Clusters > 20 Servers Intel® Cache Acceleration Software

+ 1x Intel® SSD DC S3700 Series/host or 1x Intel® SSD 910 Series PCIe*/host Read/Write Caching for all Data Disks

1Accelerates Terasort* jobs up to 42% Product selection based on write and working dataset size

But what about ‘Main’ clusters?

NSG *Other names and brands may be claimed as the property of others.

14

Cache Acceleration Software

1 Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as Terasort or HiBench, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Source: Internal Testing Configuration: Intel ® Xeon E5, 7200 RPM HDD & Intel ® SSD DC S3700 Series, Intel® 10Gb Ethernet, Intel ® Distribution for Apache Hadoop* & Intel ® Cache Acceleration Software

Page 15: Accelerating Hadoop* Clusters with SSD Technologies · Apache* – 100% Open Source Intel ® Distribution for Apache Hadoop*, Cloudera* (CHD) Greenplum* (GDH),MapR* (MRH), Hortonworks*

2013 SNIA Analytics and Big Data Summit. © Intel Corporation. All Rights Reserved.

Know your workload, is it? IO, CPU, Network, or Memory intensive & when? Generates lots or little intermediate data?

1Intel® SSDs provide 450-500MB/Sec Sustained Bandwidth and handle both Sequential and Random IO gracefully Hadoop* relies on bandwidth/throughput 2x400GB (1GB/Sec )better than 1x800GB SSD (500MB/Sec) for

Hadoop* Increasing the IO capabilities of Hadoop* nodes

Unleashing IO can require a rebalance May require more/faster cores and a faster network

Your workload really matters!

Your workload really matters!

NSG

1 Data based on Intel® SSD DC S3700 and DC 3500 Series data sheets. Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as Iometer*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Source: Internal Testing Configuration: See http://www.intel.com/go/ssd for detailed products specifications and testing configurations.

*Other names and brands may be claimed as the property of others. 15

Page 16: Accelerating Hadoop* Clusters with SSD Technologies · Apache* – 100% Open Source Intel ® Distribution for Apache Hadoop*, Cloudera* (CHD) Greenplum* (GDH),MapR* (MRH), Hortonworks*

2013 SNIA Analytics and Big Data Summit. © Intel Corporation. All Rights Reserved.

Please work with identified NSG BDMs

for Intel® SSD + Hadoop* design wins!

Business Contacts

NSG

Expertise Contact

Datacenter Solutions Architecture, Big Data & HPC

Christian Black, DSA, NSG

Intel® SSD + Intel® CAS SW Marketing Support

Carolyn Hanley, PME, NSG

Intel® Distribution for Apache Hadoop* software

David Collins, IDH Director of Sales, IAG/DSD

*Other names and brands may be claimed as the property of others. 16

Page 17: Accelerating Hadoop* Clusters with SSD Technologies · Apache* – 100% Open Source Intel ® Distribution for Apache Hadoop*, Cloudera* (CHD) Greenplum* (GDH),MapR* (MRH), Hortonworks*

2013 SNIA Analytics and Big Data Summit. © Intel Corporation. All Rights Reserved.

Thanks for attending!

Open Questions…

NSG *Other names and brands may be claimed as the property of others.

17