why hadoop is important to syncsort

23
Why was it so important to us To open the MapReduce framework 12/11/2013 Syncsort Confidential and Proprietary - do not copy or distribute

Upload: huguk

Post on 11-Nov-2014

591 views

Category:

Technology


1 download

DESCRIPTION

An overview of how Syncsort have improved Hadoop in the open source world, and what they provide on top of the standard distributions.

TRANSCRIPT

Page 1: Why Hadoop is important to Syncsort

Why was it so important to usTo open the MapReduce framework12/11/2013

Syncsort Confidential and Proprietary - do not copy or distribute

Page 2: Why Hadoop is important to Syncsort

Agenda

Who are we ?

What did we do ?

Why did we do that ?

With whom did we do it with?

For which results ?

2Syncsort Confidential and Proprietary - do not copy or distribute

Page 3: Why Hadoop is important to Syncsort

Agenda

Who are we ?

What did we do ?

Why did we do that ?

With whom did we do it with?

For which results ?

3Syncsort Confidential and Proprietary - do not copy or distribute

Page 4: Why Hadoop is important to Syncsort

Syncsort

4

• 50% of all mainframes run Syncsort• 1,500 Mainframe Customers: Most

used & trusted 3rd party mainframe software

• Speed leader for ETL & Sort• A history of innovation

• 25+ Issued & Pending Patents• Large global customer base

• 15,000+ deployments in 68 countries• First-to-market, fully integrated

approach to Hadoop ETL

For 40 years we have been helping companies solve their big data issues…even before they knew the name Big Data!

Our customers are achieving the impossible, every day!

Integrating Big Data… Smarter!

Key Partners

Syncsort Confidential and Proprietary - do not copy or distribute

Page 5: Why Hadoop is important to Syncsort

Agenda

Who are we ?

What did we do ?

Why did we do that ?

With whom did we do it with?

For which results ?

5Syncsort Confidential and Proprietary - do not copy or distribute

Page 6: Why Hadoop is important to Syncsort

6

Smart Contributions to Improve Hadoop

Plugin Shipping on CDH 4.2 and later

Augmenting Critical Batch Processing Capabilities

JIRA

4807 Allow MapOutputBuffer to be pluggable

4808 Allow Reduce-side merge to be pluggable

4809 Make classes required for 2454 public

4812 Create reduce input merger plug-in

Description

4842 Shuffle race can hang reducer

2461 HDFS file name globbing in libhdfs

4482 Backport of 2454 to MapReduce 1 & 1.2

Syncsort Confidential and Proprietary - do not copy or distribute

Page 7: Why Hadoop is important to Syncsort

Opening the MapReduce Framework

Mapper Output Sorter Shuffle Input

Sorter Reducer

7Syncsort Confidential and Proprietary - do not copy or distribute

Here to perform functional logic on our engine

Here to perform functional logic on our engine

Here and here to replace MapReduce native sort

Page 8: Why Hadoop is important to Syncsort

Agenda

Who are we ?

What did we do ?

Why did we do that ?

With whom did we do it with?

For which results ?

8Syncsort Confidential and Proprietary - do not copy or distribute

Page 9: Why Hadoop is important to Syncsort

9Syncsort Confidential and Proprietary - do not copy or distribute

Syncsort: Just integrating data … faster

Sort Join Aggregate Copy Merge

+

A simple DI engine easy to deploy, operate, and administer

ETL like development GUI

Auto-tuning Best patented algorithms

Fast, fast, faster than any other

The more data the better

Page 10: Why Hadoop is important to Syncsort

From Data to Big Data

10Syncsort Confidential and Proprietary - do not copy or distribute

60s70s

80s 90s 2000s

2010s

Next?

Mainframe PC Internet RevolutionMobile & Social Media

Revolution

$$$Variety

Quarterly Weekly Daily Intra-day Right / Real-time Monthly$$$Velocity

$$$Volume

Page 11: Why Hadoop is important to Syncsort

Smart Architecture

11Syncsort Confidential and Proprietary - do not copy or distribute

Hadoop Cluster

Hadoop Integration… for Real(No Code Generation. No Compiling. No Bolts. No Nuts!)

Runs natively within MapReduce Small footprint installs on every node Open source contributions extend

capabilities of MapReduce Pluggable sort Expanded use cases (i.e. “No sort” option) Vertical scalability Design flexibility (MapMapReduceReduce)

Unleash Hadoop’s Potential

No need to worry about this…

Page 12: Why Hadoop is important to Syncsort

Agenda

Who are we ?

What did we do ?

Why did we do that ?

With whom did we do it with?

For which results ?

12Syncsort Confidential and Proprietary - do not copy or distribute

Page 13: Why Hadoop is important to Syncsort

13

Because Mainframe Is Big Data Too!

Cloudera + Syncsort: Smarter Connectivity… Also for Mainframe

Connect

Translate

Load & Process

• Read files directly from mainframe• No software required on mainframe• Already installed on 50% of mainframes

• Parse & transform: packed decimal, EBCDIC/ASCII, multi-format

• No coding required

• Load directly to HDFS• Offload batch data processing• Find more insights

Syncsort Confidential and Proprietary - do not copy or distribute

Page 14: Why Hadoop is important to Syncsort

Syncsort DMX-h + Cloudera Manager

14Syncsort Confidential and Proprietary - do not copy or distribute

Installation

Management

Monitoring

Support Integration

API

CDH Cluster + ISV softwareCloudera Manager

Syncsort DMX-h

CDH Nodes DMX-h on every CDH node

Page 15: Why Hadoop is important to Syncsort

Agenda

Who are we ?

What did we do ?

Why did we do that ?

With whom did we do it with?

For which results ?

15Syncsort Confidential and Proprietary - do not copy or distribute

Page 16: Why Hadoop is important to Syncsort

16Syncsort Confidential and Proprietary - do not copy or distribute

Test cases

Sort Acceleration– Terasort

• Run terasort with DMX-h and without DMX-h in various configurations to compare performance.

ETL– Use DMX-h to perform several different ETL jobs and compare against

equivalent jobs in Pig (Apache Pig version 0.9.2-gphd-1.2.0.0).• File Change Data Capture (CDC)• Web Log Aggregation

Page 17: Why Hadoop is important to Syncsort

Syncsort Confidential and Proprietary - do not copy or distribute

File CDC

PigJava

149Lines of Code

70Lines of Code

DMX-h

Page 18: Why Hadoop is important to Syncsort

Syncsort Confidential and Proprietary - do not copy or distribute

Web Log Aggregation

PigJava

DMX-h

94Lines of Code

48Lines of Code

Page 19: Why Hadoop is important to Syncsort

19Syncsort Confidential and Proprietary - do not copy or distribute

Cluster Specs:– 763 node cluster

• 1 node – job tracker • 1 node - name node• 1 node – secondary name node• 760 data and task nodes

Hadoop cluster configuration changes (from defaults):

– 128 MB HDFS Block size (file.blocksize)– 1.5 GB map/ 4GB reduce task JVM

memory (mapred.child.java.opts)– Maximum 22 map tasks and 4 reduce

tasks per node (mapred.tasktracker.map.tasks.maximum & mapred.tasktracker.reduce.tasks.maximum)

Cluster Node Specs:– 12 cores - Dual Intel Westmere (Hex-

core) CPUs, 2.93 GHz, 12 MB Cache– 48GB DDR3 RDIMM Memory– 12 x 2TB 3.5” drives Seagate 7200rpm.– Disk 0 + Disk 1 are RAID1 (mirrored)

for OS.• 100 MB/Sec write• 115 MB/Sec read

– 10 single disk JBOD– Mellanox ConnectX®-3 VPI NIC

(Supported data rates 40GbE;10GbE)– RHEL 6.1 64-bit– Java 1.6 (jdk.x86_64-2000:1.6.0_29-

fcs)

Cluster Configuration – DMX-h Ran on 763 Nodes!

Page 20: Why Hadoop is important to Syncsort

20Syncsort Confidential and Proprietary - do not copy or distribute

Sort Acceleration - Terasort

Use Case

ETL or Sort

Acceleration

Alternative

Data Size (GB)

Native/Alternati

ve Elapsed

time

DMX-h Elapsed

Time

Elapsed Time

Improvement

Native/Alternative Memory

(GB)

DMX-h Physical

Memory (GB)

Memory

Improveme

nt

Native/Alternative CPU

Time DMX-h CPU

Time

CPU Improveme

nt

Native/Alterna

tive MB/Sec/Node

DMX-h MB/Sec/Node

TERASORT

Sort Acceleration Native

512 0:01:47 0:01:45 2%

12,863

12,873 0%

114,297

62,491 45%

6.5

6.6

TERASORT

Sort Acceleration Native

1,024 0:02:29 0:01:11 52%

14,512

14,522 0%

194,896

98,972 49%

9.3

19.4

TERASORT

Sort Acceleration Native

1,536 0:04:02 0:01:23 66%

14,684

14,694 0%

287,055

143,759 50%

8.6

25.0

TERASORT

Sort Acceleration Native

4,096 0:03:31 0:02:29 29%

31,520

31,549 0%

927,379

380,442 59%

26.2

37.0

TERASORT

Sort Acceleration Native

10,242 0:08:51 0:05:14 41%

47,935

47,951 0%

2,835,927

1,460,101 49%

26.4

44.6

TERASORT

Sort Acceleration Native

20,484 0:14:55 0:12:28 16%

106,153

105,239 1%

6,112,296

3,696,727 40%

31.0

37.4

TERASORT

Sort Acceleration Native

102,400 1:12:12 0:51:59 28%

387,262

387,211 0%

30,436,624

16,589,332 45%

32.3

44.9

Page 21: Why Hadoop is important to Syncsort

21Syncsort Confidential and Proprietary - do not copy or distribute

File CDC

Use Case

ETL or Sort

Acceleration

Alternative

Data Size (GB)

Native/Alternative Elapsed

time

DMX-h Elapsed

Time

Elapsed Time Improvement

Native/Alternative Memory

(GB)

DMX-h Physical

Memory (GB)

Memory

Improvement

Native/Alternative

CPU Time DMX-h

CPU Time

CPU Improvement

Native/Alterna

tive MB/Sec/Node

DMX-h MB/Sec/Node

FileCDC ETL Pig 148 0:05:31 0:01:33 72%

79,876

79,559 0%

79,876

79,559 0%

0.6

2.2

FileCDC ETL Pig 450 0:05:11 0:01:58 62%

243,834

182,869 25%

243,834

182,869 25%

1.9

5.3

FileCDC ETL Pig

1,515 0:07:49 0:03:44 52%

845,263

557,226 34%

845,263

557,226 34%

4.4

9.4

Page 22: Why Hadoop is important to Syncsort

22Syncsort Confidential and Proprietary - do not copy or distribute

Web Log Aggregation

Use CaseAlternative

Data Size (GB)

Native/Alternative

Elapsed time

DMX-h Elapsed

Time

Elapsed Time

Improvement

Native/Alternative Memory (GB)

DMX-h Physical Memory (GB)

Memory Improve

ment

Native/Alternative CPU

Time DMX-h CPU

Time

CPU Improve

ment

Native/Alternative MB/Sec/

Node

DMX-h MB/Sec/

Node WebLogAggregation -

Split Size & fixes Pig

2,067 0:01:12 0:00:58 19%

13,499

7,813 42%

145,972 56,496 61%

40.1

49.8

WebLogAggregation - Split Size & fixes Pig

4,135 0:01:42 0:01:23 19%

18,003

15,579 13%

300,627

152,390 49%

56.1

69.6

WebLogAggregation - Split Size & fixes Pig

10,240 0:05:16 0:02:04 61%

40,773

39,091 4%

807,473

335,537 58%

45.3

115.4

WebLogAggregation - Split Size & fixes Pig

20,480 0:07:54 0:06:58 12%

78,654

78,128 1%

1,339,453

568,107 58%

60.4

68.4

Page 23: Why Hadoop is important to Syncsort

23

www.syncsort.com/try +

Running on CDH

Test Drive DMX-h:Bridge the Gap Between Big Iron & Big Data!

• Self-contained image• Use case accelerators for • mainframe, Hadoop and more!

…and Quite Possibly The Only Approach!

A Smarter Approach…

( )