understanding apache hadoop - neudesic...• employs contributors to project apache hadoop •...

18
© Copyright 2014, Neudesic. All rights reserved. Understanding Apache Hadoop Presented By: Orion Gebremedhin BI/Big Data Solution Partner , Neudesic LLC. Data Platform VTSP, Microsoft Corp. @OrionGM

Upload: others

Post on 14-Mar-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Understanding Apache Hadoop - Neudesic...• Employs contributors to project Apache Hadoop • October 2011 partnered with Microsoft : Azure and Windows Server • Cloudera founded

© Copyright 2014, Neudesic. All rights reserved.

Understanding Apache Hadoop

Presented By: Orion GebremedhinBI/Big Data Solution Partner , Neudesic LLC.

Data Platform VTSP, Microsoft Corp.

@OrionGM

Page 2: Understanding Apache Hadoop - Neudesic...• Employs contributors to project Apache Hadoop • October 2011 partnered with Microsoft : Azure and Windows Server • Cloudera founded

© Copyright 2014, Neudesic. All rights reserved.

Topics Covered

• Fundamentals of Hadoop

• Major Hadoop Distributions

Page 3: Understanding Apache Hadoop - Neudesic...• Employs contributors to project Apache Hadoop • October 2011 partnered with Microsoft : Azure and Windows Server • Cloudera founded

© Copyright 2014, Neudesic. All rights reserved.

3

Big Data = Hadoop?

* A Modern Data Architecture with Apache Hadoop, Hortonworks Inc. 2014

Page 4: Understanding Apache Hadoop - Neudesic...• Employs contributors to project Apache Hadoop • October 2011 partnered with Microsoft : Azure and Windows Server • Cloudera founded

© Copyright 2014, Neudesic. All rights reserved.

4

The Fundamentals of Hadoop

• Hadoop evolved directly from

commodity scientific supercomputing

clusters developed in the 1990s

• Hadoop consists of:

• MapReduce

• Hadoop Distributed File System

(HDFS)

Page 5: Understanding Apache Hadoop - Neudesic...• Employs contributors to project Apache Hadoop • October 2011 partnered with Microsoft : Azure and Windows Server • Cloudera founded

© Copyright 2014, Neudesic. All rights reserved.

5

What’s New?

Page 6: Understanding Apache Hadoop - Neudesic...• Employs contributors to project Apache Hadoop • October 2011 partnered with Microsoft : Azure and Windows Server • Cloudera founded

© Copyright 2014, Neudesic. All rights reserved.

6

Characteristics of HDFS

• Very high fault tolerance

• Cannot be updated, but

corrections can be appended

• File blocks are replicated

multiple times

Three Types of Nodes:

1. Name Node (Directory)

2. Backup Node (Checkpoint)

3. Data Node (Actual Data)

Page 7: Understanding Apache Hadoop - Neudesic...• Employs contributors to project Apache Hadoop • October 2011 partnered with Microsoft : Azure and Windows Server • Cloudera founded

© Copyright 2014, Neudesic. All rights reserved.

7

Characteristics of MapReduce

• Map Function - Take a task and break it down into small tasks

• Reduce Function - Combine the partial answers and find the combined list

A programing framework for library and runtime (just like .NET)

• When submitting a query, this manages the Task Trackers which trigger the actual Map or Reduce task

Master (Job Tracker)

• The “doers.” Each node in the cluster has a data node and a task tracker

Workers (Task Trackers)

Page 8: Understanding Apache Hadoop - Neudesic...• Employs contributors to project Apache Hadoop • October 2011 partnered with Microsoft : Azure and Windows Server • Cloudera founded

© Copyright 2014, Neudesic. All rights reserved.

8

HDFS and MapReduce

The Main Node: runs the Job tracker and the name node controls the files.

Each node runs two processes: Task Tracker and Data Node

Page 9: Understanding Apache Hadoop - Neudesic...• Employs contributors to project Apache Hadoop • October 2011 partnered with Microsoft : Azure and Windows Server • Cloudera founded

© Copyright 2014, Neudesic. All rights reserved.

9

Basics of MapReduce

1 bill/ sec

= 400 Seconds

400 bills

Page 10: Understanding Apache Hadoop - Neudesic...• Employs contributors to project Apache Hadoop • October 2011 partnered with Microsoft : Azure and Windows Server • Cloudera founded

© Copyright 2014, Neudesic. All rights reserved.

10

Basics of MapReduce

1 bill/ sec

1 bill/ sec

= 200 Seconds

= 200 Seconds

200 Bills

200 BillsTotal =200 seconds

Page 11: Understanding Apache Hadoop - Neudesic...• Employs contributors to project Apache Hadoop • October 2011 partnered with Microsoft : Azure and Windows Server • Cloudera founded

© Copyright 2014, Neudesic. All rights reserved.

11

Basics of MapReduce

= 100 Seconds100 bills

100 bills

100 bills

100 bills

= 100 Seconds

= 100 Seconds

= 100 Seconds

Total =100 seconds

Page 12: Understanding Apache Hadoop - Neudesic...• Employs contributors to project Apache Hadoop • October 2011 partnered with Microsoft : Azure and Windows Server • Cloudera founded

© Copyright 2014, Neudesic. All rights reserved.

12

Basics of MapReduce

Query

Result

Name Node/Job TrackerQuery

Data Nodes/Task Trackers

Page 13: Understanding Apache Hadoop - Neudesic...• Employs contributors to project Apache Hadoop • October 2011 partnered with Microsoft : Azure and Windows Server • Cloudera founded

© Copyright 2014, Neudesic. All rights reserved.

13

MapReduce

• Java

• Write many lines of code

Pig

• Mostly used by Yahoo

• Most used for data processing

• Shares some constructs w/ SQL

• Is more Verbose

• Needs a lot of training for users with limited procedural programming background

• Offers control over the flow of data

Hive

• Mostly used by Facebook for analytic purposes

• Used for analytics

• Relatively easier for developers w/ SQL experience.

• Less control over optimization of data flows compared to Pig

MapReduce, Pig and Hive

• Not as efficient as MapReduce

• Higher productivity for data scientists and developers

Page 14: Understanding Apache Hadoop - Neudesic...• Employs contributors to project Apache Hadoop • October 2011 partnered with Microsoft : Azure and Windows Server • Cloudera founded

© Copyright 2014, Neudesic. All rights reserved.

14

Major Versions of Apache Hadoop

Apache Foundation

Page 15: Understanding Apache Hadoop - Neudesic...• Employs contributors to project Apache Hadoop • October 2011 partnered with Microsoft : Azure and Windows Server • Cloudera founded

© Copyright 2014, Neudesic. All rights reserved.

15

Hortonworks

• 2011: $23 million funding from Yahoo! and Benchmark Capital- positioned as an independent company

• Horton the Elephant - Horton Hears a Who!

• Employs contributors to project Apache Hadoop

• October 2011 partnered with Microsoft : Azure and Windows Server

• Cloudera founded in October 2008…started the effort to be Microsoft Azure Certified in October 2014.

Page 16: Understanding Apache Hadoop - Neudesic...• Employs contributors to project Apache Hadoop • October 2011 partnered with Microsoft : Azure and Windows Server • Cloudera founded

© Copyright 2014, Neudesic. All rights reserved.

16

HDP User Interface

Page 17: Understanding Apache Hadoop - Neudesic...• Employs contributors to project Apache Hadoop • October 2011 partnered with Microsoft : Azure and Windows Server • Cloudera founded

© Copyright 2014, Neudesic. All rights reserved.

17

HUE Architecture

Page 18: Understanding Apache Hadoop - Neudesic...• Employs contributors to project Apache Hadoop • October 2011 partnered with Microsoft : Azure and Windows Server • Cloudera founded

© Copyright 2014, Neudesic. All rights reserved.

Questions?Tom Marek

[email protected]

Twitter: @twmarek

Orion Gebremedhin

[email protected]

Twitter: @OrionGM