everything you need to know about hadoop now

30
The Leader in Big Data Consulting

Upload: mammoth-data

Post on 22-Jan-2018

307 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Everything You Need To Know About Hadoop Now

The Leader in Big Data Consulting

Page 2: Everything You Need To Know About Hadoop Now

www.mammothdata.com | @mammothdataco

Everything you (freaking) need to know about Hadoop Now

Andrew C. Oliver@acoliver#ATO2014

{All Things Open | Raleigh}

Page 3: Everything You Need To Know About Hadoop Now

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

Andrew C. Oliver, President & Founder

● @acoliver

● Programming since age 8

● Java since ~1997

● Founded POI project (currently hosted at Apache) with Marc Johnson ~2000

○ Former member Jakarta PMC

○ Emeritus member of Apache Software Foundation

● Joined JBoss ~2002

● Former Board Member/current helper/lifetime member: Open Source Initiative (http://opensource.org)

● Column in InfoWorld: http://www.infoworld.com/author-bios/andrew-oliver

○ I make fanboys cry

Page 4: Everything You Need To Know About Hadoop Now

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

Open Software Integrators

Founded Nov 2007 by Andrew C. Oliver (me)in Durham, NC

Pivoted from Java/Linux consulting to full on Hadoop/NoSQL this year

We’re Hiringmid to senior level (Java/Linux and Database background)devopsy type people (Puppet, Chef, Salt, etc, Linux background, database understanding,

Ruby/Python/etc) up to 50% travel, salary + bonus, 401k, health, etc etcpreferred: Java, Tomcat, JBoss, Hibernate, Spring, RDBMS, JQuerynice to have: Hadoop, Neo4j, MongoDB, Cassandra, Ruby, at least one Cloud platform

Page 5: Everything You Need To Know About Hadoop Now

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

Overview

What is Hadoop anyhow?

What is Hadoop Good For?

What isn’t it good for?

How do you get data into Hadoop?

How do you get data out of Hadoop?

How do you process data in Hadoop?

How do you analyze data in Hadoop?

How do you secure Hadoop?

Page 6: Everything You Need To Know About Hadoop Now

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

But first...

This is an overview talk intended as a roadmap to point you at the most important bits to learn on the

way…

It is not comprehensive training…

It is not an in-depth look at any part of Hadoop

It is a rather high level selective overview of the Hadoop ecosystem

Page 7: Everything You Need To Know About Hadoop Now

www.mammothdata.com | @mammothdataco

What Is Hadoop Anyhow?

{All Things Open | Raleigh}

Page 8: Everything You Need To Know About Hadoop Now

www.mammothdata.com | @mammothdataco

A platform for distributed computing

2011

HDFS

Hive

2012

HDFS

YARN

Hive

HBase

2014HDFSHiveYarnHBaseSparkStormKafkaMahoutSquoopOozie...

Hadoop is...

Page 9: Everything You Need To Know About Hadoop Now

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

Hadoop is...

HDFS

Distributed Filesystem similar to Gluster, Ceph, etc.

You can use other distributed filesystems in place of HDFS

Blocks are distributed, and by default duplicated on at least 1 other node

128m default block size

Restful API, CLI tools, third-party tools to “mount” HDFS on Linux (stable), Windows (ymmv),

Mac (?)

DO NOT PUT YOUR DATA NODES ON A SAN! IT IS WRONG! DO NOT DO IT! EVEN ON THURSDAY!

Page 10: Everything You Need To Know About Hadoop Now

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

Hadoop is...

YARNYet another resource negotiatorschedules “work” among nodes, distributes the “processing”

Map Reduce isan APIan algorithm, data is mapped to nodes, the answers are “reduced” to a single answer

Hive isHDFS/Hadoop based data warehousingSQL, JDBC, ODBCTables map to files on HDFSNo updates, deletes, transactions (but coming in “Stinger.next”)

Page 11: Everything You Need To Know About Hadoop Now

www.mammothdata.com | @mammothdataco

HBase

a column family database

ACID

relatively low-latency

And a whole lot more

Hadoop is...

Page 12: Everything You Need To Know About Hadoop Now

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

Hadoop is...

An ecosystem of tools for distributed processing and storage of data.

Page 13: Everything You Need To Know About Hadoop Now

www.mammothdata.com | @mammothdataco

What Is Hadoop Good For?

{All Things Open | Raleigh}

Page 14: Everything You Need To Know About Hadoop Now

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

What Is Hadoop Good For?

Working with large amounts of data in batch

ETL processing / Data Transformation

Analytics / BI

Integration (Data Lake, Enterprise Data Hub)

Working with streams of data

Events

Log data

Time series or similar data (HBase)

Page 15: Everything You Need To Know About Hadoop Now

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

What Is Hadoop Bad At?

What is Hadoop bad at?

Quick jobs - i.e. Hive/Map Reduce setup time is measured in seconds to minutes.

Lots of small files (128MB block size = 0 byte files are 128m files)

General DBMS stuff - HBase is a much more “specific” database than MySQL/etc.

High Availability

WHA???

Knox, Oozie, etc all have shaky support if any for HA Namenodes.

Page 16: Everything You Need To Know About Hadoop Now

www.mammothdata.com | @mammothdataco

How Do You Get Data Into/Out Of Hadoop?

{All Things Open | Raleigh}

Page 17: Everything You Need To Know About Hadoop Now

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

How Do You Get Data Into Hadoop?

How do you get data into Hadoop?

Sqoop it from an RDBMS

Use JDBC or ODBC and push into Hive from an external DB

Push data into Hive with the restful API

Put an extract file onto HDFS with the REST API

process it into Hive directly with a LOAD DATA statement

transform/process it into Hive using PIG

use Java

Message it in there with Kafka, RabbitMQ or similar MQ and custom “spout” for Storm

Use any multitude of APIs that write data into HDFS, HBase, Hive, etc.

Page 18: Everything You Need To Know About Hadoop Now

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

How Do You Get Data Out Of Hadoop?

How do you get data out of Hadoop?

Should you be getting it out or should you process it there?

JDBC/ODBC to Hive

HBase can be mounted into Hive

REST APIs for Hive/HDFS

APIs for Kafka, Spark, Storm, etc (subscribe)

HDCP to another HDFS

Mount it with FUSE and use your favorite Linux tool

hadoop fs -cat /path/to/file/on/hdfs |grep stuff > mynewlocalfile

Page 19: Everything You Need To Know About Hadoop Now

www.mammothdata.com | @mammothdataco

How Do You Process Data In Hadoop?

{All Things Open | Raleigh}

Page 20: Everything You Need To Know About Hadoop Now

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

How Do You Process Data In Hadoop?

Map-reduce Java API

Hive supports SQL (soon to be not a subset)

PIG can munge files on HDFS and can work with Hive

Storm and Spark have their own APIs for dealing with events or so-called micro-batches of data

There are numerous toolkits

Mahout - common machine learning algorithms (many not very parallelizable/etc)

MLib - Machine learning built on Spark

GraphX

Page 21: Everything You Need To Know About Hadoop Now

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

How Do You Analyze Data In Hadoop?

Most major BI tools now support HadoopTableauPentahoDatameerYour favorite probably here

All that stuff is for l4m3rs, use the command line interface :-)hive -e ‘select * from sometable’pig hdfs://some/dir/myscript.pig

Use RStudio and write some R to predict what sales will be next month (you will be sort of wrong probably)

Use your favorite SQL tool that supports JDBC/ODBCUse Hue

Page 22: Everything You Need To Know About Hadoop Now

www.mammothdata.com | @mammothdataco

How Do You Secure Hadoop?

{All Things Open | Raleigh}

Page 23: Everything You Need To Know About Hadoop Now

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

How Do You Secure Hadoop?

HDFS supports POSIX (that means Linux-style) filesystem security The most complete security authentication throughout Hadoop is based on Kerberos (yeah I know).You can do it with just straight LDAP too, but it isn’t integrated.Knox supplies “perimeter-based security” for (only):

HiveHDFSOoozieHBaseHCatalog

Supposedly Argus will save us from all of this!

Page 24: Everything You Need To Know About Hadoop Now

www.mammothdata.com | @mammothdataco

Other Considerations

{All Things Open | Raleigh}

Page 25: Everything You Need To Know About Hadoop Now

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

Cacophony

Disaster RecoveryFalcon (alpha quality)

WorkflowFlume

Schedule/trigger/orchestrate those ETL jobsOozie

Install, configure, monitor HadoopAmbari

Use tables in both Pig and HiveHCatalog

Page 26: Everything You Need To Know About Hadoop Now

www.mammothdata.com | @mammothdataco

Ambari

Page 27: Everything You Need To Know About Hadoop Now

www.mammothdata.com | @mammothdataco

Hue

Page 28: Everything You Need To Know About Hadoop Now

www.mammothdata.com | @mammothdataco

Hue Editing Oozie

Page 29: Everything You Need To Know About Hadoop Now

www.mammothdata.com | @mammothdataco

REGISTER

file:///usr/lib/pig/piggybank.jar;

define SUBSTRING

org.apache.pig.piggybank.evaluation

.string.SUBSTRING();

rows = load '$FILEPATH' using

org.apache.pig.piggybank.storage.CS

VExcelStorage('\u001a') as (

a0:chararray,

a1:chararray,

a2:chararray,

a3:chararray,

a4:chararray,

a5:chararray,

a6:chararray,

a7:chararray,

a8:chararray,

a9:chararray

row = foreach rows GENERATE

REPLACE((TRIM($0)),'NULL','') as

orderid,

REPLACE((TRIM($1)),'NULL','') as

customerid,

REPLACE((TRIM($2)),'NULL','') as

customername,

REPLACE((TRIM($3)),'NULL','') as

address,

REPLACE((TRIM($4)),'NULL','') as

city,

REPLACE((TRIM($5)),'NULL','') as

state,

REPLACE((TRIM($6)),'NULL','') as

zip,

REPLACE((TRIM($7)),'NULL','') as

status,

REPLACE((TRIM($8)),'NULL','') as

store row into 'stage.orders' using

Pig Script

Page 30: Everything You Need To Know About Hadoop Now

www.mammothdata.com | @mammothdataco

Thank you for attending!

{All Things Open | Raleigh}