everything you need to know about hadoop right now

29
Everything you (freaking) need to know about Hadoop Now Andrew C. Oliver @acoliver #ATO2014 {All Things Open | Raleigh} {Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Upload: all-things-open

Post on 18-Jul-2015

225 views

Category:

Technology


3 download

TRANSCRIPT

Everything you (freaking) need to know about

Hadoop NowAndrew C. Oliver

@acoliver#ATO2014

{All Things Open | Raleigh}

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Andrew C. Oliver● Programming since I was about 8● Java since ~1997● Founded POI project (currently hosted at Apache) with

Marc Johnson ~2000○ Former member Jakarta PMC○ Emeritus member of Apache Software Foundation

● Joined JBoss ~2002● Former Board Member/current helper/lifetime member:

Open Source Initiative (http://opensource.org)● Column in InfoWorld: http://www.infoworld.com/author-

bios/andrew-oliver○ I make fanboys cry.

Andrew C. Oliver@acoliver

#ATO2014

Everything You Need to Know About Hadoop Now

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Open Software Integrators● Founded Nov 2007 by Andrew C. Oliver (me)

○ in Durham, NCPivoted from Java/Linux consulting to full on

Hadoop/NoSQL this year

● We’re Hiring○ mid to senior level (Java/Linux and Database background)○ devopsy type people (Puppet, Chef, Salt, etc, Linux

background, database understanding, Ruby/Python/etc) ○ up to 50% travel, salary + bonus, 401k, health, etc etc○ preferred: Java, Tomcat, JBoss, Hibernate, Spring, RDBMS,

JQuery○ nice to have: Hadoop, Neo4j, MongoDB, Cassandra, Ruby, at

least one Cloud platform

Everything You Need to Know About Hadoop Now

Andrew C. Oliver@acoliver

#ATO2014

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

● What is Hadoop anyhow?

● What is Hadoop Good For?

● What isn’t it good for?

● How do you get data into Hadoop?

● How do you get data out of Hadoop?

● How do you process data in Hadoop?

● How do you analyze data in Hadoop?

● How do you secure Hadoop?

Overview

Everything You Need to Know About Hadoop Now

Andrew C. Oliver@acoliver

#ATO2014

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

● This is an overview talk intended as a roadmap to point you at the most

important bits to learn on the way…

● It is not comprehensive training…

● It is not an in-depth look at any part of Hadoop

● It is a rather high level selective overview of the Hadoop ecosystem

But first...

Everything You Need to Know About Hadoop Now

Andrew C. Oliver@acoliver

#ATO2014

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

What is Hadoop Anyhow?

{All Things Open | Raleigh}

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

● A platform for distributed

computing

● 2011

○ HDFS

○ Hive

● 2012

○ HDFS

○ YARN

○ Hive

○ HBase

● 2014

○ HDFS

○ Hive

○ Yarn

○ HBase

○ Spark

○ Storm

○ Kafka

○ Mahout

○ Squoop

○ Oozie

○ ...

Everything You Need to Know About Hadoop Now

Andrew C. Oliver@acoliver

#ATO2014Hadoop is

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

● HDFS

○ Distributed Filesystem similar to Gluster, Ceph, etc.

○ You can use other distributed filesystems in place of HDFS

○ Blocks are distributed, and by default duplicated on at least 1 other

node

○ 128m default block size

○ Restful API, CLI tools, third-party tools to “mount” HDFS on Linux

(stable), Windows (ymmv), Mac (?)

● DO NOT PUT YOUR DATA NODES ON A SAN! IT IS WRONG! DO NOT DO

IT! EVEN ON THURSDAY!

Hadoop is

Everything You Need to Know About Hadoop Now

Andrew C. Oliver@acoliver

#ATO2014

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

● YARN

○ Yet another resource negotiator

○ schedules “work” among nodes, distributes the “processing”

● Map Reduce is

○ an API

○ an algorithm, data is mapped to nodes, the answers are “reduced” to a single

answer

● Hive is

○ HDFS/Hadoop based data warehousing

○ SQL, JDBC, ODBC

○ Tables map to files on HDFS

○ No updates, deletes, transactions (but coming in “Stinger.next”)

Hadoop is

Everything You Need to Know About Hadoop Now

Andrew C. Oliver@acoliver

#ATO2014

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

● HBase

○ a column family database

○ ACID

○ relatively low-latency

● And a whole lot more

Hadoop is

Everything You Need to Know About Hadoop Now

Andrew C. Oliver@acoliver

#ATO2014

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

● An ecosystem of tools for distributed processing and storage of data.

Hadoop is

Everything You Need to Know About Hadoop Now

Andrew C. Oliver@acoliver

#ATO2014

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

What is Hadoop Good For?

{All Things Open | Raleigh}

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

● Working with large amounts of data in batch

○ ETL processing / Data Transformation

○ Analytics / BI

○ Integration (Data Lake, Enterprise Data Hub)

● Working with streams of data

○ Events

■ Log data

● Time series or similar data (HBase)

What is Hadoop Good for

Everything You Need to Know About Hadoop Now

Andrew C. Oliver@acoliver

#ATO2014

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

● What is Hadoop bad at?

○ Quick jobs - i.e. Hive/Map Reduce setup time is measured in seconds

to minutes.

○ Lots of small files (128MB block size = 0 byte files are 128m files)

○ General DBMS stuff - HBase is a much more “specific” database than

MySQL/etc.

○ High Availability

■ WHA???

● Knox, Oozie, etc all have shaky support if any for HA

Namenodes.

What is Hadoop bad at?

Everything You Need to Know About Hadoop Now

Andrew C. Oliver@acoliver

#ATO2014

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

How do you get data into/out of Hadoop?

{All Things Open | Raleigh}

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

● How do you get data into Hadoop?

○ Sqoop it from an RDBMS

○ Use JDBC or ODBC and push into Hive from an external DB

○ Push data into Hive with the restful API

○ Put an extract file onto HDFS with the REST API

■ process it into Hive directly with a LOAD DATA statement

■ transform/process it into Hive using PIG

■ use Java

○ Message it in there with Kafka, RabbitMQ or similar MQ and custom “spout”

for Storm

○ Use any multitude of APIs that write data into HDFS, HBase, Hive, etc.

How do you get data into Hadoop?

Everything You Need to Know About Hadoop Now

Andrew C. Oliver@acoliver

#ATO2014

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

● How do you get data out of Hadoop?

○ Should you be getting it out or should you process it there?

○ JDBC/ODBC to Hive

○ HBase can be mounted into Hive

○ REST APIs for Hive/HDFS

○ APIs for Kafka, Spark, Storm, etc (subscribe)

○ HDCP to another HDFS

○ Mount it with FUSE and use your favorite Linux tool

○ hadoop fs -cat /path/to/file/on/hdfs |grep stuff > mynewlocalfile

How do you get data out of Hadoop?

Everything You Need to Know About Hadoop Now

Andrew C. Oliver@acoliver

#ATO2014

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

How do you process data in Hadoop?

{All Things Open | Raleigh}

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

● Map-reduce Java API

● Hive supports SQL (soon to be not a subset)

● PIG can munge files on HDFS and can work with Hive

● Storm and Spark have their own APIs for dealing with events or so-called

micro-batches of data

● There are numerous toolkits

○ Mahout - common machine learning algorithms (many not very

parallelizable/etc)

○ MLib - Machine learning built on Spark

○ GraphX

How do you process data in Hadoop?

Everything You Need to Know About Hadoop Now

Andrew C. Oliver@acoliver

#ATO2014

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

● Most major BI tools now support Hadoop

○ Tableau

○ Pentaho

○ Datameer

○ Your favorite probably here

● All that stuff is for l4m3rs, use the command line interface :-)

○ hive -e ‘select * from sometable’

○ pig hdfs://some/dir/myscript.pig

● Use RStudio and write some R to predict what sales will be next month (you will be

sort of wrong probably)

● Use your favorite SQL tool that supports JDBC/ODBC

● Use Hue

How do you analyze data in Hadoop

Everything You Need to Know About Hadoop Now

Andrew C. Oliver@acoliver

#ATO2014

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

How do you secure Hadoop?

{All Things Open | Raleigh}

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

● HDFS supports POSIX (that means Linux-style) filesystem security

● The most complete security authentication throughout Hadoop is based

on Kerberos (yeah I know).

● You can do it with just straight LDAP too, but it isn’t integrated.

● Knox supplies “perimeter-based security” for (only):

○ Hive

○ HDFS

○ Ooozie

○ HBase

○ HCatalog

● Supposedly Argus will save us from all of this!

How do you secure Hadoop?

Everything You Need to Know About Hadoop Now

Andrew C. Oliver@acoliver

#ATO2014

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Other Considerations

{All Things Open | Raleigh}

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

● Disaster Recovery

○ Falcon (alpha quality)

● Workflow

○ Flume

● Schedule/trigger/orchestrate those ETL jobs

○ Oozie

● Install, configure, monitor Hadoop

○ Ambari

● Use tables in both Pig and Hive

○ HCatalog

Cacophony

Everything You Need to Know About Hadoop Now

Andrew C. Oliver@acoliver

#ATO2014

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Ambari

Everything You Need to Know About Hadoop Now

Andrew C. Oliver@acoliver

#ATO2014

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Hue

Everything You Need to Know About Hadoop Now

Andrew C. Oliver@acoliver

#ATO2014

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Hue editing Oozie

Everything You Need to Know About Hadoop Now

Andrew C. Oliver@acoliver

#ATO2014

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Pig ScriptREGISTER file:///usr/lib/pig/piggybank.jar;define SUBSTRING org.apache.pig.piggybank.evaluation.string.SUBSTRING();

rows = load '$FILEPATH' using org.apache.pig.piggybank.storage.CSVExcelStorage('\u001a') as (a0:chararray,a1:chararray,a2:chararray,a3:chararray,a4:chararray,a5:chararray,a6:chararray,a7:chararray,a8:chararray,a9:chararray);

row = foreach rows GENERATEREPLACE((TRIM($0)),'NULL','') as orderid,REPLACE((TRIM($1)),'NULL','') as customerid,REPLACE((TRIM($2)),'NULL','') as customername,REPLACE((TRIM($3)),'NULL','') as address,REPLACE((TRIM($4)),'NULL','') as city,REPLACE((TRIM($5)),'NULL','') as state,REPLACE((TRIM($6)),'NULL','') as zip,REPLACE((TRIM($7)),'NULL','') as status,REPLACE((TRIM($8)),'NULL','') as store row into 'stage.orders' using org.apache.hcatalog.pig.HCatStorer('loaddate=$LOADDATE');

Everything You Need to Know About Hadoop Now

Andrew C. Oliver@acoliver

#ATO2014

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}

Thank you for attending!

{All Things Open | Raleigh}

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}