retrieving big data for the non developer

27
Retrieving Big Data For the non-developer

Upload: gustaf-cavanaugh

Post on 10-Aug-2015

87 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Retrieving big data for the non developer

Retrieving Big DataFor the non-developer

Page 2: Retrieving big data for the non developer

Intended Audience

People who do not write code

But don’t want to wait for IT to bring them data

Page 3: Retrieving big data for the non developer

Disclaimer

You will have to write code. Sorry...

Page 4: Retrieving big data for the non developer

Worth Noting

A common objection, “But I’m not a developer”

Coding does not make you a developer anymore than patching some drywall makes you a carpenter

Page 5: Retrieving big data for the non developer

Agenda

● The minimum you need to know about Big Data (Hadoop)o Specifically, HBase and Pig

● How you can retrieve data in HBase with Pigo How to use Python with Pig to make querying easier

Page 6: Retrieving big data for the non developer

One Big Caveat

● We are not talking about analysis● Analysis is hard

● Learning code and trying to understand an analytical approach is really hard● Following a straightforward Pig tutorial is

better than a boring lecture

Page 7: Retrieving big data for the non developer

Big Data in One Slide (oh boy)

● Today, Big Data == Hadoop● Hadoop is both a distributed file system

(HDFS) and an approach to messing with data on the file system (MapReduce)o HBase is a popular database that sits on top of

HDFSo Pig is a high level language that makes messing with

data on HDFS or in HBase easier

Page 8: Retrieving big data for the non developer

HBase in one slide

● HBase = Hadoop Database, based on Google’s Big Table

● Column-oriented database – basically one giant table

Page 9: Retrieving big data for the non developer

Pig in one slide

● A data flow language we will use to write queries against HBase

● Pig is not the developer’s solution for retrieving data from HBase, but it works well enough for the BI analyst (and, of course, we aren’t developers)

Page 10: Retrieving big data for the non developer

Pig is easier...Not Easy

● If you have no coding background, Pig will not be easy

● But it’s the best of a bad set of options right now

● Not hating on SQL-on-Hadoop providers, but with SQL you tell the computer what you want, which quickly gets complicated

Page 11: Retrieving big data for the non developer

Here’s our HBase table

Page 12: Retrieving big data for the non developer

Let’s dive in - Load

raw = LOAD 'hbase://peeps'

USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(

'info:first_name info:last_name, '-loadKey true -limit 1')

AS (id:chararray, first_name:chararray, last_name:chararray);

You have to specify each field and it’s type in

order to load it

Page 13: Retrieving big data for the non developer

Response is as expected

'info:first_name info:last_name, AS (first_name:chararray, last_name:chararray);

Will return a first name and last name as seperate fields, e.g., “Steve”, “Buscemi”

Page 14: Retrieving big data for the non developer

If you can write a Vlookup()

=VLOOKUP(C34, Z17:AZ56, 17, FALSE)

You can write a load statement in Pig.

Both are equally esoteric.

Page 15: Retrieving big data for the non developer

But what if we don’t know the fields?

● Suppose we have a column family of friends

● Each record will contain will zero to many friends, e.g., friend_0: “John”, friend_1: “Paul”

Page 16: Retrieving big data for the non developer

The number of friends is variable

● There could be thousands of friends per row

● And we cannot specify “friend_5” because there is no guarantee that each record has five friends

Page 17: Retrieving big data for the non developer

This is common...

● NoSQL databases are known for flexible schemas and flat table structures

● Unfortunately, the way Pig handles this problem utterly sucks...

Page 18: Retrieving big data for the non developer

Loading unknown friends

raw = LOAD 'hbase://SampleTable'

USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(

'info:first_name info:last_name info:friends_*, '-loadKey true -limit 5')

AS (id:chararry, first_name:chararray, last_name:chararray, friends:map[]);

Now we have info:friends_* that is represented as a “map”

Page 19: Retrieving big data for the non developer

A map is just a collection of key-value pairs

● That look like this: friend_1# ‘Steve’, friend_2# ‘Willie’

● They are very similar to Python dictionaries...

Page 20: Retrieving big data for the non developer

Here’s why they suck

● We can’t iterate over them

● In order to access a value, in this case a friend’s name, I have to provide the specific key value, e.g., friend_5, in order to receive the name of the fifth friend

Page 21: Retrieving big data for the non developer

But I thought you said we didn’t know the number of friends?

● You are right – Pig expects us to provide the specific value of something unknown

● If only there were some way to iterate over a collection of key-value pairs…

Page 22: Retrieving big data for the non developer

Enter Python

● Pig may not allow you to iterate over a map, but it does allow you to write User-Defined Functions (UDFs) in Python

● In a python UDF we can read in a map as a python dict and return key-value pairs

Page 23: Retrieving big data for the non developer

Python UDF for Pig

@outputSchema("values:bag{t:tuple(key, value)}")

def bag_of_tuples(map_dict):

return map_dict.items()

We are passing in a map, e.g., “Friend_1#Steve, Friend_2#Willie”and manipulating a python dict, e.g. {‘Friend_1’: ‘Steve’, ‘Friend_2’:

‘Willie’}

Based on blog post by Chase Seibert

Page 24: Retrieving big data for the non developer

We can add loops and logic too

@outputSchema("status:chararray")

def get_steve(map_dict):

for key, value in map_dict:

if value == 'Steve':

return "I hate that guy"

else:

return value

Page 25: Retrieving big data for the non developer

Or if you just want the data in Excel

register ‘sample_udf.py’ using jython as my_udfraw = LOAD 'hbase://peeps' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage( 'info:first_name info:last_name info:friends_*, '-loadKey true -limit 5') AS (id:chararry, first_name:chararray, last_name:chararray, friends:map[]);

clean_table = FOREACH raw GENERATE id, FLATTEN(my_udf.bag_of_tuples(friends));

dump clean_table;

Page 26: Retrieving big data for the non developer

Final Thought

Make Your Big Data Small● Prototype your Pig Scripts on your local file

systemo Download some data to your local machineo Start you pig shell from the command line: pig -x

localo Load - Transform - Dump

Page 27: Retrieving big data for the non developer

Notes

Pig Tutorials● Excellent video on Pig ● Mortar Data introduction to Pig● Flatten HBase column with Python

Me● codingcharlatan.com● @GusCavanaugh