retrieving big data for the non developer

Retrieving Big DataFor the non-developer

Intended Audience

People who do not write code

But don’t want to wait for IT to bring them data

Disclaimer

You will have to write code. Sorry...

Worth Noting

A common objection, “But I’m not a developer”

Coding does not make you a developer anymore than patching some drywall makes you a carpenter

Agenda

● The minimum you need to know about Big Data (Hadoop)o Specifically, HBase and Pig

● How you can retrieve data in HBase with Pigo How to use Python with Pig to make querying easier

One Big Caveat

● We are not talking about analysis● Analysis is hard

● Learning code and trying to understand an analytical approach is really hard● Following a straightforward Pig tutorial is

better than a boring lecture

Big Data in One Slide (oh boy)

● Today, Big Data == Hadoop● Hadoop is both a distributed file system

(HDFS) and an approach to messing with data on the file system (MapReduce)o HBase is a popular database that sits on top of

HDFSo Pig is a high level language that makes messing with

data on HDFS or in HBase easier

HBase in one slide

● HBase = Hadoop Database, based on Google’s Big Table

● Column-oriented database – basically one giant table

Pig in one slide

● A data flow language we will use to write queries against HBase

● Pig is not the developer’s solution for retrieving data from HBase, but it works well enough for the BI analyst (and, of course, we aren’t developers)

Pig is easier...Not Easy

● If you have no coding background, Pig will not be easy

● But it’s the best of a bad set of options right now

● Not hating on SQL-on-Hadoop providers, but with SQL you tell the computer what you want, which quickly gets complicated

Here’s our HBase table

Let’s dive in - Load

raw = LOAD 'hbase://peeps'

USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(

'info:first_name info:last_name, '-loadKey true -limit 1')

AS (id:chararray, first_name:chararray, last_name:chararray);

You have to specify each field and it’s type in

order to load it

Response is as expected

'info:first_name info:last_name, AS (first_name:chararray, last_name:chararray);

Will return a first name and last name as seperate fields, e.g., “Steve”, “Buscemi”

If you can write a Vlookup()

=VLOOKUP(C34, Z17:AZ56, 17, FALSE)

You can write a load statement in Pig.

Both are equally esoteric.

But what if we don’t know the fields?

● Suppose we have a column family of friends

● Each record will contain will zero to many friends, e.g., friend_0: “John”, friend_1: “Paul”

The number of friends is variable

● There could be thousands of friends per row

● And we cannot specify “friend_5” because there is no guarantee that each record has five friends

This is common...

● NoSQL databases are known for flexible schemas and flat table structures

● Unfortunately, the way Pig handles this problem utterly sucks...

Loading unknown friends

raw = LOAD 'hbase://SampleTable'

USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(

'info:first_name info:last_name info:friends_*, '-loadKey true -limit 5')

AS (id:chararry, first_name:chararray, last_name:chararray, friends:map[]);

Now we have info:friends_* that is represented as a “map”

A map is just a collection of key-value pairs

● That look like this: friend_1# ‘Steve’, friend_2# ‘Willie’

● They are very similar to Python dictionaries...

Here’s why they suck

● We can’t iterate over them

● In order to access a value, in this case a friend’s name, I have to provide the specific key value, e.g., friend_5, in order to receive the name of the fifth friend

But I thought you said we didn’t know the number of friends?

● You are right – Pig expects us to provide the specific value of something unknown

● If only there were some way to iterate over a collection of key-value pairs…

Enter Python

● Pig may not allow you to iterate over a map, but it does allow you to write User-Defined Functions (UDFs) in Python

● In a python UDF we can read in a map as a python dict and return key-value pairs

Python UDF for Pig

@outputSchema("values:bag{t:tuple(key, value)}")

def bag_of_tuples(map_dict):

return map_dict.items()

We are passing in a map, e.g., “Friend_1#Steve, Friend_2#Willie”and manipulating a python dict, e.g. {‘Friend_1’: ‘Steve’, ‘Friend_2’:

‘Willie’}

Based on blog post by Chase Seibert

http://chase-seibert.github.io/blog/2013/02/10/pig-hbase-flatten-column-family.html

We can add loops and logic too

@outputSchema("status:chararray")

def get_steve(map_dict):

for key, value in map_dict:

if value == 'Steve':

return "I hate that guy"

else:

return value

Or if you just want the data in Excel

register ‘sample_udf.py’ using jython as my_udfraw = LOAD 'hbase://peeps' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage( 'info:first_name info:last_name info:friends_*, '-loadKey true -limit 5') AS (id:chararry, first_name:chararray, last_name:chararray, friends:map[]);

clean_table = FOREACH raw GENERATE id, FLATTEN(my_udf.bag_of_tuples(friends));

dump clean_table;

Final Thought

Make Your Big Data Small● Prototype your Pig Scripts on your local file

systemo Download some data to your local machineo Start you pig shell from the command line: pig -x

localo Load - Transform - Dump

Notes

Pig Tutorials● Excellent video on Pig ● Mortar Data introduction to Pig● Flatten HBase column with Python

Me● codingcharlatan.com● @GusCavanaugh

https://www.youtube.com/watch?v=a3_Xcmq4z5o

https://help.mortardata.com/technologies/pig/learn_pig

http://chase-seibert.github.io/blog/2013/02/10/pig-hbase-flatten-column-family.html

http://www.codingcharlatan.com/

retrieving big data for the non developer

Documents

hbase pig

load hbase

hbase table

slide hbase

hbase easier

hadoop hadoop

big data hadoop o

hdfs o pig