retrieving big data for the non developer
TRANSCRIPT
Retrieving Big DataFor the non-developer
Intended Audience
People who do not write code
But don’t want to wait for IT to bring them data
Disclaimer
You will have to write code. Sorry...
Worth Noting
A common objection, “But I’m not a developer”
Coding does not make you a developer anymore than patching some drywall makes you a carpenter
Agenda
● The minimum you need to know about Big Data (Hadoop)o Specifically, HBase and Pig
● How you can retrieve data in HBase with Pigo How to use Python with Pig to make querying easier
One Big Caveat
● We are not talking about analysis● Analysis is hard
● Learning code and trying to understand an analytical approach is really hard● Following a straightforward Pig tutorial is
better than a boring lecture
Big Data in One Slide (oh boy)
● Today, Big Data == Hadoop● Hadoop is both a distributed file system
(HDFS) and an approach to messing with data on the file system (MapReduce)o HBase is a popular database that sits on top of
HDFSo Pig is a high level language that makes messing with
data on HDFS or in HBase easier
HBase in one slide
● HBase = Hadoop Database, based on Google’s Big Table
● Column-oriented database – basically one giant table
Pig in one slide
● A data flow language we will use to write queries against HBase
● Pig is not the developer’s solution for retrieving data from HBase, but it works well enough for the BI analyst (and, of course, we aren’t developers)
Pig is easier...Not Easy
● If you have no coding background, Pig will not be easy
● But it’s the best of a bad set of options right now
● Not hating on SQL-on-Hadoop providers, but with SQL you tell the computer what you want, which quickly gets complicated
Here’s our HBase table
Let’s dive in - Load
raw = LOAD 'hbase://peeps'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'info:first_name info:last_name, '-loadKey true -limit 1')
AS (id:chararray, first_name:chararray, last_name:chararray);
You have to specify each field and it’s type in
order to load it
Response is as expected
'info:first_name info:last_name, AS (first_name:chararray, last_name:chararray);
Will return a first name and last name as seperate fields, e.g., “Steve”, “Buscemi”
If you can write a Vlookup()
=VLOOKUP(C34, Z17:AZ56, 17, FALSE)
You can write a load statement in Pig.
Both are equally esoteric.
But what if we don’t know the fields?
● Suppose we have a column family of friends
● Each record will contain will zero to many friends, e.g., friend_0: “John”, friend_1: “Paul”
The number of friends is variable
● There could be thousands of friends per row
● And we cannot specify “friend_5” because there is no guarantee that each record has five friends
This is common...
● NoSQL databases are known for flexible schemas and flat table structures
● Unfortunately, the way Pig handles this problem utterly sucks...
Loading unknown friends
raw = LOAD 'hbase://SampleTable'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'info:first_name info:last_name info:friends_*, '-loadKey true -limit 5')
AS (id:chararry, first_name:chararray, last_name:chararray, friends:map[]);
Now we have info:friends_* that is represented as a “map”
A map is just a collection of key-value pairs
● That look like this: friend_1# ‘Steve’, friend_2# ‘Willie’
● They are very similar to Python dictionaries...
Here’s why they suck
● We can’t iterate over them
● In order to access a value, in this case a friend’s name, I have to provide the specific key value, e.g., friend_5, in order to receive the name of the fifth friend
But I thought you said we didn’t know the number of friends?
● You are right – Pig expects us to provide the specific value of something unknown
● If only there were some way to iterate over a collection of key-value pairs…
Enter Python
● Pig may not allow you to iterate over a map, but it does allow you to write User-Defined Functions (UDFs) in Python
● In a python UDF we can read in a map as a python dict and return key-value pairs
Python UDF for Pig
@outputSchema("values:bag{t:tuple(key, value)}")
def bag_of_tuples(map_dict):
return map_dict.items()
We are passing in a map, e.g., “Friend_1#Steve, Friend_2#Willie”and manipulating a python dict, e.g. {‘Friend_1’: ‘Steve’, ‘Friend_2’:
‘Willie’}
Based on blog post by Chase Seibert
We can add loops and logic too
@outputSchema("status:chararray")
def get_steve(map_dict):
for key, value in map_dict:
if value == 'Steve':
return "I hate that guy"
else:
return value
Or if you just want the data in Excel
register ‘sample_udf.py’ using jython as my_udfraw = LOAD 'hbase://peeps' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage( 'info:first_name info:last_name info:friends_*, '-loadKey true -limit 5') AS (id:chararry, first_name:chararray, last_name:chararray, friends:map[]);
clean_table = FOREACH raw GENERATE id, FLATTEN(my_udf.bag_of_tuples(friends));
dump clean_table;
Final Thought
Make Your Big Data Small● Prototype your Pig Scripts on your local file
systemo Download some data to your local machineo Start you pig shell from the command line: pig -x
localo Load - Transform - Dump
Notes
Pig Tutorials● Excellent video on Pig ● Mortar Data introduction to Pig● Flatten HBase column with Python
Me● codingcharlatan.com● @GusCavanaugh