intro to big data - orlando code camp 2014
DESCRIPTION
Very high-level introduction to Big Data technologies, with an emphasis on how folks can get started easily.TRANSCRIPT
Dipping Your Toes into the Big Data Pool
Orlando CodeCamp 2014
John Ternent
VP Application Development
TravelClick
About Me
20+ years as a consultant, software engineer, architect, and tech executive.
Mostly data-focused, RDBMS, object database, and big data/NoSQL/analytics/data science.
Presently leading development efforts for TravelClick Channel Management team.
Twitter : @jaternent
Poll : Big Data
How many people are comfortable with the definition?
How many people are “doing” Big Data?
Big Data in the Media
The Three Four V’s of Big Data:Volume (Scale)Variety (Forms)Velocity (Streaming)Veracity (Uncertainty)
http://www.ibmbigdatahub.com/infographic/four-vs-big-data
A New Definition
Big Data is about a tool set and approach that allows for non-linear scalability of solutions to data problems.
“It depends on how capital your B and D are in Big Data…”
What is Big Data to you?
The Big Data Ecosystem
Data Sources
Data Storage
Data Manipulation
Data Manageme
nt
Data Analysis
• Sqoop• Flume
• HDFS• HBase
• Pig• MapReduc
e
• Zookeeper
• Avro• Oozie
• Hive• Mahout• Impala
The Full Hadoop Ecosystem?
Great, but What IS Hadoop?
Implementation of Google MapReduce framework
Distributed processing on commodity hardware
Distributed file system with high failure tolerance
Can support activity directly on top of distributed file system (MapReduce jobs, Impala, Hive queries, etc)
Candidate Architecture
Data Sources
• Log files• SQL DBs• Text feeds• Search• Structured• Unstructure
d• Semi-
structured
HDFSHDFS
HDFS
Data Manipulation
• MapReduce• Pig• Hive• Impala
Analytic Products
• Search• R/SAS• Mahout• SQL
Server• DW/
DMart
Example : Log File Processing
xxx.16.23.133 - - [15/Jul/2013:04:03:01 -0400] "POST /update-channels HTTP/1.1" 500 378 "-" "Zend_Http_Client" 53051 65921 617- - - - [15/Jul/2013:04:03:02 -0400] "GET /server-status?auto HTTP/1.1" 200 411 "-" "collectd/5.1.0" 544 94 590xxx.16.23.133 - - [15/Jul/2013:04:04:00 -0400] "POST /update-channels HTTP/1.1" 200 104 "-" "Zend_Http_Client" 617786 4587 360- - - [15/Jul/2013:04:04:02 -0400] "GET /server-status?auto HTTP/1.1" 200 411 "-" "collectd/5.1.0" 568 94 590- - - [15/Jul/2013:04:05:02 -0400] "GET /server-status?auto HTTP/1.1" 200 412 "-" "collectd/5.1.0" 560 94 591xxx.16.23.70 - - [15/Jul/2013:04:05:09 -0400] "POST /fetch-channels HTTP/1.1" 200 3718 "-" "-" 452811 536 3975xxx.16.23.70 - - [15/Jul/2013:04:05:10 -0400] "POST /fetch-channels HTTP/1.1" 200 6598 "-" "-" 333213 536 6855xxx.16.23.70 - - [15/Jul/2013:04:05:11 -0400] "POST /fetch-channels HTTP/1.1" 200 5533 "-" "-" 282445 536 5790xxx.16.23.70 - - [15/Jul/2013:04:05:12 -0400] "POST /fetch-channels HTTP/1.1" 200 8266 "-" "-" 462575 536 8542xxx.16.23.70 - - [15/Jul/2013:04:05:12 -0400] "POST /fetch-channels HTTP/1.1" 200 42640 "-" "-" 1773203 536 42916
Example : Log File ProcessingA = LOAD '/Users/jternent/Documents/logs/api*' USING TextLoader as (line:chararray); B = FOREACH A GENERATE FLATTEN((tuple(chararray, chararray, chararray, chararray, chararray, int, int, chararray, chararray, int, int, int))REGEX_EXTRACT_ALL(line,'^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" (\\d+) (\\d+) (\\d+)')) as (forwarded_ip:chararray, rem_log:chararray,rem_user:chararray, ts:chararray, req_url:chararray, result:int, resp_size:int, referrer:chararray, user_agent:chararray, svc_time:int, rec_bytes:int, resp_bytes:int);B1 = FILTER B BY ts IS NOT NULL;B2 = FILTER B BY req_url MATCHES '.*[fetch|update].*';B3 = FOREACH B2 GENERATE *, REGEX_EXTRACT(req_url, '^\\w+ \\/(\\S+)[\\?]* \\S+',1) as req;C = FOREACH B3 GENERATE forwarded_ip, GetMonth(ToDate(ts,'d/MMM/yyyy:HH:mm:ss Z')) as month, GetDay(ToDate(ts,'d/MMM/yyyy:HH:mm:ss Z')) as day, GetHour(ToDate(ts,'d/MMM/yyyy:HH:mm:ss Z')) as hour, req, result, svc_time;D = GROUP C BY (month, day, hour, req, result);E = FOREACH D GENERATE flatten(group), MAX(C.svc_time) as max, MIN(C.svc_time) as min, COUNT(C) as count;STORE E INTO '/Users/jternent/Documents/logs/ezy-logs-output' USING PigStorage
Another Real-World Example
2013-08-10T04:03:50-04:00 INFO (6): {"eventType":3,"eventTime":"Aug 10, 2013 4:03:50 AM","hotelId":8186,"channelId":9173,"submissionId":1376121011,"sessionId":null,"documentId":"9173SS8186_13761210111434582378cds.txt","queueName":"expedia-dx","roomCount":1,"submissionDayCount":1,"serverName":"orldc-auto-11.ezyield.com","serverLoad":1.18,"queueSize":0,"submissionStatus":2,"submissionStatusCode":0}
2013-08-10T04:03:53-04:00 INFO (6): {"eventType":2,"eventTime":"Aug 10, 2013 4:03:53 AM","hotelId":8525,"channelId":50091,"submissionId":1376116653,"sessionId":null,"documentId":"50091SS8525_13761166531434520293cds.txt","queueName":"expedia-dx","roomCount":5,"submissionDayCount":2,"serverName":"orldc-auto-11.ezyield.com","serverLoad":1.18,"queueSize":0,"submissionStatus":1,"submissionStatusCode":null}
100 million (ish) / week of these. 25MB zipped per server per day (15 servers right now), 750MB uncompressed.
Pig Example - Pros and Cons
Pros:Don’t need to ETL into a database, all off file systemSame development for one file as 10,000 filesHorizontally scalableUDFs allow fine-grained controlFlexible
Cons:Language can be difficult to work withMapReduce touches ALL the things to get the answer
(compare to indexed search)
Unstructured and Semi-Structured Data
Big Data tools can help with the analysis of data that would be more challenging in a relational databaseTwitter feeds (Natural Language Processing)Social network analysis
Big Data approaches to search are making search tools more accessible and useful than everElasticSearch
ElasticSearch/Kibana
JSON Document
sREST
ElasticSearch
Logslogsta
s
hHadoop
FileSystem
Kibana
Analytics with Big Data
Apache Mahout Machine learning on Hadoop
Recommendation Classification Clustering
RHadoopR mapreduce implementation on HDFS
Tableau Visualization on HDFS/Hive
Main point : You don’t have to roll your own for everything, many tools now using HDFS natively
Return to SQL
Many SQL dialects are being/have been ported to Hadoop
Hive : Create DDL Tables on top of HDFS structuresCREATE TABLE apachelog ( host STRING, identity STRING, user STRING, time STRING, request STRING, status STRING, size STRING, referer STRING, agent STRING)ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'WITH SERDEPROPERTIES ( "input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?")STORED AS TEXTFILE;
SELECT host, COUNT(*)FROM apachelogGROUP BY host;
Cloudera ImpalaMoves SQL processing onto each distributed node
Written for performance
Distribution and reduction of the query handled by the Impala engine
Big Data Tradeoffs
Time tradeoff – loading/building/indexing vs. runtime
ACID properties – different distribution models may compromise one or more of these properties
Be aware of what tradeoffs you’re making
TANSTAAFL – massive scalability, commodity hardware, but at what price?
Tool sophistication
NoSQL – “Not Only SQL”
Sacrificing ACID properties for different scalability benefits.Key/Value Store : SimpleDB, Riak, RedisColumn Family Store : Cassandra, HBaseDocument Database : CouchDB, MongoDBGraph Database : Neo4J
General propertiesHigh horizontal scalabilityFast accessSimple data structuresCaching
Getting Started
Play in the sandbox – Hadoop/Hive/Pig local mode or AWSRandy Zwitch has a great tutorial on this :
http://randyzwitch.com/big-data-hadoop-amazon-ec2-cloudera-part-1/
Using Airline data : http://stat-computing.org/dataexpo/2009/the-data.html
Kaggle competitions (data science)
Lots of big data sets available, look for machine learning repositories
Getting Started
Books for Developers
Books for Managers
MOOCs
Unprecedented access to very high-quality online courses, including
Udacity : Data Science Track Intro to Data ScienceData Wrangling with MongoDB Intro to Hadoop and MapReduce
Coursera : Machine Learning courseData Science Certificate Track (R, Python)
Waikato University : Weka
Bonus Round : Data Science
Outro
We live in exciting times!
Confluence of data, processing power, and algorithmic sophistication.
More data is available to make better decisions more easily than any other time in human history.