cloudera data science challenge

24
Cloudera Data Science Challenge Doug Needham Mark Nichols, P.E.

Upload: mark-nichols-pe

Post on 08-Aug-2015

164 views

Category:

Technology


4 download

TRANSCRIPT

Page 1: Cloudera Data Science Challenge

Cloudera Data Science Challenge

Doug Needham Mark Nichols, P.E.

Page 3: Cloudera Data Science Challenge

Data Science, Why does it matter?

What is the only skill that matters for a data scientist?

“the ability to use the resources available to them to solve a challenge.”

Solving problems, the only skill you need to know

The skill of solving problems.

We both accomplished a lot in tackling this challenge. For some of the problems we did well, for some we could improve.

This challenge shows the ability to solve problems over and above the actual “answers” we sought.

I think too often we seek out people who have one particular skill or another, rather than general problem solving abilities.

Certainly there is a time and a place for expertise with a particular set of skills. But the skill of adaptability is often overlooked.

Think on this, the next time you are considering who you need to assist in solving a problem.

Page 4: Cloudera Data Science Challenge

Cloudera Certified Professional: Data Scientist

Intent of CCP:DS Demonstrate Knowledge in a Variety of Data

Science Topics

Demonstrate the Knowledge at Scale

Requirements Pass Cloudera’s Data Science Essentials Exam

(DS-200)

Pass Cloudera’s Data Science Challenge (semi-annual; use simulated data to solve real problems)

Change coming in Q2 2015SME Expertise

Math & Statistics Knowledge

Computer Science

SkillsData

ScienceTraditional AnalyticsDanger

Zone

Machine Learning

Reuters Article on CCP:DS certification

Data Acquisition

Data Evaluation

Data Transformation

Machine Learning

Clustering

Classification

Model/ Selection

Feature Selection

Probability

Visualization

OptimizationCollaborative Filtering

Topics

Page 5: Cloudera Data Science Challenge

Fall 2014 Data Science Challenge

Timeline: October 21, 2014 to January 21, 2015

Each person sitting for the challenge has to submit individual solutions for each problem.

Problems: Problem 1: Smartfly – Predict probability of a flight being delayed.

Problem 2: Almost Famous – Statistical analysis of web log data.

Problem 3: Winklr – Who should follow whom.

Page 6: Cloudera Data Science Challenge

Multiple Ways to Solve a Problem

Problem100,000 FT High Overview of Solution Tools Used

Mark Doug

Smartfly (ML – binary classification)

• Hive to Explore the Data• Python & MapReduce to

Format and Clean the Input• Spark MLLIB for Model

Data Science at the Command Line. Scripts, counts, summaries. “Pseudo Map-Reduce”R plotting. Spark MLLib for predictions.

Almost Famous (spam filter & statistical analysis)

• Python to Explore the Data• Python to Filter and Answer

Questions

Data Science at the command line. Scripts, counts, summaries.“Pseudo Map-Reduce”SciPY for particular functions.

Winklr (social network analysis)

• Hive and Command Line to Explore the Data

• Mahout, Spark, Command Line and Python to develop a hybrid recommender

Gephi, for analysis of subgraphs. Python to format the data. Spark GraphX for solution.Shell scripts to get the data in the required format

Page 7: Cloudera Data Science Challenge

Smartfly – Problem Summary

Motivation Client is an online travel service that provides timely travel information to their

customers

Their product team has come up with an idea of using flight data to predict whether a flight will be delayed and use that information to respond proactively.

Given 7,374,365 records of historic flight data at 279 airports and 17 airlines

566,376 records of scheduled flight data

Requirements Rank all scheduled flights in order of descending probability of delay

Page 8: Cloudera Data Science Challenge

Smartfly – Raw Data (Starting Point)

1-Unique Flight ID (int)

2-Year (int)

3-Month (int)

4-Day of Month (int)

5-Day of Week (int)

6-Scheduled Departure (HHMM)

7-Scheduled Arrival (HHMM)

8-Airline (string)

9-Flight Number (int)

10-Tail Number (string)

11-Plane Model (string)

12-Seat Configuration (string)

13-Departure Delay in Minutes (int)

14-Origin Airport (string)

15-Destination Airport (string)

16-Distance Travelled in Miles (int)

17-Taxi In Time in Minutes (int)

18-Taxi Out Time in Minutes (int)

19-Cancelled (Boolean)

20-Cancellation Code (string)

Historic and Scheduled Data was provided in CSV format with the following fields in each row:

Page 9: Cloudera Data Science Challenge

Machine Learning Algorithms for Binary Predictions (Potential Paths)

http://spark.apache.org/docs/1.2.0/mllib-guide.html

Page 10: Cloudera Data Science Challenge

Model Evaluation Criteria

Set the evaluation criteria prior to running any models, similar to setting the null and alternate hypothesis prior to conducting an experiment

Selected criteria: Area Under Receiver Operating Characteristic Curve (auROC) Compare different models

Independent of cutoff

No cutoff assumptions required

Page 11: Cloudera Data Science Challenge

Model Evaluation Criteria

Area Under Receiver Operating Characteristic Curve (auROC)

Weighted Confusion Matrix

Page 12: Cloudera Data Science Challenge

Data Exploration

Used Hive primarily

SELECT Max(distance), Min(distance)

FROM sfhist

Determined range of values for each field

Looked at delays by airline, airport, plane model…

Are there mismatches in data (ex. Cancelled = 0, but a valid Cancel Code is present)

Page 13: Cloudera Data Science Challenge

Input Data Manipulation

Format to input for ML algorithm (LIBSVM format) using Python and Map Reduce Created dictionaries of airports, airlines, plane models, seat

configurations & holidays

LIBSVM – efficient sparse matrix

0 10:1 13:1 46:1 51:1 52:1 67:1 77:1 82:1 106:1 674:1 804:1 3225:1

1 9:1 42:1 45:1 54:1 54:1 75:1 77:1 84:1 291:1 458:1 801:1 3891:1

Deal with errors & omissions in data

Validate Manual calculation at the head/tail/changes

Verify the correct number of records

Response0 = no delay1 = delayed

Features1-12: Month13-43: Day …1K-7K: Tailnumber7001+: Holidays

Page 14: Cloudera Data Science Challenge

Train the Model

Split the historic data into training and testing subsets Split randomly

Split based on time

Run the model in Spark Load the formatted input

Set model parameters

Run the model (train the SVM or Logistic Regression Model)

Page 15: Cloudera Data Science Challenge

Test the model

Use the model to predict delays in the test data and compare to determine the auROC (Spark)

Repeat using a range of iterations, model types (SVM/Log), regularization parameter (size of step), and regularization technique (L1/L2)

Results Worst: auROC = 0.51 SVM using only flight times and default optimization settings

Best: auROC = 0.68

Logisitic Regressioin

L2 Reglarization (2000 iterations, step = 0.0001)

Categorical input for: month, day, weekday, time of day (6hr blocks), departing airport, arrival airport, airline, seat configuration, flight number (type of flight), & holidays

Represents a 36% improvement over random selection

Predict the Scheduled for Submission

Page 16: Cloudera Data Science Challenge

Smartfly Review

Ability to use all of the data Unable to run a SVM / logistic regression model in R with ~ 6million

rows

Spark completed final model in ~ 10 min

Can be used for any binary decisions process Issue loan or not

Purchase stock or not

Other ML algorithms, the basic process remains the same Linear regression – predict a value

Clustering – segment your data for reporting

Collaborative filter – recommend products to customers…

Page 17: Cloudera Data Science Challenge

Winklr – Problem Summary

Who should Follow whom?

Winklr is a curiously popular social network for fans of the sitcom Happy Days. Users can post photos, write messages, and most importantly, follow each other’s posts and content. This helps users keep up with new content from their favorite users on the site.

Basically Winklr is a site that is set up similar to Twitter. We want to provide recommendations on who to follow. We know that some people have “clicked” on another user (I interpret this as a “Favorite”, or a “Re-Tweet”)

Page 18: Cloudera Data Science Challenge

Make sense of this:

Page 19: Cloudera Data Science Challenge

My solution

Type of problem: Graph Analysis

Create a Master Graph.

Run Page Rank to identify centrality.

Create many small graphs for individual users.

Mask the Master Graph, and PageRank Graph.

Multiply out Centrality, number of in Degrees for a possible followers, and the inverse of the length of the path away from this particular user to a candidate vertex to be followed.

This code runs in about 60 hours using Spark GraphX.

Code: Problem3.sh, and AnalyzeGraph.scala

Page 20: Cloudera Data Science Challenge

Doug’s Problem Solving approach

This is the approach I took, and may or may not be useful for others to apply.

Analysis. I started with some basic numbers, and just browsing through the data with the “Data Science at the Command line toolkit”. This is very handy for getting a feel for things.

Based on some general understanding this analysis provided, create a “pipeline”

Generally the data has to be transformed to a usable structure for the particular method of solving the problem.

Do some basics with the problem solving method, Stats, ML, Graph, etc…

Get some data back out of that tool, then format output to specification.

Iterate.

I did this for problem 1, moved to problem 2, then finally problem 3. Then went back to 1, back to 2, back to 3.

This method allowed me to give some “space” to myself, and actually look at the each problem with fresh eyes on more than one occasion.

Breaking the basics down of Input, Process, Output for each problem allowed me to have “working” code for each problem really quickly, then through tuning, analysis, research, and some time to think about the problem, I was able to come up with each unique solution.

It also allows me to refactor the code, having given each problem time to “rest”.

Very much like a painting, broad strokes first, details emerge as the painting progresses.

Another benefit is, if I am able to get the data all the way through the pipeline, it becomes obvious where the performance bottlenecks are for the pipeline.

This method does take a bit of time.

Page 21: Cloudera Data Science Challenge

Graph Analysis

As Graphs get really large it becomes difficult to visualize them.

However, I was able to “subset” the master graph based on the recommendation output of my process.

I was expecting to see one big clump of nodes tightly connected. This would be the “Target” to follow.

I was also expecting to see two smaller clumps of nodes, loosely connected to the larger clump. These are the “followers”, as we make a recommendation to them to follow the more popular node, they will be closer connected to this user.

Here is the output from Gephi that shows whether the code worked or not.

Page 22: Cloudera Data Science Challenge

This is what I expected to see

Page 23: Cloudera Data Science Challenge

Looks good, except I was wrong.

The challenge is looking for those “Likely” to follow someone.

So this part called for something a little different than what I coded.

It appears they were looking for the neighbors of the people that were already being followed.

This is a much less complicated problem than I actually solved.

I look forward to seeing what Data Science Challenge 4 will look like.

Page 24: Cloudera Data Science Challenge

Where to go from here?

Spark.

Scala.

Learn these topics.

Teach these topics.

Especially for folks planning on sitting for Data Science challenge 4: Learn Scala. Learn Spark.

Oh, and keep studying about Graphs…

For an example of what not to do: Doug's github link

Recent change – This is apparently the final Data Science challenge. Future CCP:DS certs will be based on a testing format.