a primer for your next data science proof of concept on the cloud

23
A Primer for Your Next Data Science Proof of Concept on the Cloud

Upload: alton-alexander

Post on 22-Jan-2018

303 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: A Primer for Your Next Data Science Proof of Concept on the Cloud

A Primer for Your Next Data Science Proof of Concept on the Cloud

Page 2: A Primer for Your Next Data Science Proof of Concept on the Cloud

Your PresentersAlton Alexander

About Me : Data scientist. PhD dropout with a love for solving real world business problems. Experience delivering solutions with big data, machine learning, and statistics for marketing, manufacturing, and finance industries.

Affiliations : Front Analytics Consulting

Connect : altonalexander

[email protected]

Matt Davies

About Me : “Big Data” architect / engineer with clients in retail, healthcare, e-commerce, insurance, and government. Primarily focused on operationalizing complex data-driven solutions.

Affiliations : Xpert Data Solutions

Connect : 4mattdavies

[email protected]

Site : http://xds.io

Page 3: A Primer for Your Next Data Science Proof of Concept on the Cloud

Agenda● Scoping the problem and solution● Discuss pros / cons of starting with a cloud solution● Establishing realistic expectations, budgets, constraints, etc…● Hands on demo● Q&A

Page 4: A Primer for Your Next Data Science Proof of Concept on the Cloud
Page 5: A Primer for Your Next Data Science Proof of Concept on the Cloud

Scoping the Problem and Solution● What are you trying to solve?● What data might be helpful in answering the question(s)?● Are there specific techniques which are known to work well?● Do you need to use BI tools and/or export data?● What are your timelines?● What are your resources?

Page 6: A Primer for Your Next Data Science Proof of Concept on the Cloud

Competing on AnalyticsProduct Customer Operations

Who are your potential customers?What do they want?Brand loyalty?What’s next?

What motivates customers?Which channels work best?What else do they need?What is a “customer”?

How long will X function?How much product waste?Will Y be cheaper?

e.g. Customer segmentation / profilesMarket analysisChannel attributionKeywords

EngagementEnrichmentChurnConversionOffersSentimentProfiling

Yield optimizationFailure ratesFutures

Page 7: A Primer for Your Next Data Science Proof of Concept on the Cloud

data collectors bulk store

batch process

livestore

apiservice ui

Datacenter AWSand/or“Analytic”

queues & oltp*SQSredis or couchmongodbrdbmsolapmongoHbaseThrift/Protobuf/AVROsockets style*messagepack basednettykafka*kinesisapps*elastic beanstalkec2/vm + load balance

emitters*messagepack

*s3ebs*HDFScassandracolumnar

on file systemM/R based (pig, hive)Graph based

off file systemanything language

diy jsonmongodb*BaaSespostgres

columnHbase/Impala*cassandra

graphcyphergremlin

searchelasticsearch

*stupid-simple-n-scale

cloud/dc appsec2/vm + load balance*elastic beanstalk

sql-ishPhoenixCassandra

graphGremlinCypher

searchelasticsearch

d3nvd3.js - simpled3.js - complexdc.js - dimensions

putting the long-A in OLAP

Page 8: A Primer for Your Next Data Science Proof of Concept on the Cloud

Pros / Cons of Starting With a Cloud SolutionCons:

● Data sensitivity● Less control● Unfamiliarity with terminology and/or

design● On prem world very different than cloud.

Terms, risk factors, skillsets● Data movement can be difficult● Cloud “tax”

Pros:

● Elasticity● Scalability● Speed of implementation● Focus on business problem● Can easily create multiple instances for

tests● Less management● Strong security● No Network, Datacenter barriers● Strong industry adoption

Page 9: A Primer for Your Next Data Science Proof of Concept on the Cloud

Pros / Cons of Starting With a Cloud SolutionGeneral Challenges

● POC -> Production can be difficult● Security is widely misunderstood● Skillsets: When to hire, develop, consult, outsource

Page 10: A Primer for Your Next Data Science Proof of Concept on the Cloud

Use Case : Clear objective with identified stakeholders.

Sufficient Time : Discovery is such a large part of these projects that projecting “Put this legacy project out in X hrs will translate to Y in big data” is not reliable.

Iterate : Like all software projects it is usually better to iterate than have large waterfall deployments.

Review : What went well, what failed, where is our technical debt.

Establishing Realistic Expectations, Budgets, Constraints, Etc...

Page 11: A Primer for Your Next Data Science Proof of Concept on the Cloud

Establishing Realistic Expectations, Budgets, Constraints, Etc...

Budgets

● Know how the cloud provider makes money● Start lean● “Leave No Trace”

Constraints

● Time● Complexity● Resources

Page 12: A Primer for Your Next Data Science Proof of Concept on the Cloud

Establishing Realistic Expectations, Budgets, Constraints, Etc...

● Sleep on it● Are you a solution in search of a need?● Use the scientific method● Involve yourself in the community● Hire a consultant

Page 13: A Primer for Your Next Data Science Proof of Concept on the Cloud

A Quick Walkthrough on AWS

Page 15: A Primer for Your Next Data Science Proof of Concept on the Cloud

EMR Overview

Page 16: A Primer for Your Next Data Science Proof of Concept on the Cloud

Create your bucket

Page 17: A Primer for Your Next Data Science Proof of Concept on the Cloud

Configure and Launch your cluster

Open the AWS Web Console

Page 18: A Primer for Your Next Data Science Proof of Concept on the Cloud

Connect to the master and Monitor ClusterBe sure to configure your security group settings and use a private key to login

ssh -i ~/hackathon.pem [email protected]

Let’s set up an ssh tunnel so we can see what is happening on the cluster

● Hadoop, Ganglia, and other applications publish user interfaces as websites hosted on the master node. For security reasons, these websites are only available on the master node's local webserver (http://localhost:port) and are not published on the Internet.

ssh -i ~/hackathon.pem -ND 8157 [email protected]

● http://ec2-52-91-26-92.compute-1.amazonaws.com

Page 19: A Primer for Your Next Data Science Proof of Concept on the Cloud

Configure Hive to query JSONSet up the hive table to query the underlying json files -- (see notes)

/* ---[ A tool to automate creation of Hive JSON schemas ]--- */One feature missing from the openx JSON SerDe is a tool to generate a schema from a JSON document. Creating a schema for a large complex, highly nested JSON document is quite tedious.I've created a tool to automate this: https://github.com/midpeter444/hive-json-schema.

How to get data in and out.

Page 20: A Primer for Your Next Data Science Proof of Concept on the Cloud

Bootstrap our clusterNow we can bootstrap our cluster to load additional libraries and functions on all the nodes

We are going to bootstrap with python and nlp and the stanford library so we can pick out keywords in each record.

Page 21: A Primer for Your Next Data Science Proof of Concept on the Cloud

Map Reduce StepHow to write, test and configure a map reduce step

Page 22: A Primer for Your Next Data Science Proof of Concept on the Cloud

Retrieve and Analyze the Results

Use JDBC and R to read the results directly from RStudio.

Plot results

Page 23: A Primer for Your Next Data Science Proof of Concept on the Cloud

Q&A