a primer for your next data science proof of concept on the cloud

A Primer for Your Next Data Science Proof of Concept on the Cloud

Your PresentersAlton Alexander

About Me : Data scientist. PhD dropout with a love for solving real world business problems. Experience delivering solutions with big data, machine learning, and statistics for marketing, manufacturing, and finance industries.

Affiliations : Front Analytics Consulting

Connect : altonalexander

[email protected]

Matt Davies

About Me : “Big Data” architect / engineer with clients in retail, healthcare, e-commerce, insurance, and government. Primarily focused on operationalizing complex data-driven solutions.

Affiliations : Xpert Data Solutions

Connect : 4mattdavies

[email protected]

Site : http://xds.io

Agenda● Scoping the problem and solution● Discuss pros / cons of starting with a cloud solution● Establishing realistic expectations, budgets, constraints, etc…● Hands on demo● Q&A

Scoping the Problem and Solution● What are you trying to solve?● What data might be helpful in answering the question(s)?● Are there specific techniques which are known to work well?● Do you need to use BI tools and/or export data?● What are your timelines?● What are your resources?

Competing on AnalyticsProduct Customer Operations

Who are your potential customers?What do they want?Brand loyalty?What’s next?

What motivates customers?Which channels work best?What else do they need?What is a “customer”?

How long will X function?How much product waste?Will Y be cheaper?

e.g. Customer segmentation / profilesMarket analysisChannel attributionKeywords

EngagementEnrichmentChurnConversionOffersSentimentProfiling

Yield optimizationFailure ratesFutures

data collectors bulk store

batch process

livestore

apiservice ui

Datacenter AWSand/or“Analytic”

queues & oltp*SQSredis or couchmongodbrdbmsolapmongoHbaseThrift/Protobuf/AVROsockets style*messagepack basednettykafka*kinesisapps*elastic beanstalkec2/vm + load balance

emitters*messagepack

*s3ebs*HDFScassandracolumnar

on file systemM/R based (pig, hive)Graph based

off file systemanything language

diy jsonmongodb*BaaSespostgres

columnHbase/Impala*cassandra

graphcyphergremlin

searchelasticsearch

*stupid-simple-n-scale

cloud/dc appsec2/vm + load balance*elastic beanstalk

sql-ishPhoenixCassandra

graphGremlinCypher

searchelasticsearch

d3nvd3.js - simpled3.js - complexdc.js - dimensions

putting the long-A in OLAP

Pros / Cons of Starting With a Cloud SolutionCons:

● Data sensitivity● Less control● Unfamiliarity with terminology and/or

design● On prem world very different than cloud.

Terms, risk factors, skillsets● Data movement can be difficult● Cloud “tax”

Pros:

● Elasticity● Scalability● Speed of implementation● Focus on business problem● Can easily create multiple instances for

tests● Less management● Strong security● No Network, Datacenter barriers● Strong industry adoption

Pros / Cons of Starting With a Cloud SolutionGeneral Challenges

● POC -> Production can be difficult● Security is widely misunderstood● Skillsets: When to hire, develop, consult, outsource

Use Case : Clear objective with identified stakeholders.

Sufficient Time : Discovery is such a large part of these projects that projecting “Put this legacy project out in X hrs will translate to Y in big data” is not reliable.

Iterate : Like all software projects it is usually better to iterate than have large waterfall deployments.

Review : What went well, what failed, where is our technical debt.

Establishing Realistic Expectations, Budgets, Constraints, Etc...


Budgets

● Know how the cloud provider makes money● Start lean● “Leave No Trace”

Constraints

● Time● Complexity● Resources


● Sleep on it● Are you a solution in search of a need?● Use the scientific method● Involve yourself in the community● Hire a consultant

A Quick Walkthrough on AWS

POC Example: Multiple product offers to distinct products

http://dbs.uni-leipzig.de/file/parallel_er_with_dedoop.pdf





EMR Overview

Create your bucket

Configure and Launch your cluster

Open the AWS Web Console

Connect to the master and Monitor ClusterBe sure to configure your security group settings and use a private key to login

ssh -i ~/hackathon.pem [email protected]

Let’s set up an ssh tunnel so we can see what is happening on the cluster

● Hadoop, Ganglia, and other applications publish user interfaces as websites hosted on the master node. For security reasons, these websites are only available on the master node's local webserver (http://localhost:port) and are not published on the Internet.

ssh -i ~/hackathon.pem -ND 8157 [email protected]

● http://ec2-52-91-26-92.compute-1.amazonaws.com

http://ec2-52-91-26-92.compute-1.amazonaws.com

http://ec2-52-91-26-92.compute-1.amazonaws.com

Configure Hive to query JSONSet up the hive table to query the underlying json files -- (see notes)

/* ---[ A tool to automate creation of Hive JSON schemas ]--- */One feature missing from the openx JSON SerDe is a tool to generate a schema from a JSON document. Creating a schema for a large complex, highly nested JSON document is quite tedious.I've created a tool to automate this: https://github.com/midpeter444/hive-json-schema.

How to get data in and out.

https://github.com/midpeter444/hive-json-schema

https://github.com/midpeter444/hive-json-schema

Bootstrap our clusterNow we can bootstrap our cluster to load additional libraries and functions on all the nodes

We are going to bootstrap with python and nlp and the stanford library so we can pick out keywords in each record.

Map Reduce StepHow to write, test and configure a map reduce step

Retrieve and Analyze the Results

Use JDBC and R to read the results directly from RStudio.

Plot results

a primer for your next data science proof of concept on the cloud

Data & Analytics