a primer for your next data science proof of concept on the cloud
TRANSCRIPT
A Primer for Your Next Data Science Proof of Concept on the Cloud
Your PresentersAlton Alexander
About Me : Data scientist. PhD dropout with a love for solving real world business problems. Experience delivering solutions with big data, machine learning, and statistics for marketing, manufacturing, and finance industries.
Affiliations : Front Analytics Consulting
Connect : altonalexander
Matt Davies
About Me : “Big Data” architect / engineer with clients in retail, healthcare, e-commerce, insurance, and government. Primarily focused on operationalizing complex data-driven solutions.
Affiliations : Xpert Data Solutions
Connect : 4mattdavies
Site : http://xds.io
Agenda● Scoping the problem and solution● Discuss pros / cons of starting with a cloud solution● Establishing realistic expectations, budgets, constraints, etc…● Hands on demo● Q&A
Scoping the Problem and Solution● What are you trying to solve?● What data might be helpful in answering the question(s)?● Are there specific techniques which are known to work well?● Do you need to use BI tools and/or export data?● What are your timelines?● What are your resources?
Competing on AnalyticsProduct Customer Operations
Who are your potential customers?What do they want?Brand loyalty?What’s next?
What motivates customers?Which channels work best?What else do they need?What is a “customer”?
How long will X function?How much product waste?Will Y be cheaper?
e.g. Customer segmentation / profilesMarket analysisChannel attributionKeywords
EngagementEnrichmentChurnConversionOffersSentimentProfiling
Yield optimizationFailure ratesFutures
data collectors bulk store
batch process
livestore
apiservice ui
Datacenter AWSand/or“Analytic”
queues & oltp*SQSredis or couchmongodbrdbmsolapmongoHbaseThrift/Protobuf/AVROsockets style*messagepack basednettykafka*kinesisapps*elastic beanstalkec2/vm + load balance
emitters*messagepack
*s3ebs*HDFScassandracolumnar
on file systemM/R based (pig, hive)Graph based
off file systemanything language
diy jsonmongodb*BaaSespostgres
columnHbase/Impala*cassandra
graphcyphergremlin
searchelasticsearch
*stupid-simple-n-scale
cloud/dc appsec2/vm + load balance*elastic beanstalk
sql-ishPhoenixCassandra
graphGremlinCypher
searchelasticsearch
d3nvd3.js - simpled3.js - complexdc.js - dimensions
putting the long-A in OLAP
Pros / Cons of Starting With a Cloud SolutionCons:
● Data sensitivity● Less control● Unfamiliarity with terminology and/or
design● On prem world very different than cloud.
Terms, risk factors, skillsets● Data movement can be difficult● Cloud “tax”
Pros:
● Elasticity● Scalability● Speed of implementation● Focus on business problem● Can easily create multiple instances for
tests● Less management● Strong security● No Network, Datacenter barriers● Strong industry adoption
Pros / Cons of Starting With a Cloud SolutionGeneral Challenges
● POC -> Production can be difficult● Security is widely misunderstood● Skillsets: When to hire, develop, consult, outsource
Use Case : Clear objective with identified stakeholders.
Sufficient Time : Discovery is such a large part of these projects that projecting “Put this legacy project out in X hrs will translate to Y in big data” is not reliable.
Iterate : Like all software projects it is usually better to iterate than have large waterfall deployments.
Review : What went well, what failed, where is our technical debt.
Establishing Realistic Expectations, Budgets, Constraints, Etc...
Establishing Realistic Expectations, Budgets, Constraints, Etc...
Budgets
● Know how the cloud provider makes money● Start lean● “Leave No Trace”
Constraints
● Time● Complexity● Resources
Establishing Realistic Expectations, Budgets, Constraints, Etc...
● Sleep on it● Are you a solution in search of a need?● Use the scientific method● Involve yourself in the community● Hire a consultant
A Quick Walkthrough on AWS
POC Example: Multiple product offers to distinct products
http://dbs.uni-leipzig.de/file/parallel_er_with_dedoop.pdf
EMR Overview
Create your bucket
Configure and Launch your cluster
Open the AWS Web Console
Connect to the master and Monitor ClusterBe sure to configure your security group settings and use a private key to login
ssh -i ~/hackathon.pem [email protected]
Let’s set up an ssh tunnel so we can see what is happening on the cluster
● Hadoop, Ganglia, and other applications publish user interfaces as websites hosted on the master node. For security reasons, these websites are only available on the master node's local webserver (http://localhost:port) and are not published on the Internet.
ssh -i ~/hackathon.pem -ND 8157 [email protected]
● http://ec2-52-91-26-92.compute-1.amazonaws.com
Configure Hive to query JSONSet up the hive table to query the underlying json files -- (see notes)
/* ---[ A tool to automate creation of Hive JSON schemas ]--- */One feature missing from the openx JSON SerDe is a tool to generate a schema from a JSON document. Creating a schema for a large complex, highly nested JSON document is quite tedious.I've created a tool to automate this: https://github.com/midpeter444/hive-json-schema.
How to get data in and out.
Bootstrap our clusterNow we can bootstrap our cluster to load additional libraries and functions on all the nodes
We are going to bootstrap with python and nlp and the stanford library so we can pick out keywords in each record.
Map Reduce StepHow to write, test and configure a map reduce step
Retrieve and Analyze the Results
Use JDBC and R to read the results directly from RStudio.
Plot results
Q&A