big data analytics - best of the worst : anti-patterns & antidotes

Big Data Analytics -The Best of the Worst

Krishna Sankar

@ksankarhttps://www.linkedin.com/in/ksankar

About MeAbout Me

o Data Scientist • Decision Data Science & Product Data Science [Data Science Folk Knowledge http://goo.gl/O4svPx]

• Insights = Intelligence + Inference + Interface [https://goo.gl/s2KB6L]• Predicting NFL with Elo like Nate Silver & 538 [NFL : http://goo.gl/Q2OgeJ, NBA’15 : https://goo.gl/aUhdo3]

o Have been speaking at OSCON [http://goo.gl/1MJLu], PyCon, Pydata [http://vimeo.com/63270513, http://www.slideshare.net/ksankar/pydata-19] …

o Have done lots of things:• Big Data (Retail, Bioinformatics, Financial, AdTech), Starting MS-CFRM, University of WA

• Written Books (Web 2.0, Wireless, Java,…))Standards, some work in AI,• Guest Lecturer at Naval PG School,…

o Studying MS-CFRM (Computational Finance/Risk management) UWAo Full-day Spark workshop “Advanced Data Science w/ Spark” / Spark Summit-E’15[https://goo.gl/7SBKTC]o Co-author : “Fast Data Processing with Spark”, Packt Publishing [http://goo.gl/eNtXpT]o Reviewer : “Machine Learning with Spark” Packt Publishing

o Volunteer as Robotics Judge at First Lego league World Competitionso @ksankar, doubleclix.wordpress.com

Background – Top 5Background – Top 5

http://tcapp2.publishpath.com/rabbitholehttp://conservationmagazine.org/wordpress/wp-‐content/uploads/2013/05/context-‐matters.jpg

1) Data ScienceThe art of building a model with known knownsWhich when let loose, works with unknown unknowns

1) Data ScienceThe art of building a model with known knownsWhich when let loose, works with unknown unknowns

Donald Rumsfeld is an armchair Data Scientist !

http://smartorg.com/2013/07/valuepoint19/

The World

Knowns

Unknowns

YouUnKnown Known

o Others know, you don’tModel Evolution/DevOpsto capture this

o Capture in Models

o Facts, outcomes or scenarios we have not encountered, nor considered

o “Black swans”, outliers, long tails of probability distributions

o Lack of experience, imagination

o Potential facts, outcomes we are aware, but not with certainty

o Stochastic processes, Probabilities

o Known Knownso There are things we know that

we knowo Known Unknowns

o That is to say, there are things that we now know we don't know

o But there are also Unknown Unknownso There are things we do not know

we don't knowGoal of Big Data is AnalyticsGoal of Big Data is Analytics

2) The pipeline is the context 2) The pipeline is the context

o Scalable Model Deployment

o Big Data automation & purpose built appliances (soft/hard)

o Manage SLAs & response times

o Scalable Model Deployment

o Big Data automation & purpose built appliances (soft/hard)

o Manage SLAs & response times

o Volumeo Velocityo Streaming Data

o Volumeo Velocityo Streaming Data

o Canonical formo Data catalogo Data Fabric across the

organizationo Access to multiple

sources of data o Think Hybrid – Big Data

Apps, Appliances & Infrastructure

o Canonical formo Data catalogo Data Fabric across the

organizationo Access to multiple

sources of data o Think Hybrid – Big Data

Apps, Appliances & Infrastructure

CollectCollect StoreStore TransformTransform

o Metadatao Monitor counters &

Metricso Structured vs. Multi-‐

structured

o Metadatao Monitor counters &

Metricso Structured vs. Multi-‐

structured

o Flexible & Selectable§ Data Subsets § Attribute sets

o Flexible & Selectable§ Data Subsets § Attribute sets

o Refine model with§ Extended Data

subsets§ Engineered

Attribute setso Validation run across a

larger data set

o Refine model with§ Extended Data

subsets§ Engineered

Attribute setso Validation run across a

larger data set

ReasonReason ModelModel DeployDeploy

Data ManagementData Management Data ScienceData Science

o Dynamic Data Setso 2 way key-‐value tagging of

datasetso Extended attribute setso Advanced Analytics

o Dynamic Data Setso 2 way key-‐value tagging of

datasetso Extended attribute setso Advanced Analytics

ExploreExploreVisualizeVisualize RecommendRecommend PredictPredict

o Performanceo Scalabilityo Refresh Latencyo In-‐memory Analytics

o Performanceo Scalabilityo Refresh Latencyo In-‐memory Analytics

o Advanced Visualizationo Interactive Dashboardso Map Overlayo Infographics

o Advanced Visualizationo Interactive Dashboardso Map Overlayo Infographics

¤ Bytes to Business a.k.a. Build the full stack

¤ Find Relevant Data For Business

¤ Connect the Dots

VolumeVolume

VelocityVelocity

VarietyVariety

3) Mind Your “I”s, “C”s & “V”s3) Mind Your “I”s, “C”s & “V”s

ContextContext

ConnectednessConnectedness

IntelligenceIntelligence

InterfaceInterface

InferenceInference

o Three Amigoso Interface = Cognitiono Intelligence = Compute(CPU) & Computational(GPU)o Infer Significance & Causality

CURATED SIGNALS > APPLIED INTELLIGENCE > STRATIFIED INFERENCECURATED SIGNALS > APPLIED INTELLIGENCE > STRATIFIED INFERENCE

4) Model Evolution & Concept Drift4) Model Evolution & Concept Drift

Dynamic dash boardsMulti-dimensional

pivots w/ customization

Selectable algorithms on data

subsets“Cluster Customer for 5 thanksgiving

seasons”

Learning ModelsAutomatic Feature Selection

& hyper parameter optimizations as it gets

more data

Dynamic Models –Model Selection based

on context

Com

plex

ity

Value

Automated Analytics- Let Data tell story

Feature Learning, AI, Deep Learning

Concept DriftValidate Model assumptions + hyper parameters + features in the current context – after they are in production

Ref: Prof. Josh Bloom, Keynote: A Systems View of Machine Learning, #pydata Seattle’15

5) The Sense & Sensibility of a DataScientist DevOps5) The Sense & Sensibility of a DataScientist DevOps

oAnalytics in the lab = Investigative• Interactive, Iterative,

Explorative• Output is usually decision

data science

oAnalytics in the factory = Operational• Automated, systemic,

transparent & explainable• Output is embedded

intelligence• Embedded in customer facing

decision systems

Josh Wills-‐From the labs to the factory, https://doubleclix.wordpress.com/2013/11/17/of-‐building-‐data-‐products/

http://doubleclix.wordpress.com/2014/05/11/the-‐sense-‐sensibility-‐of-‐a-‐data-‐scientist-‐devops/

There is a chasm between Model/Reason and Deploy

6) Data is your product, regardless of what you sell6) Data is your product, regardless of what you sell

oData is the lens through which you see the business and fell the pulse

o Collect the right data through “Thoughtful Data Design”

oGive Data Back in a Powerful Way

o But don’t confuse or overwhelm the users• The users have to feel safe• The users have to feel they are in control

oNever try to launch a complicated data product on a fixed schedule

oOffer progressively sophisticated products, leveraging the data & insights, across the different user population segments • Customer segmentation & stratification is not just for retail !

Josh Wills-‐From the labs to the factory, https://doubleclix.wordpress.com/2013/11/17/of-‐building-‐data-‐products/

http://doubleclix.wordpress.com/2014/05/11/the-‐sense-‐sensibility-‐of-‐a-‐data-‐scientist-‐devops/

“The

re are

no ro

utin

e st

atis

tical qu

esti

ons,

onl

y qu

esti

onable

sta

tist

ical rou

tine

s” --Da

vid Co

x

Ref: Gabriele CornoNatural History Museum in #London ..by George ThalassinosBig Data Analytics - The Best of the Worst

Data SwampData SwampBlue Pillo Typical case of “ungoverned data

stores addressing a limited data science audience“

o The company proudly has crossed the chasm to the big data world with a new shiny Hadoop infrastructure.

o Now every one starts putting their data into this “lake”.

o After a few months, the disks are full; Hadoop is replicating 3 copies; even some bytes are falling off the floor from the wires – but no one has any clue on what data is in there, the consistency and the semantic coherence

Red Pill-Data CurationoData Curation• A consistent published schema

oData Quality & Data Lineage, “descriptive metadata and an underlying mechanism to maintain it”, all are part of the data curation layer …

o Semantic consistency across diverse multi-structured multi-temporal transactions require a level of data curation & discipline

oDesign for the right “Data Gravity” & “Data Mass” as Van Lindberg mentioned, yesterday, in his keynote

• Not Data Molasses !

Data SwampData SwampBlue Pillo Typical case of “ungoverned data

stores addressing a limited data science audience“

o The company proudly has crossed the chasm to the big data world with a new shiny Hadoop infrastructure.

o Now every one starts putting their data into this “lake”.

o After a few months, the disks are full; Hadoop is replicating 3 copies; even some bytes are falling off the floor from the wires – but no one has any clue on what data is in there, the consistency and the semantic coherence

Red Pill-Data CurationoData Curation

• A consistent published schema

oData quality & data lineage, “descriptive metadata and an underlying mechanism to maintain it”, all are part of the data curation layer …

o Semantic consistency across diverse multi-structured multi-temporal transactions require a level of data curation and discipline

https://www.linkedin.com/pulse/data-‐lakes-‐udls-‐vs-‐analytics-‐platforms-‐gargi-‐adhav

Big Data To Nowhere Big Data To Nowhere

Blue Pillo IT sees an opportunity and starts

building the infrastructure, sometimes massive, and puts petabytes of data in the Big Data Hub or lake or pool or … But no relevant business facing apps.

o A conversation goes like this …• Business : I heard that we have a big

data infrastructure, cool. When can I show a demo to our customers ?

• IT : We have petabytes of data and I can show the Hadoop admin console. We even have the Spark UI !

• Business : … (unprintable)

Red Pill-Full Stack MVP (see next slide)o Build the full stack ie bits to business …

o Build incremental Decision Data Science & Product Data Science layers, as appropriate …

o The following conversation is a lot better …

• Business : I heard that we have a big data infrastructure, cool. When can I show a demo to our customers ?

• IT : Actually we don’t have all the data. But from the transaction logs and customer data, we can infer that Males between 34 -36 buy a lot of stuff from us between 11:00 PM & 2:00 AM !

• Business : That is interesting … Show me a graph. BTW, do you know what is the revenue is and the profit margin from these buys ?

• IT : Graph is no problem. We have a shiny app with the dynamic model over the web logs.

• IT: With the data we have, we only know that they comprise ~‾30% of our volume by transaction. But we do not have the order data in our Hadoop yet. We can … let me send out a budget request …

ML EnginenumPy, SciPy, Pandas, Spark,

Azure ML, MPP/Impala

o Collecto Storeo Transform

oReportoVisualize

oRecommend o Predict

oReasonoModel

oModel o Explore

R/Python

o Compositional Analysis

Data HubCurated Data

Storage : HDFS, ParquetCompute : Hadoop MR, Spark

Landing Zone

DashboardsAPIs

Reporting Hub

Analytics HubETL

In-Memory HubReal-TimeKafka …

Reporting Hub

Analytics Hub

Hadoop MR

Long-‐Running Complex Jobs -‐ Yearly pivots, Multi-‐dimensional Exact Uniques

✔ ️ ✔ ️

Real-‐time ad-‐hoc pivots, Approx Uniques (HLL) ✔ ️

Fast Response with Aggregated data Subsets ✔ ️

ML EnginenumPy, SciPy, Pandas, Spark,

Azure ML, MPP/Impala

o Collecto Storeo Transform

oReportoVisualize

oRecommend o Predict

oReasonoModel

oModel o Explore

R/Python

o Compositional Analysis

Data HubCurated Data

Storage : HDFS, ParquetCompute : Hadoop MR, Spark

Landing Zone

DashboardsAPIs

Reporting Hub

Analytics HubETL

In-Memory HubReal-TimeKafka …

Reporting Hub

Analytics Hub

Hadoop MR

Long-‐Running Complex Jobs -‐ Yearly pivots, Multi-‐dimensional Exact Uniques

✔ ️ ✔ ️

Real-‐time ad-‐hoc pivots, Approx Uniques (HLL) ✔ ️

Fast Response with Aggregated data Subsets ✔ ️

https://www.linkedin.com/pulse/why-‐how-‐make-‐mvp-‐analytics-‐ruoyu-‐bao

Build The E2E Analytics MVP Stack

A Data Too FarA Data Too FarBlue Pillo You might get a few .gz files, a few .csv files

and of course, parquet files, in multiple systems

o Some will have IDs, some names, some aggregated by week, some aggregated by day and others pure transactional.

o The challenge is that we have the data, but there is no easy way to combine them for interesting inferences …

Red Pill-Data Curationo “..The most creative things that happen

with data are less about sophisticated algorithms and vast computation (though those are nice) than it is about putting together different pieces of data that were previously locked up in different silos.”

o Data Pipelines (eg.Kafka) with in-line processing to ensure correctness, semantic and temporal congruence & integrity

Ref: Jay Kreps, Announcing Confluent

Where is the Tofu ?Where is the Tofu ?Blue Pillo It is very simple to produce

“reasonable” recommendations

o But extremely difficult to improve them to become “great”

o And, there is a huge difference in business value between reasonable Data Set & great …

Red Pill-Data Curationo The Antidote : The insights and the

algorithms should be relevant and scalable …

o There is a huge gap between Model-Reason and Deploy …

o Statistical Significance need not mean business significance

o Don't confuse the statistical significance of an experiment with the magnitude of the result, even though the word "significance" is often used for both – Peter Norvig

Ref: Xavier Amatriain when he talked about the Netflix Prize

"Knowledge is a process of piling up facts; wisdom lies in their simplification."

- Martin Fischer

Analytics - miscuesAnalytics - miscuesoDon’t Torture the Data

Down the rabbit hole art by frostyshadowshttp://frostyshadows.deviantart.com/art/Down-‐the-‐Rabbit-‐Hole-‐358090601

Design PrinciplesDesign Principles1. Start with needs*2. Do less3. Design with data4. Do the hard work to make it simple5. Iterate. Then iterate again.6. Build for inclusion7. Understand context8. Build digital services, not websites9. Be consistent, not uniform10. Make things open: it makes things better

https://www.gov.uk/design-‐principles

Data Alone is not enoughData Alone is not enoughoData alone is not enough• Induction not deduction - Every learner should embody some knowledge

or assumptions beyond the data it is given in order to generalize beyond it

oMachine Learning is not magic – one cannot get something from nothing• In order to infer, one needs the knobs & the dials• One also needs a rich expressive dataset

oData Scientists are not Data Alchemists• Don’t expect Analytic Gold from a pack of data lead

A few useful things to know about machine learning - by Pedro Domingoshttp://dl.acm.org/citation.cfm?id=2347755https://www.flickr.com/photos/bionerd/3123155390

More Data Beats a Cleverer AlgorithmMore Data Beats a Cleverer Algorithm

oMore Data Beats a Cleverer Algorithm• Or conversely select algorithms that improve with data• Don’t optimize prematurely without getting more data

o Learn many models, not Just One• Ensembles ! – Change the hypothesis space• Netflix prize• E.g. Bagging, Boosting, Stacking

o Simplicity Does not necessarily imply Accuracyo Representable Does not imply Learnable• Just because a function can be represented does not mean it can be

learned

o Correlation Does not imply Causationo http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/o A few useful things to know about machine learning - by Pedro Domingos

§ http://dl.acm.org/citation.cfm?id=2347755

In short …In short …o Build Full stack, iteratively building capabilitieso Identify the ‘Right’ Business Problemso Create Valuable Data Perspectiveso Frame problems & bring analytics together with non-quantitative information to

build compelling storieso Embed Inference & Intelligence in products

https://www.linkedin.com/pulse/article/20141108013125-‐1290064-‐winning-‐at-‐analytics-‐takes-‐more-‐than-‐technologyhttp://www.kdnuggets.com/2014/09/hiring-‐data-‐scientist-‐what-‐to-‐look-‐for.html

Ogilvy & Mather Advertising : Morning view from the Ogilvy & Mather NY office, nicknamed the Chocolate Factory �#�TravelTuesdayThan

k Yo

uThank You

big data analytics - best of the worst : anti-patterns & antidotes

Data & Analytics

o capture

probabilities o

o blackswans

glauhdo3 o

unknown unknowns o

appliances softhard

organization o access

imagination o potentialfacts