big data analytics - best of the worst : anti-patterns & antidotes
TRANSCRIPT
Big Data Analytics -The Best of the Worst
Krishna Sankar
@ksankarhttps://www.linkedin.com/in/ksankar
About MeAbout Me
o Data Scientist • Decision Data Science & Product Data Science [Data Science Folk Knowledge http://goo.gl/O4svPx]
• Insights = Intelligence + Inference + Interface [https://goo.gl/s2KB6L]• Predicting NFL with Elo like Nate Silver & 538 [NFL : http://goo.gl/Q2OgeJ, NBA’15 : https://goo.gl/aUhdo3]
o Have been speaking at OSCON [http://goo.gl/1MJLu], PyCon, Pydata [http://vimeo.com/63270513, http://www.slideshare.net/ksankar/pydata-19] …
o Have done lots of things:• Big Data (Retail, Bioinformatics, Financial, AdTech), Starting MS-CFRM, University of WA
• Written Books (Web 2.0, Wireless, Java,…))Standards, some work in AI,• Guest Lecturer at Naval PG School,…
o Studying MS-CFRM (Computational Finance/Risk management) UWAo Full-day Spark workshop “Advanced Data Science w/ Spark” / Spark Summit-E’15[https://goo.gl/7SBKTC]o Co-author : “Fast Data Processing with Spark”, Packt Publishing [http://goo.gl/eNtXpT]o Reviewer : “Machine Learning with Spark” Packt Publishing
o Volunteer as Robotics Judge at First Lego league World Competitionso @ksankar, doubleclix.wordpress.com
Background – Top 5Background – Top 5
http://tcapp2.publishpath.com/rabbitholehttp://conservationmagazine.org/wordpress/wp-‐content/uploads/2013/05/context-‐matters.jpg
1) Data ScienceThe art of building a model with known knownsWhich when let loose, works with unknown unknowns
1) Data ScienceThe art of building a model with known knownsWhich when let loose, works with unknown unknowns
Donald Rumsfeld is an armchair Data Scientist !
http://smartorg.com/2013/07/valuepoint19/
The World
Knowns
Unknowns
YouUnKnown Known
o Others know, you don’tModel Evolution/DevOpsto capture this
o Capture in Models
o Facts, outcomes or scenarios we have not encountered, nor considered
o “Black swans”, outliers, long tails of probability distributions
o Lack of experience, imagination
o Potential facts, outcomes we are aware, but not with certainty
o Stochastic processes, Probabilities
o Known Knownso There are things we know that
we knowo Known Unknowns
o That is to say, there are things that we now know we don't know
o But there are also Unknown Unknownso There are things we do not know
we don't knowGoal of Big Data is AnalyticsGoal of Big Data is Analytics
2) The pipeline is the context 2) The pipeline is the context
o Scalable Model Deployment
o Big Data automation & purpose built appliances (soft/hard)
o Manage SLAs & response times
o Scalable Model Deployment
o Big Data automation & purpose built appliances (soft/hard)
o Manage SLAs & response times
o Volumeo Velocityo Streaming Data
o Volumeo Velocityo Streaming Data
o Canonical formo Data catalogo Data Fabric across the
organizationo Access to multiple
sources of data o Think Hybrid – Big Data
Apps, Appliances & Infrastructure
o Canonical formo Data catalogo Data Fabric across the
organizationo Access to multiple
sources of data o Think Hybrid – Big Data
Apps, Appliances & Infrastructure
CollectCollect StoreStore TransformTransform
o Metadatao Monitor counters &
Metricso Structured vs. Multi-‐
structured
o Metadatao Monitor counters &
Metricso Structured vs. Multi-‐
structured
o Flexible & Selectable§ Data Subsets § Attribute sets
o Flexible & Selectable§ Data Subsets § Attribute sets
o Refine model with§ Extended Data
subsets§ Engineered
Attribute setso Validation run across a
larger data set
o Refine model with§ Extended Data
subsets§ Engineered
Attribute setso Validation run across a
larger data set
ReasonReason ModelModel DeployDeploy
Data ManagementData Management Data ScienceData Science
o Dynamic Data Setso 2 way key-‐value tagging of
datasetso Extended attribute setso Advanced Analytics
o Dynamic Data Setso 2 way key-‐value tagging of
datasetso Extended attribute setso Advanced Analytics
ExploreExploreVisualizeVisualize RecommendRecommend PredictPredict
o Performanceo Scalabilityo Refresh Latencyo In-‐memory Analytics
o Performanceo Scalabilityo Refresh Latencyo In-‐memory Analytics
o Advanced Visualizationo Interactive Dashboardso Map Overlayo Infographics
o Advanced Visualizationo Interactive Dashboardso Map Overlayo Infographics
¤ Bytes to Business a.k.a. Build the full stack
¤ Find Relevant Data For Business
¤ Connect the Dots
VolumeVolume
VelocityVelocity
VarietyVariety
3) Mind Your “I”s, “C”s & “V”s3) Mind Your “I”s, “C”s & “V”s
ContextContext
ConnectednessConnectedness
IntelligenceIntelligence
InterfaceInterface
InferenceInference
o Three Amigoso Interface = Cognitiono Intelligence = Compute(CPU) & Computational(GPU)o Infer Significance & Causality
CURATED SIGNALS > APPLIED INTELLIGENCE > STRATIFIED INFERENCECURATED SIGNALS > APPLIED INTELLIGENCE > STRATIFIED INFERENCE
4) Model Evolution & Concept Drift4) Model Evolution & Concept Drift
Dynamic dash boardsMulti-dimensional
pivots w/ customization
Selectable algorithms on data
subsets“Cluster Customer for 5 thanksgiving
seasons”
Learning ModelsAutomatic Feature Selection
& hyper parameter optimizations as it gets
more data
Dynamic Models –Model Selection based
on context
Com
plex
ity
Value
Automated Analytics- Let Data tell story
Feature Learning, AI, Deep Learning
Concept DriftValidate Model assumptions + hyper parameters + features in the current context – after they are in production
Ref: Prof. Josh Bloom, Keynote: A Systems View of Machine Learning, #pydata Seattle’15
5) The Sense & Sensibility of a DataScientist DevOps5) The Sense & Sensibility of a DataScientist DevOps
oAnalytics in the lab = Investigative• Interactive, Iterative,
Explorative• Output is usually decision
data science
oAnalytics in the factory = Operational• Automated, systemic,
transparent & explainable• Output is embedded
intelligence• Embedded in customer facing
decision systems
Josh Wills-‐From the labs to the factory, https://doubleclix.wordpress.com/2013/11/17/of-‐building-‐data-‐products/
http://doubleclix.wordpress.com/2014/05/11/the-‐sense-‐sensibility-‐of-‐a-‐data-‐scientist-‐devops/
There is a chasm between Model/Reason and Deploy
6) Data is your product, regardless of what you sell6) Data is your product, regardless of what you sell
oData is the lens through which you see the business and fell the pulse
o Collect the right data through “Thoughtful Data Design”
oGive Data Back in a Powerful Way
o But don’t confuse or overwhelm the users• The users have to feel safe• The users have to feel they are in control
oNever try to launch a complicated data product on a fixed schedule
oOffer progressively sophisticated products, leveraging the data & insights, across the different user population segments • Customer segmentation & stratification is not just for retail !
Josh Wills-‐From the labs to the factory, https://doubleclix.wordpress.com/2013/11/17/of-‐building-‐data-‐products/
http://doubleclix.wordpress.com/2014/05/11/the-‐sense-‐sensibility-‐of-‐a-‐data-‐scientist-‐devops/
“The
re are
no ro
utin
e st
atis
tical qu
esti
ons,
onl
y qu
esti
onable
sta
tist
ical rou
tine
s” --Da
vid Co
x
Ref: Gabriele CornoNatural History Museum in #London ..by George ThalassinosBig Data Analytics - The Best of the Worst
Data SwampData SwampBlue Pillo Typical case of “ungoverned data
stores addressing a limited data science audience“
o The company proudly has crossed the chasm to the big data world with a new shiny Hadoop infrastructure.
o Now every one starts putting their data into this “lake”.
o After a few months, the disks are full; Hadoop is replicating 3 copies; even some bytes are falling off the floor from the wires – but no one has any clue on what data is in there, the consistency and the semantic coherence
Red Pill-Data CurationoData Curation• A consistent published schema
oData Quality & Data Lineage, “descriptive metadata and an underlying mechanism to maintain it”, all are part of the data curation layer …
o Semantic consistency across diverse multi-structured multi-temporal transactions require a level of data curation & discipline
oDesign for the right “Data Gravity” & “Data Mass” as Van Lindberg mentioned, yesterday, in his keynote
• Not Data Molasses !
Data SwampData SwampBlue Pillo Typical case of “ungoverned data
stores addressing a limited data science audience“
o The company proudly has crossed the chasm to the big data world with a new shiny Hadoop infrastructure.
o Now every one starts putting their data into this “lake”.
o After a few months, the disks are full; Hadoop is replicating 3 copies; even some bytes are falling off the floor from the wires – but no one has any clue on what data is in there, the consistency and the semantic coherence
Red Pill-Data CurationoData Curation
• A consistent published schema
oData quality & data lineage, “descriptive metadata and an underlying mechanism to maintain it”, all are part of the data curation layer …
o Semantic consistency across diverse multi-structured multi-temporal transactions require a level of data curation and discipline
https://www.linkedin.com/pulse/data-‐lakes-‐udls-‐vs-‐analytics-‐platforms-‐gargi-‐adhav
Big Data To Nowhere Big Data To Nowhere
Blue Pillo IT sees an opportunity and starts
building the infrastructure, sometimes massive, and puts petabytes of data in the Big Data Hub or lake or pool or … But no relevant business facing apps.
o A conversation goes like this …• Business : I heard that we have a big
data infrastructure, cool. When can I show a demo to our customers ?
• IT : We have petabytes of data and I can show the Hadoop admin console. We even have the Spark UI !
• Business : … (unprintable)
Red Pill-Full Stack MVP (see next slide)o Build the full stack ie bits to business …
o Build incremental Decision Data Science & Product Data Science layers, as appropriate …
o The following conversation is a lot better …
• Business : I heard that we have a big data infrastructure, cool. When can I show a demo to our customers ?
• IT : Actually we don’t have all the data. But from the transaction logs and customer data, we can infer that Males between 34 -36 buy a lot of stuff from us between 11:00 PM & 2:00 AM !
• Business : That is interesting … Show me a graph. BTW, do you know what is the revenue is and the profit margin from these buys ?
• IT : Graph is no problem. We have a shiny app with the dynamic model over the web logs.
• IT: With the data we have, we only know that they comprise ~‾30% of our volume by transaction. But we do not have the order data in our Hadoop yet. We can … let me send out a budget request …
ML EnginenumPy, SciPy, Pandas, Spark,
Azure ML, MPP/Impala
o Collecto Storeo Transform
oReportoVisualize
oRecommend o Predict
oReasonoModel
oModel o Explore
R/Python
o Compositional Analysis
Data HubCurated Data
Storage : HDFS, ParquetCompute : Hadoop MR, Spark
Landing Zone
DashboardsAPIs
Reporting Hub
Analytics HubETL
In-Memory HubReal-TimeKafka …
Reporting Hub
Analytics Hub
Hadoop MR
Long-‐Running Complex Jobs -‐ Yearly pivots, Multi-‐dimensional Exact Uniques
✔ ️ ✔ ️
Real-‐time ad-‐hoc pivots, Approx Uniques (HLL) ✔ ️
Fast Response with Aggregated data Subsets ✔ ️
ML EnginenumPy, SciPy, Pandas, Spark,
Azure ML, MPP/Impala
o Collecto Storeo Transform
oReportoVisualize
oRecommend o Predict
oReasonoModel
oModel o Explore
R/Python
o Compositional Analysis
Data HubCurated Data
Storage : HDFS, ParquetCompute : Hadoop MR, Spark
Landing Zone
DashboardsAPIs
Reporting Hub
Analytics HubETL
In-Memory HubReal-TimeKafka …
Reporting Hub
Analytics Hub
Hadoop MR
Long-‐Running Complex Jobs -‐ Yearly pivots, Multi-‐dimensional Exact Uniques
✔ ️ ✔ ️
Real-‐time ad-‐hoc pivots, Approx Uniques (HLL) ✔ ️
Fast Response with Aggregated data Subsets ✔ ️
https://www.linkedin.com/pulse/why-‐how-‐make-‐mvp-‐analytics-‐ruoyu-‐bao
Build The E2E Analytics MVP Stack
A Data Too FarA Data Too FarBlue Pillo You might get a few .gz files, a few .csv files
and of course, parquet files, in multiple systems
o Some will have IDs, some names, some aggregated by week, some aggregated by day and others pure transactional.
o The challenge is that we have the data, but there is no easy way to combine them for interesting inferences …
Red Pill-Data Curationo “..The most creative things that happen
with data are less about sophisticated algorithms and vast computation (though those are nice) than it is about putting together different pieces of data that were previously locked up in different silos.”
o Data Pipelines (eg.Kafka) with in-line processing to ensure correctness, semantic and temporal congruence & integrity
Ref: Jay Kreps, Announcing Confluent
Where is the Tofu ?Where is the Tofu ?Blue Pillo It is very simple to produce
“reasonable” recommendations
o But extremely difficult to improve them to become “great”
o And, there is a huge difference in business value between reasonable Data Set & great …
Red Pill-Data Curationo The Antidote : The insights and the
algorithms should be relevant and scalable …
o There is a huge gap between Model-Reason and Deploy …
o Statistical Significance need not mean business significance
o Don't confuse the statistical significance of an experiment with the magnitude of the result, even though the word "significance" is often used for both – Peter Norvig
Ref: Xavier Amatriain when he talked about the Netflix Prize
"Knowledge is a process of piling up facts; wisdom lies in their simplification."
- Martin Fischer
Down the rabbit hole art by frostyshadowshttp://frostyshadows.deviantart.com/art/Down-‐the-‐Rabbit-‐Hole-‐358090601
Design PrinciplesDesign Principles1. Start with needs*2. Do less3. Design with data4. Do the hard work to make it simple5. Iterate. Then iterate again.6. Build for inclusion7. Understand context8. Build digital services, not websites9. Be consistent, not uniform10. Make things open: it makes things better
https://www.gov.uk/design-‐principles
Data Alone is not enoughData Alone is not enoughoData alone is not enough• Induction not deduction - Every learner should embody some knowledge
or assumptions beyond the data it is given in order to generalize beyond it
oMachine Learning is not magic – one cannot get something from nothing• In order to infer, one needs the knobs & the dials• One also needs a rich expressive dataset
oData Scientists are not Data Alchemists• Don’t expect Analytic Gold from a pack of data lead
A few useful things to know about machine learning - by Pedro Domingoshttp://dl.acm.org/citation.cfm?id=2347755https://www.flickr.com/photos/bionerd/3123155390
More Data Beats a Cleverer AlgorithmMore Data Beats a Cleverer Algorithm
oMore Data Beats a Cleverer Algorithm• Or conversely select algorithms that improve with data• Don’t optimize prematurely without getting more data
o Learn many models, not Just One• Ensembles ! – Change the hypothesis space• Netflix prize• E.g. Bagging, Boosting, Stacking
o Simplicity Does not necessarily imply Accuracyo Representable Does not imply Learnable• Just because a function can be represented does not mean it can be
learned
o Correlation Does not imply Causationo http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/o A few useful things to know about machine learning - by Pedro Domingos
§ http://dl.acm.org/citation.cfm?id=2347755
In short …In short …o Build Full stack, iteratively building capabilitieso Identify the ‘Right’ Business Problemso Create Valuable Data Perspectiveso Frame problems & bring analytics together with non-quantitative information to
build compelling storieso Embed Inference & Intelligence in products
https://www.linkedin.com/pulse/article/20141108013125-‐1290064-‐winning-‐at-‐analytics-‐takes-‐more-‐than-‐technologyhttp://www.kdnuggets.com/2014/09/hiring-‐data-‐scientist-‐what-‐to-‐look-‐for.html