big data : bits of history, words of advice
TRANSCRIPT
Big Data : Bits of History, Words of Advice
Venu Vasudevan
GLSEC Big Data Meetup
Big Data : Bits of History, Words of Advice
Big Data PastBig
Fast
intelligent mediaIoT
satellites
Big Data : Behavioral
Big Data
- The ‘V’ view of Big Data challenges - Number of V’s up for debate
Big Data : Architectural
untidy data
firehose
clean analytics
fast & good
slower & much better
Lambda architecture
Lake architecture
Stream architecture
Technical
Technical
This Talk
Behavioral View
Technology Solution
Stack
‘Middleware’ (benefit of hindsight)
some more some
governance culture (gap)
data economics
owne
rshi
p
foo
d fi
ght
s
data eco
nom
ics
3 data pointsBig
Fast
intelligent mediaIoT
satellites
Iridium
• mobile routers (10K mph), fixed people
• no repeated patterns
• satellites N-S movement
• earth E-W movement
• regular topology, irregular exceptions
• solar flares
• military satellite presence
Fast Data Problem
• cellular frequency allocation (graph coloring problem)
• frequent fast recalculations (fast routers + semi-fast earth)
• transmit-no transmit (solar flares, military satellite presence)
• moving ‘seam’
seam
irreg
ular
ities
Fast Data Problem
• cellular frequency allocation (graph coloring problem)
• frequent fast recalculations (fast routers + semi-fast earth)
• transmit-no transmit (solar flares, military satellite presence)
• moving ‘seam’
• + ‘France’
seam
irreg
ular
ities broadcast
= +$$$
broadcast = -$$$ (lawsuit)
Fast Data Problem
• quest for (OO)DB technology to address ‘France’ as make-or-break use case
• query expressive power
• complex constraint satisfaction
• query handling throughput
• 3-4 month benchmarking effort
seam
broadcast = +$$$
broadcast = -$$$ (lawsuit)
Fast Data Problem
• quest for (OO)DB technology to address ‘France’
• query expressive power
• query handling throughput
• 3-4 month benchmarking effort
• France solved ‘out-of-band’ (legally)
seam
broadcast = +$$$
broadcast = -$$$ (lawsuit)don’t overfit your architecture to
an extreme requirement
unless it’s from an extreme (paying) user
Big Data Problem
• systems management
• manage 66 ‘nodes’
• nodes moving at 10K mph
• ‘seam’ moving of 20K mph
• sounds harder than trivial, but not too hard
‘Pre’ Lambda Solution
• Dumb edge | smart core approach
• 15K events/sec/satellite
• 1M events/sec
• Fast & Approximate - FMEA: ’compiled’ lookup table for failure modes
• Slow & Precise - Model-based reasoning on satellite models
untidy satellite firehose
(1M events/sec)
actionable insights
‘Pre’ Lambda architecture
Model-Based Reasoning
FMEA
‘Pre’ Lambda Solution
• Dumb edge | smart core approach
• 15K events/sec/satellite
• Fast & Approximate - FMEA: ’compiled’ lookup table for failure modes
• Slow & Precise - Model-based reasoning on satellite models
• Simple, straightforward & wrong.
untidy satellite firehose
(1M events/sec)
actionable insights
‘Pre’ Lambda architecture
Model-Based Reasoning
real-time expert system
FMEA
Yet, an architecture that is
‘rinsed and repeated’
over the years
why does dumb edge smart cloud endure?
• edges are expensive ($2B)
• when edges go wrong (break/blow up /collide) , they make headlines
$
$$$$$
why dumb edge smart cloud
• edges are expensive ($2B)
• when edges go wrong (break/blow up /collide) and make headlines
• nobody messes with an ‘edge’ once it works
• clouds don’t make for good news headlines
$ T-0
$$$$$ T-30 yrs
why dumb edge smart cloud
• edges are expensive ($2B)
• when edges go wrong (break/blow up /collide) and make news headlines
• nobody messes with an ‘edge’ once it works
• thus, implementing an end-to-end architecture causes culture clashes
over my dead body
iterate & refine
an almost repeat (Industrial IoT)
• edges are messy & domain specific
• creating them means dealing with culture clashes
• but .. an ounce of edge is worth a pound of cloud
$$$$$ T-30 yrs
$ T-0
Things to consider• Problem statement. What’s your ‘France’?
• colorful sub-problem. strategy overfit.
• Architecture. small fixes to IT/OT gap can go a long way to a simpler problem
• Technology Choices. best practices & the risk of ‘rewardless risk’
• right - make average programmers productive with new tech
• frequent - turn great programmers into average
Big Data to Deep Metadata
streaming video(TV) ~ 1 petabyte/day
second
minute
hour
day/week
epochal
detect & replace ads
Create Playlists by Player,
Play, Sentiment
Identify minor characters with rabid fan following
rejuvenate old content
derive new
content
‘chapterize’ by Player,
Play, Sentiment
Platform Triage Challenge new Product, new market
• one core technology, many markets
• platform triaging challenge. what drives the platform?
• highest (but uncertain) $ potential?
• ‘extreme’ requirement?
• sparsest competition?
• use case outlier is your biggest customer
deep metadata
technology
SaaS data
platform
Advertising
Search
Video concept
maps
ad replacement use case• speed
• few days (on-demand content)
• few seconds (real-time rebroadcast with new ads)
• precision
• low - best effort, for low cost international content for niche audiences
• high - frame level for expensive content. e.g. Sports/$10M/episode programming
• errors
• 90% accuracy - ok for long tail content
• ‘five nines’ for premium content
precision accu
racy
spee
d
ad replacement opportunity space
largest customer
occam’s razor works (again)
• build to simplicity
• loose coupling between data engg & equipment engg
• modularize complexity
• ‘differentiate your product’ changes
• ‘necessary evil’ changes
data-only approach
+1st party integration (dynamically configure
ad splicers)
3rd party knobs (dynamically refresh CDN)
Architecture
but, what if ..
• Data is untidy
• Interpretation is subjective/cultural
• Automation is aspirational but quixotic
human-powered analytics
• some analytics tasks are too ‘slippery’ for machines
• data hard to characterize
• uneven video quality of ‘old’ archives
• untidy
• insights are subjective
human-powered analytics
• some analytics tasks are too ‘slippery’ for machines
• need for human augmentation
• humans generate ‘training’ sets to bootstrap m/c learning
• humans completely take over some tasks
machines vs humans
• crowdsourcing & human-powered computing
• has been the ‘next big thing’ for a while
• checkered history:
• uneven output
• fraud
• uneven throughput
Machines Humans
fast slow
brittle malleable
objective subjective
clear nuanced
machines vs humans
• much of that has changed
• Amazon Mech Turk
• 500K active users
• the ‘human machine’ can return substantial jobs in under 30 mins
• quantifiable as a machine for many media tasks - latency, quality, error rate, thruput
Hybrid Architecture
Things to consider• Beware ‘France’ in other forms:
• customer with loudest voice & ‘holy grail’ hairball
• Dealing with data quality & variability
• crowdsourcing has come a long way as credible ‘engine’
• If big data the answer, what is the question? (have strong opinion held weakly)
• decision rationalization
• process automation
• human ‘power tool’ (e.g. compelling visualization) vs imperfect automation
startup data jiu-jitsu
• How to create a data-driven strategy before the data shows up?
• rationalize future SaaS revenue models
• justify product decisions in a data-driven manner
need data for product
need product for data
startup data jiu-jitsu
• How to create a data-driven strategy before the data shows up?
• how ‘intelligent’ can lighting control be with 50-100K users?
• how do people use dimmers (continuous or quantized) — UX implications
data set dilemma
• standard sources (e.g. Kaggle & UCI) insufficient
• few ‘physical world’ datasets
• expensive to collect
• may be specialized (vendor-specific)
• dataset proxies for IoT actuation may not work
• energy utilization != switch usage
big data, small start
• physical world data likely to be smaller (1-10 homes, few months)
• setup costs limit size of public datasets
• e.g. UMass Smart* light switch dataset
big data, small start
• consider data ‘augmentation’
• standard practice in AI (deep learning) - horizontally flipping, random crops …
• under-used in data space
• may need some thought on perturbation models for your domain
real
synthesized
https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html
In short ..
• big data success - equal parts tech & non-tech
• solving right problem, not just problem right
• revisit problem, and what success means