big data : bits of history, words of advice

Big Data : Bits of History, Words of Advice

Venu Vasudevan

GLSEC Big Data Meetup

Big Data : Bits of History, Words of Advice

Big Data PastBig

Fast

intelligent mediaIoT

satellites

Big Data : Behavioral

Big Data

- The ‘V’ view of Big Data challenges - Number of V’s up for debate

Big Data : Architectural

untidy data

firehose

clean analytics

fast & good

slower & much better

Lambda architecture

Lake architecture

Stream architecture

Technical

This Talk

Behavioral View

Technology Solution

Stack

‘Middleware’ (benefit of hindsight)

some more some

governance culture (gap)

data economics

owne

rshi

p

foo

d fi

ght

s

data eco

nom

ics

3 data pointsBig

Fast

intelligent mediaIoT

satellites

Iridium

• mobile routers (10K mph), fixed people

• no repeated patterns

• satellites N-S movement

• earth E-W movement

• regular topology, irregular exceptions

• solar flares

• military satellite presence

Fast Data Problem

• cellular frequency allocation (graph coloring problem)

• frequent fast recalculations (fast routers + semi-fast earth)

• transmit-no transmit (solar flares, military satellite presence)

• moving ‘seam’

seam

irreg

ular

ities

Fast Data Problem

• cellular frequency allocation (graph coloring problem)

• frequent fast recalculations (fast routers + semi-fast earth)

• transmit-no transmit (solar flares, military satellite presence)

• moving ‘seam’

• + ‘France’

seam

irreg

ular

ities broadcast

= +$$$

broadcast = -$$$ (lawsuit)

Fast Data Problem

• quest for (OO)DB technology to address ‘France’ as make-or-break use case

• query expressive power

• complex constraint satisfaction

• query handling throughput

• 3-4 month benchmarking effort

seam

broadcast = +$$$

broadcast = -$$$ (lawsuit)

Fast Data Problem

• quest for (OO)DB technology to address ‘France’

• query expressive power

• query handling throughput

• 3-4 month benchmarking effort

• France solved ‘out-of-band’ (legally)

seam

broadcast = +$$$

broadcast = -$$$ (lawsuit)don’t overfit your architecture to

an extreme requirement

unless it’s from an extreme (paying) user

Big Data Problem

• systems management

• manage 66 ‘nodes’

• nodes moving at 10K mph

• ‘seam’ moving of 20K mph

• sounds harder than trivial, but not too hard

‘Pre’ Lambda Solution

• Dumb edge | smart core approach

• 15K events/sec/satellite

• 1M events/sec

• Fast & Approximate - FMEA: ’compiled’ lookup table for failure modes

• Slow & Precise - Model-based reasoning on satellite models

untidy satellite firehose

(1M events/sec)

actionable insights

‘Pre’ Lambda architecture

Model-Based Reasoning

FMEA

‘Pre’ Lambda Solution

• Dumb edge | smart core approach

• 15K events/sec/satellite

• Fast & Approximate - FMEA: ’compiled’ lookup table for failure modes

• Slow & Precise - Model-based reasoning on satellite models

• Simple, straightforward & wrong.

untidy satellite firehose

(1M events/sec)

actionable insights

‘Pre’ Lambda architecture

Model-Based Reasoning

real-time expert system

FMEA

Yet, an architecture that is

‘rinsed and repeated’

over the years

why does dumb edge smart cloud endure?

• edges are expensive ($2B)

• when edges go wrong (break/blow up /collide) , they make headlines

$

$$$$$

why dumb edge smart cloud


• when edges go wrong (break/blow up /collide) and make headlines

• nobody messes with an ‘edge’ once it works

• clouds don’t make for good news headlines

$ T-0

$$$$$ T-30 yrs

why dumb edge smart cloud


• when edges go wrong (break/blow up /collide) and make news headlines

• nobody messes with an ‘edge’ once it works

• thus, implementing an end-to-end architecture causes culture clashes

over my dead body

iterate & refine

an almost repeat (Industrial IoT)

• edges are messy & domain specific

• creating them means dealing with culture clashes

• but .. an ounce of edge is worth a pound of cloud

$$$$$ T-30 yrs

$ T-0

Things to consider• Problem statement. What’s your ‘France’?

• colorful sub-problem. strategy overfit.

• Architecture. small fixes to IT/OT gap can go a long way to a simpler problem

• Technology Choices. best practices & the risk of ‘rewardless risk’

• right - make average programmers productive with new tech

• frequent - turn great programmers into average

Big Data to Deep Metadata

streaming video(TV) ~ 1 petabyte/day

second

minute

hour

day/week

epochal

detect & replace ads

Create Playlists by Player,

Play, Sentiment

Identify minor characters with rabid fan following

rejuvenate old content

derive new

content

‘chapterize’ by Player,

Play, Sentiment

Platform Triage Challenge new Product, new market

• one core technology, many markets

• platform triaging challenge. what drives the platform?

• highest (but uncertain) $ potential?

• ‘extreme’ requirement?

• sparsest competition?

• use case outlier is your biggest customer

deep metadata

technology

SaaS data

platform

Advertising

Search

Video concept

maps

ad replacement use case• speed

• few days (on-demand content)

• few seconds (real-time rebroadcast with new ads)

• precision

• low - best effort, for low cost international content for niche audiences

• high - frame level for expensive content. e.g. Sports/$10M/episode programming

• errors

• 90% accuracy - ok for long tail content

• ‘five nines’ for premium content

precision accu

racy

spee

d

ad replacement opportunity space

largest customer

occam’s razor works (again)

• build to simplicity

• loose coupling between data engg & equipment engg

• modularize complexity

• ‘differentiate your product’ changes

• ‘necessary evil’ changes

data-only approach

+1st party integration (dynamically configure

ad splicers)

3rd party knobs (dynamically refresh CDN)

Architecture

but, what if ..

• Data is untidy

• Interpretation is subjective/cultural

• Automation is aspirational but quixotic

human-powered analytics

• some analytics tasks are too ‘slippery’ for machines

• data hard to characterize

• uneven video quality of ‘old’ archives

• untidy

• insights are subjective

human-powered analytics

• some analytics tasks are too ‘slippery’ for machines

• need for human augmentation

• humans generate ‘training’ sets to bootstrap m/c learning

• humans completely take over some tasks

machines vs humans

• crowdsourcing & human-powered computing

• has been the ‘next big thing’ for a while

• checkered history:

• uneven output

• fraud

• uneven throughput

Machines Humans

fast slow

brittle malleable

objective subjective

clear nuanced

machines vs humans

• much of that has changed

• Amazon Mech Turk

• 500K active users

• the ‘human machine’ can return substantial jobs in under 30 mins

• quantifiable as a machine for many media tasks - latency, quality, error rate, thruput

Hybrid Architecture

Things to consider• Beware ‘France’ in other forms:

• customer with loudest voice & ‘holy grail’ hairball

• Dealing with data quality & variability

• crowdsourcing has come a long way as credible ‘engine’

• If big data the answer, what is the question? (have strong opinion held weakly)

• decision rationalization

• process automation

• human ‘power tool’ (e.g. compelling visualization) vs imperfect automation

startup data jiu-jitsu

• How to create a data-driven strategy before the data shows up?

• rationalize future SaaS revenue models

• justify product decisions in a data-driven manner

need data for product

need product for data

startup data jiu-jitsu

• How to create a data-driven strategy before the data shows up?

• how ‘intelligent’ can lighting control be with 50-100K users?

• how do people use dimmers (continuous or quantized) — UX implications

data set dilemma

• standard sources (e.g. Kaggle & UCI) insufficient

• few ‘physical world’ datasets

• expensive to collect

• may be specialized (vendor-specific)

• dataset proxies for IoT actuation may not work

• energy utilization != switch usage

big data, small start

• physical world data likely to be smaller (1-10 homes, few months)

• setup costs limit size of public datasets

• e.g. UMass Smart* light switch dataset

big data, small start

• consider data ‘augmentation’

• standard practice in AI (deep learning) - horizontally flipping, random crops …

• under-used in data space

• may need some thought on perturbation models for your domain

real

synthesized

https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html

In short ..

• big data success - equal parts tech & non-tech

• solving right problem, not just problem right

• revisit problem, and what success means

@venuv62 [email protected]

mailto:[email protected]

big data : bits of history, words of advice

Internet