big data and analytics - sveriges riksbankarchive.riksbank.se/documents/forskning/konferenser... ·...
TRANSCRIPT
Daniel Gillblad, [email protected] Institute of Computer Science
BIG DATA AND ANALYTICS
www.sics.se
OUR RESEARCH: SURVIVING THE DATA FLOOD
www.sics.se
SOME QUESTIONS AROUND BIG DATA ANALYTICS1. What is Big Data?
2. What is different this time?
3. Why is it important?
4. What does the future look like?
www.sics.se
BIG DATA – 3V, AND MORE?
www.sics.se
BIG DATA IS PARALLELIZATION
Read on 10 000 machines:10 000 times faster
www.sics.se
BIG DATA IS SCALABILITYON COMMODITY HARDWARE
LOTS OF COMMODITY HARDWARE……BIG DATA IS FAULT TOLERANCE
www.sics.se
BIG DATA IS PLATFORMS AND MIDDLEWARE
www.sics.se
FIRST, THERE WAS MAP REDUCE
• Distribute data over commodity hardware (HDFS etc.) in data center
• Map and distribute computation to computers storing the data
• Choose fix, batch-oriented form of calculation: Map -> Reduce
www.sics.se
MAP REDUCE, EXAMPLE• Word count: “read -> map -> reduce -> write”
...public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {String line = value.toString();StringTokenizer tokenizer = new StringTokenizer(line);while (tokenizer.hasMoreTokens()) {word.set(tokenizer.nextToken());output.collect(word, one);
}}
}
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter)throws IOException {
int sum = 0;while (values.hasNext()) {sum += values.next().get();
}output.collect(key, new IntWritable(sum));
}}...
www.sics.se
NEXT-GEN MAP REDUCE
• Handle real-time, interactive and iterative tasks: Necessary for machine learning etc.
• Resilient distributed data sets – in-memory caching etc.
1. Transformations: Map, filter, join…2. Actions: Count, collect, save…• Faster, easier, more powerful
file = spark.textFile("hdfs://...")
file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
www.sics.se
SPARK, REAL-WORLD EXAMPLE
• Incomprehensible example:logs.map(log => (log.data, log)).groupByKey()
.filter(_._2.size >= minimumValue)
.map(xLogsPair => new Aggregate(Pair(xLogsPair._1, logsExpanded(xLogsPair._2))))
• Defines both what/how and hints on how to parallelize
www.sics.se
IS THAT IT?
• Still not that easy to write – special kind of developers
• Fantastic productivity in the right hands• Developer support limited• Parallelization is difficult• Pretty difficult to write advanced analytics
www.sics.se
WHO USES THIS?
Growth of Big Data Analytics
• Big Data Analytics: gaining value from data – Web analytics, fraud detection, system
management, networking monitoring, business dashboard, …
2 Need to enable more users to perform data analytics http://www.delphianalytics.net
www.sics.se
SCALABILITY IS OFTEN NO LONGER THE MAIN ISSUE –ANALYTICS, EFFICIENCY AND PRODUCTIVITY ARE
www.sics.se
ACCESSIBLE BIG DATA TOOLS?
• Higher5level7abstractions7(RDDs75>7Dataframes etc.)
• R7and7Python7interfaces• Libraries:
• Spark7MLib,7GraphX• FlinkML,7Gelly• …
• Higher7level7languages• Imperative7Big7Data7processing
• …
www.sics.se
RAW PERFORMANCE MATTERS
https://databricks.com/blog/2015/04/28/project5tungsten5 bringing5spark5clos er5to5bare5meta l.html http://bid2.berkeley.edu/bid5data5project/
www.sics.se
DATA AT REST…
www.sics.se
… TO DATA IN MOTION
www.sics.se
FROM LAMBDA ARCHITECTURE…!
!
! !! !8!
*Figure*2:*Lambda*architecture*as*described*by*MAPR2.*
!
!Figure*3:*STREAMLINE*Flink*Architecture.*
!
! Lambda*Architecture! Flink*Streamline*Architecture!
Developer*Productivity!
• Implementations!in!multiple!systems!and!programming!languages!
• Orchestrating!multiple!system!architectures!for!efficient!usage!
• Implementation!of!data!analysis!programs!purely!in!a!single!system!
• No!system!boundaries!need!to!be!crossed,!only!one!system!needs!to!be!tuned!
Data*Transfer**Between*Layers!
• Data!transfer!happens!externally!through!the!serving!layer!with!corresponding!system!latency!overhead!
• Expert!knowledge!of!all!layers!needed!for!optimal!performance!
• Data!transfer!(between!batch!and!streaming)!happens!internally!to!the!system!
• Automatically!optimized!for!maximal!performance!
Code*Base*Maintenance!
• Upgrades!and!version!changes!independently!in!multiple!systems!
• Possible!version!conflicts!across!frameworks!on!the!user7side!
• Updates!are!solely!dependent!on!Flink!and!changes!are!made!in!a!single!API!
• Versioning!related!problems!handled!by!Flink!developers!
Debuggability! • Debugging!across!multiple!systems! • Out!of!the!box!debugging!tools!for!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!2!https://www.mapr.com/sites/default/files/otherpageimages/lambda7architecture727800.jpg!
https://www.mapr.com/sites/default/files/otherpageimages/lambda5architecture525800.jpg
…TO UNIFIED ARCHITECTURE
www.sics.se
WHAT IS DIFFERENT THIS TIME?
• Haven’t we seen this before? AI, Data Mining etc.?• Things are a bit different this time around:
1. We have the computational power and middleware (cloud, Big Data middleware)
2. We have the data (Human generated, IoT)3. We have the algorithms
www.sics.se
THE UNREASONABLE EFFECTIVENESS OF DATA• In a wide array of academic fields, the ability to effectively process data is
superseding other more classical modes of research.
“More data trumps better algorithms”*
*“The7Unreasonable7Effectiveness7of7Data”7[Halevey et7al709]
www.sics.se
EXAMPLE: POPULATION SCALE GENOMICS
[Image7source:7Patterson,7Fighting7the7Big7C7with the7Big7D,72014]
www.sics.se
EXAMPLE: POPULATION SCALE GENOMICS
[Image7source:7Patterson,7Fighting7the7Big7C7with the7Big7D,72014]
www.sics.se
Data set
Graph View
maternal
ardent
shia
deciduous
badly
guerrilla
benedictine
armoured
aerobic
franciscan
geometric
weighs
slavic
favorablearchaeological
bacterial
middleweight
southbound
wbc
fluorescent
interstellar
flyweight
turkic
stricter
extermination
featherweightbehavioural
westbound
eastbound
sta
sodium
seriously
kappa
sigma
taoist
buddhistcalibre
mechanized
welterweight
lcd
rabbinicashkenazi
psychic
impressionist
staunch
didnt
weighing
tenor
sizeable
inflicting
liver
topographic
wba
influenza
incredibly
prescription
tarot
tremendous
czar
berber
gravely
presidentialgubernatorial
pour
fiber
immense
amorphous
stem
bassist
abdel
tau
mythological
mount
ramesses
ghulam
pounder extremely
steam
risc
unfavorable
colder
thtre
projective
casimir
dietary
acidic
aluminum
philanthropic
dont
quarterback
wnba
semitic
terrible
beech
calcium
impendingconiferous
telepathic
inorganic
domesticated
authoritarian
exceptionally
volkswagen
liter
exceedingly
haji
kharkiv
hasidic
suikoden
anterior
cfm
escuela
feral
somervilleasp
corinthian
medford
freight
womens
quantitative
universitt
humid
keyboardist
multilateral
williamstown
mistakenly
antiaircraft
motocross
daytime
fcc
feminine
washing
greatly
antisubmarine
parque
zimbabwean
maulana
scriptstyle
runic
northeastern aris
teammate
neighbouring
sheikh
ammonium
bitterly
greenish
temperate
dsl
considerable
bowed
kedah
hilly
islamist
plum
microbial
atlanta
religiously
illyrian
luxurious
holyoke
viral
histoire
guitarist
extravagant
aeronautical
stuffed
infantry
magnesium
horrible
friedrichtorpedo
aryan
westport
malden
comparatively
abdominal
differentiable
fax
pashtun
nomadicalpha
khalid
thracian
wrestlemania
hash
fairbanks
evergreen
intestinal
frontman
bullion
diocesan
smythe
eocene
kamehameha
kitchener
mahayana
cosmic
cali
intentionally
calgary
vigorously
drastically
electrical
accidentally
erroneously
cincinnati
hindu
gsm
ib
frac
totally
mohammad
spruce
significantly
saturday
toledo
gg
positive
coloured
uv
medley
pius
violin
georg
safavid
articulated
devastating
huntington
fatality
epsilon
translink cylindrical
drier
monochrome
denver
bluish
turbocharged
armoredauf
developmental
vaughan
cdot
slam
ptolemaic
phi
wheel
polynesian
extratropical
ncaa
hazardous
sill
bedouin
kodiak
dans
arabian
torsion
fuzzy
seville
nottingham
plated
navajo
internationale
farmington
hennepinoak
coconut
warmer
nap
wheeled
grilled
madman
pahlavi
electromagnetic
redwood
horizontal
ultima
varying
transverse
undisclosed
csa
ideological
xia
romanesque
gnral
southeastern
zen
conflicting
amalgam
meter
gov
milling
neurological
walnut
astoria
grizzly
palais
intracellular
thurston
pi
gallic
hardwood
electronically
stout
num
sudbury
deteriorating
refresh
ftp
fertile
lankan
wichita
kilograms
jute
palepixar
fir
inner
poplar
gottfried
titanium
fda
kannadanyu
mips
continually
sitka
celestial
geo
sindh
helsinki
kidney
syed
alicante
dhaka
schematic
edwardian
algebraic
olive
pleistocene
aziz
usb
cw
synchronous
sep
fukuoka
snowy
ankara
sesame
persian
placer
confucian
acute
tai
wooster
pir
kcb
malankara
doric
adolescent
physiological
dreamworks
cistercian
couldnt
charlestown
intergovernmental
northwestern
mountainous
unilateral
hee
grassland
keynote
downward
forza
hari
gartermanganese
cauchy telugu
pusher
cyrillic
museo
respiratory
stunning
silvery
quincy
mystical
reluctantly
firestone
wirral
gnostic
lugano
herding
conifer
enid
phenomenal
poorest
oricon
syntacticrotting
mulberry
batu
drainage
tian
lethbridge
baritone
stylistic
anthropomorphic
lookup
exponential
perak
valencia
strict
dowager
dependencies
tanning
ste
defensive
falsely
sima
proxy
escorting
imam
bolivia
parser
dialog
homosexual
mlaga
sassanid
socially
melee
cigarette
salim
reformist
ultraviolet
phylogenetic
mohammed
fig
bengali
canine
anarcho
defendingenriched
vegetable
uthman
caterpillars
zoological
annual
realist
outlying
alfonso
abdullahmayan
tunisian
expressly
broadband
serotonin
supernatural
lambda
carnivorous
indigenous
caltech
warmly
bahadur
malm
lunar
unorthodox
dreamwave
marietta
thatched appleton
hasansquash
lifeguard
rightarrow
madre
psi
copper
benign
beta
han
rue
superficially
philosophical
sectarian
topological
gare
profoundlyarchie
aspx
shoe
sher
governmental
figurative
ccd
flammable
pinball
hind
intercity
default
ahmad
sanchounderwent
ctv
middletown
palazzo
hickory
parliamentary
rhythm
vegetarian
zeta
elm
cleveland
billion hilton
analogue
dense
fil
kart
sigismund
cavalry
gamma
albrecht
gravitational
yusuf
quan
orient
uncivil
sixty
flu
guyed
speciality
graz
vastly
monstrous
pulmonary
islamabad
computational
differing
caller
bodied
fbi
willow
varies
roasted
enormous
sturm
pentium
gorges
bahia
super
tinker
remotely
souvenir
geographic
ismail
markham
protestant
cherry
sandstone
inline
regenerative
sunflower
uniformly
mangrove
modernist
oslo
intangible
optimal
serra
watercolor
shaykh
byu
birch
bhp
scandinavian
ethical
wooded
vapor
surrealist
talmudic
answering
reconnaissance
walled
disastrous
satirical
huey
wadysaw
senegalese
centric
hiroshima
excessively
esoteric
seventy
indianapolis
ninth
officially
biblical
sexual
gujarati
banco
christoph
littleton
nigerian
tambon
quadruple
jat
multidisciplinary
relatively
kurdish
dull
bleach
eligibility
incoming
poultry
halifax
litchfield
es
kbe
watertown
mobile
wrongly
mallorca
hyun
paternal
sunni
severely
guerilla
anaerobic
augustinian
geometrical
weighed
germanic
favourablearcheological
fungal
northbound
incandescent
interplanetary
stringent
refugee
behavioral
santa
potassium
jain
caliber
bantamweight
motorized
crt
rabbinical
sephardic
expressionist
outspoken
doesnt
sizable
inflict
topographical
ibf
extraordinarily
illicit
greeting
tsar
theta
mayoral
fibre
crystalline
germ
drummer
abdul
mythical
mt
antiochus
rifled
superhuman
diesel
powerpc
maison
hyperbolic
nutritional
alkaline
aluminium
charitable
qb
nba
pine
imminent
organic
allegorical
totalitarian
vw
litre
lviv
messianic
posterior
stray
passenger
mens
waltham
qualitative
verlag
eucalyptus
bilateral
shaolin
bmx
primetime
masculine
asw
universidad
bangladeshi
panathinaikos
shaikh
fiercely
brownish
arid
adsl
neighboring
plucked
leftist
racially
luxury
lavish
biomedical
parachute
horrific
inflatable
stamford
thoracic
recursive
dramatically
anchorage
commemorative
suffragan
icc
miocene
guelph
theravada
medelln
alto
deliberately
mechanical
inadvertently
adverse
cdma
baccalaureate
sqrt
utterly
muhammad
fatally
substantially
weekday
columbus
qc
epa
negative
colored
freestyle
cello
johann
decker
catastrophic
rockaway
unemployment
spherical
boulder
rugged
aspirated
ausevolutionary
completely
robbie
itf
umayyad
mega
tropical
invitational
radioactive
stillwater
sous
sinai
snack
programmable
welded
cherokee
parc
langle
bloomfield
euclidmaple
cooler
unc
fried
grenadier
itc
vertical
bowel
clement
longitudinal
unspecified
bayesian
doctrinal
zhou
gothic southwestern
tibetan
edmonton
metre
edu
cnc
autoimmune
cedar
hillsboro
hershey
machine
extracellular
mandy
balkan
placebo
flattened
prod
worsening
toxic
dns
smallpox
alluvial
cuyahoga
flour
celincorrectly
outer
marathi
constantly
heavenly
zinc
balochistan
turku
chittagong
feynman
victorian
scsi
ayatollah
upn
numerical
asynchronous
jul
nara
misty
zmir
tonkin
coal
chronic
ho
hazrat
syriac
ionic
teenage
biochemical
cant
balfour
ding
montanes
ipv
upward
davao
amar
venomous
lithium
nucleotidemalayalam
bladed
phoenician
castel
spectacular
springfield
magical
willingly
goodyear
heretical
fribourg
blooded
castor
poorer
thrice
grammatical
napolon
decaying
karim
geyser
guang
soprano
morphological
periodic
penang
cdiz
zane
consort
hydrographic
bunk
brandywine
sainte
offensive
xiao
escorted
janeiro
oppressive
cardboard
heterosexual
seleucid
environmentally
cigar
infrared
mohamed
montreux
umar
gastrointestinal
meiji
valle
rooster
ska
reigning
depleted
larvae
landscaped
yearly
moroccan
explicitly
wireless
dopamine
herbivorous
enthusiastically
nader
stockholm
meteorite
unconventional
jesuit
cmg
fawcett
retractable
milwaukee
tennis
faire
pear
picket
cordillera
substantialcypress
malignant
vaguelymetaphysical
escalating
hausdorff
negatively
orc
slr
colorless
fore
aboutus
piazza
undergoing
sidi
millimeter
cbc
sabha
flamenco
vegan
trillion
hyatt
analog
lush
gael
harness
erich
electrostatic
ali
pci
sarcastic
specialty
linz
renal
analytical
divergent
arboreal
dea
vary
roast
hugemusik
earthen
hula
stringer
antique
geographical
rennes
pentecostal
peach
limestone
roller
mustard
evenly
postmodern
trondheim
tangible
thyroid
optimum
estdio
ozark
prophet
empress
demonic
horsepower
moral
forested
atmospheric
johor
recon
affine
pelvic
humorous
juliette
ghanaian
aerospace
overly
occult
eighty
valparaiso
regulatory
eighth
formallyphilipp
thoma
jutland
muban
triple
rajput
holistic
fairly
albanian
ranma
mango
psychotic
aboriginal
outgoing
dairy
vila
ja
kcmg
cellular
gcb
majorca
eun
Double-click on empty area to zoom in, shift-double-click to zoom out. Click and drag on empty area to pan. Double-click on node to highlight immediate neighours.
SICS, Swedish Institute of Computer Science, 2014
www.sics.se
BIG DATA – A BIG MISTAKE?
• “Four oversimplified articles of faith:”1. Data analysis produces uncannily accurate results2. Every single data point can be captured, making old
statistical sampling techniques obsolete3. It is passé to fret about what causes what, because
statistical correlation tells us what we need to know4. “The End of Theory”: With enough data, the numbers speak
for themselves• Practitioners do not necessarily believe this, and
neither should youFrom Big Data: are we making a big mistake?, Tim Harford, Financial Times March 28 2014
www.sics.se
BIG DATA TO BIG VALUE?
Big Data Big Value
www.sics.se
BIG DATA TO BIG VALUE?
Big Data Big Value
Real-world Data Model Conclusion
Question
www.sics.se
BDA: WORKFLOW NOT FUNDAMENTALLY DIFFERENT
Acquisition
Cleaning
RepresentationCase-based
Statistical Logical Kernel-based
Validation
Deployment
www.sics.se
EVERYTHING IS ABOUT MACHINE LEARNING ANYWAY…
A Few Useful Things to Know About Machine LearningPedro DomingosCACM 55:10http://dl.acm.org/citation.cfm?id=2347755
www.sics.se
LET’S START WITH THE OBVIOUS
It is a key component of digital services It will help us understand the world better
Lindh, J et al.: PLoS ONE 7(11), 2012.
www.sics.se
…BUT IT IS REALLY, REALLY DIFFICULT
Sample bias, data veracity Model complexity
…+ user and analyst bias…
www.sics.se
SCALABLE, PRIVACY-PRESERVING MOBILITY ANALYTICSFROM NETWORK DATA
A1 B1
C1 D1 E1
F1
G1
A2
A3 A4 A5 A6
A7 A8
B2
B3 B4 B5
B6
C2 C3
C4 C5
D2
D3
D4
E2 E3
E4F2 F3
F4 F5 F6
F7 F8
G2
G4G3 G5
G6 G7 G8
H2 H3
H1
Trajectory 1
Trajectory 2
Trajectory 3
A3, A4, A5, C1, D1, D2, B4, B1 F2, F3, F6, G1, C5, D3, E2, E1, B5, B2 F7, F5, F6, G1, G2, D4, E2, E3, B6
Urban planning Traffic management
Crisis management
Consumerapplications
Networkmanagement
www.sics.se
AN UNSOLVABLE PROBLEM?• Moving datasets can be difficult
• Large and sensitive (business value, integrity issues, etc.)• Can allow for re-identification in anonymized data
• Combining several datasets outside their collection entity is key to many (public) applications, but…
Centralized datasets: Technically feasible Politically hardFederated datasets: Technically difficult Politically solvable
www.sics.se
BIG DATA: KEY COMPONENT IN AUTOMATION
www.sics.se
MASSIVE LEARNING SYSTEMS
www.sics.se
LARGER AND LARGER REPRESENTATIONS
ConvolutionPoolingSoftmaxOther
www.sics.se
UNIFICATION OF MACHINE LEARNING?
Alternating Direction Method of MultipliersS. Boyd, N. Parikh, et al.http://stanford.edu/~boyd/papers/admm_distr_stats.html
www.sics.se
MOVING TO PROBABILISTIC APPROXIMATIONS
https://highlyscalable.wordpress.com/2012/05/01/probabilistic5structures5web5analytics5data5mining/
www.sics.se
ANALYTICS EVERYWHERE
www.sics.se
ANALYTICS AS A SERVICE
www.sics.se
WHERE ARE WE GOING?① Advanced analytics is moving towards large-scale Machine
Learning② Computation and storage platforms need to adapt and develop
LearningMachines@SICSAnalytics and system development on real data and use cases
www.sics.se
THE DATA DRIVEN SYSTEMS STACK, SICS
Hops PaaS (http://hops.io)
HDFS
Compute andresource management
Storage and streams
Data processingengines
Algorithms andframeworks
Analytics
Mesos
Flink
Graph algorithms
HBase Kafka
Yarn
Spark Storm
Scalable /On-line
Structural / DeepLearning
Diagnosis / classification
Anomaly detection Clustering
SICS ContributionsSicsthSense
ICN, SDN
SDN Monitoring
Text, Social Media
Autonomous RANNetworking
Data collection
www.sics.se
Network analytics
Vehicles &Transport
eHealth
Hops PaaS (http://hops.io)
HDFS
Compute andresource management
Storage and streams
Data processingengines
Algorithms andframeworks
Analytics
Mesos
Flink
Graph algorithms
HBase Kafka
Yarn
Spark Storm
Scalable /On-line
Structural / DeepLearning
Diagnosis / classification
Anomaly detection Clustering
SICS ContributionsSicsthSense
ICN, SDN
SDN Monitoring
Text, Social Media
Autonomous RANNetworking
Data collection
DATA DRIVEN SYSTEMS, APPLICATION EXAMPLES
www.sics.se
EXPERIENCE FROM APPLICATIONS• Big Data can be made small – consider the complete application
• Big Data can become small very quickly
• Beware of sample bias
• Distribute models, not data
• Distributed solutions, Statistical Machine Learning, and Bayesian statistics can help
www.sics.se
DISCUSSING APPLICATIONS: IT IS NOT OBVIOUS WHAT IS DIFFICULT
www.sics.se
WHAT ABOUT INTEGRITY?
• Not all Big Data data is integrity sensitive!• Media, measurements, science, …
• How data is (or potentially is) used is everything• Surveillance and data driven services are very different• It all comes down to trust• Transparency is key• Laws and regulations? How do we manage sales,
sharing?
www.sics.se
THE END OF COMPUTER SCIENCE
Computer science Data science
www.sics.se
…BUT JUST REMEMBER…
We’re not there yet – things are moving fast!
www.sics.se