Transcript
Page 1: Don't follow the followers

Copyright  Third  Nature,  Inc.  

Following  Google  Or  Don’t  Follow  the  Followers,  Follow  the  Leaders  Or  The  problem  probably  isn’t  the  database,  the  problem  is  probably  you  

May,  2014  

Mark  Madsen  www.ThirdNature.net  @markmadsen  

Page 2: Don't follow the followers

Copyright  Third  Nature,  Inc.  

 A  Quick  Intro  I  may  or  may  not  be  qualified  to  make  the  outlandish  statements  in  this  presentaDon.  I  have,  however,  made  almost  every  mistake  menDoned,  and  one  learns  by  making  mistakes.  

You  might  say  that’s  my  singular  skill.  

Page 3: Don't follow the followers

Copyright  Third  Nature,  Inc.  

A  BRIEF  HISTORY  OF  DATA  STORAGE  AND  RETRIEVAL  

History  isn’t  taught  in  most  university  science  curricula    (probably  because  it’s  a  rabbit  hole)    

Page 4: Don't follow the followers

Copyright  Third  Nature,  Inc.  

Databases:  the  problem  statements  over  Gme  

“InformaDon  has  become  a  form  of  garbage,  not  only  incapable  of  answering  the  most  fundamental  human  quesDons  but  barely  useful  in  providing  coherent  direcDon  to  the  soluDon  of  even  mundane  problems.”  –    Neil  Postman,  1985  

"We  have  reason  to  fear  that  the  mulDtude  of  books  which  grows  every  day  in  a  prodigious  fashion  will  make  the  following  centuries  fall  into  a  state  as  barbarous  as  that  of  the  centuries  that  followed  the  fall  of  the  Roman  Empire.”  –  Adrien  Baillet,  1685  

”…so  many  books  that  we  do  not  even  have  Dme  to  read  the  Dtles."  –  Anton  Francesco  Doni,  1550  

Page 5: Don't follow the followers

Copyright  Third  Nature,  Inc.  

The  origin  of  informaGon  management  problems  

For  ~5000  years  we  used  counters  of  various  types,  eventually  developing  wriDng  to  cope  with  civilizaDon’s  needs.  

WriDng  is  more  efficient  than  counters  you  can  lose.  

Sumerian  bulla  envelope  with  tokens.  The  beta  period.  

Page 6: Don't follow the followers

Copyright  Third  Nature,  Inc.  

InformaGon  Technology  v1.0:  Clay  Tech,  ~3000  bce  

The  first  informaGon  explosion  

Page 7: Don't follow the followers

Copyright  Third  Nature,  Inc.  

That  explosion  led  to  the  first  metadata  

“tags”

Small piles in baskets are easy to tag and search

Page 8: Don't follow the followers

Copyright  Third  Nature,  Inc.  

Metadata  v1.0:  tablets  about  tablets  

When  there  are  enough  of  these  lying  around  you  need  to  work  on  organizaDon  of  the  collecDon  by  categorizaDons,  aka  “taxonomy”,  “schema”  

Like  working  out  what  tables  are  in  a  database,  or  what  files  are  stored  in  HDFS.  

Babylonian  library  catalog  ~2000bce  

Page 9: Don't follow the followers

Copyright  Third  Nature,  Inc.  

Metadata  v1.1:  tablets  about  what’s  in  tablets  

When literacy rates are higher and people need to communicate more effectively, you need to invent mechanisms to cope, like dictionaries.

Now we’re worried about what’s inside the documents, not where they are placed.

Synonym list, Ashurbanipal, ~900 bce

Page 10: Don't follow the followers

Copyright  Third  Nature,  Inc.  

Clay  Tech  has  some  familiar  limitaGons  

Page 11: Don't follow the followers

Copyright  Third  Nature,  Inc.  

InformaGon  Management  v2.0  Paper  Tech*  

Lighter,  denser,  faster  storage  media  

Page 12: Don't follow the followers

Copyright  Third  Nature,  Inc.  

More  informaGon  =  need  for  new  metadata  techniques:  content  tagging,  author  catalogs  

The  first  real  library  ~300BC  

Page 13: Don't follow the followers

Copyright  Third  Nature,  Inc.  

Discovery  of  one  tradeoff  between  clay  and  paper…  

Recorded  informaDon  creates  permanence  and  instability  

Page 14: Don't follow the followers

Copyright  Third  Nature,  Inc.  

You  can’t  have  disconGnuous  reading*  unGl  you  have  a  random  access  technology.  

*Indexing  and  encyclopedias  are  hard  in  linear  scrolls.  Hello  ISAM  

Page 15: Don't follow the followers

Copyright  Third  Nature,  Inc.  

Paper  Tech  v2.1:  increased  storage  density,  smaller  form  factor,  durability,  high  res  RGB  graphics  

Page 16: Don't follow the followers

Copyright  Third  Nature,  Inc.  

Paper  Tech  v2.2  

The  change  in  prinDng  over  Dme  accelerates.  

Block  prinDng  replaced  by  movable  metal  type.  

The  job  of  producDon  is  faster  and  cheaper.  

CommodiDzaDon  changes  the  landscape  over  the  next  200  years.  

The  printed  becomes  more  important  than  the  printer.  

Page 17: Don't follow the followers

Copyright  Third  Nature,  Inc.  

The  Elizabethan  Era  

ProducDon:  prinDng  presses  

Data  management  tech:  ▪  Perfect  copies  ▪  Topical  catalogs  ▪  Font  standardizaDon  ▪  Taxonomy  ascends  

InformaDon  explosion:    ▪  8M  books  in  1500  ▪  200M  by  1600  ▪ CommodiDzaDon  

▪ Overload  

Page 18: Don't follow the followers

Copyright  Third  Nature,  Inc.  

Be^er  embedded  metadata:  Gtle  page,  colophon,  ToC  

Page 19: Don't follow the followers

The  Georgian  Era:  The  Explosion  of  Natural  Philosophy  

Sharing knowledge in a larger community required common language, structure

Page 20: Don't follow the followers

Copyright  Third  Nature,  Inc.  

Linnaeus  

Top  down  orientaDon  StaDc  structure  

DescripDve  rather  than  explanatory  

Taxonomic  classificaHon  

Page 21: Don't follow the followers

Copyright  Third  Nature,  Inc.  

Bobom  up  orientaDon  Flexible  structure  

Explanatory,  descripDve  

Faceted  classificaHon  

Buffon  

Page 22: Don't follow the followers

Copyright  Third  Nature,  Inc.  

vs  

vs  

SQL  NoSQL  

Think about why that happened

Page 23: Don't follow the followers

Copyright  Third  Nature,  Inc.  

The  Victorian  Era  

The  powered  prinDng  informaDon  explosion:  ▪ Card  catalogs,  cross-­‐referencing,  random  access  metadata  ▪ Universal  classificaDon  ▪  Extended  informaDon  management  debates  

▪  Trading  effort  and  flexibility  for  storage  and  retrieval  

▪  Stereotyping  

Page 24: Don't follow the followers

Copyright  Third  Nature,  Inc.  

Melvil  Dewey  

Dewey  Decimal  System  Top  down  orientaDon  

StaDc  structure  

DescripDve  rather  than  explanatory  

Taxonomic  classificaHon  

Page 25: Don't follow the followers

Copyright  Third  Nature,  Inc.  

Cuber  Expansive  ClassificaDon  System  (~1882)  Bobom  up  orientaDon  

More  flexible  structure  

Explanatory,  descripDve  

Faceted  classificaHon  

Charles  Ammi  Cu^er  

Page 26: Don't follow the followers

Copyright  Third  Nature,  Inc.  

vs  

SQL  NoSQL  

Page 27: Don't follow the followers

Copyright  Third  Nature,  Inc.  

So  why  did  Linnaeus  and  Dewey  win?  

Good  enough  wins  the  day  

 PragmaJsm  

It  wasn’t  solving  the  problem  you  thought  it  was.    

Page 28: Don't follow the followers

Copyright  Third  Nature,  Inc.  

Summarizing  

Thousands  of  years  of  thought  have  been  put  into  principles  of  organizaDon  and  use.  The  abstract    paberns  are  the  same,  only  the  implementaDon  changed.  ▪ Clay:  tablets  about  tablets,  tablets  about  what’s  in  tablets,  100X  increase  in  data  density  over  counDng  tech  

▪  Scrolls:  scrolls  about  scrolls,  scrolls  about  what’s  in  scrolls,  prepended/appended  navigaDon,  >100X  increase  in  density  

▪ Books:  books  about  books,  books  about  what’s  in  books,  embedded  internal  navigaDon,  >1000X  increase  in  density  

▪ DigiDzed  data:  similar,  far  denser,  and  different  because  it  isn’t  locked  into  physical  forms  

Page 29: Don't follow the followers

Copyright  Third  Nature,  Inc.  

History  is  always  the  same  

Every  technology  is  a  trade:  ▪  Top  down  vs.  bobom  up  ▪ Authority  vs.  anarchy  ▪ Bureaucracy  vs.  autonomy  ▪ Control  vs.  creaDvity  ▪ Hierarchy  vs.  network  ▪ Dynamic  vs.  staDc  ▪  Power  vs.  ease  ▪ Work  up  front  vs.  postponed  

In every choice, something is lost and something is gained.

Page 30: Don't follow the followers

Copyright  Third  Nature,  Inc.  

What  lessons  does  this  history  teach  us?  

1.  InformaDon  requires  organizing  principles.  

2.  Differences  in  scale  require  different  principles.  3.  There  are  mulDple  levels  of  informaDon  

architecture  and  principles  of  organizaDon.    

4.  At  a  key  point  in  the  adopDon  cycle,  emphasis  shins  from  collecDon  and  management  of  informaDon  to  disseminaDon  and  use.  

 First  we  record,  then  we  use  and  share.  

Like  transacDon  processing,  query  &  analysis.  

Page 31: Don't follow the followers

Copyright  Third  Nature,  Inc.  

InformaGon  management  through  human  history  always  follows  the  same  pa^ern  

New  technology  development  

creates  

New  methods  to  cope  

creates  

New  informaGon  scale  and  availability  

creates…  

Page 32: Don't follow the followers

Copyright  Third  Nature,  Inc.  

Big  Data  

Page 33: Don't follow the followers

Copyright  Third  Nature,  Inc.  

DEALING  WITH  BIG:  SOME  SCALING  HISTORY  

"The  most  amazing  achievement  of  the  computer  sonware  industry  is  its  conDnuing  cancellaDon  of  the  steady  and  staggering  gains  made  by  the  computer  hardware  industry.“  -­‐  Henry  Peteroski  

Page 34: Don't follow the followers

Copyright  Third  Nature,  Inc.  

Why  doesn’t  your  database  scale?  

Page 35: Don't follow the followers

Copyright  Third  Nature,  Inc.  

Hipster  bullshit  

I  can’t  get  MySQL  to  scale  therefore  

RelaDonal  databases  don’t  scale  

therefore  

We  must  use  NoSQL*  for  query  too  

*including  Hadoop  and  related  

Page 36: Don't follow the followers

Copyright  Third  Nature,  Inc.  

GeneraGon  and  collecGon  always  come  first  

UnDl  there’s  enough  informaDon,  general  problems  of  organizaDon  and  informaDon  architecture  don’t  appear.  

Page 37: Don't follow the followers

Copyright  Third  Nature,  Inc.  

The  early  days:  client/server  as  the  starGng  point  

We  had  transacDon  processing  against  the  DB  all  on  the  same  machine.  Then  on  two  separate  machines.  

Application

Page 38: Don't follow the followers

Copyright  Third  Nature,  Inc.  

Scaling  client/server  

We  added  app  servers  and  pooled  connecDons.  

Client

App server Client

Client

Page 39: Don't follow the followers

Copyright  Third  Nature,  Inc.  

Scaling  client/server  

Then  threw  money  at  the  problem  in  the  form  of  hardware  (made  the  database  bigger).  

Client

App server Client

Client

Page 40: Don't follow the followers

Copyright  Third  Nature,  Inc.  

Web  apps  were  a  huge  increase  in  concurrency  

Architecture  changed  to  reflect  new  stateless  model.  We  had  scalability  and  availability  problems.  

webapp

webapp

Load

bal

ance

r

Load

bal

ance

r service

service

service

Page 41: Don't follow the followers

Copyright  Third  Nature,  Inc.  

Increasing  traffic?  

Keep  adding  hardware,  make  the  DB  bigger.  Limits  reached,  performance,  scalability  and  

availability  problems.  

webapp

webapp

Load

bal

ance

r

Load

bal

ance

r service

service

service

Page 42: Don't follow the followers

Copyright  Third  Nature,  Inc.  

Increasing  traffic?  

Read-­‐only  replicas  will  save  the  day!  SDll  have  scalability  and  availability  problems.  

And  now  operaDonal  overhead  and  problems.  

webapp

webapp

Load

bal

ance

r

Load

bal

ance

r service

service

service

Page 43: Don't follow the followers

Copyright  Third  Nature,  Inc.  

Increasing  traffic?  

Sharding  seems  a  fine  thing.  But  it’s  one  leber  from…  Scaling  and  perf  beber,  overhead  and  operaDonal  

complexity  high  and  worsening.    

webapp

webapp

Load

bal

ance

r

Load

bal

ance

r service

service

service Load

bal

ance

r

Page 44: Don't follow the followers

Copyright  Third  Nature,  Inc.  

Increasing  traffic?  

Let’s  cache  data  at  the  service  Der!  Performance  beber,  overhead  and  operaDonal  

complexity  higher.    

webapp

webapp

Load

bal

ance

r

Load

bal

ance

r service

service

service Load

bal

ance

r

cach

e ca

che

cach

e

Page 45: Don't follow the followers

Copyright  Third  Nature,  Inc.  

What  are  the  problems  now?  

1.  More  hardware,  more  things  to  break  2.  More  management  and  administraDon  

3.  More  sonware  complexity  

4.  Increasing  distance  for  data  to  travel  =  latency  5.  Data  administraDon  difficult  to  impossible  

Page 46: Don't follow the followers

Copyright  Third  Nature,  Inc.  

Problem  solved?  

Distributed  NoSQL  DB  (handles  cache,  load  balance,  data  distribuDon).  Similar  performance,  simpler  scaling,  reduced  operaDonal  problems,  simpler  

applicaDon  architecture.  Finished!    

webapp

webapp

Load

bal

ance

r

Load

bal

ance

r service

service

service

Page 47: Don't follow the followers

Copyright  Third  Nature,  Inc.  

TANSTAAFL  

When  replacing  the  old  with  the  new  (or  ignoring  the  new  over  the  old)  you  always  make  tradeoffs,  and  usually  you  won’t  see  them  for  a  long  Dme.  

Technologies  are  not  perfect  replacements  for  one  another.  Onen  not  beber,  only  different.  

Page 48: Don't follow the followers

Copyright  Third  Nature,  Inc.  

Not  finished:  remember  the  cycle  of  history…  

The  biggest  hole  in  the  prior  secDon  on  scaling  is  that  we  scaled  OLTP,  what  about  OLAP?  

Queries  <>  transacDons.  

Page 49: Don't follow the followers

Copyright  Third  Nature,  Inc.  

Solving  query  problems  

Aggregate  or  low  selecDvity  queries  were  a  problem  early  on,  when  people  wanted  to  use  the  data.  

Every  report  or  query  is  a  program.  

OLTP Report app Report app

Page 50: Don't follow the followers

Copyright  Third  Nature,  Inc.  

Increasing  data  volume  

Make  it  faster  by  throwing  money  at  hardware  (sound  familiar?)  

OLTP Report app Report app

Page 51: Don't follow the followers

Copyright  Third  Nature,  Inc.  

Increasing  data  volume  

Replicas:  split  the  workload  and  tune  the  systems  based  on  their  workload.  

Report app OLTP Report app

Page 52: Don't follow the followers

Copyright  Third  Nature,  Inc.  

Increasing  data  volume  breaks  the  old  model  

Devise  a  new  architecture.    ReschemaDze  the  database,  eliminate  cyclic  joins,  selecDve  denormalizaDon,  query  generators.  But  it  takes  bulk  processing  to  reschemaDze  the  data.  

Query tools

OLTP

OLTP

OLTP

ETL

Page 53: Don't follow the followers

Copyright  Third  Nature,  Inc.  

Increasing  data  volume  

Improve  response  Dme  with  caching  in  the  query  tools,  and  by  using  MOLAP  tools  that  map  into  cache  or  memory.  Like  mainframe  reporDng.  Or…  

Query tools

OLTP

OLTP

OLTP

ETL

cache

Query tools

cache

Page 54: Don't follow the followers

Copyright  Third  Nature,  Inc.  

Increasing  data  volume  

Parallel  processing  for  ETL.  Distributed  query  databases  for  fine  grained  high  volume  parallelism.  

ETL ETL ETL

OLTP

OLTP

OLTP

Query tools

cache

Query tools

cache

Page 55: Don't follow the followers

Copyright  Third  Nature,  Inc.  

The  architecture  looks  familiar  

Two  workloads,  two  not  dissimilar  architectures:  ▪  Load-­‐balanced  front  ends  ▪ Distributed  caching  layers  ▪  Scalable  distributed  parallel  databases  

But  the  nature  of  the  OLTP  and  OLAP  workloads  is  very  different.  Forcing  them  into  one  plaworm  is  almost  impossible  for  data  architecture  reasons  and  parDcularly  at  scale*  

Page 56: Don't follow the followers

Copyright  Third  Nature,  Inc.  

DATA  PERSISTENCE  AND  STORES  

Why  would  digital  data  be  any  different  than  clay  or  scrolls  or  books?  

Page 57: Don't follow the followers

Copyright  Third  Nature,  Inc.  

“Big  data  is  unprecedented.”  -­‐  Anyone  involved  with  big  data  in  even  the  most  barely  percepHble  way  

Page 58: Don't follow the followers

Copyright  Third  Nature,  Inc.  

There’s  a  difference  between  having  no  past  and  acDvely  rejecDng  it.  

Page 59: Don't follow the followers

Copyright  Third  Nature,  Inc.  

MultiValue, Hierarchical PICK IMS IDS ADABAS

Relational CODASYL System R (SEQUEL) SQL/DS INGRES (QUEL) Mimer Oracle

RDBMS, SQL standard DB2 Teradata Informix Sybase Postgres

OODBMS, ORDBMS Versant Objectivity Gemstone Informix* Oracle*

MPP Query, NoSQL Netezza Paraccel Vertica MongoDB CouchBase Riak Cassandra

NewSQL SciDB MonetDB NuoDB CitusDB

1960s 1980s 2000s

1970s 1990s 2010s

Spanner F1

Page 60: Don't follow the followers

Copyright  Third  Nature,  Inc.  

A  history  of  databases  in  No-­‐taGon  

1970:  NoSQL  =  We  have  no  SQL    

1980:  NoSQL  =  Know  SQL    

2000:  NoSQL  =  No  SQL!    

2005:  NoSQL  =  Not  only  SQL    

2013:  NoSQL  =  No,  SQL!  

(R)DB(MS)  

Page 61: Don't follow the followers

Copyright  Third  Nature,  Inc.  

Tradeoffs?  

“Query  opHmizaHon  is  not  rocket  science.  When  you  flunk  out  of  query  opHmizaHon,  we  make  you  go  build  rockets.”  

Page 62: Don't follow the followers

Copyright  Third  Nature,  Inc.  

Joins.  Wait,  there’s  more  than  one?  

Wait, there’s more than one?

Page 63: Don't follow the followers

Copyright  Third  Nature,  Inc.  

In  NoSQL  Land,  OpGmizer  is  You!  

Page 64: Don't follow the followers

Copyright  Third  Nature,  Inc.  

Services  provided  

Standard  API/query  layer*  TransacDon  /  consistency  

Query  opDmizaDon  Data  navigaDon,  joins  

Data  access  

Storage  management  Database

Database

Tradeoffs:  In  NoSQL,  the  DBMS  is  You  Too  

SQL  database   NoSQL  database  

Application Application

Anything  not  done  by  the  DB  becomes  a  developer’s  task.  

Page 65: Don't follow the followers

Copyright  Third  Nature,  Inc.  

The  relaDonal  database  is  the  franchise  technology  for  storing  and  retrieving  data,  but…  

1.  Global,  staDc  schema  model  2.  No  rich  typing  system  

3.  No  management  of  natural  ordering  in  data  

4. Many  are  not  a  good  fit  for  network  parallel  compuDng,  aka  cloud  5.  Limited  API  in  atomic  SQL  statement  syntax    &  simple  result  set  return  

6.  Poor  developer  support  (in  languages,  in  IDEs,  in  processes)  

RelaGonal:  a  good  conceptual  model,  but  a  prematurely  standardized  implementaGon  

Page 66: Don't follow the followers

Copyright  Third  Nature,  Inc.  

The  relaDonal  database  is  the  franchise  technology  for  storing  and  retrieving  data,  but…  

1.  Global,  staDc  schema  model  2.  No  rich  typing  system  

3.  No  management  of  natural  ordering  in  data  

4. Many  are  not  a  good  fit  for  network  parallel  compuDng,  aka  cloud  5.  Limited  API  in  atomic  SQL  statement  syntax    &  simple  result  set  return  

6.  Poor  developer  support  

RelaGonal:  a  good  conceptual  model,  but  a  prematurely  standardized  implementaGon  

What  I  did  not  list:  Scalability  and  performance  

Page 67: Don't follow the followers

Copyright  Third  Nature,  Inc.  

BIGNESS  AND  SCALABILITY  

Page 68: Don't follow the followers

Copyright  Third  Nature,  Inc.  

Technology  Capability  and  Data  Volume  

Source: Noumenal, Inc.

Page 69: Don't follow the followers

Copyright  Third  Nature,  Inc.  

You  can  make  a  database  emulate  a  KVS  

If  you  map  the  shared  event  fields  to  fixed  columns  and  an  event  type,  then  use  a  varchar  or  clob  payload  column,  you  can  store  arbitrary  events  in  a  database  and  query  them  via  SQL  and  views  (or  column  funcDons),  and  do  it  all  in  a  single  table.  

Arbitrary data parsed at query time using native DB features like regex or UDF

Data common to all events in the database

Type code used to differentiate event payload formats.

Page 70: Don't follow the followers

Data  Warehouse  +  

Behavioral  

Singularity

Data  Warehouse  

Low End Enterprise-class System

Contextual-Complex Analytics

Deep, Seasonal, Consumable Data Sets

Production Data Warehousing

Large Concurrent User-base

+ + 150+

concurrent users

500+ concurrent users

Enterprise-class System

5-10 concurrent users

Structure the Unstructured

Detect Patterns

Commodity Hardware System

6+PB 40+PB 20+PB

Analyze  &  Report  Discover  &  Explore  

Data  Plagorms  

Thanks to eBay for these case slides.

Page 71: Don't follow the followers

Start_dt   Guid   Sess_id   Page_id   Soj  

2011-­‐10-­‐18   1234   1   15   Language=en&  source=hp&  itm=i1,i2,i3,i4,i5  

SELECT start_dt, guid, sess_id, page_id,

NVL(e.soj, ‘itm’) AS item_list

FROM event e

WHERE e.start_dt = ‘2011-10-18’

AND e.page_id = 3286 /* Search Results */

Start_dt   Guid   Sess_id   Page_id   Item_list  

2011-­‐10-­‐18   1234   1   15   i1,i2,i3,i4,i5  

How  to  use  a  database  for  semi-­‐structured  data  

First the user-defined function (UDF)

Thanks to eBay for these case slides.

Page 72: Don't follow the followers

WITH event (start_dt, item_list) AS (<previous SQL>)

SELECT

start_dt,

item_id, /* Individual Item */

count(*)

FROM TABLE ( /* Normalize comma delimited list */

normalize_list( start_dt, item_list, ‘,’)

RETURNS(start_dt, idx, item_id)

)

GROUP BY 1, 2

ORDER BY 3 DESC

Start_dt   Item_id   Count(*)  

2011-­‐10-­‐18   i1   555  

2011-­‐10-­‐18   i2   444  

2011-­‐10-­‐18   i3   333  

2011-­‐10-­‐18   i4   222  

2011-­‐10-­‐18   i5   111  

*syntax simplified

How  to  use  a  database  for  semi-­‐structured  data  Then the table function (standard ANSI SQL)

Thanks to eBay for these case slides.

Page 73: Don't follow the followers

Copyright  Third  Nature,  Inc.  

Pricing  and  performance:  Hadoop  is  a  storage  and  processing  play,  not    a  database  play*  

Source: Venturebeat

With  big  data  systems,  the  cost  of  storing  data  is  an  order  of  magnitude  lower  than  with  databases  today  (but  not  the  cost  or  ability  to  query  it  back  out).  

Processing  data  at  scale  is  at  least  an  order  of  magnitude  cheaper  too.  

Copyright  Third  Nature,  Inc.  

Page 74: Don't follow the followers

Copyright  Third  Nature,  Inc.  

Bigness:  most  people  do  not  need  special  technology  N

umbe

r of p

eopl

e

The distribution of data size is about normal, yet these guys set the tone of the market today.

Bigness of data

Copyright  Third  Nature,  Inc.  

Page 75: Don't follow the followers

Copyright  Third  Nature,  Inc.  

AnalyGcs:  This  is  really  raw  data  under  storage  N

umbe

r of j

obs

Microsoft study of 174,000 analytic jobs in their cluster: median size ???

Bigness of data

Copyright  Third  Nature,  Inc.  

Page 76: Don't follow the followers

Copyright  Third  Nature,  Inc.  

Working  data  for  analyGcs  most  olen  not  big  N

umbe

r of j

obs

14 GB

Smallness of data

Copyright  Third  Nature,  Inc.  

Page 77: Don't follow the followers

© Third Nature Inc. © Third Nature Inc.

What  makes  data  “big”?  

Very  large  amounts  

Hierarchical  structures  

Nested  structures  

Encoded  values  

Non-­‐standard  (for  a  database)  types  

Time  series  

Deep  structure  

Human  authored  text  

“big”  is  beMer  off  being  defined  as  “complex”  or  “hard  to  manage”  

Copyright  Third  Nature,  Inc.  

Page 78: Don't follow the followers

Copyright  Third  Nature,  Inc.  

Web  tracking  data  has  a  nested  structure  USER_ID 301212631165031 SESSION_ID 590387153892659 VISIT_DATE 1/10/2010  0:00 SESSION_START_DATE 1:41:44  AM PAGE_VIEW_DATE 1/10/2010  9:59

DESTINATION_URL

hbps://www.phisherking.com/gins/store/LogonForm?mmc=link-­‐src-­‐email-­‐_-­‐m100109-­‐_-­‐44IOJ1-­‐_-­‐shop&langId=-­‐1&storeId=1055&URL=BECGinListItemDisplay

REFERRAL_NAME Direct REFERRAL_URL -­‐ PAGE_ID PROD_24259_CARD REL_PRODUCTS PROD_24654_CARD,  PROD_3648_FLOWERS SITE_LOCATION_NAME VALENTINE'S  DAY  MICROSITE SITE_LOCATION_ID SHOP-­‐BY-­‐HOLIDAY  VALENTINES  DAY IP_ADDRESS 67.189.110.179

BROWSER_OS_NAME MOZILLA/4.0  (COMPATIBLE;  MSIE  7.0;  AOL  9.0;  WINDOWS  NT  5.1;  TRIDENT/4.0;  GTB6;  .NET  CLR  1.1.4322)

“unstructured” data embedded in the logged message: complex strings

…but it’s easily unpacked into tuples

Page 79: Don't follow the followers

© Third Nature Inc. © Third Nature Inc.

Common  Names  ▪  Aspirin,  Acetylsalicylic  acid,  Excedrin  

Structural  formulas  ▪  Commonly  used  to  communicate  between  chemists  

SystemaDc  nomenclatures:  ▪  Mass  formula:  C9H8O4  

▪  SMILES:  OC(=O)C1=C(C=CC=C1)OC(=O)C  

▪  InChI:  1/C9H8O4/c1-­‐6(10)13-­‐8-­‐5-­‐3-­‐2-­‐4-­‐7(8)9(11)12/h2-­‐5H,1H3,(H,11,12)  

▪  IUPAC:  pyrido[1",2":1',2']imidazo[4',5':5,6]pyrazino[2,3-­‐b]phenazine  

They  all  refer  to  the  same  thing.  

If  you  process  them  in  a  database,  how  do  you  store  them?  

All  of  these  things  are  “unstructured”  data  

Page 80: Don't follow the followers

Copyright  Third  Nature,  Inc.  

Unstructured  data  isn’t  really  unstructured.  The  problem  is  that  this  data  is  unmodeled.  The  real  challenge  is  complexity.  

We  usually  mean  text,  not  data  structures…  

Page 81: Don't follow the followers

Copyright  Third  Nature,  Inc.  

Pa^erns  emerge  from  lots  of  event  data  

Paberns  emerge  from  the  underlying  structure  of  the  enHre  dataset.  The  paberns  are  more  interesDng  than  sums  and  counts  of  the  events.  

Web  paths:  clicks  in  a  session  as  network  node  traversal.  

Email:  traffic  analysis  producing  a  network  

The event stream is a source for analysis, generating another set of data that is the source for different analysis.

Page 82: Don't follow the followers

Copyright  Third  Nature,  Inc.  

BIGNESS  AND  DATA  COMPUTATIONAL  WORKLOADS  

Page 83: Don't follow the followers

Copyright  Third  Nature,  Inc.  

Not  finished:  remember  the  cycle  of  history…  

The  biggest  hole  in  the  prior  secDons  is  that  we  scaled  OLTP  and  OLAP  but  what  about  analyDcs?  

Queries  <>  transacDons  <>  computaDons  

Page 84: Don't follow the followers

Copyright  Third  Nature,  Inc.  

The  three  way  workload  break  

1.  OperaDonal:  OLTP  systems  2.  AnalyDc:  OLAP  systems  

3.  ScienDfic:  ComputaDonal  systems  

Unit  of  focus:  

1.  TransacDon  2.  Query  3.  ComputaDon  

Different  problems  require  different  plaworms  

Page 85: Don't follow the followers

Copyright  Third  Nature,  Inc.  

The  holy  grail  of  databases  under  current  market  hype  

We’re  talking  mostly  about  computaGon  over  data  when  we  talk  about  “big  data”  and  analyGcs.  

The  goal  is  combining  data  storage,  retrieval  and  analysis  into  one  system,  a  potenJal  mismatch  for  both  relaJonal  and  nosql.

Copyright  Third  Nature,  Inc.  

Page 86: Don't follow the followers

Copyright  Third  Nature,  Inc.  

A  Simple  Division  of  the  AnalyGc  Problem  Space  C

ompu

tatio

n

Littl

e L

ots

Data volume Little Lots

Big analytics, little data

Specialized computing, modeling problems:

supercomputing, GPUs

Big analytics, big data

Complex math over large data volumes requires non-relational shared nothing

architectures

Little analytics, little data

The entry point; SAS, SMP databases, even OLAP cubes can work

Little analytics, big data

The BI/DW space, for the most part, done in databases mostly

Page 87: Don't follow the followers

Copyright  Third  Nature,  Inc.  

No  technical  soluGon  fits  all  three  axes  

Big  

ComputaGon  

Parallel  DSLs,  HPC,  GPU,  MapReduce  

Concurrency  

Parallel  DBs,  nosql  stores*  

Volume  

Parallel  DBs,  MapReduce,  

BSP    

Some work along two axes, but most have a primary focus on one core component

Page 88: Don't follow the followers

Copyright  Third  Nature,  Inc.  

Data  infrastructure  is  a  plagorm  ▪  Any  data  –  structures,  forms  

▪  Any  latency  –in  moDon,  at  rest  

▪  Any  process  –  query,  algorithm,  engine,  transformaDon  

▪  Any  access  –  SQL,  API,  queue,  file  movement  

Page 89: Don't follow the followers

Copyright  Third  Nature,  Inc.  

Hadoop  &  NoSQL  AdopGon  

Some  people  can’t  resist  ge�ng  the  next  new  thing  because  it’s  new.  

Many  organizaDons  are  like  this,  promoDng  a  soluDon  and  hunDng  for  the  problem  that  matches  it.  

Beber  to  ask  “What  is  the  problem  for  which  this  technology  is  the  answer?”  

Page 90: Don't follow the followers

Copyright  Third  Nature,  Inc.  

NoSQL  Will  “Fail”  

Unless  some  mathemaDcally  derived  data  model  is  developed,  and  a  query  language  using  it  is  created.  

Otherwise  there’s  no  standard  interfacing  model,  no  interoperability,  no  chance  for  a  tool  ecosystem  to  evolve  over  top  of  the  plaworm  –  because  there  are  no  uniform  plaworm  boundaries.  

“One  logical  interface,  many  physical  implementaDons”  is  a  key  reason  why  SQL  won  the  database  wars.  This  creates  an  ecosystem.  

Page 91: Don't follow the followers

Copyright  Third  Nature,  Inc.  

Maybe…  

These aren’t the databases we’re looking for.

We need one that speaks pig latin.

Page 92: Don't follow the followers

Copyright  Third  Nature,  Inc.  

What’s  wrong  with  BASE?  

Page 93: Don't follow the followers

Copyright  Third  Nature,  Inc.  

Google  F1:  Another  EvoluGon  

Distributed  SQL  database  

ACID  compliance,  2PC  and  row-­‐level  locking  (!)  

Transparent  data  distribuDon  Synchronous  replicaDon  across  data  centers  

Table  interleaving  (hierarchies)  Queryable  protobufs  MapReduce  access  to  underlying  data  Average  user-­‐facing  latency  of  ~200ms  with  small  deviaDon  

Page 94: Don't follow the followers

Copyright  Third  Nature,  Inc.  

The  big  data  revoluGon,  more  of  an  evoluGon  

Be  pragmaGc,  not  dogmaGc  

Page 95: Don't follow the followers

Copyright  Third  Nature,  Inc.  

Conclusion  

Page 96: Don't follow the followers

Copyright  Third  Nature,  Inc.  

References  (things  worth  reading  on  the  way  home)  A  relaDonal  model  for  large  shared  data  banks,  CommunicaDons  of  the  ACM,  June,  1970,  hbp://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf  

Column-­‐Oriented  Database  Systems,  Stavros  Harizopoulos,  Daniel  Abadi,  Peter  Boncz,  VLDB  2009  Tutorial  hbp://cs-­‐www.cs.yale.edu/homes/dna/talks/Column_Store_Tutorial_VLDB09.pdf  

Nobody  ever  got  fired  for  using  Hadoop  on  a  cluster,  1st  InternaDonal  Workshop  on  Hot  Topics  in  Cloud  Data  ProcessingApril  10,  2012,  Bern,  Switzerland.  

A  co-­‐RelaDonal  Model  of  Data  for  Large  Shared  Data  Banks,  ACM  Queue,  2012,  hbp://queue.acm.org/detail.cfm?id=1961297  

A  query  language  for  mulDdimensional  arrays:  design,  implementaDon  and  opDmizaDon  techniques,  SIGMOD,  1996  

ProbabilisDcally  Bounded  Staleness  for  PracDcal  ParDal  Quorums,  Proceedings  of  the  VLDB  Endowment,  Vol.  5,  No.  8,  hbp://vldb.org/pvldb/vol5/p776_peterbailis_vldb2012.pdf  

“Amorphous  Data-­‐parallelism  in  Irregular  Algorithms”,  Keshav  Pingali  et  al  

MapReduce:  Simplified  Data  Processing  on  Large  Clusters,  hbp://staDc.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/mapreduce-­‐osdi04.pdf  

Dremel:  InteracDve  Analysis  of  Web-­‐Scale  Datasets,  Proceedings  of  the  VLDB  Endowment,  Vol.  3,  No.  1,  2010  hbp://staDc.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/36632.pdf  

Spanner:  Google’s  Globally-­‐Distributed  Database,  SIGMOD,  May,  2012,  hbp://staDc.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/es//archive/spanner-­‐osdi2012.pdf  

F1:  A  Distributed  SQL  Database  That  Scales,  Proceedings  of  the  VLDB  Endowment,  Vol.  6,  No.  11,  2013,  hbp://staDc.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/41344.pdf  

Page 97: Don't follow the followers

Copyright  Third  Nature,  Inc.  

CC  Image  A^ribuGons  Thanks  to  the  people  who  supplied  the  creaDve  commons  licensed  images  used  in  this  presentaDon:  

shady_puppy_sales.jpg  -­‐  hbp://www.flickr.com/photos/brizzlebornandbred/5001120150  

cuneiform_proto_3000bc.jpg  -­‐  hbp://www.flickr.com/photos/takomabibelot/3124619443/  

cuneiform_undo.jpg  -­‐  hbp://www.flickr.com/photos/charlesDlford/2552654321/  

scroll_kerouac.jpg  -­‐  hbp://www.flickr.com/photos/ari/93966538/  

House  on  fire  -­‐  hbp://flickr.com/photos/oldonliner/1485881035/  

Manuscripts  on  shelf  -­‐  hbp://flickr.com/photos/peterkaminski/1688635/  

manuscript_illum.jpg  -­‐  hbp://www.flickr.com/photos/diorama_sky/2975796332/  

manuscript_page.jpg  -­‐  hbp://www.flickr.com/photos/calliope/306564541/  

subway  dc  metro    -­‐  hbp://flickr.com/photos/musaeum/509899161/  

Page 98: Don't follow the followers

Copyright  Third  Nature,  Inc.  

About  the  Presenter  

Mark  Madsen  is  president  of  Third  Nature,  a  research  and  consulDng  firm  focused  on  building  the  infrastructure  for  analyDcs,  evidence-­‐based  management,  business  intelligence  and  data  management.  Mark  is  an  award-­‐winning  author,  architect  and  CTO  whose  work  has  been  featured  in  numerous  industry  publicaDons.  Over  the  past  ten  years  Mark  received  awards  for  his  work  from  the  American  ProducDvity  &  Quality  Center,  TDWI,  and  the  Smithsonian  InsDtute.  He  is  an  internaDonal  speaker,  a  contributor  at  Forbes  Online  and  InformaDon  Management.  For  more  informaDon  or  to  contact  Mark,  follow  @markmadsen  on  Twiber  or  visit    hbp://ThirdNature.net    

Page 99: Don't follow the followers

Copyright  Third  Nature,  Inc.  

About  Third  Nature  

Third Nature is a research and consulting firm focused on new and emerging technology and practices in business intelligence, analytics and performance management. If your question is related to BI, analytics, information strategy and data then you‘re at the right place.

Our goal is to help companies take advantage of information-driven management practices and applications. We offer education, consulting and research services to support business and IT organizations as well as technology vendors.

We fill the gap between what the industry analyst firms cover and what IT needs. We specialize in product and technology analysis, so we look at emerging technologies and markets, evaluating technology and hw it is applied rather than vendor market positions.


Top Related