graham mossman - sql and high performance computing on hadoop

27
© 2014 EXASOL AG SQL and high performance compu3ng on Hadoop Graham Mossman, Senior Solu;on Engineer, EXASOL

Upload: huguk

Post on 26-Dec-2014

186 views

Category:

Technology


3 download

DESCRIPTION

As usage of Hadoop has broadened, alternative frameworks are emerging which offer more high performance computing based on SQL and ‘SQL-like’ technologies. In this talk, Graham considers how Hadoop and SQL alternatives compare in terms of use cases, best practice, functionality and performance. He goes on to consider future possible market outcomes with regard to Hadoop and SQL architectures and integration options.

TRANSCRIPT

Page 1: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

1  

SQL  and  high  performance  compu3ng  on  Hadoop  

Graham  Mossman,  Senior  Solu;on  Engineer,  EXASOL  

Page 2: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

2  

I  Love  My  Lawnmower  ...  

...  because  it  cuts  my  grass  well  

Page 3: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

3  

But  ...  

...  it‘s  quite  a  struggle  cuBng  my  hedge  

Page 4: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

4  

And  ...  

...  it  isn‘t  good  at  making  apple  sauce  

Page 5: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

5  

And  don‘t  even  thinking  about...  

...  using  it  to  cut  hair  

Page 6: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

6  

Hadoop  today  is  …  

§  S;ll  Open  Source  !  §  Began  with  HDFS  and  Map/Reduce  §  Now  comprises  a  number  of  addi;onal  technologies  

§  File  systems    §  (e.g.  Tachyon)  

§  Cluster  Managers    §  (e.g.  YARN  +  Mesos)  

§  Execu;on  Engines    §  (e.g.  Tez,  Spark  etc.)  

§  Analy;cal  Layer  and  Applica;ons  §   (e.g.  Hive,  Pig,  various  SQL  on  Hadoop)  

 

Page 7: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

7  

Hadoop  With  Everything  

§  Hadoop  was  invented  to  more  easily  distribute  the  Nutch  and  Lucene  applica;ons  across  a  cluster  of  machines.  §  Map/Reduce  –  distributed  processing  §  HDFS  –  distributed  file  system  

§  Began  to  be  used  for  ….  just  about  everything.  §  But  not  all  processing  tasks  are  like  indexing  the  Internet  §  Hadoop  started  to  acract  cri;cism  

§  But  usually  when  it  was  being  used  for  something  it  wasn’t  designed  for  

Page 8: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

8  

Definitely  NOT  jobs  for  Hadoop  

 §  Word  processing  

 §  Payroll  system  

§  Anything  on  a  single  computer  

§  Anything  with  “small”  data  

Page 9: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

9  

Analy3cal  Queries  

 §  “GROUP  BY“  logic  

§  i.e.  not  concerned  with  individual  data  items  

§  Analy;cal  Func;ons  §  MAX,  MEDIAN,  MIN,  SUM,  COUNT,  STANDARD  DEVIATION  …  

§  Table  joins,  nested  sub-­‐queries    Usually  short-­‐running,  ad-­‐hoc  and  submiced  many  at  a  ;me.  

Page 10: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

10  

Map/Reduce  and  HDFS  :  the  wrong  tools  for  Analy3cs  ?  

§  Queries  tend  to  be  short  :  fault  tolerance  is  less  important  §  If  chance  of  failure  in  a  5  hour  batch  is  1  in  300  § Chance  of  failure  in  a  5  second  query  is  1  in  1,000,000  

§  Queries  tend  to  be  short  :  start-­‐up  ;me  is  significant  §  a  20  second  start-­‐up  ;me  is  NOT  OK  on  a  5  second  query  

§  A  number  of  projects  started  to  address  these  issues  §  e.g.  “Hot  containers”  in  Hive  on  Tez  to  reduce  start-­‐up  ;me  

Page 11: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

11  

Example taken from Reynold Xin’s 2012 “Shark: Hive (SQL) on Spark” presentation  

Map/Reduce:  the  wrong  language  for  Analy3cs  ?  

Stage 0: Map-Shuffle-ReduceMapper(row) { fields = row.split("\t") emit(fields[0], fields[1]);}Reducer(key, values) { sum = 0; for (value in values) { sum += value; } emit(key, sum);}

Stage 1: Map-ShuffleMapper(row) { ... emit(page_views, page_name);}... shuffleStage 2: Localdata = open("stage1.out")for (i in 0 to 10) { print(data.getNext())}

Page 12: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

12  

Equivalent  in  SQL  

SELECT page_name, SUM(page_views) views

FROM wikistats GROUP BY page_nameORDER BY views DESC LIMIT 10;

Page 13: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

13  

The  SQL  language  

§  Portable  § Well-­‐defined  standards  exist  §  No  detailed  knowledge  of  the  plaporm  required  

§  e.g.    you  don’t  need  to  manage  memory  §  SQL  is  assumed  by  a  lot  of  repor;ng  tools  § Widely  used  and  understood  even  by  non-­‐technical  people  

 

Page 14: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

14  

I‘m  not  saying  that  SQL  is  perfect    

•  Try writing the simple Hadoop “Word Count” example in pure SQL

•  Or try to “sessionise” weblog data

•  Or anything with data that is not structured•  “Which part of STRUCTURED Query Language

don’t you understand …?!”

•  All I’m saying is that is an excellent language for analytical queries.

Page 15: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

15  

Hadoop  could  handle  SQL  (via  Hive),    but  historically  …  

§  High  Latency  §  Restricted  SQL  op;ons  §  All  but  simple  table  joins  were  difficult  §  Licle  support  for  compression  &  indexing  § Merv  Adrian  (Gartner  Research  -­‐  2014)    

§ “What  is  remarkable  is  that  Hadoop  does  SQL.  Just  don’t  expect  it  to  do  it  well”  

§  Result  :  EVERYTHING  looked  good  compared  to  Hive    

Page 16: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

16  

Everyone  s3ll  likes  to  compare  themselves  to  Hive  

Page 17: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

17  

EXASOL  being  no  excep3on  !  

Page 18: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

18  

Hive  con3nues  to  be  improved  …  §  Completed  §  Views  (HIVE-­‐1143)  

§  Par;;oned  Views  (HIVE-­‐1941)  

§  Storage  Handlers  (HIVE-­‐705)  

§  HBase  Integra;on  §  HBase  Bulk  Load  

§  Locking  (HIVE-­‐1293)  

§  Indexes  (HIVE-­‐417)  

§  Bitmap  Indexes  (HIVE-­‐1803)  

§  Filter  Pushdown  (HIVE-­‐279)  §  Table-­‐level  Sta;s;cs  (HIVE-­‐1361)  

§  Dynamic  Par;;ons  

§  Binary  Data  Type  (HIVE-­‐2380)  

§  Decimal  Precision  and  Scale  Support  §  HCatalog  

§  HiveServer2  (HIVE-­‐2935)  

§  Column  Sta;s;cs  in  Hive  (HIVE-­‐1362)  

§  List  Bucke;ng  (HIVE-­‐3026)  

§  Group  By  With  Rollup  (HIVE-­‐2397)  §  Enhanced  Aggrega;on,  Cube,  Grouping  

and  Rollup  (HIVE-­‐3433)  

§  Op;mizing  Skewed  Joins  (HIVE-­‐3086)  

§  Correla;on  Op;mizer  (HIVE-­‐2206)  

§  Hive  on  Tez  (HIVE-­‐4660)  §  Vectorized  Query  

Execu;on  (HIVE-­‐4160)  

§  In  Progress  §  Atomic  Insert/Update/Delete  

(HIVE-­‐5317)  

§  Transac;on  Manager  (HIVE-­‐5843)  

§  Cost  Based  Op;mizer  in  Hive  (HIVE-­‐5775)  

§  Proposed  §  Spa;al  Queries  

§  Theta  Join  (HIVE-­‐556)  

§  JDBC  Storage  Handler  

§  MapJoin  Op;miza;on  §  Proposal  to  standardize  and  expand  

Authoriza;on  in  Hive  

§  Dependent  Tables  (HIVE-­‐3466)  

§  AccessServer  

§  Type  Qualifiers  in  Hive  §  MapJoin  &  Par;;on  Pruning  

(HIVE-­‐5119)  

§  SQL  Standard  based  secure  authoriza;on  (HIVE-­‐5837)  

§  Updatable  Views  (HIVE-­‐1143)  §  Hive  on  Spark  (HIVE-­‐7292)  

 

Page 19: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

19  

The  dream  data  architecture  for  analy3cs  …  

§ Based  on  the  SQL  language  § but  leverages  Hadoop’s  extreme  scalability    §  and  Hadoop’s  fault  tolerance    § while  not  compromising  on  speed.  

Could  it  please  also  have  some  maturity  ?    And  be  easy  to  use  ?    

Page 20: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

20  

The  current  reality  

§  SQL  on  SQL,  which  is  arguably  § Less  scalable  § Less  fault  tolerant  § Less  good  with  unstructured  data  

§  SQL  on  Hadoop,  which  is  arguably  § Less  mature  § Less  easy  to  use  § Slower    

Page 21: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

21  

Choices  for  SQL  and  Hadoop  

§  SQL  AND  HADOOP  §  A  Connector  

§  HADOOP  ON  SQL  §  User  Defined  Func;ons  

§  SQL  ON  HADOOP  §  Something  like  Hive,  but  becer  

 

Page 22: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

22  

Op3on  1  –  SQL  AND  HADOOP  

Run  SQL-­‐on-­‐SQL  and  Hadoop-­‐on-­‐Hadoop  and  use  a  connector  to  join  the  two  systems    

Pros  §  Minimal  impact  (SQL  and  Hadoop  worlds  can  func;on  as  before)  §  Easier  to  implement    

Cons  §  Network  !  §  Challenge  of  op;mising  across  two  technologies  

Page 23: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

23  

Op3on  2  –  HADOOP  ON  SQL  

§  Bring  Map/Reduce  into  the  Parallel  database    §  For  example  using  Java  User  Defined  Func;ons        

select  my_java_map_func1on(words)  a_word,    count(*)  word_count  from  DOCUMENTS  group  by  1  

 §  Doesn’t  benefit  from  Hadoop’s  storage  advantages  

Page 24: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

24  

Op3on  3  -­‐    SQL  ON  HADOOP  

Build  a  rela;onal  database  on  Hadoop  storage  §  Impala  (Cloudera)  §  S;nger  (Hortonworks)  §  Presto  (Facebook)  §  SparkSQL  (UC  Berkeley)  §  HAWQ  (Pivotal)  §  BigSQL  (IBM)  §  Apache  Phoenix  (for  HBase)  §  Apache  Tajo  §  Apache  Drill  §  etc  etc  etc  ….  AND  DON‘T  FORGET  HIVE  !  

 

Page 25: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

25  

Four  possible  market  outcomes…  

§  Hadoop  and  SQL  databases  are  on  a  collision  course  –  only  one  will  survive    §  No  sign  of  that  so  far  

§  They  are  complementary  –  both  will  survive    §  Probably  -­‐  the  challenge  is  how  to  make  them  work  together  

§  They  will  merge  and  become  one    §  Some  indica;ons  this  is  already  star;ng  to  happen  

§  Something  even  more  amazing  will  come  along  and  replace  them  both  §  Some;mes  this  happens  –  Spark  ?  

Page 26: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

26  

My  Personal  Opinionated  Opinion  

Becer  to  use  a  tool  that  has  been  made  for  the  job  A  purpose-­‐built  tool  will  always  beat  one  made  originally  for  another  purpose.  

Page 27: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

27  

Ques3ons  ?  

My  contact  details  :    Email  :  [email protected]    Twicer  :  @EXADude