introduction to cloudera impala

43
1 Cloudera Impala Charm City Linux, March 2014 Alex Moundalexis [email protected] @technmsg

Upload: alex-moundalexis

Post on 27-Jan-2015

130 views

Category:

Technology


2 download

DESCRIPTION

Cloudera Impala provides a fast, ad hoc query capability to Apache Hadoop, complementing traditional MapReduce batch processing. Learn the design choices and architecture behind Impala, and how to use near-ubiquitous SQL to explore your own data at scale. As presented to Charm City Linux on March 25th 2014. http://www.meetup.com/CharmCityLinux/events/168288632/

TRANSCRIPT

Page 1: Introduction to Cloudera Impala

1

Cloudera  Impala  Charm  City  Linux,  March  2014    Alex  Moundalexis  [email protected]    @technmsg  

Page 2: Introduction to Cloudera Impala

Thirty  Seconds  About  Alex  

•  SoluEons  Architect  •  aka  consultant  •  government  •  infrastructure  

•  former  coder  of  Perl  •  former  administrator  •  likes  shiny  objects  

2  

Page 3: Introduction to Cloudera Impala

What  Does  Cloudera  Do?  

•  product  •  distribuEon  of  Hadoop  components,  Apache  licensed  •  enterprise  tooling  

•  support  •  training  •  services  (aka  consulEng)  •  community  

3

Page 4: Introduction to Cloudera Impala

Disclaimer  

•  Cloudera  builds  things  soPware  •  most  donated  to  Apache  •  some  closed-­‐source  

•  Cloudera  “products”  I  reference  are  open  source  •  Apache  Licensed  •  source  code  is  on  GitHub  

•  hVps://github.com/cloudera  

4

Page 5: Introduction to Cloudera Impala

What  This  Talk  Isn’t  About  

•  deploying  •  Puppet,  Chef,  Ansible,  homegrown  scripts,  intern  labor  

•  sizing  &  tuning  •  depends  heavily  on  data  and  workload  

•  coding  •  unless  you  count  XML  or  CSV  or  SQL  

•  algorithms  

5

Page 6: Introduction to Cloudera Impala

6

Quick  and  dirty,  for  context.  

The  Apache  Hadoop  Ecosystem  

Page 7: Introduction to Cloudera Impala

Why  “Ecosystem?”  

•  In  the  beginning,  just  Hadoop  •  HDFS  •  MapReduce  

•  Today,  dozens  of  interrelated  components  •  I/O  •  Processing  •  Specialty  ApplicaEons  •  ConfiguraEon  •  Workflow  

7

Page 8: Introduction to Cloudera Impala

HDFS  

•  Distributed,  highly  fault-­‐tolerant  filesystem  •  OpEmized  for  large  streaming  access  to  data  •  Based  on  Google  File  System  

•  hVp://research.google.com/archive/gfs.html  

8

Page 9: Introduction to Cloudera Impala

Lots  of  Commodity  Machines  

9

Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

Page 10: Introduction to Cloudera Impala

MapReduce  (MR)  

•  Programming  paradigm  •  Batch  oriented,  not  realEme  •  Works  well  with  distributed  compuEng  •  Lots  of  Java,  but  other  languages  supported  •  Based  on  Google’s  paper  

•  hVp://research.google.com/archive/mapreduce.html  

10

Page 11: Introduction to Cloudera Impala

Under  the  Covers  

11

Page 12: Introduction to Cloudera Impala

You specify map() and reduce() functions. ���

���The framework does the

rest. 60

Page 13: Introduction to Cloudera Impala

Apache  Hive  

•  AbstracEon  of  Hadoop’s  Java  API  •  HiveQL  “compiles”  down  to  MR  

•  a  “SQL-­‐like”  language  

•  Eases  analysis  using  MapReduce  

13

Page 14: Introduction to Cloudera Impala

Apache  Hive  Metastore  

•  Maps  HDFS  files  to  DB-­‐like  resources  •  Databases  •  Tables  •  Column/field  names,  data  types  •  Roles/users  •  InputFormat/OutputFormat  

14

Page 15: Introduction to Cloudera Impala

WHY  DO  WE  NEED  THIS?  But  wait…  

15  

Page 16: Introduction to Cloudera Impala

16  

Page 17: Introduction to Cloudera Impala

17

I  am  not  a  SQL  wizard  by  any  means…  

Super  Shady  SQL  Supplement  

Page 18: Introduction to Cloudera Impala

A  Simple  RelaEonal  Database  

name   state   employer   year  

Alex   Maryland   Cloudera   2013  

Joey   Maryland   Cloudera   2011  

Sean   Texas   Cloudera   2013  

Paris   Maryland   AOL   2011  

18

>  

Page 19: Introduction to Cloudera Impala

InteracEng  with  RelaEonal  Data  

name   state   employer   year  

Alex   Maryland   Cloudera   2013  

Joey   Maryland   Cloudera   2011  

Sean   Texas   Cloudera   2013  

Paris   Maryland   AOL   2011  

19

>  SELECT  *  FROM  people;  

Page 20: Introduction to Cloudera Impala

InteracEng  with  RelaEonal  Data  

name   state   employer   year  

Alex   Maryland   Cloudera   2013  

Joey   Maryland   Cloudera   2011  

Sean   Texas   Cloudera   2013  

Paris   Maryland   AOL   2011  

20

>  SELECT  *  FROM  people;  

Page 21: Introduction to Cloudera Impala

RequesEng  Specific  Fields  

name   state   employer   year  

Alex   Maryland   Cloudera   2013  

Joey   Maryland   Cloudera   2011  

Sean   Texas   Cloudera   2013  

Paris   Maryland   AOL   2011  

21

>  SELECT  name,  state  FROM  people;  

Page 22: Introduction to Cloudera Impala

RequesEng  Specific  Fields  

name   state   employer   year  

Alex   Maryland   Cloudera   2013  

Joey   Maryland   Cloudera   2011  

Sean   Texas   Cloudera   2013  

Paris   Maryland   AOL   2011  

22

>  SELECT  name,  state  FROM  people;  

Page 23: Introduction to Cloudera Impala

RequesEng  Specific  Rows  

name   state   employer   year  

Alex   Maryland   Cloudera   2013  

Joey   Maryland   Cloudera   2011  

Sean   Texas   Cloudera   2013  

Paris   Maryland   AOL   2011  

23

>  SELECT  name,  state  FROM  people  WHERE  year  <  2012;  

Page 24: Introduction to Cloudera Impala

RequesEng  Specific  Rows  

name   state   employer   year  

Alex   Maryland   Cloudera   2013  

Joey   Maryland   Cloudera   2011  

Sean   Texas   Cloudera   2013  

Paris   Maryland   AOL   2011  

24

>  SELECT  name,  state  FROM  people  WHERE  year  <  2012;  

Page 25: Introduction to Cloudera Impala

Two  Simple  Tables  

owner   species   name  

Alex   Cactus   Marvin  

Joey   Cat   Brain  

Sean   None  

Paris   Unknown  

25  

>  

name   state   employer   year  

Alex   Maryland   Cloudera   2013  

Joey   Maryland   Cloudera   2011  

Sean   Texas   Cloudera   2013  

Paris   Maryland   AOL   2011  

Page 26: Introduction to Cloudera Impala

Joining  Two  Tables  

owner   species   name  

Alex   Cactus   Marvin  

Joey   Cat   Brain  

Sean   None  

Paris   Unknown  

26  

>  SELECT  people.name  AS  owner,  people.state  AS  state,  pets.name  AS  pet    FROM  people  LEFT  JOIN  pets  ON  people.name  =  pets.owner  

 name   state   employer   year  

Alex   Maryland   Cloudera   2013  

Joey   Maryland   Cloudera   2011  

Sean   Texas   Cloudera   2013  

Paris   Maryland   AOL   2011  

Page 27: Introduction to Cloudera Impala

Joining  Two  Tables  

owner   species   name  

Alex   Cactus   Marvin  

Joey   Cat   Brain  

Sean   None  

Paris   Unknown  

27  

>  SELECT  people.name  AS  owner,  people.state  AS  state,  pets.name  AS  pet    FROM  people  LEFT  JOIN  pets  ON  people.name  =  pets.owner  

 name   state   employer   year  

Alex   Maryland   Cloudera   2013  

Joey   Maryland   Cloudera   2011  

Sean   Texas   Cloudera   2013  

Paris   Maryland   AOL   2011  

Page 28: Introduction to Cloudera Impala

Joining  Two  Tables  

owner   species   name  

Alex   Cactus   Marvin  

Joey   Cat   Brain  

Sean   None  

Paris   Unknown  

28  

>  SELECT  people.name  AS  owner,  people.state  AS  state,  pets.name  AS  pet    FROM  people  LEFT  JOIN  pets  ON  people.name  =  pets.owner  

name   state   employer   year  

Alex   Maryland   Cloudera   2013  

Joey   Maryland   Cloudera   2011  

Sean   Texas   Cloudera   2013  

Paris   Maryland   AOL   2011  

Page 29: Introduction to Cloudera Impala

Joining  Two  Tables  

29

>  SELECT  people.name  AS  owner,  people.state  AS  state,  pets.name  AS  pet    FROM  people  LEFT  JOIN  pets  ON  people.name  =  pets.owner  

owner   state   pet  

Alex   Maryland   Marvin  

Joey   Maryland   Brain  

Sean   Texas  

Paris   Maryland  

Page 30: Introduction to Cloudera Impala

Varying  ImplementaEon  of  JOIN  

30

>  SELECT  people.name  AS  owner,  people.state  AS  state,  pets.name  AS  pet    FROM  people  LEFT  JOIN  pets  ON  people.name  =  pets.owner  

owner   state   pet  

Alex   Maryland   Marvin  

Joey   Maryland   Brain  

Sean   Texas   ?  

Paris   Maryland   ?  

Page 31: Introduction to Cloudera Impala

31

Familiar  interface,  but  more  powerful.  

Cloudera  Impala  

Page 32: Introduction to Cloudera Impala

Cloudera  Impala  

•  InteracEve  query  on  Hadoop  •  think  seconds,  not  minutes  

•  Nearly  ANSI-­‐92  standard  SQL  •  compaEble  with  HiveQL  

•  NaEve  MPP  query  engine  •  built  for  low-­‐latency  queries  

32

Page 33: Introduction to Cloudera Impala

Cloudera  Impala  –  Design  Choices  

•  NaEve  daemons,  wriVen  in  C/C++  •  No  JVM,  no  MapReduce  •  Saturate  disks  on  reads  •  Uses  in-­‐memory  HDFS  caching  

•  Re-­‐uses  Hive  metastore  •  Not  as  fault-­‐tolerant  as  MapReduce  

33

Page 34: Introduction to Cloudera Impala

Cloudera  Impala  –  Architecture  

•  Impala  Daemon  •  runs  on  every  node  •  handles  client  requests  •  handles  query  planning  &  execuEon  

•  State  Store  Daemon  •  provides  name  service  •  metadata  distribuEon  •  used  for  finding  data  

34

Page 35: Introduction to Cloudera Impala

Impala  Query  ExecuEon  

35

Query  Planner  Query  Coordinator  Query  Executor  

HDFS  DN   HBase  

SQL  App  ODBC  

Hive  Metastore   HDFS  NN   Statestore  

Query  Planner  Query  Coordinator  Query  Executor  

HDFS  DN   HBase  

Query  Planner  Query  Coordinator  Query  Executor  

HDFS  DN   HBase  

SQL  request  

1)  Request  arrives  via  ODBC/JDBC/HUE/Shell  

Page 36: Introduction to Cloudera Impala

Impala  Query  ExecuEon  

36

Query  Planner  Query  Coordinator  Query  Executor  

HDFS  DN   HBase  

SQL  App  ODBC  

Hive  Metastore   HDFS  NN   Statestore  

Query  Planner  Query  Coordinator  Query  Executor  

HDFS  DN   HBase  

Query  Planner  Query  Coordinator  Query  Executor  

HDFS  DN   HBase  

2)  Planner  turns  request  into  collecRons  of  plan  fragments  3)  Coordinator  iniRates  execuRon  on  impalad(s)  local  to  data  

Page 37: Introduction to Cloudera Impala

Impala  Query  ExecuEon  

37

Query  Planner  Query  Coordinator  Query  Executor  

HDFS  DN   HBase  

SQL  App  ODBC  

Hive  Metastore   HDFS  NN   Statestore  

Query  Planner  Query  Coordinator  Query  Executor  

HDFS  DN   HBase  

Query  Planner  Query  Coordinator  Query  Executor  

HDFS  DN   HBase  

4)  Intermediate  results  are  streamed  between  impalad(s)  5)  Query  results  are  streamed  back  to  client  

Query  results  

Page 38: Introduction to Cloudera Impala

Cloudera  Impala  –  Results  

•  Allows  for  fast  iteraEon/discovery  •  How  much  faster?  

•  3-­‐4x  faster  on  I/O  bound  workloads  •  up  to  45x  faster  on  mulE-­‐MR  queries  •  up  to  90x  faster  on  in-­‐memory  cache  

38

Page 39: Introduction to Cloudera Impala

39

Hold  onto  something,  folks.  

Demo  

Page 40: Introduction to Cloudera Impala

What’s  Next?  

•  Download  Hadoop!  •  CDH  available  at  www.cloudera.com  •  Already  done  that?  Contribute…  

•  Cloudera  provides  pre-­‐loaded  VMs  •  hVp://Eny.cloudera.com/quickstartvm  

•  Clone  our  repos!  •  hVps://github.com/cloudera  

40

Page 41: Introduction to Cloudera Impala

PARIS  Special  thanks:  

41  

Page 42: Introduction to Cloudera Impala

42

Preferably  related  to  the  talk…  or  not.  

QuesEons?  

Page 43: Introduction to Cloudera Impala

43

Thank  You!  Alex  Moundalexis  [email protected]  @technmsg    We’re  hiring,  kids!  Well,  not  kids.