big data

62
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD 1 Introduction to Big Data - Survival Guide! Luan Cestari February 28 , 2014

Upload: luan-cestari

Post on 13-Jun-2015

335 views

Category:

Technology


6 download

TRANSCRIPT

Page 1: Big data

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD1

Introduction to Big Data -

Survival Guide!

Luan CestariFebruary 28 , 2014

Page 2: Big data

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD2

Please, let me ask ...

● Who already tested a product/project related to Big Data?

● Who does work with Big Data?

Page 3: Big data

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD3

What are we going to see here

● The demystification the term ¨Big Data¨ and beyond!● What does the people claim to be Big Data● What is the relationship between Big Data and

databases● Some facts about database history ● Why there are so many DB available?

● How to clue all this stuff together?● Some well-known Hadoop ecosystem tools that cover a very

wide of Big Data issues

Page 4: Big data

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD4

Why Big Data is important

● Many companies is already dealing with Big Data using Open Source tools

● There is demand for people to work with those tools as a developer and analyst

● You can also work with some integration between those system and building to improve a already existing tool or the next Big Data Tool

Page 5: Big data

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD5

Why Big Data is important

● When a company is using Big Data tools, it can grow very fast and complex:

● Many different clusters (due tenant, geo localized or different versions)

● Different technologies for very related propose (also due different team skills or use cases)

● Many many software integration, layers to segregate the different aspects and re factoring due the the fast pace

Page 6: Big data

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD6

Cool ... but what is Big Data after all?

● Just tons of information isn't enough, it also needs to be have:

● Variety● Velocity● Value● And Volume

Page 7: Big data

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD7

More about Volume: How Big it can be?

● What is the size of daily batch job from Facebook? 100 GB 10000GB 100000GB?

● Answer:104 857 600 gigabytes of users log

Page 8: Big data

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD8

More about Variety: Where the data are from?

● Customer generated Content

● M2M

● Sensors

● B2B

● B2C

● Social Network

● And others Devices: mobile phones, setbox, Security Cameras

Page 9: Big data

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD9

More about Value

● The value is about the processing the data in a reasonable period of time, so you can forecast something. Because of that you will need some data scientists, so they can do:

● Analysts (find correlations using statistics, signal processing, machine learning, persona, etc) using different kind of tools (SQL, search engines, stream processing)

Page 10: Big data

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD10

More about Value

● The value is about the processing the data in a reasonable period of time, so you can forecast something. Because of that you will need some data scientists, so they can:

● Find correlations using statistical or predictive analytics, signal processing, machine learning, natural language processing, BI, visualization, etc using different kind of tools (SQL, search engines, stream processing)

Page 11: Big data

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD11

More about Value

● So the value are the insights generated that may help you to generate a better product, making better decision or take a competitive advantage over the other competitors

● The Open Source helps also the value to enable it in a cost effective way, instead buying tons of expensive tools

Page 12: Big data

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD12

... and the Velocity

● This is a very interesting point due different analyzes may require different times:

● A traffic system may need a streaming system to analyze and predict the actual traffic and suggest better routes over the city

● The same traffic system may need to process several weeks to have a good prediction of the average traffic over the road, so that could be an offline batch

Page 13: Big data

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD13

... and the Velocity

● The main point is that there isn't a silver bullet for this, different store system may be required for different services that it aims to provide

Page 14: Big data

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD14

SQL History

● Hierarchical Database in 60`s

● Then Relational Database in 80`s and until couple years ago was the only solution used in most of the enterprise

● Big companies used to buy expensive special DW database system to analyze their data

Page 15: Big data

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD15

... and now

Page 16: Big data

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD16

... and now

Page 17: Big data

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD17

Again the reason for that

● For example the Web Analysis in Facebook:● +1 Billion users● +240 Billion photos● +1 Trillion connections● 22% of references of the Internet

● Harvard Business Review● A change from DW to a Big Data system made a 96

hours job run in just 4 hours● 2012 2.5 exabyte create a day

Page 18: Big data

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD18

We need to avoid the Golden hammer/Silver Bullet Anti-pattern

Page 19: Big data

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD19

Hadoop ecosystem save the day

● Open Source projects that help you to deal with the Big Data

● Don't need vertical scaling (big machines), you ca use cluster of commodity machines and archive even better results

● Parallel Processing● Fault tolerant Jobs● Redundant and distributed data (for disk failure and to

avoid moving data around)● Less complex programming model● It have low level native lib for high performance

Page 20: Big data

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD20

Hadoop ecosystem save the day

● But the Hadoop file system (HDFS) doesn't handle well low latency requests and small files =(

● Well, there isn't silver bullet, we need more tools● so this is why Hadoop is not alone, there are many

different projects which integrate with it

Page 21: Big data

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD21

Hadoop ecosystem save the day

● But the Hadoop file system (HDFS) doesn't handle well low latency requests and small files =(

● Well, there isn't silver bullet, we need more tools● so this is why Hadoop is not alone, there are many

different projects which integrate with it● There are several big companies that offer Hadoop and

other projects as a big product and they help the community, I will talk a little more about Hortonworks and Cloudera`s projects sets as they are very well-known and how they integrate. Find more on http://wiki.apache.org/hadoop/Distributions%20and%20Commercial%20Support

Page 22: Big data

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD22

Hadoop ecosystem save the day

● Cluadera: CDH

Page 23: Big data

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD23

Hadoop ecosystem save the day

● Cluadera:● How to create this whole stack with minimum effort:

Cloudera Manager

Page 24: Big data

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD24

Hadoop ecosystem save the day

● Hortonworks: HDP

Page 25: Big data

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD25

Hadoop ecosystem save the day

● Hortonworks: ● They use Ambari to management the cluster like

Claudera Manager does● They also have Tez to enhance the speed of the

workloads

Page 26: Big data

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD26

Hadoop ecosystem save the day

● And more tools:● You may use Apache Mesos or Hadoop 2 YARN to

better manage and sharing your services (for example tenants/cloud)

● Apache BigTop, Fuse-DFS, Apache Crunch, Apache Whirr, Apache Hama,Apache Giraph, Open MPI, Cascading (and its extensions), Weave, and more

Page 27: Big data

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD27

Hadoop ecosystem save the day

● There more tools for specific cases, like low latency with Spark ecosystem

Page 28: Big data

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD28

Hadoop ecosystem save the day

● But you can also use other tools for low latency such as Twitter Storm, Yahoo S4, Linkedin Samza (or Kafka), Amazon Kinesis, Google Millwheel

Page 29: Big data

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD29

The integration with other system will be complex

● An overview:

Page 30: Big data

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD30

A different approach: Lambda Architecture

● Idea from Twitter Team (like Nathan Marz) about how to deal with Big Data Systems

Page 31: Big data

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD31

Questions?

Page 32: Big data

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD1

Introduction to Big Data -

Survival Guide!

Luan CestariFebruary 28 , 2014

Page 33: Big data

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD2

Please, let me ask ...

● Who already tested a product/project related to Big Data?

● Who does work with Big Data?

ScalablePortableOn-demandResource ManagementMeasureable

Page 34: Big data

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD3

What are we going to see here

● The demystification the term ¨Big Data¨ and beyond!● What does the people claim to be Big Data● What is the relationship between Big Data and

databases● Some facts about database history ● Why there are so many DB available?

● How to clue all this stuff together?● Some well-known Hadoop ecosystem tools that cover a very

wide of Big Data issues

The difference in http://www.slideshare.net/CAinc/cloud-expo-session-from-virtualization-to-cloud-computing-building-an-effective-pragmatic-reliable-cloud

Page 35: Big data

4

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD4

Why Big Data is important

● Many companies is already dealing with Big Data using Open Source tools

● There is demand for people to work with those tools as a developer and analyst

● You can also work with some integration between those system and building to improve a already existing tool or the next Big Data Tool

Page 36: Big data

5

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD5

Why Big Data is important

● When a company is using Big Data tools, it can grow very fast and complex:

● Many different clusters (due tenant, geo localized or different versions)

● Different technologies for very related propose (also due different team skills or use cases)

● Many many software integration, layers to segregate the different aspects and re factoring due the the fast pace

Page 37: Big data

6

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD6

Cool ... but what is Big Data after all?

● Just tons of information isn't enough, it also needs to be have:

● Variety● Velocity● Value● And Volume

Page 38: Big data

7

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD7

More about Volume: How Big it can be?

● What is the size of daily batch job from Facebook? 100 GB 10000GB 100000GB?

● Answer:104 857 600 gigabytes of users log

Page 39: Big data

8

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD8

More about Variety: Where the data are from?

● Customer generated Content

● M2M

● Sensors

● B2B

● B2C

● Social Network

● And others Devices: mobile phones, setbox, Security Cameras

Page 40: Big data

9

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD9

More about Value

● The value is about the processing the data in a reasonable period of time, so you can forecast something. Because of that you will need some data scientists, so they can do:

● Analysts (find correlations using statistics, signal processing, machine learning, persona, etc) using different kind of tools (SQL, search engines, stream processing)

Page 41: Big data

10

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD10

More about Value

● The value is about the processing the data in a reasonable period of time, so you can forecast something. Because of that you will need some data scientists, so they can:

● Find correlations using statistical or predictive analytics, signal processing, machine learning, natural language processing, BI, visualization, etc using different kind of tools (SQL, search engines, stream processing)

Page 42: Big data

11

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD11

More about Value

● So the value are the insights generated that may help you to generate a better product, making better decision or take a competitive advantage over the other competitors

● The Open Source helps also the value to enable it in a cost effective way, instead buying tons of expensive tools

Page 43: Big data

12

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD12

... and the Velocity

● This is a very interesting point due different analyzes may require different times:

● A traffic system may need a streaming system to analyze and predict the actual traffic and suggest better routes over the city

● The same traffic system may need to process several weeks to have a good prediction of the average traffic over the road, so that could be an offline batch

Page 44: Big data

13

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD13

... and the Velocity

● The main point is that there isn't a silver bullet for this, different store system may be required for different services that it aims to provide

Page 45: Big data

14

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD14

SQL History

● Hierarchical Database in 60`s

● Then Relational Database in 80`s and until couple years ago was the only solution used in most of the enterprise

● Big companies used to buy expensive special DW database system to analyze their data

Page 46: Big data

15

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD15

... and now

Page 47: Big data

16

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD16

... and now

Page 48: Big data

17

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD17

Again the reason for that

● For example the Web Analysis in Facebook:● +1 Billion users● +240 Billion photos● +1 Trillion connections● 22% of references of the Internet

● Harvard Business Review● A change from DW to a Big Data system made a 96

hours job run in just 4 hours● 2012 2.5 exabyte create a day

Page 49: Big data

18

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD18

We need to avoid the Golden hammer/Silver Bullet Anti-pattern

Page 50: Big data

19

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD19

Hadoop ecosystem save the day

● Open Source projects that help you to deal with the Big Data

● Don't need vertical scaling (big machines), you ca use cluster of commodity machines and archive even better results

● Parallel Processing● Fault tolerant Jobs● Redundant and distributed data (for disk failure and to

avoid moving data around)● Less complex programming model● It have low level native lib for high performance

Page 51: Big data

20

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD20

Hadoop ecosystem save the day

● But the Hadoop file system (HDFS) doesn't handle well low latency requests and small files =(

● Well, there isn't silver bullet, we need more tools● so this is why Hadoop is not alone, there are many

different projects which integrate with it

Page 52: Big data

21

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD21

Hadoop ecosystem save the day

● But the Hadoop file system (HDFS) doesn't handle well low latency requests and small files =(

● Well, there isn't silver bullet, we need more tools● so this is why Hadoop is not alone, there are many

different projects which integrate with it● There are several big companies that offer Hadoop and

other projects as a big product and they help the community, I will talk a little more about Hortonworks and Cloudera`s projects sets as they are very well-known and how they integrate. Find more on http://wiki.apache.org/hadoop/Distributions%20and%20Commercial%20Support

Page 53: Big data

22

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD22

Hadoop ecosystem save the day

● Cluadera: CDH

Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

Page 54: Big data

23

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD23

Hadoop ecosystem save the day

● Cluadera:● How to create this whole stack with minimum effort:

Cloudera Manager

Page 55: Big data

24

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD24

Hadoop ecosystem save the day

● Hortonworks: HDP

Oozie is a workflow scheduler system to manage Apache Hadoop jobs.

Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions.

Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty

Page 56: Big data

25

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD25

Hadoop ecosystem save the day

● Hortonworks: ● They use Ambari to management the cluster like

Claudera Manager does● They also have Tez to enhance the speed of the

workloads

Page 57: Big data

26

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD26

Hadoop ecosystem save the day

● And more tools:● You may use Apache Mesos or Hadoop 2 YARN to

better manage and sharing your services (for example tenants/cloud)

● Apache BigTop, Fuse-DFS, Apache Crunch, Apache Whirr, Apache Hama,Apache Giraph, Open MPI, Cascading (and its extensions), Weave, and more

Apache Whirr is a set of libraries for running cloud services.

The Apache Crunch Java library provides a framework for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run.

Open MPI is a standardized API typically used for parallel and/or distributed computing

Page 58: Big data

27

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD27

Hadoop ecosystem save the day

● There more tools for specific cases, like low latency with Spark ecosystem

Apache Whirr is a set of libraries for running cloud services.

Page 59: Big data

28

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD28

Hadoop ecosystem save the day

● But you can also use other tools for low latency such as Twitter Storm, Yahoo S4, Linkedin Samza (or Kafka), Amazon Kinesis, Google Millwheel

Apache Whirr is a set of libraries for running cloud services.

Page 60: Big data

29

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD29

The integration with other system will be complex

● An overview:

Page 61: Big data

30

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD30

A different approach: Lambda Architecture

● Idea from Twitter Team (like Nathan Marz) about how to deal with Big Data Systems

Page 62: Big data

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD31

Questions?