big data
TRANSCRIPT
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD1
Introduction to Big Data -
Survival Guide!
Luan CestariFebruary 28 , 2014
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD2
Please, let me ask ...
● Who already tested a product/project related to Big Data?
● Who does work with Big Data?
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD3
What are we going to see here
● The demystification the term ¨Big Data¨ and beyond!● What does the people claim to be Big Data● What is the relationship between Big Data and
databases● Some facts about database history ● Why there are so many DB available?
● How to clue all this stuff together?● Some well-known Hadoop ecosystem tools that cover a very
wide of Big Data issues
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD4
Why Big Data is important
● Many companies is already dealing with Big Data using Open Source tools
● There is demand for people to work with those tools as a developer and analyst
● You can also work with some integration between those system and building to improve a already existing tool or the next Big Data Tool
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD5
Why Big Data is important
● When a company is using Big Data tools, it can grow very fast and complex:
● Many different clusters (due tenant, geo localized or different versions)
● Different technologies for very related propose (also due different team skills or use cases)
● Many many software integration, layers to segregate the different aspects and re factoring due the the fast pace
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD6
Cool ... but what is Big Data after all?
● Just tons of information isn't enough, it also needs to be have:
● Variety● Velocity● Value● And Volume
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD7
More about Volume: How Big it can be?
● What is the size of daily batch job from Facebook? 100 GB 10000GB 100000GB?
● Answer:104 857 600 gigabytes of users log
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD8
More about Variety: Where the data are from?
● Customer generated Content
● M2M
● Sensors
● B2B
● B2C
● Social Network
● And others Devices: mobile phones, setbox, Security Cameras
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD9
More about Value
● The value is about the processing the data in a reasonable period of time, so you can forecast something. Because of that you will need some data scientists, so they can do:
● Analysts (find correlations using statistics, signal processing, machine learning, persona, etc) using different kind of tools (SQL, search engines, stream processing)
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD10
More about Value
● The value is about the processing the data in a reasonable period of time, so you can forecast something. Because of that you will need some data scientists, so they can:
● Find correlations using statistical or predictive analytics, signal processing, machine learning, natural language processing, BI, visualization, etc using different kind of tools (SQL, search engines, stream processing)
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD11
More about Value
● So the value are the insights generated that may help you to generate a better product, making better decision or take a competitive advantage over the other competitors
● The Open Source helps also the value to enable it in a cost effective way, instead buying tons of expensive tools
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD12
... and the Velocity
● This is a very interesting point due different analyzes may require different times:
● A traffic system may need a streaming system to analyze and predict the actual traffic and suggest better routes over the city
● The same traffic system may need to process several weeks to have a good prediction of the average traffic over the road, so that could be an offline batch
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD13
... and the Velocity
● The main point is that there isn't a silver bullet for this, different store system may be required for different services that it aims to provide
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD14
SQL History
● Hierarchical Database in 60`s
● Then Relational Database in 80`s and until couple years ago was the only solution used in most of the enterprise
● Big companies used to buy expensive special DW database system to analyze their data
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD15
... and now
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD16
... and now
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD17
Again the reason for that
● For example the Web Analysis in Facebook:● +1 Billion users● +240 Billion photos● +1 Trillion connections● 22% of references of the Internet
● Harvard Business Review● A change from DW to a Big Data system made a 96
hours job run in just 4 hours● 2012 2.5 exabyte create a day
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD18
We need to avoid the Golden hammer/Silver Bullet Anti-pattern
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD19
Hadoop ecosystem save the day
● Open Source projects that help you to deal with the Big Data
● Don't need vertical scaling (big machines), you ca use cluster of commodity machines and archive even better results
● Parallel Processing● Fault tolerant Jobs● Redundant and distributed data (for disk failure and to
avoid moving data around)● Less complex programming model● It have low level native lib for high performance
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD20
Hadoop ecosystem save the day
● But the Hadoop file system (HDFS) doesn't handle well low latency requests and small files =(
● Well, there isn't silver bullet, we need more tools● so this is why Hadoop is not alone, there are many
different projects which integrate with it
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD21
Hadoop ecosystem save the day
● But the Hadoop file system (HDFS) doesn't handle well low latency requests and small files =(
● Well, there isn't silver bullet, we need more tools● so this is why Hadoop is not alone, there are many
different projects which integrate with it● There are several big companies that offer Hadoop and
other projects as a big product and they help the community, I will talk a little more about Hortonworks and Cloudera`s projects sets as they are very well-known and how they integrate. Find more on http://wiki.apache.org/hadoop/Distributions%20and%20Commercial%20Support
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD22
Hadoop ecosystem save the day
● Cluadera: CDH
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD23
Hadoop ecosystem save the day
● Cluadera:● How to create this whole stack with minimum effort:
Cloudera Manager
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD24
Hadoop ecosystem save the day
● Hortonworks: HDP
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD25
Hadoop ecosystem save the day
● Hortonworks: ● They use Ambari to management the cluster like
Claudera Manager does● They also have Tez to enhance the speed of the
workloads
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD26
Hadoop ecosystem save the day
● And more tools:● You may use Apache Mesos or Hadoop 2 YARN to
better manage and sharing your services (for example tenants/cloud)
● Apache BigTop, Fuse-DFS, Apache Crunch, Apache Whirr, Apache Hama,Apache Giraph, Open MPI, Cascading (and its extensions), Weave, and more
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD27
Hadoop ecosystem save the day
● There more tools for specific cases, like low latency with Spark ecosystem
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD28
Hadoop ecosystem save the day
● But you can also use other tools for low latency such as Twitter Storm, Yahoo S4, Linkedin Samza (or Kafka), Amazon Kinesis, Google Millwheel
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD29
The integration with other system will be complex
● An overview:
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD30
A different approach: Lambda Architecture
● Idea from Twitter Team (like Nathan Marz) about how to deal with Big Data Systems
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD31
Questions?
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD1
Introduction to Big Data -
Survival Guide!
Luan CestariFebruary 28 , 2014
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD2
Please, let me ask ...
● Who already tested a product/project related to Big Data?
● Who does work with Big Data?
ScalablePortableOn-demandResource ManagementMeasureable
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD3
What are we going to see here
● The demystification the term ¨Big Data¨ and beyond!● What does the people claim to be Big Data● What is the relationship between Big Data and
databases● Some facts about database history ● Why there are so many DB available?
● How to clue all this stuff together?● Some well-known Hadoop ecosystem tools that cover a very
wide of Big Data issues
The difference in http://www.slideshare.net/CAinc/cloud-expo-session-from-virtualization-to-cloud-computing-building-an-effective-pragmatic-reliable-cloud
4
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD4
Why Big Data is important
● Many companies is already dealing with Big Data using Open Source tools
● There is demand for people to work with those tools as a developer and analyst
● You can also work with some integration between those system and building to improve a already existing tool or the next Big Data Tool
5
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD5
Why Big Data is important
● When a company is using Big Data tools, it can grow very fast and complex:
● Many different clusters (due tenant, geo localized or different versions)
● Different technologies for very related propose (also due different team skills or use cases)
● Many many software integration, layers to segregate the different aspects and re factoring due the the fast pace
6
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD6
Cool ... but what is Big Data after all?
● Just tons of information isn't enough, it also needs to be have:
● Variety● Velocity● Value● And Volume
7
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD7
More about Volume: How Big it can be?
● What is the size of daily batch job from Facebook? 100 GB 10000GB 100000GB?
● Answer:104 857 600 gigabytes of users log
8
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD8
More about Variety: Where the data are from?
● Customer generated Content
● M2M
● Sensors
● B2B
● B2C
● Social Network
● And others Devices: mobile phones, setbox, Security Cameras
9
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD9
More about Value
● The value is about the processing the data in a reasonable period of time, so you can forecast something. Because of that you will need some data scientists, so they can do:
● Analysts (find correlations using statistics, signal processing, machine learning, persona, etc) using different kind of tools (SQL, search engines, stream processing)
10
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD10
More about Value
● The value is about the processing the data in a reasonable period of time, so you can forecast something. Because of that you will need some data scientists, so they can:
● Find correlations using statistical or predictive analytics, signal processing, machine learning, natural language processing, BI, visualization, etc using different kind of tools (SQL, search engines, stream processing)
11
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD11
More about Value
● So the value are the insights generated that may help you to generate a better product, making better decision or take a competitive advantage over the other competitors
● The Open Source helps also the value to enable it in a cost effective way, instead buying tons of expensive tools
12
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD12
... and the Velocity
● This is a very interesting point due different analyzes may require different times:
● A traffic system may need a streaming system to analyze and predict the actual traffic and suggest better routes over the city
● The same traffic system may need to process several weeks to have a good prediction of the average traffic over the road, so that could be an offline batch
13
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD13
... and the Velocity
● The main point is that there isn't a silver bullet for this, different store system may be required for different services that it aims to provide
14
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD14
SQL History
● Hierarchical Database in 60`s
● Then Relational Database in 80`s and until couple years ago was the only solution used in most of the enterprise
● Big companies used to buy expensive special DW database system to analyze their data
15
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD15
... and now
16
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD16
... and now
17
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD17
Again the reason for that
● For example the Web Analysis in Facebook:● +1 Billion users● +240 Billion photos● +1 Trillion connections● 22% of references of the Internet
● Harvard Business Review● A change from DW to a Big Data system made a 96
hours job run in just 4 hours● 2012 2.5 exabyte create a day
18
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD18
We need to avoid the Golden hammer/Silver Bullet Anti-pattern
19
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD19
Hadoop ecosystem save the day
● Open Source projects that help you to deal with the Big Data
● Don't need vertical scaling (big machines), you ca use cluster of commodity machines and archive even better results
● Parallel Processing● Fault tolerant Jobs● Redundant and distributed data (for disk failure and to
avoid moving data around)● Less complex programming model● It have low level native lib for high performance
20
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD20
Hadoop ecosystem save the day
● But the Hadoop file system (HDFS) doesn't handle well low latency requests and small files =(
● Well, there isn't silver bullet, we need more tools● so this is why Hadoop is not alone, there are many
different projects which integrate with it
21
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD21
Hadoop ecosystem save the day
● But the Hadoop file system (HDFS) doesn't handle well low latency requests and small files =(
● Well, there isn't silver bullet, we need more tools● so this is why Hadoop is not alone, there are many
different projects which integrate with it● There are several big companies that offer Hadoop and
other projects as a big product and they help the community, I will talk a little more about Hortonworks and Cloudera`s projects sets as they are very well-known and how they integrate. Find more on http://wiki.apache.org/hadoop/Distributions%20and%20Commercial%20Support
22
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD22
Hadoop ecosystem save the day
● Cluadera: CDH
Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
23
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD23
Hadoop ecosystem save the day
● Cluadera:● How to create this whole stack with minimum effort:
Cloudera Manager
24
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD24
Hadoop ecosystem save the day
● Hortonworks: HDP
Oozie is a workflow scheduler system to manage Apache Hadoop jobs.
Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions.
Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty
25
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD25
Hadoop ecosystem save the day
● Hortonworks: ● They use Ambari to management the cluster like
Claudera Manager does● They also have Tez to enhance the speed of the
workloads
26
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD26
Hadoop ecosystem save the day
● And more tools:● You may use Apache Mesos or Hadoop 2 YARN to
better manage and sharing your services (for example tenants/cloud)
● Apache BigTop, Fuse-DFS, Apache Crunch, Apache Whirr, Apache Hama,Apache Giraph, Open MPI, Cascading (and its extensions), Weave, and more
Apache Whirr is a set of libraries for running cloud services.
The Apache Crunch Java library provides a framework for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run.
Open MPI is a standardized API typically used for parallel and/or distributed computing
27
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD27
Hadoop ecosystem save the day
● There more tools for specific cases, like low latency with Spark ecosystem
Apache Whirr is a set of libraries for running cloud services.
28
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD28
Hadoop ecosystem save the day
● But you can also use other tools for low latency such as Twitter Storm, Yahoo S4, Linkedin Samza (or Kafka), Amazon Kinesis, Google Millwheel
Apache Whirr is a set of libraries for running cloud services.
29
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD29
The integration with other system will be complex
● An overview:
30
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD30
A different approach: Lambda Architecture
● Idea from Twitter Team (like Nathan Marz) about how to deal with Big Data Systems
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD31
Questions?