big data com hadoop · big data is not bitcoin . sources for big data • data warehouse • rdbms...
TRANSCRIPT
![Page 1: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint](https://reader031.vdocuments.mx/reader031/viewer/2022022106/5be5869609d3f247448ba358/html5/thumbnails/1.jpg)
VIII Sessão - SQL Bahia 03/03/2018
Big Data com Hadoop Impala, Hive e Spark
Diógenes Pires
![Page 2: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint](https://reader031.vdocuments.mx/reader031/viewer/2022022106/5be5869609d3f247448ba358/html5/thumbnails/2.jpg)
Connect with PASS Sign up for a free membership today at:
pass.org
#sqlpass
![Page 3: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint](https://reader031.vdocuments.mx/reader031/viewer/2022022106/5be5869609d3f247448ba358/html5/thumbnails/3.jpg)
![Page 4: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint](https://reader031.vdocuments.mx/reader031/viewer/2022022106/5be5869609d3f247448ba358/html5/thumbnails/4.jpg)
![Page 5: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint](https://reader031.vdocuments.mx/reader031/viewer/2022022106/5be5869609d3f247448ba358/html5/thumbnails/5.jpg)
Internet Live
http://www.internetlivestats.com/
![Page 6: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint](https://reader031.vdocuments.mx/reader031/viewer/2022022106/5be5869609d3f247448ba358/html5/thumbnails/6.jpg)
Introduction
![Page 7: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint](https://reader031.vdocuments.mx/reader031/viewer/2022022106/5be5869609d3f247448ba358/html5/thumbnails/7.jpg)
Big Data is not Bitcoin
![Page 8: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint](https://reader031.vdocuments.mx/reader031/viewer/2022022106/5be5869609d3f247448ba358/html5/thumbnails/8.jpg)
Sources for Big Data
• Data Warehouse
• RDBMS
• Web server log files;
• Social Media Contents;
• Business Reports;
• Texts of consumer emails to the company;
• Macroeconomic indicators;
• Satisfaction surveys;
• IoT
• CRM
• …
![Page 9: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint](https://reader031.vdocuments.mx/reader031/viewer/2022022106/5be5869609d3f247448ba358/html5/thumbnails/9.jpg)
Definitions
Business intelligence (BI) is an umbrella term that includes the applications, infrastructure
and tools, and best practices that enable access to and analysis of information to improve
and optimize decisions and performance.
Big data is high-volume, high-velocity and/or high-variety information assets that demand
cost-effective, innovative forms of information processing that enable enhanced insight,
decision making, and process automation.
Business analytics is comprised of solutions used to build analysis models and simulations
to create scenarios, understand realities and predict future states. Business analytics
includes data mining, predictive analytics, applied analytics and statistics, and is delivered as
an application suitable for a business user.
Gartner
![Page 10: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint](https://reader031.vdocuments.mx/reader031/viewer/2022022106/5be5869609d3f247448ba358/html5/thumbnails/10.jpg)
Other Concepts
• Cognitive Computing
• Data Discovery
• Data Lake
• Data Science
• Machine Learning
• Self BI
• Fast Data
![Page 11: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint](https://reader031.vdocuments.mx/reader031/viewer/2022022106/5be5869609d3f247448ba358/html5/thumbnails/11.jpg)
Landscape
![Page 12: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint](https://reader031.vdocuments.mx/reader031/viewer/2022022106/5be5869609d3f247448ba358/html5/thumbnails/12.jpg)
Google File System (GFS or GoogleFS)
Google File System (GFS or GoogleFS) is a proprietary distributed file
system developed by Google to provide efficient, reliable access to data
using large clusters of commodity hardware. A new version of Google File
System code named Colossus was released in 2010.
Wikipedia
• 2003 GFS
• 2004 MapReduce
• 2006 Big Table
![Page 13: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint](https://reader031.vdocuments.mx/reader031/viewer/2022022106/5be5869609d3f247448ba358/html5/thumbnails/13.jpg)
Apache Hadoop
The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers
using simple programming models
Apache Hadoop.
![Page 14: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint](https://reader031.vdocuments.mx/reader031/viewer/2022022106/5be5869609d3f247448ba358/html5/thumbnails/14.jpg)
Apache Hadoop
The project includes these modules:
• Hadoop Common: The common utilities that support the other Hadoop
modules.
• Hadoop Distributed File System (HDFS™): A distributed file system that
provides high-throughput access to application data.
• Hadoop YARN: A framework for job scheduling and cluster resource
management.
• Hadoop MapReduce: A YARN-based system for parallel processing of large
data sets.
![Page 15: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint](https://reader031.vdocuments.mx/reader031/viewer/2022022106/5be5869609d3f247448ba358/html5/thumbnails/15.jpg)
Others Hadoop Projects
![Page 16: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint](https://reader031.vdocuments.mx/reader031/viewer/2022022106/5be5869609d3f247448ba358/html5/thumbnails/16.jpg)
Hadoop Architecture
![Page 17: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint](https://reader031.vdocuments.mx/reader031/viewer/2022022106/5be5869609d3f247448ba358/html5/thumbnails/17.jpg)
Processing
https://entendendoti.blogspot.com.br/2011/05/tipos-de-processamento.html
Types of Processing
• Batch Processing: This is batch processing, information is collected or
received, stored and processed.
• Online Processing: It is the updated processing, the information is processed
at the same time as it is registered.
• Real Time Processing: It is the immediate processing, the information is
processed the moment it is registered, generating a new processing sub
sequent. Ex .: Autopilot, GPS.
![Page 18: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint](https://reader031.vdocuments.mx/reader031/viewer/2022022106/5be5869609d3f247448ba358/html5/thumbnails/18.jpg)
Batch Processing
![Page 19: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint](https://reader031.vdocuments.mx/reader031/viewer/2022022106/5be5869609d3f247448ba358/html5/thumbnails/19.jpg)
Example
![Page 20: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint](https://reader031.vdocuments.mx/reader031/viewer/2022022106/5be5869609d3f247448ba358/html5/thumbnails/20.jpg)
MapReduce
A programming paradigm that allows for massive scalability across
hundreds or thousands of servers in a Hadoop cluster.
IBM.
![Page 21: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint](https://reader031.vdocuments.mx/reader031/viewer/2022022106/5be5869609d3f247448ba358/html5/thumbnails/21.jpg)
MapReduce
![Page 22: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint](https://reader031.vdocuments.mx/reader031/viewer/2022022106/5be5869609d3f247448ba358/html5/thumbnails/22.jpg)
The Apache Hive ™ data warehouse software facilitates reading, writing,
and managing large datasets residing in distributed storage using SQL.
Structure can be projected onto data already in storage. A command line
tool and JDBC driver are provided to connect users to Hive.
Hive.org.
Apache Hive
![Page 23: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint](https://reader031.vdocuments.mx/reader031/viewer/2022022106/5be5869609d3f247448ba358/html5/thumbnails/23.jpg)
Apache Architecture
![Page 24: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint](https://reader031.vdocuments.mx/reader031/viewer/2022022106/5be5869609d3f247448ba358/html5/thumbnails/24.jpg)
Online Processing
![Page 25: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint](https://reader031.vdocuments.mx/reader031/viewer/2022022106/5be5869609d3f247448ba358/html5/thumbnails/25.jpg)
Example
![Page 26: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint](https://reader031.vdocuments.mx/reader031/viewer/2022022106/5be5869609d3f247448ba358/html5/thumbnails/26.jpg)
Cloudera Impala provides fast, interactive SQL queries directly on your
Apache Hadoop data stored in HDFS or HBase. In addition to using the
same unified storage platform, Impala also uses the same metadata, SQL
syntax (Hive SQL), ODBC driver, and user interface (Cloudera Impala
query UI in Hue) as Apache Hive.
Cloudera.
Cloudera Impala
![Page 27: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint](https://reader031.vdocuments.mx/reader031/viewer/2022022106/5be5869609d3f247448ba358/html5/thumbnails/27.jpg)
Impala Architecture
![Page 28: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint](https://reader031.vdocuments.mx/reader031/viewer/2022022106/5be5869609d3f247448ba358/html5/thumbnails/28.jpg)
Real Time Processing
![Page 29: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint](https://reader031.vdocuments.mx/reader031/viewer/2022022106/5be5869609d3f247448ba358/html5/thumbnails/29.jpg)
Example
![Page 30: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint](https://reader031.vdocuments.mx/reader031/viewer/2022022106/5be5869609d3f247448ba358/html5/thumbnails/30.jpg)
Spark is a fast and general processing engine compatible with Hadoop
data. It can run in Hadoop clusters through YARN or Spark's standalone
mode, and it can process data in HDFS, HBase, Cassandra, Hive, and
any Hadoop InputFormat. It is designed to perform both batch processing
(similar to MapReduce) and new workloads like streaming, interactive
queries, and machine learning.
spark.apache.org
Apache Spark
![Page 31: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint](https://reader031.vdocuments.mx/reader031/viewer/2022022106/5be5869609d3f247448ba358/html5/thumbnails/31.jpg)
Apache Spark
![Page 32: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint](https://reader031.vdocuments.mx/reader031/viewer/2022022106/5be5869609d3f247448ba358/html5/thumbnails/32.jpg)