hadoop education v1
TRANSCRIPT
-
8/13/2019 Hadoop Education v1
1/32
Hadoop EducationDBA Team
4/14/2011
-
8/13/2019 Hadoop Education v1
2/32
2Jan. 16, 2009
SHI BI Landscape
-
8/13/2019 Hadoop Education v1
3/32
3Jan. 16, 2009
SHC Hadoop Landscape
-
8/13/2019 Hadoop Education v1
4/32
4Jan. 16, 2009
SHC Hadoop
Hadoop is a software framework for processing large amounts of data scattered across multiple
commodity nodes (servers). The base Hadoop environment will contain a Distributed File System
(HDFS) and a Parallel Programming (MapReduce) piece. Additional projects may be added to the
Hadoop software framework.
Hadoop is not a replacement for a RDBMS (Relational DataBase Management System).
-
8/13/2019 Hadoop Education v1
5/32
5Jan. 16, 2009
SHC Hadoop Projects Overview
-
8/13/2019 Hadoop Education v1
6/32
6Jan. 16, 2009
HADOOP CORE
-
8/13/2019 Hadoop Education v1
7/327Jan. 16, 2009
HDFS
HDFS (Hadoop Distributed File System) isa distributed fault-tolerant file system designed to be deployed on
low cost commodity hardware. HDFS provides high throughput access to large amounts of application data.
HDFS is not a file system which requires expensive fast disk drives with RAID (Redundant Array of Independent
Disks) to provide high throughput and fault tolerance.
-
8/13/2019 Hadoop Education v1
8/328Jan. 16, 2009
MapReduce
Mapsplit1
Mapsplit2
Mapsplit0
Reduce part0
merge
sortcopy
sort
sort Reduce part1
merge
MapReduce is a programming model and software framework for writing applications
that rapidly process vast amounts of data in parallel on large clusters of compute nodes.
MapReduce is not a replacement for a RDBMS (Relational DataBase Management
System) or SQL (Structured Query Language).
-
8/13/2019 Hadoop Education v1
9/329Jan. 16, 2009
HADOOP PROJECTS & SUBPROJECTS
-
8/13/2019 Hadoop Education v1
10/3210Jan. 16, 2009
AVRO
AVRO is a data serialization system. It provides a means to
distribute non-text files, such as .zip, graphics, binary files (andtext files) in a consistent manner across a distributed (Hadoop)
environment.
-
8/13/2019 Hadoop Education v1
11/3211Jan. 16, 2009
FLUME
Flume (Log Flume) is a horizontally scalable data aggregation
tool, which can support different levels of compression, batchingand reliability for each unique data flow.
-
8/13/2019 Hadoop Education v1
12/3212Jan. 16, 2009
HBase
HBase is a NoSQL multi-dimensional, distributed, highly available data store made up of rows and
column families, which can support billions of rows and millions of columns.
HBase is not a SQL database and thus does not have the concepts of joins, data types, SQL or even a
query engine.
-
8/13/2019 Hadoop Education v1
13/3213Jan. 16, 2009
HiveHive is a data warehouse environment built on top of Hadoop. Hive gives the capability for SQL
programmers and map reduce programmers to use a common SQL-like query language called QL
which is extensible to custom mapper and reducer plug ins. It is best used for batch jobs with largeimmutable sets of data.
Hive is not designed for online transaction processing (OLTP) and does not offer real time queries
and row level updates.
-
8/13/2019 Hadoop Education v1
14/3214Jan. 16, 2009
HUEHue (Hadoop User Experience) is a unified web-based UI for interacting with Hadoop. Hue
provides an interface to submit jobs, watch running jobs, browse the file system, and interact with
Hive . Additional UI applications can be built to be used with Hue, thus providing a single accesspoint into Hadoop.
-
8/13/2019 Hadoop Education v1
15/3215Jan. 16, 2009
LUCENE/SOLR
www.yonik.com
Lucene/Solrare two projects that merged into one in March 2010. Lucene is a Java-based indexing and search implementation, as well as spellchecking, hit highlighting and
advanced analysis/tokenization capabilities. Solr is ahigh performance enterprise search server, with XML/HTTP and JSON/Python/Ruby APIs, hit
highlighting, faceted search, caching, replication, distributed search, database integration, web admin and
search interfaces.
-
8/13/2019 Hadoop Education v1
16/3216Jan. 16, 2009
PIG
Apache Pig (Pig Latin) is a scripting language for exploring large datasets. It provides
the ability with a few commands to search terabytes of data. Pig programs run in adistributed environment on a cluster (programs are compiled into MapReduce jobs and
execute using Hadoop).
-
8/13/2019 Hadoop Education v1
17/3217Jan. 16, 2009
OOZIE (Yahoo)
http://yahoo.github.com/oozie/design.html
Oozie is a workflow and coordination server tool for managing jobs
on a distributed (Hadoop) environment. Oozie job execution can bedriven on a Time and/or Data availability basis.
-
8/13/2019 Hadoop Education v1
18/3218Jan. 16, 2009
SQOOP
RDBMS HADOOPSQOOP
Generated Record
Datatype Definitions
Sqoop (Sql-to-hadoop) is a database import tool which provides the capability to easily
copy tables or entire databases between SQL databases (RDBMS) and Hadoop files inHDFS (Hadoop Distributed File System).
-
8/13/2019 Hadoop Education v1
19/3219Jan. 16, 2009
ZOOKEEPER
ZooKeeper enables highly reliable distributed coordination by providing a centralized
service for maintaining configuration information, naming, distributed synchronization,
and group services for distributed (Hadoop) applications.
-
8/13/2019 Hadoop Education v1
20/32
20Jan. 16, 2009
NON-HADOOP PROJECTS
-
8/13/2019 Hadoop Education v1
21/32
-
8/13/2019 Hadoop Education v1
22/32
22Jan. 16, 2009
GANGLIA
Cluster
Node #1
gmond
Node #2
gmond
Node #3
gmond
Node
gmond
Node
gmetad
RRD
BrowserClient
Gangliais a scalable distributed monitoring system used to monitor cluster and grids. It
provides the ability to drill down through standard or custom textual and graphical views
at a single node or at a cluster level.
-
8/13/2019 Hadoop Education v1
23/32
23Jan. 16, 2009
NAGIOS
NAGIOS isa open source monitoring, alerting, response, reporting, maintenance, and
capacity planning tool for servers and networks. Nagios can be setup to monitor critical
infrastructure, such as network protocols, applications, services, servers and network
components. It is very flexible by allowing custom Nagios plugins to be created and
shared via the open community, to enhance Nagioss features.-
-
8/13/2019 Hadoop Education v1
24/32
24Jan. 16, 2009
INFOBRIGHT
Infobright is a columnar MySQL compatible analytic database.
-
8/13/2019 Hadoop Education v1
25/32
25Jan. 16, 2009
JASPERSOFTJaspersoft is an open source BI (Business Intellegence) and ETL (Extract, Transform
and Load) set of tools, which incorporates R (project for Statistical Computing) and
supports Hadoop/Hive.
-
8/13/2019 Hadoop Education v1
26/32
26Jan. 16, 2009
R
Ris a language and environment for statistical (linear and nonlinear modelling, classical
statistical tests, time-series analysis, classification, clustering, etc.) computing andgraphics.
-
8/13/2019 Hadoop Education v1
27/32
27Jan. 16, 2009
HADOOP HARDWARE
-
8/13/2019 Hadoop Education v1
28/32
-
8/13/2019 Hadoop Education v1
29/32
29Jan. 16, 2009
Hadoop Nodes
Production Cluster
DL380 - Master Nodes
8 x 2.8Ghz Intel, 60GB RAM4 x 146GB 10k SAS (RAID)
6 x GB NICs, Mgmt Onboard
Redundant Power Supplies
R415 - Worker/Data Nodes
12 x 2.6Ghz AMD, 32GB RAM
4 x 2TB SATA (JBOD)4 x GB NICs, Mgmt Onboard
Single Power Supply
R515 - Access Nodes
12 x 2.6Ghz AMD, 64GB RAM
12 x 2TB SATA (RAID)4 x 10GB NICs, Mgmt Onboard
Redundant Power Supplies
Gb
Gb
3
S
T
EST
FAN S
PR OC
1
PR OC
2
POWER
SU PPLY
2POWER
SU PPLY
1 OVER
TEMP
POWER
C AP
1 2 3 4
9
8
7
6
5
4
3
2
1 1
2
3
4
5
6
7
8
9
ONLINESPAR E
MIR R OR
UID
2
1
4
3
6
5
8
76 5 4 3 2 14 3 2 16 5
PROC
1
PROC
2
POWERSUPPLY
2
POWERSUPPLY
1 OVERTEMP
POWERCAP
1 2 3 4
9
8
7
6
5
4
3
2
11
2
3
4
5
6
7
8
9
AMPSTATUS
FANS
DIMMS
HProLiant
DL380G7
iLO4 3UID
2 1
Backup Cluster
R710 - Master Nodes
8 x 2.6Ghz Intel, 48GB RAM4 x 300GB 15k SAS (RAID)
8 x GB NICs, Mgmt Onboard
Redundant Power Supplies
R310 - Worker/Data Nodes
4 x 2.4Ghz Intel, 8GB RAM
4 x 2TB SATA (JBOD)2 x GB NICs, No Mgmt
Single Power Supply
R515 - Access Nodes
12 x 2.6Ghz AMD, 64GB RAM
12 x 2TB SATA (RAID)4 x 10GB NICs, Mgmt Onboard
Redundant Power Supplies
Gb
Gb
3
S
T
EST
Gb4b 3b b
3
4
EST
1Gb
2Gb
MEST
Primary Network(1GbE)
Management
NetworkSecondary Network
(1GbE)
Gb Gb
Secondary Network(1GbE)
Primary Network(1GbE)
Integration/UAT Cluster
R415 - Worker/Data Nodes
12 x 2.6Ghz AMD, 32GB RAM
4 x 2TB SATA (JBOD)4 x GB NICs, Mgmt Onboard
Single Power Supply
R515 - Access Nodes
12 x 2.6Ghz AMD, 64GB RAM
12 x 2TB SATA (RAID)4 x 10GB NICs, Mgmt Onboard
Redundant Power Supplies
Gb
Gb
3
S
T
EST
1Gb
2Gb
MEST
Primary Network(1GbE)
Management
NetworkSecondary Network
(1GbE)
R710 - Master Nodes
8 x 2.6Ghz Intel, 48GB RAM
4 x 300GB 15k SAS (RAID)
8 x GB NICs, Mgmt Onboard
Redundant Power Supplies
EST
Gb4b 3b b
3
4
-
8/13/2019 Hadoop Education v1
30/32
30Jan. 16, 2009
Hadoop - Production
-
8/13/2019 Hadoop Education v1
31/32
31Jan. 16, 2009
Hadoop Integration/UAT and Backup
-
8/13/2019 Hadoop Education v1
32/32