hadoop_admin_evenkat

of 4

Big Data – Apache Hadoop Administrator Training

Objective

This training aims to provide the participants with a comprehensive understanding

of all the steps necessary to operate and maintain a Hadoop cluster. From

Installation and configuration through load-balancing and tuning.

The participants will learn the complete Installation of Hadoop Cluster, understand

the basic and advanced concepts of Map Reduce and the best practices for Apache

Hadoop Development as experienced by the developers and architects of core

Apache Hadoop. With the help of hands-on exercises, participants will learn the

following topics during the course.

1. The internals of MapReduce and HDFS and how to build Hadoop Architecture.

2. Proper cluster configuration and deployment to integrate with systems

and hardware in data centre.

3. How to load data into cluster from dynamically-generated files using

Flume and from RDBMS using Sqoop.

4. Configuring the FairScheduler to provide service-level agreements for

multiple users of a cluster.

5. Discussing Kerberos-based security for your cluster.

6. Best practices for preparing and maintaining Apache Hadoop in

production.

7. Troubleshooting, diagnosing, tuning and solving Hadoop issues.

Note: The course will be have 20% of theoretical discussion and 80% of actual

hands on

Audience & Pre-Requisites

This course is designed for Systems Administrators and IT Managers who have

basic Linux experience. No need for prior knowledge of Apache Hadoop.

Duration: 30 hours

Course Outline

• Introduction

• The Case for Apache Hadoop

o A Brief History of Hadoop

of 4

o Core Hadoop Components

o Fundamental Concepts

• The Hadoop Distributed File System o HDFS Features

o HDFS Design Assumptions

o Overview of HDFS Architecture

• MapReduce and YARN

o What Is MapReduce?

o Features of MapReduce

o Basic MapReduce Concepts

o Architectural Overview

o Hands-On Exercise

• An Overview of the Hadoop Ecosystem o What is the Hadoop Ecosystem?

o Analysis Tools

o Data Storage and Retrieval Tools

• Overview of Cloudera Distributions of Hadoop

o What is CDH?

• Overview of Hortonworks Distributions of Hadoop

• Planning your Hadoop Cluster

o General planning Considerations

o Choosing the Right Hardware

o Network Considerations

• Gen1 – Pseudo and 4 Node Cluster -Vanilla Hadoop o Installation

o Configuration

o Performance Aspects

• Installation a 4 Node with NN, SNN, JT in EC2

• Hadoop Installation

o Deployment Types

o Installing Hadoop

o Basic Configuration Parameters

o Hands-On Exercise

of 4

• Advanced Configuration

o Advanced Parameters

o Configuring Rack Awareness

• Hadoop Security o Why Hadoop Security Is Important

o Hadoop’ s Security System Concepts

o What Kerberos Is and How it Works

• Gen2 Pseudo Cluster – Vanilla Cluster o Installation of Hadoop

o Hadoop 2 Configuration

o Hadoop Federation Capability

• Configuring HA in Gen2

• Configuring Federation in Gen2

Managing and Scheduling Jobs

o Managing Running Jobs

o Hands-On Exercise

o The Capacity Scheduler

• Cluster Maintenance o Checking HDFS Status

o Hands-On Exercise

o Copying Data Between Clusters

o Adding and Removing Cluster Nodes [ Node Maintenance]

o Rebalancing the Cluster

o Hands-On Exercise

o NameNode Metadata Backup

o Cluster Upgrading

o User Management o Quota Management

• Cluster Monitoring and Troubleshooting o General System Monitoring

o Managing Hadoop’ s Log Files

o Using the NameNode and JobTracker Web UIs

o Hands-On Exercise

o Cluster Monitoring with Ganglia

o Common Troubleshooting Issues

o Benchmarking Your Cluster

of 4

• Installing and Managing Other Hadoop Projects

o Hive

o Pig

o Sqoop

• Working with Apache Ambari

o Installation of a 4 Node cluster

o Web HDFS

o Security in Ambari

o Adding new host via Ambari

o Configuring Capacity Scheduler

o Mounting HDFS

o HDFS Snapshots

hadoop_admin_evenkat

Documents