hadoop_realtime_processing_evenkat

of 3

Hadoop Real Time Processing Systems

Objective

Apache Storm is a free and open source distributed real-time computation system.

Storm makes it easy to reliably process unbounded streams of data, doing for

real-time processing what Hadoop did for batch processing. The main purpose of

this course is to provide knowledge and skills for real time analytics of wide variety

of streamed data.

Apache Spark is an open-source data analytics cluster computing framework.

Spark is not tied to the two-stage MapReduce paradigm, and promises

performance up to 100 times faster than Hadoop MapReduce for certain

applications. Spark provides primitives for in-memory cluster computing that

allows user programs to load data into a cluster’s memory and query it repeatedly,

making it well suited to machine learning algorithms

The participants will start by learning the What and Why of Storm and how storm

is used in real time analytics. After that they will be installing Storm on their

systems and work with Spouts and Bolts. They will then be introduced to Spark

which is successor to Map Reduce using Scala. The participants will learn

1. Hadoop Gen 2 Installation. 2. Introduction to Yarn and its working

3. Understand where to use Storm for real time analytics 4. Setup Apache Storm cluster on your system

5. Learn Storm Technology Stack and Groupings 6. Implement Spouts and Bolts

7. Work on multiple Real World Projects using Storm 8. Concepts and features of RDD 9. Transformation and Actions

10.Working of Spark in a Cluster

Note: The course will be have 40% of theoretical discussion and 60% of actual

hands on

Duration: 30 hours

Audience

This course is designed for anyone who is

1. Wanting to architect a project using Spark.

2. An ETL or Data Warehousing Developer looking at alternative approach to data analysis and storage.

3. Data Engineer

of 3

Pre-Requisites

1. Basic knowledge of Java.

2. Basic understanding of Hadoop and its working.

Course Outline

1 Hadoop & YARN Overview

• Anatomy of Hadoop Cluster, Installing and Configuring Plain Hadoop

• What is Big Data Analytics • Batch v/s Real time

• Limitations of Hadoop • Storm for Real Time Analytics

2 Storm Basics

• Installation of Storm

• Components of Storm

• Properties of Storm

3 Storm Technology Stack and Groupings

• Storm Running Modes

• Creating First Storm Topology

• Topologies in Storm

4 Spouts and Bolts

• Reliable vs Unreliable Messages

• Getting Data

• Bolt Lifecycle

• Bolt Structure

• Reliable vs Unreliable Bolts

5. Spark Basics

• Batch Analytics

• Real Time Analytics Options

• Streaming Data – Storm

• In Memory Data – Spark

• Modes of Spark

6. Spark Installation

• Spark Installation

• Overview of Spark on a cluster

• Spark Standalone Cluster.

of 3

7. Working with RDD

• RDDs

• Transformations in RDD

• Actions in RDD

• Loading Data in RDD

• Saving Data through RDD

• Key-Value Pair RDD

• MapReduce and Pair RDD Operations

• Scala and Hadoop Integration Hands on.

8. Spark integration with Hive

9. Spark Streaming

hadoop_realtime_processing_evenkat

Documents