bigdata hadoop and spark development - acadgild · • custom writable in mapreduce • custom...
TRANSCRIPT
Course Details
BIG DATAHADOOP & SPARK DEVELOPMENT
[email protected] | www.acadgild.com | 90360 10796
Brief About the CourseHadoop is considered as the most effective data platform for companies working with Big Data and is an integral part of storing, handling and Retrieving enormous amount of data in variety applications. In this course you will learn Hadoop Architecturein depth and also the key components oh Hadoop Ecosystem-Hive, Hbase, Sqoop, flume & pig.
01
02
Who should take this courseAny graduate aiming to successfully build the career around Big Data can do this course. This course is beneficial for:
Software Developers and ArchitectsProfessionals with analytics and data management profileBusiness Intelligence ProfessionalsProject ManagersData ScientistsProfessionals with Business Intelligence, ETL and datawarehousing background
Professionals from testing and mainframes background.
03
Solving Big data problem& Hadoop framework
SYLL
ABU
S
• Why is Data So Important?
• Pre-requisite – Data Scale
• What is Big Data?
• Big Bank: Big Challenge
• Common Problems
• 3 Vs Of Big Data
• Defining Big Data
• Sources Of Data Flood
• Exploding Data Problem
• Redefining The
Challenges Of Big Data
• Possible Solutions:
Scaling Up Vs. Scaling Out
• Challenges Of Scaling Out
• Solution For Data
Explosion-Hadoop
• Hadoop: Introduction
• Hadoop In Layman's Term
• Hadoop Ecosystem
• Evolutionary Features Of
Hadoop
• Hadoop Timeline
• Why Learn Big Data
Technologies?
• Who Is Using Big Data?
• HDFS: Introduction
• Design Of HDFS
• HDFS Blocks
• Components Of Hadoop 1.X
• NameNode And Hadoop
Cluster
• Arrangement Of Racks
• Arrangement Of Machines
And Racks
• Local FS And HDFS
Day 1 2 Hours
04
HDFS
• NameNode
• Checkpointing
• Replica Placement
• Benefits-Replica Placement And
Rack Awareness
• URI
• URL And URN
• HDFS Commands
• Problems With HDFS In
Hadoop 1.X
• HDFS Federation (Included In
Hadoop 2.X)
• HDFS Federation
• High Availability, Anatomy Of
File Read From HDFS
• Data Read Steps
• Important Java Classes To Write
Data To HDFS
• Anatomy Of File Write To HDFS
• Writing File To HDFS: Steps
Day 2 2 Hours
05
Exploring MapReduce
• Building Principles
• Introduction To MapReduce
• MR Demo
• Pseudo Code
• Mapper Class
• Reducer Class
• Driver Code
• InputSplit
• InputSplit And Data Blocks –
Difference
• Why Is The Block Size 128 MB?
• RecordReader
• InputFormat
• Default Inputformat : TextIn
putFormat
• InputFormat
• OutputFormat
• Using A Different
OutputFormat
• Important Points
• Partitioner
• Using Partitioner
• Map Only Job
• Flow Of Operations In
MapReduce
Day 3 2 Hours
06
Schedulers in YARN & Introduction to Pig
• Serialization In MapReduce
• Custom Writable In MapReduce
• Custom WritableComparable In
MapReduce
• Schedulers In YARN
• FIFO Scheduler
• Capacity Scheduler
• Fair Scheduler
• Differences Between Hadoop
1.X And Hadoop 2.X
• Introduction to Apache Pig
• Why Pig?
• Apache Pig Architecture
• Simple Data Types
• Complex Data Types
Day 4 2 Hours
07
Exploring Pig
• Sample Execution
• Pig Operators demo
• Parameter Substitution
• Macros
• Anatomy Of Reduce-Side-Join
• Job Optimizations In Pig
• UDF's in Pig
•Execution Of XML and CSV Files
In Pig
Day 5 2 Hours
08
Hive Introduction
• Introduction
• Hive DDL
• Demo: Databases.Ddl
• Demo: Tables.Ddl
• Hive Views
• Demo: Views.Ddl
• Architecture
• Primary Data Types
• Data Load
• Demo: ImportExport.Dml
• Demo: HiveQueries.Dml
• Demo: Explain.Hql Table Types
• Demo: ExternalTable.Ddl
• Complex Data Types
• Demo: Working With Complex
Datatypes
• Hive Variables
• Demo: Working With Hive
Variables
• Hive Variables And Execution
Customisation
Day 6 2 Hours
09
Hive Operations
Day 7 2 Hours
• Working With Arrays
• Sort By And Order By
• Distribute By And Cluster By
• Partitioning
• Static And Dynamic Partitioning
• Bucketing Vs Partitioning
• Joins And Types
• Bucket-Map Join
• Sort-Merge-Bucket-Map Join
• Left Semi Join
• DDemo: Join Optimisations
• Input Formats In Hive
• Sequence Files In Hive
• RC File In Hive
• File Formats In Hive
• ORC Files In Hive
• Inline Index In ORC Files
• ORC File Configurations
In Hive
10
Advanced Hive
Day 8 2 Hours
• SerDe In Hive
• Demo: CSVSerDe
• JSONSerDe
• RegexSerDe
• Analytic And Windowing In Hive
• Demo: Analytics.Hql
• Hcatalog In Hive,
• Demo: Using_HCatalog
• Accessing Hive With JDBC
• Demo: HiveQueries.Java
• HiveServer2 And Beeline
• Demo: Beeline
• UDF In Hive
• Demo: ToUpper.Java And
Working_with_UDF
• Optimizations In Hive
• Demo: Optimizations
11
HBase
• Challenges With Traditional RDBMS
• Features Of NoSQL Databases
• NoSQL Database Types
• CAP Theorem
• What Is HBase Regions
• HBase HMaster ZooKeeper
• HBase First Read
• HBase Meta Table
• Region Split
• Apache HBase Architecture Benefits
• HBase Vs. RDBMS
• Shell Commands
Day 9 2 Hours
Oozie and Sqoop
• Introduction To Oozie
• Oozie Architecture
• Oozie Workflow Nodes
• Oozie Server
• Oozie Workflow
• Sqoop Architecture
• Sqoop Features
Day 10 2 Hours
12
Sqoop contd. & Apache Flume
• Sqoop Hands On
• Flume: Introduction
• Flume Architecture
• Example Description
• Transactions
• Batching
• Partitioning
• Exec Source
• Spooling Directory Source
• File Channel
• Memory Channel
• Logger Sink
• HDFS Sink
Day 11 2 Hours
13
Project - 1 & Introduction toScala - Session I
• Project Discussion
• Introduction to Function Pro
gramming Language and Scala
• Functional vs OOP
• Variable
• Functions
• Using if
• while to define logic
• Loops in scala
• Collections in scala
Day 12 2 Hours
14
Scala - Session II
• Object Oriented Programming
• Classes and Objects
• Traits in Scala
• Constructors in Scala
• Method Overloading
• Implicit parameter usage
Day 13 2 Hours
Scala - Session III
• Inheritance - OOP
• Override modifier
• Polymorphism
• Invoking superclass methods
• Final members
• Traits in detail
Day 14 2 Hours
15
Scala - Session IV
• Control Structures in detail
• Exception Handling
• Coding without break and continue
• Coding the functional way
• Case classes in Scala
• Implicit conversions
• Parameter in depth
Day 15 2 Hours
Introduction to Apache Spark
• Introduction to Apache Spark
• Map Reduce Limitations
• RDD's
• Spark Context - SQLContext
and HiveContext
Day 16 2 Hours
16
RDD's in Spark
• Programming with RDD's
• Creating RDD's from text-files
• Transformations and Actions
• How does spark execution work
• RDD API's - filter
• flatMap
• fold
• foreach
• glom
• groupBy
• map
• reduceByKey
• zip
• persist
• unpersist
• Read/Write from storage
• RDD Examples
Day 17 2 Hours
17
RDD's contd. & Introduction toDataframes
• RDD API's - aggregate
• cartesian
• checkpoint
• coalesce
• reparition
• cogroup
• collectAsMap
• combineByKey
• count and countApprox
functions
• More RDD Examples
• Schema - StructType
• StructFields
• DataType
• DataFrame API's and
examples
Day 18 2 Hours
18
Spark SQL
• Create temporary tables
• SparkSQL
• Parquet vs Avro
• Examples and problem
solving on real data using
RDD and converting the
same to Dataframe
Day 19 2 Hours
Spark Streaming
• Demo: Spark Streaming Example
Day 20 2 Hours
19
ML-lib and GraphX
• Spark ML-lib • GraphX
Day 21 2 Hours
Deploying aSpark application
• Create a Spark project
• SBT / Maven
• How do maven repo work
Day 22 2 Hours
20
Project Demo From Hadoop
• Demo: Music data analysis using
Hadoop
Day 23 2 Hours
Project II
• Final project discussion
Day 24 2 Hours
[email protected] | www.acadgild.com | 90360 10796