docker big data

Part 7 Big Data Components on Docker Vlm 1• Learn to implement Big Data Hadoop Components on Docker • Apache services like Spark ,Zookeeper ,Kafka , Solr

Why would any organization want to store data? • The present and the future belongs to those who hold onto their

data and work with it to improve their current operations and innovate to generate newer products and opportunities.

• Data and the creative use of it is the heart of organizations such as Google, Facebook, Netflix, Amazon, and Yahoo!.

• Or any other organization which uses database .( Why ?= > for predictive analytics and reporting)

• They have proven that data, along with powerful analysis, helps in building fantastic and powerful products.

What is Big Data ?• Organizations want to now use this data to get insight to help

understand existing problems,• Seize new opportunities, and be more profitable. • The study and analysis of these vast volumes of data has given birth

to a term called big data.

Distributed Computing/ Cluster/Several companies have been working to solve this problem and have come

out with a few commercial offerings that leverage the power of distributed computing.

In this solution, multiple computers work together (a cluster) to store and process large volumes of data in parallel, thus making the analysis of large volumes of data possible.

Google, the Internet search engine giant, ran into issues when their data, acquired by crawling the Web, started growing to such large volumes that it was getting increasingly impossible to process.

They had to find a way to solve this problem and this led to the creation of Google File System (GFS) and MapReduce.

What is Apache Hadoop ?

• Apache Hadoop is a widely used open source distributed computing framework that is employed to efficiently process large volumes of data using large clusters of cheap or commodity computers.

• Apache Hadoop is a framework written in Java that: • Is used for distributed storage and processing of large volumes of

data, which run on top of a cluster and can scale from a single computer to thousands of computers

• Stores and processes data on every worker node (the nodes on the cluster that are responsible for the storage and processing of data) and handles hardware failures efficiently, providing high availability

Map reduce & YARN ( Yet Another Resource Negotiator )Uses the MapReduce programming model to process data .Most of the Apache Hadoop clusters in production run Apache

Hadoop 1.x (MRv1—MapReduce Version 1). New version of Apache Hadoop, 2.x (MRv2—MapReduce Version 2),

also referred to as Yet Another Resource Negotiator (YARN) is being adopted by many organizations actively.

YARN vs Map Reduce• YARN is a general-purpose, distributed, application management

framework for processing data in Hadoop clusters. • YARN was built to solve the following two important problems: • Support for large clusters (4000 nodes or more) • Ability to run other applications apart example Apache Giraph .

• MapReduce to make use of data already stored in HDFS

Scenario from Customer • Customer needs HANA DB running on 500GB ,Customer needs Apache

components running on remaining 500GB• Apache components will sync with HANA DB to produce customer desired

results • Database team will handle the further request from customer• Apache component needed are SparkZookeeperKafkaSolr

Apache Spark Why Spark is better choice for HANA ?

• Apache Spark is a data processing engine for large data sets. • Apache Spark is much faster (up to 100 times faster in memory) than Apache Hadoop

Map Reduce. • Cluster mode : Spark applications run as independent processes coordinated by the

Spark Context object in the driver program, which is the main program .• Spark Context : Connects to several types of cluster managers to allocate resources to

Spark applications.• Supported cluster managers include the Standalone cluster manager, Mesos and YARN.• Apache Spark is designed to access data from varied data sources including the HDFS,

Apache HBase and NoSQL databases such as Apache Cassandra and MongoDB . • Run an Apache Spark Master in cluster mode using the YARN cluster manager in a

Docker container.

Setting Enviroment• Setting the Environment• Running the Docker Container for CDH• Running Apache Spark Job in yarn-cluster Mode• Running Apache Spark Job in yarn-client Mode• Running the Apache Spark Shell

Apache Spark • sudo docker pull svds/cdh• sudo docker run -p 8088 -d --name cdh svds/cdh• docker exec –it cdh bash• spark-submit --master yarn-cluster --class org.apache.spark.examples.SparkPi • /usr/lib/spark/examples/lib/spark-examples-1.3.0-cdh5.4.7-hadoop2.6.0-cdh5.4.7.jar

1000• spark-submit --master yarn-client --class org.apache.spark.examples.SparkPi• /usr/lib/spark/examples/lib/spark-examples-1.3.0-cdh5.4.7-hadoop2.6.0-cdh5.4.7.jar

1000

•

Running Apache Spark Shell ( This is what developer do )• spark-shell --master yarn-client• object HelloWorld {• def main(args: Array[String]) {• println("Hello, world!")• }• }• HelloWorld.main(null)

Summary Apache Spark • #)Apache Spark applications on a YARN cluster in a Docker container

using the sparksubmit command.

• #) Example application in yarn-cluster and yarn-client modes.

• #) HelloWorld Scala script in a Spark shell.

Apache Solr / Enviroment Set -up• Apache Solr is an open source search platform built on Apache Lucene, a text search engine library. • Apache Solr is scalable and reliable and provides indexing and querying service. • Cloudera Search is based on Apache Solr.• Setting the Environment• Starting Docker Container for Apache Solr Server• Starting Interactive Shell• Logging in to the Solr Admin Console

• Creating a Core Admin Index

Apache Solr / Enviroment Set -up• Creating a Core Admin Index• Loading Sample Data• Querying Apache Solr in Solr Admin Console• Querying Apache Solr using REST API Client• Deleting Data• Listing Logs• Stopping Apache Solr Server

Starting Docker Container for Apache Solr Server

• docker pull solr• docker run -p 8983:8983 -d --name solr_on_docker solr• docker logs -f <containerID>• docker exec -it –user=solr solr_on_docker bash

Apache Kafka• Apache Kafka is a messaging system based on the publish-subscribe model. • A Kafka cluster consists of one or more servers called brokers. Kafka keeps

messages categorized by “topics”. • Producers produce messages and publish the messages to topics.• Consumers subscribe to specific topic/s and consume feeds of messages

published to the topics. • The messages published to a topic do not have to be consumed as produced and

are stored in the topic for a configurable duration. • A consumer may choose to consume the messages in a topic from the beginning.• Apache ZooKeeper server is used to coordinate a Kafka cluster.

Kafka + Zookeeper

Setting up the Environment for Kafka• Starting Docker Containers for Apache Kafka

• Finding IP Addresses• Listing the Kafka Logs• Creating a Kafka Topic• Starting the Kafka Producer• Starting the Kafka Consumer• Producing and Consuming Messages• Stopping and Removing the Docker Containers

Starting Docker Containers for Apache Kafka• docker pull dockerkafka/zookeeper• docker pull dockerkafka/kafka• docker run -d --name zookeeper -p 2181:2181

dockerkafka/zookeeper• docker run --name kafka -p 9092:9092 --link zookeeper:zookeeper

dockerkafka/kafka•

Finding IP Addresses• export ZK_IP=$(sudo docker inspect --format

'{{ .NetworkSettings.IPAddress }}' zookeeper)• export KAFKA_IP=$(sudo docker inspect --format

'{{ .NetworkSettings.IPAddress }}' kafka)

• echo $ZK_IP• echo $KAFKA_IP

Running Docker on SLES 12 SP 1• Docker support only SLES version 12 .

docker big data

Documents