distributed data processing workshop - sbu

1

کارگاه پردازش داده توزیع شده

پردیس- شهیدبهشتی

دانشکده علوم و مهندسی کامپیوتر

پایگاه داده توزیع شدهدرس:

دکتر هادی طباطباییاستاد:

ابوالفضل صدیقی ارائه: ۱۳۹۳آبان

Distributed Data Processing

School of Computer Science and Engineering

A. Sedighi

@amirsedighiHexican.com

[email protected]

https://twitter.com/amirsedighi

http://hexican.com/

3

Every Game needs it's Playing Yard

4

Every Game needs it's Playing Yard

5

What can I do on a Single Machine?

● MVC Programming

● Regular Biz Apps

● 100 GBs Data

● Web Surfing

● ...

6

Linux Cluster

9

Introduction

This is a 4 sessions, hands-on, step-by-step

tutorial on setting up, a Linux cluster on your

machine (Notebook or PC), to try a few number

of big-data processing frameworks and tools.

10

What we are going to do?

● Your notebook, or a PC is just enough for starting.– Setting your Linux cluster up.

● Distributed Log Management and Realtime Search-Engines– What is Elasticsearch?

– Elasticsearch on the cluster.

– Monitoring and Usage.

● The most popular Distributed Data Processing Framework.– What is Apache Hadoop?

– Apache Hadoop on the cluster.

– Using Scenarios.

http://elasticsearch.org/

http://hadoop.apache.org/

11

What we would Learn?

● Leveraging our knowledge of Big-Data.

● Getting familiar with distributed data processing.

● Maximizing availability and reliability.

● Increasing data storage capacity.

● Leveraging data processing performance.

● Data locality is a silver bullet.

● Increasing cluster utilization.

● Taming giants by giving them a try.

12

Preparing the Linux Cluster - VirtualBox

13

Preparing the Cluster - Hosting

● VirtualBox

– Memory Size, Disk Capacity and CPU cores.

– Network Interfaces.● NAT, provides Internet.● Host-Only, provides cluster communication.

14

Preparing the Cluster – Adding a Host-Only Network

15

Preparing the Cluster – Adding a NAT Interface

16

Preparing the Cluster – Adding a Host-Only Interface

17

Preparing the Cluster – First Node

● Creating a Linux machine inside VirtualBox.

● Installing Linux. (I've used Ubuntu 12.04)

– Check Samba

– Check OpenSSH

● Give the first node all.

– Having an “install” folder on.

– Having primitives such as Java installed on.

● Shutting down the first node.

18

Preparing the Cluster – Cloning, The Virtual Box Side

● Cloning the first node. (tutorial)

http://blog.hexican.com/2014/10/adding-new-node-to-cluster/

19

Preparing the Cluster – Cloning, the Linux side

● Turning the new node on.

● Network configuration

– sudo nano /etc/hosts

– sudo nano /etc/hostname

– sudo nano /etc/network/interfaces

– sudo rm /etc/udev/rules.d/70-persistent-net.rules

● sudo reboot

20

Preparing the Cluster – No Password Login

● Do this:

– ssh-keygen

– ssh-copy-id -i ~/.ssh/id_rsa.pub user@host

● Or this:

– ssh-keygen -t dsa -p '' -f ~/.ssh/id_dsa

– scp .ssh/id_rsa.pub user@host:~/master_key

– ssh user@host

– cat master_key >> ./ssh/authorized_keys

mailto:user@host

mailto:user@host

mailto:user@host

21

Preparing the Cluster – Distributed Shell

● Do it like a Commander

– Installing DSH (Optional)

http://www.tecmint.com/using-dsh-distributed-shell-to-run-linux-commands-across-multiple-machines/

22

Preparing the Cluster – Enjoy it

● To scale your cluster just repeat the cloning step.

23

Next?

● An introduction to distributed Log Management and analytical search-engines.– How Elasticsearch works?

– Workshop.

● An introduction to Apache Hadoop

– How Apache Hadoop works?

– Workshop.

http://www.slideshare.net/AmirSedighi/an-introduction-to-elasticsearch

http://www.slideshare.net/AmirSedighi/an-introduction-to-elasticsearch

http://www.slideshare.net/AmirSedighi/an-introduction-to-bigdata-processing-applying-hadoop

distributed data processing workshop - sbu

Data & Analytics

cluster cloning

cluster distributedshell

cluster communication

cluster utilization

linux cluster virtualbox

cluster nopassword login

data locality

data processing performance