savanna - elastic hadoop on openstack

Post on 26-Jan-2015

119 Views

Category:

Technology

5 Downloads

Preview:

Click to see full reader

DESCRIPTION

Slide deck for the talk at local meetup.

TRANSCRIPT

Savanna - Hadoop onOpenStack

Mirantis, 2013Sergey LukjanovSavanna Technical Lead

● Savanna Overview● Savanna Use Cases● Roadmap & Current Status● Architecture & Features Overview● Hadoop vs. Virtualization

Agenda

● Savanna Overview● Savanna Use Cases● Roadmap & Current Status● Architecture & Features Overview● Hadoop vs. Virtualization

Agenda

● Open source native OpenStack component● Supports different Hadoop distributions● Solves both bare cluster provisioning use case

and "analytics as a service"● Managed through REST API● Web UI as part of the OpenStack Dashboard● Flexible templates of Hadoop configurations

Savanna - Elastic Hadoop on OpenStack

● Project home - https://launchpad.net/savanna○ bug tracking○ blueprints○ answers

● Code review (gerrit) - https://review.openstack.org● Sources - https://github.com/stackforge/savanna● Mailing list - savanna-all@lists.launchpad.net ● CI - https://jenkins.openstack.org and

http://jenkins.savanna.mirantis.com

Savanna - Elastic Hadoop on OpenStack

● Contributors:○ large core team from Mirantis○ teams from RedHat, Hortonworks○ several minor contributors

● Intel joined recently● Several upcoming customers

Savanna - Participants

● Savanna Overview● Savanna Use Cases● Roadmap & Current Status● Architecture & Features Overview● Hadoop vs. Virtualization

Agenda

● Administrators - centralized cluster management and monitoring

● Dev and QA teams - fast clusters provisioning ● Data Scientists/Analysts - API to run the analytic

jobs with infrastructure provisioning happening under the hood

● Making resources dedicated to IaaS cloud available for Hadoop workload

Savanna Use Cases

● Central point of control over infrastructure● Enables self-service capabilities, including choice

of Hadoop distribution to be used● Integration with vendor tooling:

○ Ambari for Apache/HortonWorks○ Cloudera Management Console○ Intel Hadoop

● Utilization of free IaaS capacity for Hadoop tasks

Administrators Use Case

● Fast on-demand provisioning of the environments

● Increase agility and speed of innovation ● Controlled access to data from production

Dev and QA Use Cases

● Simplified tasks execution - complexity of provisioning and managing cluster hidden under the hood○ Access to higher level interfaces (e.g. pig, hive)

● Bursty workload: ad-hoc queries requiring a significant resource only for short time period

● Utilization of free IaaS capacity for Hadoop tasks

Analytics Use Cases

● Savanna Overview● Savanna Use Cases● Roadmap & Current Status● Architecture & Features Overview● Hadoop vs. Virtualization

Agenda

Roadmap for Hadoop in Cloud

Phase 1 Basic cluster provisioning of Apache Hadoop

Phase 2Cluster operation support and integration with tooling,

advanced configuration (HDFS, Swift, etc.)

Phase 3"Analytics as a service": job execution framework, support different scripting languages, deeper integration with OS

Phase 1 - Basic Cluster Operation

● Cluster provisioning● Deployment Engine implementation for pre-

installed images● Templates for Hadoop cluster configuration● REST API for cluster startup and operations● Web UI integrated into OpenStack Dashboard

Roadmap for Hadoop in Cloud

Phase 1 [Released - April, 10]Basic cluster provisioning of Apache Hadoop

Phase 2Cluster operation support and integration with tooling,

advanced configuration (HDFS, Swift, etc.)

Phase 3"Analytics as a service": job execution framework, support different scripting languages, deeper integration with OS

Phase 2 - Advanced Configuration

● Hadoop cluster configuration support:○ Solutions for HDFS data reliability issue○ Configurable DN storage location○ Configurable topology of DN, NN, TT, JT ○ Add/remove nodes○ More Hadoop parameters

● Integration with vendor deployment/management tooling

● Basic monitoring support

Roadmap for Hadoop in Cloud

Phase 1 [Released - April, 10]Basic cluster provisioning of Apache Hadoop

Phase 2 [In progress - July 15]Cluster operation support and integration with tooling,

advanced configuration (HDFS, Swift, etc.)

Phase 3"Analytics as a service": job execution framework, support different scripting languages, deeper integration with OS

Phase 3 - Analytics as a Service

● API to execute Map/Reduce jobs without exposing details of underlying infrastructure (similar to AWS EMR)

● User-friendly UI for ad-hoc analytics queries based on Hive or Pig

Roadmap for Hadoop in Cloud

Phase 1 [Released - April, 10]Basic cluster provisioning of Apache Hadoop

Phase 2 [In progress - July 15]Cluster operation support and integration with tooling,

advanced configuration (HDFS, Swift, etc.)

Phase 3 [Planned - October 15]"Analytics as a service": job execution framework, support different scripting languages, deeper integration with OS

Further Roadmap

● Autoscaling● HA for NameNode● Deeper HDFS and Swift integration

○ Caching of Swift data on HDFS● Integration with logging and error handling● HBase support

● Savanna Overview● Savanna Use Cases● Roadmap & Current Status● Architecture & Features Overview● Hadoop vs. Virtualization

Agenda

Architecture Overview

Savanna Python Client RE

ST A

PI Cluster Configuration

Manager

Horizon

Keystone

Auth

DAL

Nova

Glance

Swift

Savanna Pages

HadoopVM

Provisioning Plugin

HadoopVM

HadoopVM

HadoopVM

Instance Interop Helper

ImageRegistry

● HDFS Reliability● Data Persistence● I/O Performance● etc.

Hadoop vs. Virtualization

● HDFS Reliability● Data Persistence● I/O Performance● etc.

Hadoop vs. Virtualization

● HDFS Reliability● Data Persistence● I/O Performance● etc.

Hadoop vs. Virtualization

● HDFS Reliability● Data Persistence● I/O Performance● etc.

Hadoop vs. Virtualization

HDFS Reliability: the issue

Compute

DN DN

DN

DN DN

DN

Data Block

Compute

HDFS Reliability: the issue

Compute

DN DN

DN

DN DN

DN

Data Block

Compute

HDFS Reliability: the issue

Compute

DN DN

DN

DN DN

DN

Data Block

Compute

HDFS Reliability: single DN per host

DN

Compute

TT | DN

Compute

DN

Compute

DN

Cluster A Cluster B

HDFS Reliability: Hadoop-8468hypervisor-awareness for HDFS scheduler

DN

Compute

DN DN

Compute

DN DN

Compute

DN

HDFSData Block

HDFS Reliability: Hadoop-8545enables Swift for Hadoop

Swift

HadoopJob #1

HDFSHadoopJob #2

...HadoopJob #N

initial input

final output

● Master node(s)

● Worker nodes

Configurable topology of DN, NN, TT, JT

JT | NN JT NN+

TTTT | DN DN

10 6 8

HDFS Placement Options

● Ephemeral drive/var/lib/nova/instances/instance-xxx/disk -> /mnt/ephemeral

● Block storage volumeCinder Volume -> /mnt/volume

● Bare hard drive support/dev/sdb -> /mnt/sdb

Q&A

We are hiring!

Phase 1 deployment mechanism

HadoopVM

HadoopVM

HadoopVM

HadoopVM

Savanna

Provision VMs withpre-installed Hadoop

Configure HadoopCluster

Tool usage scenarios

HadoopVM

HadoopVM

HadoopVM

HadoopVM

ToolManage Hadoop Cluster

VMVM

VM VMTool

Provision & Manage Hadoop Cluster

Scenario I

Scenario II

Extensible Provisioning

● get extra configs● validate input● launch/terminate cluster● add/remove nodes

● launch/terminate VMs● get VM status● ssh/scp to VM

Instance Interop

● register image in Savanna

● add/remove tags● get image by tag

Image registry

PluginSavanna

get extra parameters

add/remove nodes

Provisioning Interaction

launch cluster

launch cluster

get extra parametersfor the plugin

Savanna

User

Plugin

validate cluster parameters

add/remove nodes

launch cluster

add/remove nodes

Provisioning: Launching a Cluster

launch VMs

PLUGIN

ImageRegistry

Instance Interop Helper

get imageby tag

launch VMs

install andconfigureHadoop

HadoopVM

HadoopVM

HadoopVM

HadoopVM

passcommandsvia ssh, scp

Q&A

We are hiring!

top related