Janos Matyas / CTO / SequenceIQ Inc.
GOAL / MOTIVATION
TECHNOLOGY STACK
PROBLEM RESOLUTION / HOW IT WORKS
RESULTS / ACHIEVEMENTS
OVERVIEW
GOAL / MOTIVATION
Ease Hadoop provisioning – everywhere
Automate and unify the process
Arbitrary cluster size
Same process through a cluster lifecycle (Dev, QA, UAT, Prod)
(Auto) scaling Hadoop
QoS
OUR APPROACH
Use Docker
Build cloud-specific ‘Dockerized’ images
Provision the cluster
Use Ambari
DOCKER
Lightweight, portable
Build once, run anywhere
VM – without the overhead of a VM
Isolated containers
Automated and scripted
DOCKER – CONTAINERS vs. VMs
Containers are isolated, but share OS and, where appropriate, bins/libraries
APACHE AMBARI – ARCHITECTURE
Easy Hadoop cluster provisioning
Management and monitoring
Key features – blueprints
REST API
APACHE AMBARI – CREATE CLUSTER
Define a blueprint (POST /api/v1/blueprints)
Create cluster (POST /api/v1/clusters/mycluster)
HADOOP PROVISIONG ISSUES
Each cloud provider has a proprietary API
Create images for each provider
Network configuration
Service discovery
Resize, failover, member join support
OUR APPROACH – DETAILS
Build your Docker image
Install or pre-install Hadoop services with Ambari
Install Serf and dnsmasq
Build your cloud image
Use Ansible to create an image
Provision the cluster
BUILD DOCKER IMAGES
Create the Dockerfile
Have Docker.io to build the image
Optionally pre-install services
Use Ambari
Push image to Docker.io
Licensing questions
BUILD CLOUD IMAGES
Use a Docker ready base image
Use Ansible to provision the image template
Pull the Docker images
Apply custom infrastructure
Use cloud provider specific playbooks
AWS EC2
Azure
ANSIBLE
Configuration as data
Simplest way to automate IT
Secure and agentless
Goal oriented
One playbook – multiple modules
We use it to “burn” cloud images/templates
PROVISIONING – ISSUES
FQDN
/etc/hosts is read-only in Docker
Everybody needs to know everybody
DNS
Single point of failure
Dynamic cluster – nodes joining, leaving, failing
Routing
Cloud – ability to inter-host container routing
Collision free private IP range for Docker bridge
We need predefined host names/IP addresses /etc/hosts is read-only in Docker Use Ansible to provision the image template
Pull the Docker images
Start a DNS server Use it as a reference docker run -dns <IP_OF_DNS> Nodes need to know each other
PROVISIONING – SOLUTION
FQDN
Use –h and –dns Docker params
DNS
dnsmasq is running on each Docker container
Serf member-xxx events trigger dnsmasq reconfiguration
Routing
Docker bridge configuration – follows a convention
SERF
Gossip based membership
Service discovery
Decentralized
Lightweight, fault tolerant
Highly available
DevOps friendly
Keep an eye on Consul, Open vSwitch, pipework
SERF – DECENTRALIZED SERVICE DISCOVERY
Gossip instead of heartbeat
LAN, WAN profiles
Provides membership information
Event handlers: member_join, member_leave, member_failed, member-update, member-reap, user
Query
SERF – GOSSIPING
SERF – MEMBERSHIP, EVENT HANDLERS
DNSMASQ
Network infrastructure for small networks
Lightweight DNS, DHCP server
Comes with most Linux distributions
AWS EC2 – HADOOP CLUSTER
Use EC2 REST API to provision instances (from Dockerized image)
Start Docker containers
One Ambari server
N-1 Ambari agents connecting to server
Connect ambari-shell to
Define blueprint
Provision the cluster
AWS EC2 – NETWORK SECURITY
Create a VPC
Configure subnets
Routing tables
Security gateway
Set ACL
Configure VPN
AWS EC2 - CLOUDFORMATION
Manually set up VPC is too complicated
Use CloudFormation
Manage the stack together
Template-based
Environments under version control
Customizable at runtime
No extra charge
"VpcId" : { "Type" : "String", "Description" : "VpcId of your existing Virtual Private Cloud (VPC)" },
"SubnetId" : { "Type" : "String", "Description" : "SubnetId of an existing subnet (for the primary network) in your Virtual Private Cloud (VPC)" },
"SecondaryIPAddressCount" : { "Type" : "Number", "Default" : "1", "MinValue" : "1", "MaxValue" : "5", "Description" : "Number of secondary IP addresses to assign to the network interface (1-5)", "ConstraintDescription": "must be a number from 1 to 5." },
"SSHLocation" : { "Description" : "The IP address range that can be used to SSH to the EC2 instances", "Type": "String", "MinLength": "9", "MaxLength": "18", "Default": "0.0.0.0/0", "AllowedPattern": "(\\d{1,3})\\.(\\d{1,3})\\.(\\d{1,3})\\.(\\d{1,3})/(\\d{1,2})", "ConstraintDescription": "must be a valid IP CIDR range of the form x.x.x.x/x." } },
"Mappings" : { "RegionMap" : { "us-east-1" : { "AMI" : "ami-7f418316" },
CLOUDBREAK
Cloudbreak is a powerful left surf that breaks over a coral reef, a mile off
southwest the island of Tavarua, Fiji.Cloudbreak is a cloud-agnostic
Hadoop as a Service API. Abstracts
the provisioning and ease
management and monitoring of on-
demand clusters.
Provisioning Hadoop has never been easier
CLOUDBREAK
Benefits Elastic
Scalable
Blueprints
Flexible
Main REST resources /template – specify a cluster infrastructure
/stack – creates a cloud infrastructure built from a template
/blueprint – describes a Hadoop cluster
/cluster – creates a Hadoop cluster
RESULTS AND ACHIEVEMENTS
Hadoop as a Service API
Available for EC2 and Azure cloud
OpenStack, bare metal is coming soon
Open source under Apache 2 licence
Same goals as Apache Ambari Launchpad project
What's next?
HADOOP SERVICES - AS A SERVICE
Leverage YARN
Slider (Hoya) providers
HBase, Accumulo
SequenceIQ providers - Flume, Tomcat
YARN -1964
QoS for YARN – heuristic scheduler
Platform as a Service API
BANZAI PIPELINE
Banzai Pipeline is a surf reef break located in Hawaii, off Ehukai Beach Park in
Pupukea on O'ahu's North Shore.Banzai Pipeline is a RESTful
application development
platform for building on-demand
data and job pipelines running
on Hadoop YARN.
Banzai Pipeline is a big data API for the REST
THANK YOU
Get the code: https://github.com/sequenceiq
Read about: http://blog.sequenceiq.com
Facebook: http://facebook.com/sequenceiq
Twitter: http://twitter.com/sequenceiq
LinkedIn: http://linkedin.com/sequenceiq
Contact: [email protected]
FEEL FREE TO CONTRIBUTE