extending devops to big data applications with kubernetes

EXTENDING DEVOPS TOBIG DATA APPLICATIONSWITH KUBERNETES

ABOUT MENICOLA FERRAROSoftware Engineer at Red Hat

Working on Apache Camel, Fabric8, JBoss Fuse, Fuse Integration Services forOpenshift

Previously Solution Architect for Big Data systems at Engineering (IT)

@ni_ferraro

SUMMARYOverview of Big Data Systems and Apache Spark

DevOps, Docker, Kubernetes and Openshift Origin

Demo: Oshinko, Kafka, Spring-Boot and Fabric8 tools

BIG DATA

Big Data ≈ Machine Learning at Scale

Big Data systems are capable ofhandling data with high:

Volume

Terabytes, petabytes, ...Log filesTransactions

Velocity

Streaming dataMicro batchingNear real-time

Variety

Structured dataImagesVideosFree text

Volume

Velocity Variety

Here

> 100machines

EVOLUTION OF BIG DATA SOFTWARE

http://www.slideshare.net/nicolaferraro/a-brief-history-of-big-data-48525942

Hadoop Pig Hive

(a whole zoo)

Spark

http://www.slideshare.net/nicolaferraro/a-brief-history-of-big-data-48525942

EVOLUTION OF BIG DATAINFRASTRUCTURE

Commodity hardware

(Really ?)

Specialized hardware(Appliances)

The Cloud

EVOLUTION OFARCHITECTURES:BATCHThe first Big Data architecture was heavilybased on Hadoop: MapReduce + HDFS.

Some commercial advertising motivations:

Extract meaningful information from rawdata: "companies analyze around <<put-a-random-number-between-0-and-20-here>> % of the data they produce"

More data = more precision for machinelearning algorithms = better insights

Assist the decision making process: predictthe future using all the available data

SELECT page, count(*) FROM logs

ARCHITECTURES: BATCH

Hadoop (HDFS + MapReduce) Nodes

App Server

Logs + Data(Flume)

Report (DWH)

Client

BatchProcessing

(Hive)

MR

EVOLUTION OFARCHITECTURES:HYBRIDSecond generation architectures were focusedto improve user experience through machinelearning:

Not everything could be executed instreaming, for performance reasons

Execute heavyweight steps offline inbatches (generate the "view", e.g. amachine learning model)

Execute lightweight steps online withstreaming applications

The view must be refreshed periodically

+

ARCHITECTURES: HYBRID (LAMBDA)

Hadoop Nodes

App Server

Client

BatchProcessing

NoSQL + Messaging + Other

StreamingProcessing

EVOLUTION OFARCHITECTURES: STREAMINGThe new generation of architectures arestreaming only:

Provide immediate feedback to the user

No need to execute offline work

Provide a personalized experience to eachuser

Here we focus (mainly) onstreaming applications ...

ARCHITECTURES: STREAMING (KAPPA)

Hadoop Nodes

App Server

Client

NoSQL + Messaging + Other

StreamingProcessing

Here we focus (mainly) onstreaming applications ...

APACHE SPARKOne of the main reason of Spark success is thatit allows to define, by design, with a unifiedmodel:

Batch processing and streaming(MapReduce / Storm)

Multi-language: core in Scala, can be alsoused in Java, Python and R

Declarative semantics (SQL), proceduraland functional programming (Hive /MapReduce / Pig)

General purpose data processing andmachine learning (MapReduce / Mahout)

Logistic regression in Spark vs.Hadoop (it was so in 2013)

And the performance are really impressive!

SPARK ARCHITECTURE

Driver

ClusterManager

Driver

(2nd app)

Workers

Driver App(main)

Executor Executor

Single Worker

Tasks:"do something ona data partition"

Assignexecutors to

the driver app

Define the appworkflow

Do thedirty job

Logical viewPhysical view

SPARK ARCHITECTURE: DATADISTRIBUTION

HDFS / S3

SparkExecuto

r

SparkExecuto

r

SparkExecuto

r

r/w

SparkDriver

KafkaPartition

SparkExecuto

r

SparkExecuto

r

SparkExecuto

r

SparkDriver

KafkaPartition

KafkaPartition

HDFS / S3 HDFS / S3

Why distribution?

... to spread thework across

thousands ofworkers

SPARK DEMO

A simple quickstart showing a batchapplication doing a word count.

The purpose is showing which part of thesoftware is executed in the driver and whichparts are sent to the remote executors.

SPARK: WHAT CAN GO WRONG ???Programming with Spark is awesome, but thereare a lot of problems you need to solve:

Most of the problems do not happen in astandalone cluster (serialization issues,classloading issues, network issues, ...)

Something that works well with a small datasethangs forever in production: you must test itearlier with a lot of data!

Spark applications must be tuned: decidecaching steps, decide partitioning, reduce thenumber of shuffles, do not "collect()"

For streaming apps:

Exactly-once message semantics cannot beguaranteed all the times: you have to writeidempotent operations to take into accountduplicate messages

org.apache.spark.S

parkException: Tas

k not serializable

at org.apache.spar

k.util.ClosureClea

ner$.ensureSeriali

zable(ClosureClean

er.scala

at org.apache.spar

k.util.ClosureClea

ner$.org$apache$sp

ark$util$ClosureCl

eaner$$clean

at org.apache.spar

k.util.ClosureClea

ner$.clean(Closure

Cleaner.scala:122)

at org.apache.spar

k.SparkContext.cle

an(SparkContext.sc

ala:2055)

at org.apache.spar

k.rdd.RDD$$anonfun

$map$1.apply(RDD.s

cala:324)

at org.apache.spar

k.rdd.RDD$$anonfun

$map$1.apply(RDD.s

cala:323)

at org.apache.spar

k.rdd.RDDOperation

Scope$.withScope(R

DDOperationScope.s

cala:150

at org.apache.spar

k.rdd.RDDOperation

Scope$.withScope(R

DDOperationScope.s

cala:111

at org.apache.spar

k.rdd.RDD.withScop

e(RDD.scala:316)

at org.apache.spar

k.rdd.RDD.map(RDD.

scala:323)

AND SPARK IS THE MINORPROBLEMEvery operation is executed by different softwaredistributed in hundreds of machines.

Production different from the developmentenvironment: difficult to maintain softwarealigned (patches and upgrades of NoSQLdatabases, message brokers, ...): you need totest the integration with different versions ofexternal systems

The whole system randomly doesn't work: e.g.you expect a message in a queue or you expecta result in a query, and your expectations arenot satisfiedDebugging is hard: it's better to fix bugsearlier, but bugs appear late in production!

There are consistency problems in someNoSQL databases: the CAP theorem

Too many steps from development to release,with different teams involved: development, qa,operations I can continue...

THE 2 MAIN PROBLEMS

1 The local development environment is usually too differentfrom production (replication and versions), so you find bugs late

2 Big Data applications are difficult to test effectively, becauseconfiguring a good testing environment is a hard and long task:it's difficult to debug bugs all together

How? we can apply DevOps practices with Kubernetes ...

Easy Solution:

Test earlyTest moreTest the production software

DEVOPSNo common definition

Combination of software development, qaand operations: working together to releaseworking software frequently

Applying common dev and agile techniquesto operations

It's a cultural change, rather than just achange in development and administrationtools

Common practices:

Infrastructure as codeContinuous integrationContinuous delivery

DEV

OPS

WHY DEVOPSIncreased communication andcollaboration among: dev, ops, qa,business

Fast time to market: accomodatingbusiness and customer needs quickly tohave competitive advantage

Embrace change, do not fear it

Embrace innovation (and teamengagement), through usage of newtechnologies

Improved quality of software throughautomated testing

DEV

OPS

A DEVOPS WORKFLOWThe key point of DevOps is being able tocollaborate on the infrastructure the sameway you collaborate on software:

Common shared repository (SCM)Pull requestsPeer reviewAutomated checks (correctness, quality)Continuous integrationContinuous delivery

Testing the infrastructure like you testsoftware: scalability, fault tolerance, backup,recovery options.

DevOps practices become difficult to applyand uneffective when your infrastructure doesnot follow the software release cycle.

DOCKERDocker has revolutionized the way we buildand ship software.

Containers can be created using aDockerfile (example on the right)

There are at least 10 different ways tocreate Docker containers

A container is not a VM (even if it seemsreally close to it)

A container hosts a single application

Containers need to be orchestrated todeliver their full power

FROM ubuntu:16.04

RUN apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv EA312927RUN echo "deb http://repo.mongodb.org/apt/ubuntu $(cat /etc/lsb-release | grep DISTRIB_CODENAME | cut -d= -f2)/mongodb-org/3.2 multiverse"RUN apt-get update && apt-get install -y mongodb-org

# Create the MongoDB data directoryRUN mkdir -p /data/db

# Expose port #27017 from the container to the hostEXPOSE 27017

# Set /usr/bin/mongod as the dockerized entry-point applicationENTRYPOINT ["/usr/bin/mongod"]

DOCKER ARCHITECTURE

Docker is the building block of infrastructure as code:

The image is completely described by the DockerfileAny container is stateless (although stateful data can be attached with volumes).This means:

You don't ssh your server and change the configurationYou rebuild a new version of the application when you need to change something"Deployment" means basically replacing a container with a different container

KUBERNETESCloud platform, with several deploymentoptions (including local and private cloud), toOrchestrate (Docker) containers:

Born at GoogleProduction ready (Pokemon Go!)Provides:

Application composition in namespacesVirtual networksService discoveryLoad balancingAuto (and manual) scalingAutomatic recoveryHealth checkingRolling upgradesMonitoringLogging....

kubectl run --image=nginx nginx-app --port=80oc run --image=nginx nginx-app --port=80# or better..# oc new-app nginx --name nginx-app2 BOTH Kuebertes and Openshift Origin

are Open Source projects!

KUBERNETES ARCHITECTURE

Very high level architecture of Kubernetes.

Note: it resembles the Spark architecture seen in previous slides (with the Kubernetes master playing therole of the cluster manager), but we want to deploy a whole Spark application inside Kubernetes(including the driver).

KUBERNETES FORLOCAL DEV

Minikube

https://github.com/kubernetes/minikube

OPENSHIFT FORLOCAL DEV

Minishift

"oc cluster up"

https://github.com/minishift/minishift

$ minikube startStarting local Kubernetes cluster...Running pre-create checks...Creating machine...Starting local Kubernetes cluster...

$ kubectl run hello-minikube --image=gcr.io/google_containers/echoserver:1.4 --port=8080

$ minishift start...

$ oc cluster up-- Checking OpenShift client ... OK-- Checking Docker client ... OK...

https://github.com/kubernetes/minikube

https://github.com/minishift/minishift

OPENSHIFT ORIGIN DEMOA simple recommender system using collaborative filtering on Openshift origin.

Deploy a:

Spring-boot microserviceKafka BrokerSpark Cluster

Into openshift origin.

Source code:

https://github.com/nicolaferraro/voxxed-bigdata-spark

Rating **** Rating ****

RecommendationsRecommendations

Spark Cluster


SPARK ONKUBERNETESThe problem:

A Spark application needs a cluster of nworker machines to runYou need 1 cluster manager nodeThe application runs on 1 driver nodeWe want to deploy everythinginside Kubernetes or Openshift Origin

Basically, every piece of the Spark architectureshould become a POD.

There are multiple solutions. I'm going tointroduce the Oshinko project.

http://radanalytics.io/

https://github.com/radanalyticsio

Driver

ClusterManager

Recall the Spark architecture

PODs

Worker Worker Worker

http://radanalytics.io/

https://github.com/radanalyticsio

OSHINKOMaking Apache Spark Cloud Native on Openshift Origin.

W WW

CM

D

Spark Cluster

W WW

CM

D

Spark Cluster

1 cluster per application!Deployed in the application namespace

Oshinko console

DEMO TIME

We will deploy the Oshinko rest server and then deploy the sparkstreaming application. It will connect to Kafka to retrieve ratingsand return personalized (not really) recommendations.

Sources:


I used a basic implementation of the slope one algorithm( ) for collaborativefiltering.

It's just an example of "real life" custom Spark application thatsometimes doesn't work :)

https://en.wikipedia.org/wiki/Slope_One


https://en.wikipedia.org/wiki/Slope_One

TESTING

One of the major advantages of making a cloud native Spark application is thepossibility of executing system tests on software identical to production.

Production pods

Optional testing tools or loadgenerators

Test driver (inside or outside the namespace)

You can create virtual test environments on the fly and do assertions.

Demo: using fabric8-arquillian and the fabric8 kubernetes client

TESTING: SOME SCENARIOSSome tests that you may want to run:

Deploy the production infrastructure (or part of it) and run a suite of functionaltests to ensure every piece works as expected to produce the result

Deploy the production infrastructure and do performance tests using externalor injected data

Deploy the production infrastructure (or part of it) and check that every piece ishealthy. Then randomly kill pods while sending a load to test the systemavailability (system tests)

The Chaos Monkey!

FABRIC8

We have used some tools from Fabric8:

Fabric8 Maven Plugin: to build Kubernetes resources for Java projects

Docker Maven Plugin: used by the fabric8 maven plugin to create docker images

Fabric8 Arquillian: to create system tests from artifacts using the fabric8 plugin

Fabric8 Kubernetes Client: to interact with Kubernetes resources (pods, services, etc.)

Fabric8 is a complete development platform forKubernetes!

Uses Jenkins Pipelines and to provide:

Continuous integrationContinuous delivery

Out of the box!

https://fabric8.io/

https://fabric8.io/

QUESTIONS?

Q/AWhat about storage and data locality? In cloud native applications, using a storagesystem like s3 has many advantages over HDFS (costs, availability, usage patterns intesting). Performance of HDFS are superior (6x), but you can use larger clusters for alimited amount of time to gain the same performance. More info:

Can you scale the application directly from Openshift, or you always need tochange the code? Sure, you can use Openshift to scale the application. Beware that, ina pure DevOps approach, changing the configuration of a system at runtime is notsomething you should do, especially on production. You lose the benefits of having allthe software and configuration in a "executable" state, and you lose the effectiveness ofsystem tests because tests will not run on an environment identical to production. As analternative, you can use the auto-scaling feature (enabling it in the code).

Is the Spark demo using a machine learning algorithm? The demo is using a basiccollaborative filtering algorithm called "slope one" (the focus of this talk is not machinelearning). Spark provides many algorithms out of the box, also for collaborative filtering.Many of them are supposed to be executed in a batch application, rather than in astreaming app.

http://qr.ae/TAF4cN

http://qr.ae/TAF4cN

THANKS

Follow me on Twitter!

@ni_ferraro