extending devops to big data applications with kubernetes
TRANSCRIPT
EXTENDING DEVOPS TOBIG DATA APPLICATIONSWITH KUBERNETES
ABOUT MENICOLA FERRAROSoftware Engineer at Red Hat
Working on Apache Camel, Fabric8, JBoss Fuse, Fuse Integration Services forOpenshift
Previously Solution Architect for Big Data systems at Engineering (IT)
@ni_ferraro
SUMMARYOverview of Big Data Systems and Apache Spark
DevOps, Docker, Kubernetes and Openshift Origin
Demo: Oshinko, Kafka, Spring-Boot and Fabric8 tools
BIG DATA
Big Data ≈ Machine Learning at Scale
Big Data systems are capable ofhandling data with high:
Volume
Terabytes, petabytes, ...Log filesTransactions
Velocity
Streaming dataMicro batchingNear real-time
Variety
Structured dataImagesVideosFree text
Volume
Velocity Variety
Here
> 100machines
EVOLUTION OF BIG DATA SOFTWARE
http://www.slideshare.net/nicolaferraro/a-brief-history-of-big-data-48525942
Hadoop Pig Hive
(a whole zoo)
Spark
EVOLUTION OF BIG DATAINFRASTRUCTURE
Commodity hardware
(Really ?)
Specialized hardware(Appliances)
The Cloud
EVOLUTION OFARCHITECTURES:BATCHThe first Big Data architecture was heavilybased on Hadoop: MapReduce + HDFS.
Some commercial advertising motivations:
Extract meaningful information from rawdata: "companies analyze around <<put-a-random-number-between-0-and-20-here>> % of the data they produce"
More data = more precision for machinelearning algorithms = better insights
Assist the decision making process: predictthe future using all the available data
SELECT page, count(*) FROM logs
ARCHITECTURES: BATCH
Hadoop (HDFS + MapReduce) Nodes
App Server
Logs + Data(Flume)
Report (DWH)
Client
BatchProcessing
(Hive)
MR
EVOLUTION OFARCHITECTURES:HYBRIDSecond generation architectures were focusedto improve user experience through machinelearning:
Not everything could be executed instreaming, for performance reasons
Execute heavyweight steps offline inbatches (generate the "view", e.g. amachine learning model)
Execute lightweight steps online withstreaming applications
The view must be refreshed periodically
+
ARCHITECTURES: HYBRID (LAMBDA)
Hadoop Nodes
App Server
Client
BatchProcessing
NoSQL + Messaging + Other
StreamingProcessing
EVOLUTION OFARCHITECTURES: STREAMINGThe new generation of architectures arestreaming only:
Provide immediate feedback to the user
No need to execute offline work
Provide a personalized experience to eachuser
Here we focus (mainly) onstreaming applications ...
ARCHITECTURES: STREAMING (KAPPA)
Hadoop Nodes
App Server
Client
NoSQL + Messaging + Other
StreamingProcessing
Here we focus (mainly) onstreaming applications ...
APACHE SPARKOne of the main reason of Spark success is thatit allows to define, by design, with a unifiedmodel:
Batch processing and streaming(MapReduce / Storm)
Multi-language: core in Scala, can be alsoused in Java, Python and R
Declarative semantics (SQL), proceduraland functional programming (Hive /MapReduce / Pig)
General purpose data processing andmachine learning (MapReduce / Mahout)
Logistic regression in Spark vs.Hadoop (it was so in 2013)
And the performance are really impressive!
SPARK ARCHITECTURE
Driver
ClusterManager
Driver
(2nd app)
Workers
Driver App(main)
Executor Executor
Single Worker
Tasks:"do something ona data partition"
Assignexecutors to
the driver app
Define the appworkflow
Do thedirty job
Logical viewPhysical view
SPARK ARCHITECTURE: DATADISTRIBUTION
HDFS / S3
SparkExecuto
r
SparkExecuto
r
SparkExecuto
r
r/w
SparkDriver
KafkaPartition
SparkExecuto
r
SparkExecuto
r
SparkExecuto
r
SparkDriver
KafkaPartition
KafkaPartition
HDFS / S3 HDFS / S3
Why distribution?
... to spread thework across
thousands ofworkers
SPARK DEMO
A simple quickstart showing a batchapplication doing a word count.
The purpose is showing which part of thesoftware is executed in the driver and whichparts are sent to the remote executors.
SPARK: WHAT CAN GO WRONG ???Programming with Spark is awesome, but thereare a lot of problems you need to solve:
Most of the problems do not happen in astandalone cluster (serialization issues,classloading issues, network issues, ...)
Something that works well with a small datasethangs forever in production: you must test itearlier with a lot of data!
Spark applications must be tuned: decidecaching steps, decide partitioning, reduce thenumber of shuffles, do not "collect()"
For streaming apps:
Exactly-once message semantics cannot beguaranteed all the times: you have to writeidempotent operations to take into accountduplicate messages
org.apache.spark.S
parkException: Tas
k not serializable
at org.apache.spar
k.util.ClosureClea
ner$.ensureSeriali
zable(ClosureClean
er.scala
at org.apache.spar
k.util.ClosureClea
ner$.org$apache$sp
ark$util$ClosureCl
eaner$$clean
at org.apache.spar
k.util.ClosureClea
ner$.clean(Closure
Cleaner.scala:122)
at org.apache.spar
k.SparkContext.cle
an(SparkContext.sc
ala:2055)
at org.apache.spar
k.rdd.RDD$$anonfun
$map$1.apply(RDD.s
cala:324)
at org.apache.spar
k.rdd.RDD$$anonfun
$map$1.apply(RDD.s
cala:323)
at org.apache.spar
k.rdd.RDDOperation
Scope$.withScope(R
DDOperationScope.s
cala:150
at org.apache.spar
k.rdd.RDDOperation
Scope$.withScope(R
DDOperationScope.s
cala:111
at org.apache.spar
k.rdd.RDD.withScop
e(RDD.scala:316)
at org.apache.spar
k.rdd.RDD.map(RDD.
scala:323)
AND SPARK IS THE MINORPROBLEMEvery operation is executed by different softwaredistributed in hundreds of machines.
Production different from the developmentenvironment: difficult to maintain softwarealigned (patches and upgrades of NoSQLdatabases, message brokers, ...): you need totest the integration with different versions ofexternal systems
The whole system randomly doesn't work: e.g.you expect a message in a queue or you expecta result in a query, and your expectations arenot satisfiedDebugging is hard: it's better to fix bugsearlier, but bugs appear late in production!
There are consistency problems in someNoSQL databases: the CAP theorem
Too many steps from development to release,with different teams involved: development, qa,operations I can continue...
THE 2 MAIN PROBLEMS
1 The local development environment is usually too differentfrom production (replication and versions), so you find bugs late
2 Big Data applications are difficult to test effectively, becauseconfiguring a good testing environment is a hard and long task:it's difficult to debug bugs all together
How? we can apply DevOps practices with Kubernetes ...
Easy Solution:
Test earlyTest moreTest the production software
SUMMARYOverview of Big Data Systems and Apache Spark
DevOps, Docker, Kubernetes and Openshift Origin
Demo: Oshinko, Kafka, Spring-Boot and Fabric8 tools
DEVOPSNo common definition
Combination of software development, qaand operations: working together to releaseworking software frequently
Applying common dev and agile techniquesto operations
It's a cultural change, rather than just achange in development and administrationtools
Common practices:
Infrastructure as codeContinuous integrationContinuous delivery
DEV
OPS
WHY DEVOPSIncreased communication andcollaboration among: dev, ops, qa,business
Fast time to market: accomodatingbusiness and customer needs quickly tohave competitive advantage
Embrace change, do not fear it
Embrace innovation (and teamengagement), through usage of newtechnologies
Improved quality of software throughautomated testing
DEV
OPS
A DEVOPS WORKFLOWThe key point of DevOps is being able tocollaborate on the infrastructure the sameway you collaborate on software:
Common shared repository (SCM)Pull requestsPeer reviewAutomated checks (correctness, quality)Continuous integrationContinuous delivery
Testing the infrastructure like you testsoftware: scalability, fault tolerance, backup,recovery options.
DevOps practices become difficult to applyand uneffective when your infrastructure doesnot follow the software release cycle.
DOCKERDocker has revolutionized the way we buildand ship software.
Containers can be created using aDockerfile (example on the right)
There are at least 10 different ways tocreate Docker containers
A container is not a VM (even if it seemsreally close to it)
A container hosts a single application
Containers need to be orchestrated todeliver their full power
FROM ubuntu:16.04
RUN apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv EA312927RUN echo "deb http://repo.mongodb.org/apt/ubuntu $(cat /etc/lsb-release | grep DISTRIB_CODENAME | cut -d= -f2)/mongodb-org/3.2 multiverse"RUN apt-get update && apt-get install -y mongodb-org
# Create the MongoDB data directoryRUN mkdir -p /data/db
# Expose port #27017 from the container to the hostEXPOSE 27017
# Set /usr/bin/mongod as the dockerized entry-point applicationENTRYPOINT ["/usr/bin/mongod"]
DOCKER ARCHITECTURE
Docker is the building block of infrastructure as code:
The image is completely described by the DockerfileAny container is stateless (although stateful data can be attached with volumes).This means:
You don't ssh your server and change the configurationYou rebuild a new version of the application when you need to change something"Deployment" means basically replacing a container with a different container
KUBERNETESCloud platform, with several deploymentoptions (including local and private cloud), toOrchestrate (Docker) containers:
Born at GoogleProduction ready (Pokemon Go!)Provides:
Application composition in namespacesVirtual networksService discoveryLoad balancingAuto (and manual) scalingAutomatic recoveryHealth checkingRolling upgradesMonitoringLogging....
kubectl run --image=nginx nginx-app --port=80oc run --image=nginx nginx-app --port=80# or better..# oc new-app nginx --name nginx-app2 BOTH Kuebertes and Openshift Origin
are Open Source projects!
KUBERNETES ARCHITECTURE
Very high level architecture of Kubernetes.
Note: it resembles the Spark architecture seen in previous slides (with the Kubernetes master playing therole of the cluster manager), but we want to deploy a whole Spark application inside Kubernetes(including the driver).
KUBERNETES FORLOCAL DEV
Minikube
https://github.com/kubernetes/minikube
OPENSHIFT FORLOCAL DEV
Minishift
"oc cluster up"
https://github.com/minishift/minishift
$ minikube startStarting local Kubernetes cluster...Running pre-create checks...Creating machine...Starting local Kubernetes cluster...
$ kubectl run hello-minikube --image=gcr.io/google_containers/echoserver:1.4 --port=8080
$ minishift start...
$ oc cluster up-- Checking OpenShift client ... OK-- Checking Docker client ... OK...
SUMMARYOverview of Big Data Systems and Apache Spark
DevOps, Docker, Kubernetes and Openshift Origin
Demo: Oshinko, Kafka, Spring-Boot and Fabric8 tools
OPENSHIFT ORIGIN DEMOA simple recommender system using collaborative filtering on Openshift origin.
Deploy a:
Spring-boot microserviceKafka BrokerSpark Cluster
Into openshift origin.
Source code:
https://github.com/nicolaferraro/voxxed-bigdata-spark
Rating **** Rating ****
RecommendationsRecommendations
Spark Cluster
SPARK ONKUBERNETESThe problem:
A Spark application needs a cluster of nworker machines to runYou need 1 cluster manager nodeThe application runs on 1 driver nodeWe want to deploy everythinginside Kubernetes or Openshift Origin
Basically, every piece of the Spark architectureshould become a POD.
There are multiple solutions. I'm going tointroduce the Oshinko project.
http://radanalytics.io/
https://github.com/radanalyticsio
Driver
ClusterManager
Recall the Spark architecture
PODs
Worker Worker Worker
OSHINKOMaking Apache Spark Cloud Native on Openshift Origin.
W WW
CM
D
Spark Cluster
W WW
CM
D
Spark Cluster
1 cluster per application!Deployed in the application namespace
Oshinko console
DEMO TIME
We will deploy the Oshinko rest server and then deploy the sparkstreaming application. It will connect to Kafka to retrieve ratingsand return personalized (not really) recommendations.
Sources:
https://github.com/nicolaferraro/voxxed-bigdata-spark
I used a basic implementation of the slope one algorithm( ) for collaborativefiltering.
It's just an example of "real life" custom Spark application thatsometimes doesn't work :)
https://en.wikipedia.org/wiki/Slope_One
TESTING
One of the major advantages of making a cloud native Spark application is thepossibility of executing system tests on software identical to production.
Production pods
Optional testing tools or loadgenerators
Test driver (inside or outside the namespace)
You can create virtual test environments on the fly and do assertions.
Demo: using fabric8-arquillian and the fabric8 kubernetes client
TESTING: SOME SCENARIOSSome tests that you may want to run:
Deploy the production infrastructure (or part of it) and run a suite of functionaltests to ensure every piece works as expected to produce the result
Deploy the production infrastructure and do performance tests using externalor injected data
Deploy the production infrastructure (or part of it) and check that every piece ishealthy. Then randomly kill pods while sending a load to test the systemavailability (system tests)
The Chaos Monkey!
FABRIC8
We have used some tools from Fabric8:
Fabric8 Maven Plugin: to build Kubernetes resources for Java projects
Docker Maven Plugin: used by the fabric8 maven plugin to create docker images
Fabric8 Arquillian: to create system tests from artifacts using the fabric8 plugin
Fabric8 Kubernetes Client: to interact with Kubernetes resources (pods, services, etc.)
Fabric8 is a complete development platform forKubernetes!
Uses Jenkins Pipelines and to provide:
Continuous integrationContinuous delivery
Out of the box!
https://fabric8.io/
QUESTIONS?
Q/AWhat about storage and data locality? In cloud native applications, using a storagesystem like s3 has many advantages over HDFS (costs, availability, usage patterns intesting). Performance of HDFS are superior (6x), but you can use larger clusters for alimited amount of time to gain the same performance. More info:
Can you scale the application directly from Openshift, or you always need tochange the code? Sure, you can use Openshift to scale the application. Beware that, ina pure DevOps approach, changing the configuration of a system at runtime is notsomething you should do, especially on production. You lose the benefits of having allthe software and configuration in a "executable" state, and you lose the effectiveness ofsystem tests because tests will not run on an environment identical to production. As analternative, you can use the auto-scaling feature (enabling it in the code).
Is the Spark demo using a machine learning algorithm? The demo is using a basiccollaborative filtering algorithm called "slope one" (the focus of this talk is not machinelearning). Spark provides many algorithms out of the box, also for collaborative filtering.Many of them are supposed to be executed in a batch application, rather than in astreaming app.
http://qr.ae/TAF4cN
THANKS
Follow me on Twitter!
@ni_ferraro