gerrit + jenkins = continuous delivery for big data

Post on 15-Apr-2017

698 Views

Category:

Software

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Gerrit + Jenkins = Continuous Delivery for Big Data

Mountain View, CA, November 2015

Stefano GalarragaGerritForge

stefano@gerritforge.comhttp://www.gerritforge.com

Real-life case study and future developments

2

The Team Luca Milanesio• Co-founder and Director of GerritForge • over 20 years in Agile Development and ALM• OpenSource contributor to many projects

(BigData, Continuous Integration, Git/Gerrit)

Antonios Chalkiopulos• Author of Programming MapReduce with Scalding• Open source contributor to many BigData projects• Working on the "land-of-Hadoop' (landoop.com)

Tiago Palma• Data Warehouse & Big Data Development• Senior Data Modeler• Big Data infrastructure specialist

Stefano Galarraga• 20 years of Agile Development• Middleware, Big Data, Reactive Distributed Systems. • Open Source contributor to BigData projects.

3

Agenda

• What’s special in Big Data – General lack of support for Unite/Integration testing– Testing the "real thing" (aka the Cluster)

• Why Gerrit for continuous deployment on BigData?• Our Development Lifecycle ingredients

– Gerrit, Jenkins, Mesos, Marathon, CDH / Spark• Gerrit Role and Components

– What did we use, why, what we would like to have • New developments

– Usint Topics with microservices for “atomic” multi-service changes• Live (minimised) Demo• Open points and discussion

4

WHY Gerrit?

• Fast Paced• Distributed team• Relatively a “niche” technology

– A lot of “junior” developers– Need for strong ownership– Validation rules– CD => We need to be have green build and consistent code

quality

5

Code-Review Lifecycle

• GIT used by distributed teams (UK, Israel, India)• Topics and Code Review• Jenkins build on every patch-set• Commits reviewed / approved via Gerrit Submit• Submitting a Topic automatically does:

– all patch-sets merged (semi-atomically)– trigger a longer chain of CI steps– automatically promote a RC if everything passes

• Jenkins automation via Gerrit Trigger Plugin

6

Build Steps and Solutions

• Unit tests abstracting from dependencies • Integration Tests:

– Using Docker to run dependencies on the CI• “Micro” Hadoop cluster or other dependencies (DBs,

messaging) => Jenkins docker plugin• When possible “dockerizing” just the required

components and driving them from the test framework • Performance/Acceptance required a real cluster

7

Fitting CDH Into this Picture

• Acceptance / performance test with short-lived CDHs• Solution: Mesos, Marathon and Docker:

– Ephemeral clusters with defined capacity– Automatic cluster-config– All controlled via Docker/Mesos

• This was quite a long process!! – mostly because of CDH cluster configuration

8

Mesos + Marathon

• Apache Mesos– Abstracts CPU, memory, storage, other compute

resources away from machines

• Marathon Framework– Runs on top of Mesos – Guarantees that long-running applications never

stop– REST API for managing and scaling services

9

CDH Components

• CDH 5.4.1 distribution– Apache Spark– Hadoop HDFS– YARN

10

Slave Host

Integration/Performance Test Flow on CDH Cluster

Jenkins Master

MesosMasterMarathon Private

Docker RegistryMesosSlave Docker

POST to Marathon REST API to start 1 docker container with Cloudera Manager and N docker containers with cloudera agents

Marathon Framework receives resource offers from Mesos Master and submits the tasks

The task is sent to the Mesos Slave

Mesos slave starts the docker container

Docker image is fetched from Docker registry if not present in Slave hostW

aitin

g fo

r Doc

kers

Doc

kers

UP

Install Cloudera packages via Cloudera Manager API using Python

Deploy the ETL, run the ETL and the Acceptance Tests

11

Unit and Integration Tests sample

• Test project:– Test Spark project – ETL from Oracle to HDFS

• Unit-test directly on Spark logic• Integration tests for every patch-set:

– VERY small dataset just for this demo– CDH and Oracle Docker Images

12

O

Unit and Integration Tests

Hadoop Pseudo-distributed mode

Spark Standalone

Jenkins

Oracle

CDH

Build Jobinit

Submit job

Init/read HDFS

13

DEMO

14

Open Point and Discussion

• Topic based build of multiple artifacts– Demo implementation is naïve and difficult to maintain– Race conditions on build of dependent artifacts

• Need more advanced triggering system (zuul might fit)– Race condition on submit of topic

• Stream event: “topic-submitted” instead/in addition of many “patch-submitted” event

• Gerrit Trigger plugin should listen to this event to coordinate

Questions?

top related