gerrit + jenkins = continuous delivery for big data

15
1 Gerrit + Jenkins = Continuous Delivery for Big Data Mountain View, CA, November 2015 Stefano Galarraga GerritForge [email protected] m http://www.gerritforge .com Real-life case study and future developments

Upload: stefano-galarraga

Post on 15-Apr-2017

698 views

Category:

Software


2 download

TRANSCRIPT

Page 1: Gerrit + Jenkins = Continuous Delivery For Big Data

1

Gerrit + Jenkins = Continuous Delivery for Big Data

Mountain View, CA, November 2015

Stefano GalarragaGerritForge

[email protected]://www.gerritforge.com

Real-life case study and future developments

Page 2: Gerrit + Jenkins = Continuous Delivery For Big Data

2

The Team Luca Milanesio• Co-founder and Director of GerritForge • over 20 years in Agile Development and ALM• OpenSource contributor to many projects

(BigData, Continuous Integration, Git/Gerrit)

Antonios Chalkiopulos• Author of Programming MapReduce with Scalding• Open source contributor to many BigData projects• Working on the "land-of-Hadoop' (landoop.com)

Tiago Palma• Data Warehouse & Big Data Development• Senior Data Modeler• Big Data infrastructure specialist

Stefano Galarraga• 20 years of Agile Development• Middleware, Big Data, Reactive Distributed Systems. • Open Source contributor to BigData projects.

Page 3: Gerrit + Jenkins = Continuous Delivery For Big Data

3

Agenda

• What’s special in Big Data – General lack of support for Unite/Integration testing– Testing the "real thing" (aka the Cluster)

• Why Gerrit for continuous deployment on BigData?• Our Development Lifecycle ingredients

– Gerrit, Jenkins, Mesos, Marathon, CDH / Spark• Gerrit Role and Components

– What did we use, why, what we would like to have • New developments

– Usint Topics with microservices for “atomic” multi-service changes• Live (minimised) Demo• Open points and discussion

Page 4: Gerrit + Jenkins = Continuous Delivery For Big Data

4

WHY Gerrit?

• Fast Paced• Distributed team• Relatively a “niche” technology

– A lot of “junior” developers– Need for strong ownership– Validation rules– CD => We need to be have green build and consistent code

quality

Page 5: Gerrit + Jenkins = Continuous Delivery For Big Data

5

Code-Review Lifecycle

• GIT used by distributed teams (UK, Israel, India)• Topics and Code Review• Jenkins build on every patch-set• Commits reviewed / approved via Gerrit Submit• Submitting a Topic automatically does:

– all patch-sets merged (semi-atomically)– trigger a longer chain of CI steps– automatically promote a RC if everything passes

• Jenkins automation via Gerrit Trigger Plugin

Page 6: Gerrit + Jenkins = Continuous Delivery For Big Data

6

Build Steps and Solutions

• Unit tests abstracting from dependencies • Integration Tests:

– Using Docker to run dependencies on the CI• “Micro” Hadoop cluster or other dependencies (DBs,

messaging) => Jenkins docker plugin• When possible “dockerizing” just the required

components and driving them from the test framework • Performance/Acceptance required a real cluster

Page 7: Gerrit + Jenkins = Continuous Delivery For Big Data

7

Fitting CDH Into this Picture

• Acceptance / performance test with short-lived CDHs• Solution: Mesos, Marathon and Docker:

– Ephemeral clusters with defined capacity– Automatic cluster-config– All controlled via Docker/Mesos

• This was quite a long process!! – mostly because of CDH cluster configuration

Page 8: Gerrit + Jenkins = Continuous Delivery For Big Data

8

Mesos + Marathon

• Apache Mesos– Abstracts CPU, memory, storage, other compute

resources away from machines

• Marathon Framework– Runs on top of Mesos – Guarantees that long-running applications never

stop– REST API for managing and scaling services

Page 9: Gerrit + Jenkins = Continuous Delivery For Big Data

9

CDH Components

• CDH 5.4.1 distribution– Apache Spark– Hadoop HDFS– YARN

Page 10: Gerrit + Jenkins = Continuous Delivery For Big Data

10

Slave Host

Integration/Performance Test Flow on CDH Cluster

Jenkins Master

MesosMasterMarathon Private

Docker RegistryMesosSlave Docker

POST to Marathon REST API to start 1 docker container with Cloudera Manager and N docker containers with cloudera agents

Marathon Framework receives resource offers from Mesos Master and submits the tasks

The task is sent to the Mesos Slave

Mesos slave starts the docker container

Docker image is fetched from Docker registry if not present in Slave hostW

aitin

g fo

r Doc

kers

Doc

kers

UP

Install Cloudera packages via Cloudera Manager API using Python

Deploy the ETL, run the ETL and the Acceptance Tests

Page 11: Gerrit + Jenkins = Continuous Delivery For Big Data

11

Unit and Integration Tests sample

• Test project:– Test Spark project – ETL from Oracle to HDFS

• Unit-test directly on Spark logic• Integration tests for every patch-set:

– VERY small dataset just for this demo– CDH and Oracle Docker Images

Page 12: Gerrit + Jenkins = Continuous Delivery For Big Data

12

O

Unit and Integration Tests

Hadoop Pseudo-distributed mode

Spark Standalone

Jenkins

Oracle

CDH

Build Jobinit

Submit job

Init/read HDFS

Page 13: Gerrit + Jenkins = Continuous Delivery For Big Data

13

DEMO

Page 14: Gerrit + Jenkins = Continuous Delivery For Big Data

14

Open Point and Discussion

• Topic based build of multiple artifacts– Demo implementation is naïve and difficult to maintain– Race conditions on build of dependent artifacts

• Need more advanced triggering system (zuul might fit)– Race condition on submit of topic

• Stream event: “topic-submitted” instead/in addition of many “patch-submitted” event

• Gerrit Trigger plugin should listen to this event to coordinate

Page 15: Gerrit + Jenkins = Continuous Delivery For Big Data

Questions?