![Page 1: DevOps for Big Data - Data 360 2014 Conference](https://reader035.vdocuments.mx/reader035/viewer/2022062614/54705038af7959c5208b478a/html5/thumbnails/1.jpg)
DevOps for Big DataEnabling Continuous Delivery for data analytics applications based on Hadoop, Vertica, and Tableau
1
Max Martynov, VP of TechnologyGrid Dynamics
![Page 2: DevOps for Big Data - Data 360 2014 Conference](https://reader035.vdocuments.mx/reader035/viewer/2022062614/54705038af7959c5208b478a/html5/thumbnails/2.jpg)
2
Introductions
• Grid Dynamics─ Solutions company, specializing in eCommerce
─ Experts in mission-critical applications (IMDGs, Big Data)
─ Implementing Continuous Integration and Continuous Delivery for 5+ years
• Qubell─ Enterprise DevOps platform
─ Focused on self-service environments, service orchestration, and continuous upgrades
─ Targets web-scale and big data applications
![Page 3: DevOps for Big Data - Data 360 2014 Conference](https://reader035.vdocuments.mx/reader035/viewer/2022062614/54705038af7959c5208b478a/html5/thumbnails/3.jpg)
3
State of DevOps and Continuous Delivery
Continuous Delivery Value
• Agility
• Transparency
• Efficiency
• Consistency
• Quality
• Control
Findings from The 2014 State of DevOps Report
• Strong IT performance is a competitive advantage
• DevOps practices improve IT performance
• Organizational culture matters
• Job satisfaction is the No. 1 predictor of organizational performance
![Page 4: DevOps for Big Data - Data 360 2014 Conference](https://reader035.vdocuments.mx/reader035/viewer/2022062614/54705038af7959c5208b478a/html5/thumbnails/4.jpg)
4
Continuous Delivery Infrastructure
• Environments ─ Reliable and repeatable deployment automation
─ Database schema management
─ Data management
─ Application properties management
─ Dynamic environments
• Quality─ Test automation
─ Test data management (again)
─ Code analysis and review
• Process─ Source code management, branching strategy
─ Agile requirements and project management
─ CICD pipeline
* Big Data applications bring additional challenges in these areas due to big amounts of data, complexity of business logic and large scale environments.
![Page 5: DevOps for Big Data - Data 360 2014 Conference](https://reader035.vdocuments.mx/reader035/viewer/2022062614/54705038af7959c5208b478a/html5/thumbnails/5.jpg)
5
Implementing Continuous Delivery for Big Data:Initial State of the Project
• Medium size distributed development team
• Diverse technology stack – Hadoop + Vertica + Tableau
• Only one environment existed and it was production
• Delivery pipeline:
• Procurement of hardware for a new environment was taking months
Development Team
Production
![Page 6: DevOps for Big Data - Data 360 2014 Conference](https://reader035.vdocuments.mx/reader035/viewer/2022062614/54705038af7959c5208b478a/html5/thumbnails/6.jpg)
6
Development in Production
It is fun until somebody misses the nail
![Page 7: DevOps for Big Data - Data 360 2014 Conference](https://reader035.vdocuments.mx/reader035/viewer/2022062614/54705038af7959c5208b478a/html5/thumbnails/7.jpg)
7
Hadoop Analytical Application
Master
Database
Slaves 1 - N
Manager
10+ TB of data; 10+ nodes in production; 10+ applications; manually pre-deployed on hardware serversHow to quickly reproduce this environment for dev-test purposes?
![Page 8: DevOps for Big Data - Data 360 2014 Conference](https://reader035.vdocuments.mx/reader035/viewer/2022062614/54705038af7959c5208b478a/html5/thumbnails/8.jpg)
8
1. Stop Gap Measure
• Same hardware, different logical “zones” implemented on the file system
• Automated build and deployment
• Delivery pipeline:
Development Team
Production cluster
/test1-N
/stage
/prod
Zones
![Page 9: DevOps for Big Data - Data 360 2014 Conference](https://reader035.vdocuments.mx/reader035/viewer/2022062614/54705038af7959c5208b478a/html5/thumbnails/9.jpg)
9
1. Stop Gap Measure: Pros and Cons
Pros
• Better than before: code can be tested before it goes to production
• All logical environments has access to the same production data
• Zero additional environment costs
Cons
• Stability, security and compliance issues: dev, test and prod environments share same hardware
• Performance issues: tests affect production performance
• Impossible to run “destructive” tests that affect shared production data
• Impossible to test upgrades of middleware (new versions of H* components)
![Page 10: DevOps for Big Data - Data 360 2014 Conference](https://reader035.vdocuments.mx/reader035/viewer/2022062614/54705038af7959c5208b478a/html5/thumbnails/10.jpg)
10
2. Hadoop Dynamic Environments
DataCustom
Application
Dev
Components
Services Environment Policies
QA
ProdStage
Dev/QA/Ops
Request Environment
Orchestrate environment provisioning and application
deployment
Environment
![Page 11: DevOps for Big Data - Data 360 2014 Conference](https://reader035.vdocuments.mx/reader035/viewer/2022062614/54705038af7959c5208b478a/html5/thumbnails/11.jpg)
11
2. Hadoop Dynamic Environments (continued)
• Dev/QA/Ops teams got a self-service portal to ─ provision environments
─ deploy applications
• A new environment can be created from scratch in 2-3 hours─ singe-node dev sandbox
─ multi-node QA
─ big clusters for scalability and performance
• An application can be deployed to an environment within 10 minutes
![Page 12: DevOps for Big Data - Data 360 2014 Conference](https://reader035.vdocuments.mx/reader035/viewer/2022062614/54705038af7959c5208b478a/html5/thumbnails/12.jpg)
12
3. Vertica and Tableau Dynamic Environments
Data UDF
Dev
Components
ServicesEnvironment
Policies
QA
ProdStage
Dev/QA/Ops
Request Environment
Orchestrate environment provisioning and application
deployment
Environment
VSQL Config
Shared service
![Page 13: DevOps for Big Data - Data 360 2014 Conference](https://reader035.vdocuments.mx/reader035/viewer/2022062614/54705038af7959c5208b478a/html5/thumbnails/13.jpg)
13
Unit Tests
Component Tests
Integration Tests(integration with data)
4. Tests & Test Data
• Dev and QA teams implemented automated tests
• Two options to handle data on dev-test environments:
1. Tests generate data for themselves
2. A reduced representative snapshot of obfuscated production data (10TB -> 10GB)
Exploratory Tests
Java code, auto-generated data;build-time validation
Auto tests on “API” level, testing job output;test-generated data
Auto tests on “API” level, validating job output;snapshot of production data
Manual tests;snapshot of production data
![Page 14: DevOps for Big Data - Data 360 2014 Conference](https://reader035.vdocuments.mx/reader035/viewer/2022062614/54705038af7959c5208b478a/html5/thumbnails/14.jpg)
14
5. CICD pipeline
With all components ready, implementing CICD pipeline is easy:
Development Team
Dev Sandbox QA Environment
Github Flow2. Commit
1. Develop & Experiment
3. Build & unit test
4. Deploy 5. Test
6. Release
![Page 15: DevOps for Big Data - Data 360 2014 Conference](https://reader035.vdocuments.mx/reader035/viewer/2022062614/54705038af7959c5208b478a/html5/thumbnails/15.jpg)
15
6. Release Button
Release Candidate
Release
ProductionOps/RE
![Page 16: DevOps for Big Data - Data 360 2014 Conference](https://reader035.vdocuments.mx/reader035/viewer/2022062614/54705038af7959c5208b478a/html5/thumbnails/16.jpg)
16
Assembly Line
![Page 17: DevOps for Big Data - Data 360 2014 Conference](https://reader035.vdocuments.mx/reader035/viewer/2022062614/54705038af7959c5208b478a/html5/thumbnails/17.jpg)
17
Results
• Reduced risk and higher quality─ No more development in production
─ Developers have sandboxes, tests are run on separate environments
─ Feature are deployed to production only after validation
• Increased efficiency─ A new environment can be provisioned within 2 hours
─ Developers can freely experiment with new changes
─ No resource contention
• Reduced costs─ No need to procure in-house hardware and manage in-house datacenter
─ Dynamic environments save money by using them on only when they are needed
![Page 18: DevOps for Big Data - Data 360 2014 Conference](https://reader035.vdocuments.mx/reader035/viewer/2022062614/54705038af7959c5208b478a/html5/thumbnails/18.jpg)
18
Enabling Technologies
Agile Software FactorySoftware Engineering Assembly Line
griddynamics.com
QubellEnterprise DevOps Platform
qubell.com
![Page 19: DevOps for Big Data - Data 360 2014 Conference](https://reader035.vdocuments.mx/reader035/viewer/2022062614/54705038af7959c5208b478a/html5/thumbnails/19.jpg)
A P R I L 8 , 2 0 2 3
Thank You
19
Max Martynov, VP of Technology, Grid [email protected]
Victoria Livschitz, CEO and Founder, [email protected]