stackstrom: if-this-than-that for devops automation
TRANSCRIPT
If-This-Then-That for DevOps Automation
@Stack_Storm http://github.com/StackStorm/st2
Dmitri Zimine @dzimine
BS BINGO
DevOps Automation
Cross-Domain Integration
Event Driven
Orchestration ChatOps Serverless
Auto Remediation
artefacts are code
IFTTT, for DevOps
6
cat /opt/stackstorm/packs/st2-demos/rules/demo.yaml --- name: "sensu_crit_to_slack" pack: "st2-demos" description: "Post all critical alerts to the demo enabled: true trigger: type: "sensu.event_handler" criteria: trigger.check.status: pattern: 2 type: "equals" trigger.check.name: pattern: "demo_.*" type: "matchregex" action: ref: "slack.post_message" parameters: message: > [ALERT]{{trigger.client.name}} {{trigger.check.output}} channel: "#demos"
Ingredients
9
IT Domains
Config mgmt Storage Networking Containers Cloud Infra Monitoring
Actions Sensors
Workflows Rules
Ops Support
Trigger Call
Automation Example
10
Automation
Engineer Service
Monitoring Incident Management
Event: “low disk on web301”
Web301 is “low disk”
Resolve known cases, fast. Is it
/var/log? Clean up!
Unknown problem, need
a human
Wake up, buddy. Something real
is going on…
What is being automated with StackStorm • Security checks
– On malware detection in a VM, isolate network port on a switch
• App blue-green deployment – On Jenkins tests passed, bring new vm
claster, deploy and configure app, set loadbalancer to send % of traffic to new app, monitor, roll forward, or back out
• Networking – On BGP peer goes down: collect
troubleshooting data, post on slack & create JIRA ticket
– On Link aggregation member error, check load, if capacity of rest of LAG bundle enough, disable link with error
• OpenStack – orphan VM clean-up: On orphans
detected, shut down, email owner, keep for few days, delete
– VM evacuation on HW failures: On host RAID failure, get list of impacted VMs, email VM owners, evacuate VMs, create JIRA ticket for hardware replacement.
• Service remediation: – Cassandra “node down” recovery: On ring
node dying, deploy new node, configure, add to the ring.
– Remediating RabbitMQ, Galera cluster, MySQL, and more…
14
Auto Remediation
17
FBAR (saving 13,680 hours/day)
Naoru
Nurse
Winston (powered by StackStorm)
Azure Automation
Mistral workflow service
StackStorm automation platform
ACT
OBSERVE
ORIENT
DECIDE
“Auto-Remediation & Automation at Facebook” @ Auto-Remediation Meetup SF https://www.meetup.com/Auto-Remediation-and-Event-Driven-Automation/events/236704012/
43 % Problem Fixed 51 % False positives 94 % Automated
“Sleep Better at Night: OpenStack Cloud Auto-Healing” @ OpenStack Summit Barcelona Mirantis: Auto-remediating 2,000 node OpenStack cluster at Symantec with StackStorm
Engineer Wakes up
Logs in and ACK
Checksrunbook
Studiesthe alert
Fixes theproblem
Runs diagnostics
PagerDuty Alert
2:02 AM 2:07 AM 2:15 AM 2:10 AM 2:30 AM2:20 AM 2:00 AM
On-call, Without Automation
Source: “Winston: Helping Netflix Engineers Sleep at Night” @ Qcon ‘16 SF https://goo.gl/lHzq4r
False Positive
Winston
2:00 AM
2:05 AM
2:05 AM
2:15 AM
AssistedDiagnostics
Fixed the problem
On-call With Winston
Source: “Winston: Helping Netflix Engineers Sleep at Night” @ Qcon ‘16 SF https://goo.gl/lHzq4r
“Innovation at Dimension Data: Optimizing Operations with Event Driven Automation” https://stackstorm.com/2016/12/15/dimension-data-devops-beyond-deployment/
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 24
StackStorm Enterprise
eval license
automation • 5 cloud services
• Integrated with on-prem systems
StackStorm is like …
31
Actions Sensors
Workflows Rules
IT Domains
Config mgmt Storage Networking Containers Cloud Infra Monitoring Ops Support
Step Functions
AWS Lambda
Serverless with Swarm for Genomic Annotation Computing Dmitri Zimine http://github.com/dzimine/serverless-swarm @dzimine
Image by Miki Yoshihito, Creative Commons license
Why not Scripts?
38
• Simple to define, reason, visualize
• Transparent – state is clear, execution is trackable: running, complete, failed steps
• Reliable – Workflows are long-running – Crash tolerance
– “Restart from point of failure”
Workflows Better in Operations
• Simple to define, reason, visualize
• Transparent – state is clear, execution is trackable: running, complete, failed steps
• Reliable – Workflows are long-running – Crash tolerance
– “Restart from point of failure”
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 40
44
Infrastructure as code
Case Study
• Service Catalog backed up by workflow
• Automate provisioning on VMW/OpenStack, 4 Data centers
• Before: CPO, operator updates via GUI, click and pray, x4
• After: StackStorm, dev -> code review -> staging -> QA-> prod
Infrastructure as code
45
The top predictor of IT performance?
Version control used by Ops, for Ops artifacts!
• Use StackStorm. Try it, find automation, nail POC. Let us know, good & bad. curl -sSL https://stackstorm.com/packages/install.sh | bash -s docs.stackstorm.com/install
• Commit code. Become a “community maintainer” It is not hard (weekend hack?). We help & support.
• Spread the word Blog. Tweet. Talk. Mention. Bug. Github Star!
56
Contribute! Everything counts