ansible at scale - meetupfiles.meetup.com/17312132/ansible il - ansible at scale.pdf · ansible at...

Post on 26-Jul-2018

319 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Ansible at Scale

Ansible Israel, May 9, 2016

David Melamed

Senior Research Engineer, CTO Office, CloudLock

dmelamed@cloudlock.com @dvdmelamed

Who is this guy?

4 B

Where is he working?

Founded: 2011

Corporate Headquarters: Waltham, Mass. (U.S.A.)

R&D Headquarters: Tel Aviv

Employees: 140 (30 in TLV)

Trusted by major brands:

157K APPS

10 MUSERS ACTIVITIES

01 Ansible main notions

What is Ansible?

● Open-source configuration automation tool● Written in Python and easily extensible● Agent less (only requires SSH / WinRM)● Idempotent modules● Ad hoc task execution● Reusable list of tasks● Code deployment

Inventory

WEB SERVERS DAEMON SERVERS FILE SERVERS

COMPUTING CLUSTER

[webservers]192.168.1.12192.168.1.13192.168.1.19

[daemonservers]192.168.1.34192.168.4.24

[vpc]webserversdaemonservers

Static inventory

VPC

Task, play & playbook

- name: check server is aliveaction: ping

- name: update app configurationaction: copy src=myapp.conf dest=/etc/myapp/prod.conf

...

task

play

playbook

Role

- tasks main.yml

- handlersmain.yml

- templatestemplate.conf.j2

- filesfile1.txt

- varsmain.yml

Vault

● Put all secrets in one place● Store secrets into git

02 Our requirements

CloudLock requirements

● Multiple environments (AIO vs. VPC, AWS vs. AppEngine)● Multiple environment types (local / stage / prod)● 10 different VPCs with different access levels● VPCs with ~ 100 machines of several types● Multiple small repos (python package) with dependencies● Zero-downtime deployment as much as possible

Multiple stacks & environments

Web server(Angular app)

My laptop(OSX)

Your laptop(Ubuntu)

Multi-tier env.in AWS

AIOin AWS

Multi-tier env.in AWS

LOCAL STAGE PROD

API server(Flask app)

Database(PostgreSQL or RDS)

Cache server(Redis or ElastiCache)

Message Queue(RabbitMQ)

PRE-PROD

Multi-tier env.in AWS

03 Ansible profiling

Profiling Ansible (2)PLAY [Deploy | Ensure database and user] *************************** Thursday 15 October 2015 09:51:01 +0000 (0:00:01.786) 0:00:12.318 ****** ===============================================================================

TASK: [storage/postgresql-database | Create | Ensure database from database variable] *** Thursday 15 October 2015 09:51:01 +0000 (0:00:00.011) 0:00:12.329 ****** ok: [sandbox]

TASK: [storage/postgresql-database | Create | Ensure database user from database.user variable] *** Thursday 15 October 2015 09:51:01 +0000 (0:00:00.163) 0:00:12.493 ****** ok: [sandbox]

TASK: [storage/pgbouncer | Start pgBouncer] *********************************** Thursday 15 October 2015 09:51:09 +0000 (0:00:00.242) 0:00:20.782 ****** ok: [sandbox]

TASK: [storage/pgbouncer | Bump file descriptor limits] *********************** Thursday 15 October 2015 09:51:09 +0000 (0:00:00.177) 0:00:20.960 ****** changed: [sandbox] => (item=hard)changed: [sandbox] => (item=soft)

...

PLAY RECAP ******************************************************************** module1 | Install | Ensure modules ------------------------------------- 13.14smodule2 | Install pgBouncer --------------------------------------------- 7.51smodule3 | Install | Clean/uninstall modules ----------------------------- 6.85smodule4 | Install | Ensure core installed ------------------------------ 4.66s...Thursday 15 October 2015 09:52:49 +0000 (0:00:00.023) 0:02:00.236 ****** =============================================================================== sandbox : ok=142 changed=82 unreachable=0 failed=0

04 Tips for scale support

(faster & easier to maintain)

Factors impacting ansible speed

● SSH connection● Facts gathering● Tasks performed serially● Redundant tasks

Improving SSH speed

● Persistent connection (default on for SSH)○ ControlMaster=auto○ ControlPersist=60s

● SSH pipelining (1 connection per task)○ Requires disabling requiretty

Ansible configuration

● Commit your ansible.cfg● Control facts gathering (gathering)

○ implicit (default) - always discover the facts○ explicit - use facts cache, not used unless defined in play○ smart - use facts cache, discover facts for new hosts

● Control the number of parallel processes (forks)○ default is 5○ we use 25

● SSH args / SSH pipelining

Inventory

● Make your ansible code environment agnostic● Machine grouping by environment or by “role” type● Hierarchical inventory● Vault per environment● Dynamic inventory for better cloud support● Use dedicated machine to deploy (ansible-workstation)

CloudLock static inventory overview

inventory/ | |---- environments | |----- allinone |----- beta |----- demo |----- dev1 |----- dev2 |---- qa1 |----- qa2 |---- group_vars | |----- allinone/ |----- beta/ |----- demo/ |----- dev1/ |----- dev2/ |---- qa1/ |----- qa2/

+ use of route53 for internal DNS

EC2 dynamic inventory

● Python script using boto● List of instances + hostvars● Use instance names or IPs● Groups by instance tags, vpc, …● List cached

"ec2": [ "52….", "52….", "52….", ], "tag_Environment_prod": [ "52….", "52…..", "54….." ], "tag_Name_prod_bastion": [ "54…." ], "tag_Name_Report_Decryptor": [ "52….." ], "tag_Name_devpi": [ "52….." ]

Playbooks

● Tasks executed synchronously○ Segment roles/groups to leverage parallel forks

● Use tags to add modularity (i.e. config, deploy…)● Name each task● Limit conditional execution in roles, put them in the

playbooks instead

Tasks & Roles

● Make your role generic and simple● Role should be decoupled from inventory● Keep your configuration separate● Tasks should be idempotent● Use “include” for sub-roles● Try to avoid redundant tasks (use AMI)● Share handlers with a global role ● Avoid using command and shell and use appropriate modules instead

- roles/

ci/

jenkins/

jobs/

monitor/

cloudwatch/

nagios/

platform/

base/

component-a/

component-b/

events/

setup/

teardown/

system/

web/

Vault

● Encrypt only what is necessary● No way to merge 2 encrypted files● Several tools to improve vault management

○ https://github.com/building5/ansible-vault-tools○ https://gist.github.com/benzado/7bf5aa15e15d2d0d0380

ansible-playbook vs ansible-pull

● Regular mode: connect to server and deploy● “Pull” mode: pull from repo on remote and execute● Syntax: ansible-pull -U git://github.com/REPO.git -d DEST_DIR● Example of cron install using ansible

https://github.com/ansible/ansible-examples/blob/master/language_features/ansible_pull.yml

CI for Ansible

● Test locally with vagrant / docker● PR reviews (issue with vault changes)● Jenkins job deploying to AIO + github hook

● Coming soon: unit tests (ansible-kitchen)

Ansible 1.9 vs. Ansible 2.0

● Some breaking changes● A lot of new cloud modules (i.e. ECS, VPC)

Results

● Before: deployment to VPC took several hours● After: ~ 20 min for a full deployment

More about Ansible

● Awesome Ansible: https://github.com/jdauphant/awesome-ansible

● Ansible for DevOpshttps://leanpub.com/ansible-for-devops

Cloudlock is looking for talents

Questions/feedback

top related