scaling up lookout

84
Scaling Up Lookout R. Tyler Croy github.com/rtyler

Upload: lookout

Post on 13-Jan-2015

10.234 views

Category:

Technology


4 download

DESCRIPTION

Scaling Up Lookout was originally presented at Lookout's Scaling for Mobile event on July 25, 2013. R. Tyler Croy is a Senior Software Engineer at Lookout, Inc. Lookout has grown immensely in the last year. We've doubled the size of the company—added more than 80 engineers to the team, support 45+ million users, have over 1000 machines in production, see over 125,000 QPS and more than 2.6 billion requests/month. Our analysts use Hadoop, Hive, and MySQL to interactively manipulate multibillion row tables. With that, there are bound to be some growing pains and lessons learned.

TRANSCRIPT

Page 1: Scaling Up Lookout

Scaling Up Lookout

R. Tyler Croygithub.com/rtyler

Page 2: Scaling Up Lookout

Hello everybody, welcome to Lookout! I'm excited to be up here talking about one of my favorite subjects, scaling.

Not just scaling in a technical sense, but scaling *everything*. Scaling people, scaling projects, scaling services, scaling hardware, everything needs to scale up as your company grows, and I'm going to talk about what we've been doing here.

First, I should talk about ->

Page 3: Scaling Up Lookout

this guy

Page 4: Scaling Up Lookout

Who I am.

- I've spoken a lot before about continuous deployment and automation, generally via Jenkins. As part of the Jenkins community, I help run the project infrastructure and pitch in as the marketing events coordinator, cheerleader, blogger, and anything else that Kohsuke (the founder) doesn't want to do.

Prior to Lookout I've worked almost entirely on consumer web applications, not in a controllers and views sense, but rather building out backend services and APIs to help handle growth

At Lookout, I've worked a lot on the Platform and Infrastructure team, before promoted, or demoted depending on how you look at it, to the Engineering Lead for ->

Page 5: Scaling Up Lookout

OMGSerious Business

Page 6: Scaling Up Lookout

The Lookout for Business team

I could easily talk for over 30 minutes about some of the challenges building business products presents, but suffice it to say, it's chock full of tough problems to be solved.

Not many companies grow to the point to where they're building out multiple product lines and revenue streams, but at Lookout we've now got Consumer, Data Platform and now Business projects underway.

It's pretty exciting, but not what I want to talk about.

Let's start by ->

Page 7: Scaling Up Lookout

Let's travel back in time

Page 8: Scaling Up Lookout

Talking about the past at Lookout. I've been here for a couple years now, so my timeline starts in ->

Page 9: Scaling Up Lookout

2011

Page 10: Scaling Up Lookout

2011

In the olden days, we did things pretty differently, in almost all aspects. I joined as the sixth member of the server engineering team, a group that now has 20-30 engineers today.

-> Coming in with a background in continuous deployment, the first thing that caught my eye was

Page 11: Scaling Up Lookout

release process

Page 12: Scaling Up Lookout

Our release process was like running a gauntlet every couple weeks, and maybe we'd ship at the end of those two weeks, maybe not. It was terribly error-prone and really wasn't that great.

James ran the numbers for me at one point, and during this time-period we were experiencing a "successful" deployment rate of ->

Page 13: Scaling Up Lookout

36%of deployments failed

Page 14: Scaling Up Lookout

This means that 1/3 of the time when we would try to deploy code into production, something would go wrong and we would have to rollback the deploy and find out what went wrong.

Unfortunately, since it took us two or more weeks to get the release out, we had on average ->

Page 15: Scaling Up Lookout

68commits per deployment

Page 16: Scaling Up Lookout

68 commits per deployment, so one or more commits out of 68 could have caused the failure.

After a rollback, we'd have to sift through all those commits and find the bug, fix it and then re-deploy.

Because of this ->

Page 17: Scaling Up Lookout

62%of deployments slipped

Page 18: Scaling Up Lookout

About 2/3rds of our deployments slipped their planned deployment dates. As an engineering organization, we couldn't really tell the product owner when changes were going to be live for customers with *any* confidence!

Page 19: Scaling Up Lookout
Page 20: Scaling Up Lookout

There were a myriad of reasons for these problems, including:

- lack of test automation (tests existed, not reliably running, using Bitten with practically zero developer feedback) - painful deployment process

To make things more difficult, all our back-end application code was in a ->

Page 21: Scaling Up Lookout

monorails

Page 22: Scaling Up Lookout

monolithic Rails application. While it served it's purpose as the company was bootstrapping itself, but was starting to show its age and prove challenging with more and more developers interacting with the repository.

Page 23: Scaling Up Lookout
Page 24: Scaling Up Lookout

The team was at an interesting junction during this time, problems were readily acknowledged with the way things were done, but finding the bandwidth and buy-in to fix them were difficult to come by.

I think every startup that grows from 20 to 100 people goes through this phase when it is in denial of it's own growing pains.

As more people joined the team, we pushed past the denial though and started working on ->

Page 25: Scaling Up Lookout

Scaling the Workflow

Page 26: Scaling Up Lookout

Scaling the workflow. Our two-ish week release cycle was first on the chopping block, we started with what became known as the ->

Page 27: Scaling Up Lookout

The Burgess Challenge

Page 28: Scaling Up Lookout

The Burgess Challenge. While having beers one night with James and the server team lead Dave, James asked if we could fix our release process and get us from two-ish week deployments to *daily* deployments, in ->

Page 29: Scaling Up Lookout

60 days

Page 30: Scaling Up Lookout

60 days. This was right at the end of the year, with Thanksgiving and Christmas breaks coming up, we had some slack in the product pipeline and decided to take the project on, and enter 2012 a different engineering org than we had left 2011.

We started the process by bringing in some ->

Page 31: Scaling Up Lookout

New Tools

Page 32: Scaling Up Lookout

New tools, starting with ->

Page 33: Scaling Up Lookout

JIRA

Page 34: Scaling Up Lookout

JIRA. While I could rant on how much I hate JIRA, I think it's a better tool than Pivotal Tracker was for us. Pivotal Tracker worked well when the team and the backlog were much smaller, and less inter-dependent than they were in late 2011.

Another tool we introduced was ->

Page 35: Scaling Up Lookout

Jenkins

Page 36: Scaling Up Lookout

Jenkins - Talk about the amount of work just to get tests passing *consistently* in Jenkins - Big change in developer feedback on test runs compared to previously.

We also moved our code from Subversion into ->

Page 37: Scaling Up Lookout

Git + Gerrit

Page 38: Scaling Up Lookout

Git and Gerrit. Gerrit being a fantastic Git-based code-review tool. At the time the security team was already using GitHub:Firewall for their work. We discussed at great length whether the vanilla GitHub branch, pull request, merge process would be sufficient for our needs and whether or not a "second tool" like Gerrit would provide any value.

I could, and have in the past, given entire presentations on the benefits of the Gerrit-based workflow, so I'll try to condense as much as possible into this slide of our new code workflow ->

Page 39: Scaling Up Lookout
Page 40: Scaling Up Lookout

describe the new workflow, comparing it to the previous SVN based one (giant commits, loose reviews, etc)

Page 41: Scaling Up Lookout
Page 42: Scaling Up Lookout

with Jenkins in the mix, our fancy Gerrit workflow had the added value of ensuring all our commits passed tests before even entering the main tree.

We doing a much better job of consistently getting higher quality code into the repository, but we still couldn't get it to production easily

Next on the fix-it-list was ->

Page 43: Scaling Up Lookout

The Release Process

Page 44: Scaling Up Lookout

The release process itself.

At the time our release process was a mix of manual steps and capistrano tasks

- Automation through Jenkins - Consistency with stages (no more update_faithful)

We've managed to change entire engineering organization such that ->

Page 45: Scaling Up Lookout

2%of deployments failed

Page 46: Scaling Up Lookout

14commits per deployment

Page 47: Scaling Up Lookout

3%of deployments slipped

Page 48: Scaling Up Lookout

neat

Page 49: Scaling Up Lookout

Automating Internal Tooling

Page 50: Scaling Up Lookout

Introducing OpenStack to provide developer accessible internal VM managementManaging of Jenkins build slaves via PuppetIntroduction of MI Stages

Page 51: Scaling Up Lookout

OpenStack

Page 52: Scaling Up Lookout

If you're going to use a pre-tested commit workflow with an active engineering organization such as ours, make sure plan ahead and have plenty of hardware, or virtualized hardware for Jenkins

We've started to invest in OpenStack infrastructure and the jclouds plugin for provisioning hosts to run all our jobs on.

With over a 100 build slaves now, we had to also make sure we had ->

Page 53: Scaling Up Lookout

Automated Build Slaves

Page 54: Scaling Up Lookout

Automated the management of those build slaves, nobody has time to hand-craft hundreds of machines and ensure that they're consistent. Additionally, we didn't want to waste developer time playing the "it's probably the machine's fault" game every time a test failed.

Page 55: Scaling Up Lookout

Per-Developer Test Instances

Page 56: Scaling Up Lookout

Scaling the People

Page 57: Scaling Up Lookout

Not much to say here, every company is going to be different but you can't just ignore that there are social and cultural challenges in taking a small engineering team and growing to 100+ people.

Page 58: Scaling Up Lookout
Page 59: Scaling Up Lookout

- Transition from talking about the workflow to the tech stack

Page 60: Scaling Up Lookout

Scaling the Tech Stack

Page 61: Scaling Up Lookout

With regards to scaling the technical stack, I'm not going to spend too much time on this since the other people here tonight will speak to this in more detail than I probably should get into, but there are some major highlights from a server engineering standpoint

Starting with the databases ->

Page 62: Scaling Up Lookout

Shard the Love

Page 63: Scaling Up Lookout

Global Derpbase woes Moving more and more data out of non-sharded tables Experimenting with various connection pooling mechanisms (worth mentioning?)

Page 64: Scaling Up Lookout

Undoing Rails Katamari

Page 65: Scaling Up Lookout

Diagnosing a big ball of mud Migrating code onto the first service (Pushcart) Slowly extracting more and more code from monorails, project which is ongoing

Page 66: Scaling Up Lookout

Modern JavaScript

Page 67: Scaling Up Lookout

I never thought this would have a big impact on scaling the technical stack, but modernizing our front-end applications has helped tremendously

The JavaScript community has changed tremendously since the company was founded, the ecosystem is much more mature and the web in general has changed.

By rebuilding front-end code as single-page JavaScript applications (read: Backbone, etc), we are able to reduce complexity tremendously on the backend by turning everything into more or less JSON API services

Page 68: Scaling Up Lookout
Page 69: Scaling Up Lookout

Infinity and Beyond

Page 70: Scaling Up Lookout

The future at Lookout is going to be very interesting, both technically and otherwise.

On the technical side of the things we're seeing more of a ->

Page 71: Scaling Up Lookout

Diversifyingthe

technical portfolio

Page 72: Scaling Up Lookout

Diversified technical portfolio. Before the year is out, we'll have services running running in Java, Ruby and even Node.

TO support more varied services, we're getting much more friendly ->

Page 73: Scaling Up Lookout

Hello JVM

Page 74: Scaling Up Lookout

with the JVM, either via JRuby or other JVM-based languages. More things are being developed for and deployed on top of the JVM. Which offers some interesting some interesting opportunities to change our workflow further with things like: - Remote debugging - Live profiling - Better parallelism

Page 75: Scaling Up Lookout
Page 76: Scaling Up Lookout

With an increasingly diverse technical stack, and stratified services architecture, we're going to be faced with the technical and organization challenges of operating ->

Page 77: Scaling Up Lookout

100 services

Page 78: Scaling Up Lookout

100 services at once.

When a team which owns a service is across the office, or across the country, how does that mean for clearly expressing service dependencies, contracts and interactions on an on-going basis?

With all these services floating around, how do we maintain our ->

Page 79: Scaling Up Lookout

Institutional Knowledge

Page 80: Scaling Up Lookout

Institutional knowledge amongst the engineering team

Growth means the size of our infrastructure exceeds the mental capacity of singular engineers to understand each component in detail.

Page 81: Scaling Up Lookout
Page 82: Scaling Up Lookout

We're not alone in this adventure, we have much to learn from companies like Amazon, or Netflix, who have traveled this path before.

I wish I could say that the hard work is over, and that it's just smooth sailing and printing money from here on out, but that's not true.

There's still a lot of hard work to be done, and difficult problems to talk about as we move into a much more service-oriented, and multi-product architecture.

I'd like to ->

Page 83: Scaling Up Lookout

Thank you

Page 84: Scaling Up Lookout

Thank you for your time, if you have any questions for me, I'll be sticking around afterwards.

Thank you