algolia's fury road to a worldwide api - take off conference 2016

ALGOLIA’S FURY ROAD TO A WORLDWIDE API

Build Unique Search Experiences

Olivier Lance Solutions Engineer

[email protected] @olance

Take Off Conference 2016

@algolia

A hosted search APIthat focuses on Developer and User Experience

@algolia

With intuitive relevance

A hosted search API

From anywhereReplies in milliseconds

@algolia

Algolia Today

15 regions 47 data centers

2000+ customers in 100+ countries

30B+ Write operations per month

15B+ User-generated queries per month

@algolia

.1 March 2013 High Availability was designed…

but not implemented

A single machine in 2 different locations: Canada/East and Europe/West

Focus on performance, searching over indexing

First customer in prod ⚠ RAM: 32GB Proc: 4 cores, 3.4-3.8 GHz SSD: 2x 120 GB Raid-0 (Intel 320)

.2 June 2013 Implementation of high availability in our architecture

3 machines with a consensus on write… but in the same data center API clients handled automatic retries in case of error APPID-1.algolia.io, APPID-2.algolia.io, APPID-3.algolia.io

RAM: 64GB Proc: 6 cores, 3.2-3.8 GHz SSD: 2x 300 GB Raid-0 (Intel 320)

.3 August 2013 Official launch of the service

Two locations: Europe/West and Canada/East

Same provider but different network equipment and power units (cheap multi-AZ)

10 API clients, developed manually (https keep alive, using TLS correctly, retry strategy…)

RAM: 128GB Proc: 8 cores, 3.1-3.8 GHz SSD: 2x 300 GB Raid-0 (Intel S3500)

.4 January 2014 Deployment is a big risk for high availability

Agile development, 6000+ unit tests, 200+ non-regression tests… But no instant rollback! Result: 8 minutes of indexing downtime ☂

From then on - start with test clusters - instant rollback

.5 October 2014 Automation via Chef

Significant increase in managed machines Shell Scripts -> Chef

Automation is great but s**t happens… A typo in a cookbook nearly broke our prod!

From then on: 2 versions of the cookbooks deployed to different servers of the same cluster

.6 November 2014 DNS is a SPOF in the architecture

Service was intermittently slow in Asia… Culprit =.io TLD

Migration to .net TLD and a new DNS provider Extensive testing but… nothing goes as planed!

☁ Black Thursday ☁ (see http://bit.ly/algoliablackthursday)

http://bit.ly/algoliablackthursday

.7 February 2015 Launch of our synchronized worldwide infrastructure

8 new regions! Low latency everywhere with automatic replication

12 regions

@algolia

Distributed Search Network - Worldwide Synchronization

.8 March 2015 Better high availability per region

Spread our US clusters across two completely different providers • 2 different data centers in close

locations (24 miles, 1ms latency) • 3 different machines • 2 completely different autonomous

systems

.9 May 2015 Introducing several DNS

providers

Retry strategy in API clients, again! 1. APPID-dsn.algolia.net 2. Retry randomly, APPID-1.algolianet.com APPID-2.algolianet.com APPID-3.algolianet.com

.10 July 2015 Three completely independent

providers per cluster

With 2 providers we could still loose indexing

Clusters spanning multiple data centers, autonomous systems and upstream providers.

.11 April 2016 Finer grained monitoring

Our monitoring was at the minute granularity (with ServerDensity)

Moved to Wavefront to enable drilling down at the second level (on demand)

500 metrics/server monitored

.12 September 2016 Algolia Vault

Algolia’s response to security challenges of larger organizations

Restrict API access to specific IP addresses to get your own « private cloud »

Encryption at rest for sensible data, in addition to encrypting all communications

@algolia

Design early Do not over engineer Focus on execution

Building an HA architecture takes time

@algolia

THANK YOU! QUESTIONS?

[email protected]

Full version on Medium http://bit.ly/algoliafuryroad

mailto:[email protected]?subject=

http://bit.ly/algoliafuryroad

algolia's fury road to a worldwide api - take off conference 2016

Technology