algolia's fury road to a worldwide api - take off conference 2016
TRANSCRIPT
ALGOLIA’S FURY ROAD TO A WORLDWIDE API
Build Unique Search Experiences
Olivier Lance Solutions Engineer
[email protected] @olance
Take Off Conference 2016
@algolia
Algolia Today
15 regions 47 data centers
2000+ customers in 100+ countries
30B+ Write operations per month
15B+ User-generated queries per month
.1 March 2013 High Availability was designed…
but not implemented
A single machine in 2 different locations: Canada/East and Europe/West
Focus on performance, searching over indexing
First customer in prod ⚠ RAM: 32GB Proc: 4 cores, 3.4-3.8 GHz SSD: 2x 120 GB Raid-0 (Intel 320)
.2 June 2013 Implementation of high availability in our architecture
3 machines with a consensus on write… but in the same data center API clients handled automatic retries in case of error APPID-1.algolia.io, APPID-2.algolia.io, APPID-3.algolia.io
RAM: 64GB Proc: 6 cores, 3.2-3.8 GHz SSD: 2x 300 GB Raid-0 (Intel 320)
.3 August 2013 Official launch of the service
Two locations: Europe/West and Canada/East
Same provider but different network equipment and power units (cheap multi-AZ)
10 API clients, developed manually (https keep alive, using TLS correctly, retry strategy…)
RAM: 128GB Proc: 8 cores, 3.1-3.8 GHz SSD: 2x 300 GB Raid-0 (Intel S3500)
.4 January 2014 Deployment is a big risk for high availability
Agile development, 6000+ unit tests, 200+ non-regression tests… But no instant rollback! Result: 8 minutes of indexing downtime ☂
From then on - start with test clusters - instant rollback
.5 October 2014 Automation via Chef
Significant increase in managed machines Shell Scripts -> Chef
Automation is great but s**t happens… A typo in a cookbook nearly broke our prod!
From then on: 2 versions of the cookbooks deployed to different servers of the same cluster
.6 November 2014 DNS is a SPOF in the architecture
Service was intermittently slow in Asia… Culprit =.io TLD
Migration to .net TLD and a new DNS provider Extensive testing but… nothing goes as planed!
☁ Black Thursday ☁ (see http://bit.ly/algoliablackthursday)
.7 February 2015 Launch of our synchronized worldwide infrastructure
8 new regions! Low latency everywhere with automatic replication
12 regions
.8 March 2015 Better high availability per region
Spread our US clusters across two completely different providers • 2 different data centers in close
locations (24 miles, 1ms latency) • 3 different machines • 2 completely different autonomous
systems
.9 May 2015 Introducing several DNS
providers
Retry strategy in API clients, again! 1. APPID-dsn.algolia.net 2. Retry randomly, APPID-1.algolianet.com APPID-2.algolianet.com APPID-3.algolianet.com
.10 July 2015 Three completely independent
providers per cluster
With 2 providers we could still loose indexing
Clusters spanning multiple data centers, autonomous systems and upstream providers.
.11 April 2016 Finer grained monitoring
Our monitoring was at the minute granularity (with ServerDensity)
Moved to Wavefront to enable drilling down at the second level (on demand)
500 metrics/server monitored
.12 September 2016 Algolia Vault
Algolia’s response to security challenges of larger organizations
Restrict API access to specific IP addresses to get your own « private cloud »
Encryption at rest for sensible data, in addition to encrypting all communications
@algolia
Design early Do not over engineer Focus on execution
Building an HA architecture takes time
@algolia
THANK YOU! QUESTIONS?
Full version on Medium http://bit.ly/algoliafuryroad