sharding @ instagram - postgresql · sharding @ instagram sfpug april 2012 mike krieger instagram...

183
Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

Upload: vuongdan

Post on 15-Sep-2018

245 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

Sharding InstagramSFPUG April 2012

Mike KriegerInstagram

Monday April 23 12

me

- Co-founder Instagram

- Previously UX amp Front-end Meebo

- Stanford HCI BSMS

- mikeyk on everything

Monday April 23 12

pug

Monday April 23 12

communicating and sharing in the real world

Monday April 23 12

Monday April 23 12

Monday April 23 12

Monday April 23 12

30+ million users in less than 2 years

Monday April 23 12

at its heart Postgres-driven

Monday April 23 12

a glimpse at how a startup with a small eng team scaled with PG

Monday April 23 12

a brief tangent

Monday April 23 12

the beginning

Monday April 23 12

Text

Monday April 23 12

2 product guys

Monday April 23 12

no real back-end experience

Monday April 23 12

(you should have seen my first time finding my

way around psql)

Monday April 23 12

analytics amp python meebo

Monday April 23 12

CouchDB

Monday April 23 12

CrimeDesk SF

Monday April 23 12

Monday April 23 12

early mix PG Redis Memcached

Monday April 23 12

but were hosted on a single machine

somewhere in LA

Monday April 23 12

Monday April 23 12

less powerful than my MacBook Pro

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 2: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

me

- Co-founder Instagram

- Previously UX amp Front-end Meebo

- Stanford HCI BSMS

- mikeyk on everything

Monday April 23 12

pug

Monday April 23 12

communicating and sharing in the real world

Monday April 23 12

Monday April 23 12

Monday April 23 12

Monday April 23 12

30+ million users in less than 2 years

Monday April 23 12

at its heart Postgres-driven

Monday April 23 12

a glimpse at how a startup with a small eng team scaled with PG

Monday April 23 12

a brief tangent

Monday April 23 12

the beginning

Monday April 23 12

Text

Monday April 23 12

2 product guys

Monday April 23 12

no real back-end experience

Monday April 23 12

(you should have seen my first time finding my

way around psql)

Monday April 23 12

analytics amp python meebo

Monday April 23 12

CouchDB

Monday April 23 12

CrimeDesk SF

Monday April 23 12

Monday April 23 12

early mix PG Redis Memcached

Monday April 23 12

but were hosted on a single machine

somewhere in LA

Monday April 23 12

Monday April 23 12

less powerful than my MacBook Pro

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 3: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

pug

Monday April 23 12

communicating and sharing in the real world

Monday April 23 12

Monday April 23 12

Monday April 23 12

Monday April 23 12

30+ million users in less than 2 years

Monday April 23 12

at its heart Postgres-driven

Monday April 23 12

a glimpse at how a startup with a small eng team scaled with PG

Monday April 23 12

a brief tangent

Monday April 23 12

the beginning

Monday April 23 12

Text

Monday April 23 12

2 product guys

Monday April 23 12

no real back-end experience

Monday April 23 12

(you should have seen my first time finding my

way around psql)

Monday April 23 12

analytics amp python meebo

Monday April 23 12

CouchDB

Monday April 23 12

CrimeDesk SF

Monday April 23 12

Monday April 23 12

early mix PG Redis Memcached

Monday April 23 12

but were hosted on a single machine

somewhere in LA

Monday April 23 12

Monday April 23 12

less powerful than my MacBook Pro

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 4: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

communicating and sharing in the real world

Monday April 23 12

Monday April 23 12

Monday April 23 12

Monday April 23 12

30+ million users in less than 2 years

Monday April 23 12

at its heart Postgres-driven

Monday April 23 12

a glimpse at how a startup with a small eng team scaled with PG

Monday April 23 12

a brief tangent

Monday April 23 12

the beginning

Monday April 23 12

Text

Monday April 23 12

2 product guys

Monday April 23 12

no real back-end experience

Monday April 23 12

(you should have seen my first time finding my

way around psql)

Monday April 23 12

analytics amp python meebo

Monday April 23 12

CouchDB

Monday April 23 12

CrimeDesk SF

Monday April 23 12

Monday April 23 12

early mix PG Redis Memcached

Monday April 23 12

but were hosted on a single machine

somewhere in LA

Monday April 23 12

Monday April 23 12

less powerful than my MacBook Pro

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 5: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

Monday April 23 12

Monday April 23 12

Monday April 23 12

30+ million users in less than 2 years

Monday April 23 12

at its heart Postgres-driven

Monday April 23 12

a glimpse at how a startup with a small eng team scaled with PG

Monday April 23 12

a brief tangent

Monday April 23 12

the beginning

Monday April 23 12

Text

Monday April 23 12

2 product guys

Monday April 23 12

no real back-end experience

Monday April 23 12

(you should have seen my first time finding my

way around psql)

Monday April 23 12

analytics amp python meebo

Monday April 23 12

CouchDB

Monday April 23 12

CrimeDesk SF

Monday April 23 12

Monday April 23 12

early mix PG Redis Memcached

Monday April 23 12

but were hosted on a single machine

somewhere in LA

Monday April 23 12

Monday April 23 12

less powerful than my MacBook Pro

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 6: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

Monday April 23 12

Monday April 23 12

30+ million users in less than 2 years

Monday April 23 12

at its heart Postgres-driven

Monday April 23 12

a glimpse at how a startup with a small eng team scaled with PG

Monday April 23 12

a brief tangent

Monday April 23 12

the beginning

Monday April 23 12

Text

Monday April 23 12

2 product guys

Monday April 23 12

no real back-end experience

Monday April 23 12

(you should have seen my first time finding my

way around psql)

Monday April 23 12

analytics amp python meebo

Monday April 23 12

CouchDB

Monday April 23 12

CrimeDesk SF

Monday April 23 12

Monday April 23 12

early mix PG Redis Memcached

Monday April 23 12

but were hosted on a single machine

somewhere in LA

Monday April 23 12

Monday April 23 12

less powerful than my MacBook Pro

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 7: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

Monday April 23 12

30+ million users in less than 2 years

Monday April 23 12

at its heart Postgres-driven

Monday April 23 12

a glimpse at how a startup with a small eng team scaled with PG

Monday April 23 12

a brief tangent

Monday April 23 12

the beginning

Monday April 23 12

Text

Monday April 23 12

2 product guys

Monday April 23 12

no real back-end experience

Monday April 23 12

(you should have seen my first time finding my

way around psql)

Monday April 23 12

analytics amp python meebo

Monday April 23 12

CouchDB

Monday April 23 12

CrimeDesk SF

Monday April 23 12

Monday April 23 12

early mix PG Redis Memcached

Monday April 23 12

but were hosted on a single machine

somewhere in LA

Monday April 23 12

Monday April 23 12

less powerful than my MacBook Pro

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 8: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

30+ million users in less than 2 years

Monday April 23 12

at its heart Postgres-driven

Monday April 23 12

a glimpse at how a startup with a small eng team scaled with PG

Monday April 23 12

a brief tangent

Monday April 23 12

the beginning

Monday April 23 12

Text

Monday April 23 12

2 product guys

Monday April 23 12

no real back-end experience

Monday April 23 12

(you should have seen my first time finding my

way around psql)

Monday April 23 12

analytics amp python meebo

Monday April 23 12

CouchDB

Monday April 23 12

CrimeDesk SF

Monday April 23 12

Monday April 23 12

early mix PG Redis Memcached

Monday April 23 12

but were hosted on a single machine

somewhere in LA

Monday April 23 12

Monday April 23 12

less powerful than my MacBook Pro

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 9: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

at its heart Postgres-driven

Monday April 23 12

a glimpse at how a startup with a small eng team scaled with PG

Monday April 23 12

a brief tangent

Monday April 23 12

the beginning

Monday April 23 12

Text

Monday April 23 12

2 product guys

Monday April 23 12

no real back-end experience

Monday April 23 12

(you should have seen my first time finding my

way around psql)

Monday April 23 12

analytics amp python meebo

Monday April 23 12

CouchDB

Monday April 23 12

CrimeDesk SF

Monday April 23 12

Monday April 23 12

early mix PG Redis Memcached

Monday April 23 12

but were hosted on a single machine

somewhere in LA

Monday April 23 12

Monday April 23 12

less powerful than my MacBook Pro

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 10: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

a glimpse at how a startup with a small eng team scaled with PG

Monday April 23 12

a brief tangent

Monday April 23 12

the beginning

Monday April 23 12

Text

Monday April 23 12

2 product guys

Monday April 23 12

no real back-end experience

Monday April 23 12

(you should have seen my first time finding my

way around psql)

Monday April 23 12

analytics amp python meebo

Monday April 23 12

CouchDB

Monday April 23 12

CrimeDesk SF

Monday April 23 12

Monday April 23 12

early mix PG Redis Memcached

Monday April 23 12

but were hosted on a single machine

somewhere in LA

Monday April 23 12

Monday April 23 12

less powerful than my MacBook Pro

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 11: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

a brief tangent

Monday April 23 12

the beginning

Monday April 23 12

Text

Monday April 23 12

2 product guys

Monday April 23 12

no real back-end experience

Monday April 23 12

(you should have seen my first time finding my

way around psql)

Monday April 23 12

analytics amp python meebo

Monday April 23 12

CouchDB

Monday April 23 12

CrimeDesk SF

Monday April 23 12

Monday April 23 12

early mix PG Redis Memcached

Monday April 23 12

but were hosted on a single machine

somewhere in LA

Monday April 23 12

Monday April 23 12

less powerful than my MacBook Pro

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 12: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

the beginning

Monday April 23 12

Text

Monday April 23 12

2 product guys

Monday April 23 12

no real back-end experience

Monday April 23 12

(you should have seen my first time finding my

way around psql)

Monday April 23 12

analytics amp python meebo

Monday April 23 12

CouchDB

Monday April 23 12

CrimeDesk SF

Monday April 23 12

Monday April 23 12

early mix PG Redis Memcached

Monday April 23 12

but were hosted on a single machine

somewhere in LA

Monday April 23 12

Monday April 23 12

less powerful than my MacBook Pro

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 13: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

Text

Monday April 23 12

2 product guys

Monday April 23 12

no real back-end experience

Monday April 23 12

(you should have seen my first time finding my

way around psql)

Monday April 23 12

analytics amp python meebo

Monday April 23 12

CouchDB

Monday April 23 12

CrimeDesk SF

Monday April 23 12

Monday April 23 12

early mix PG Redis Memcached

Monday April 23 12

but were hosted on a single machine

somewhere in LA

Monday April 23 12

Monday April 23 12

less powerful than my MacBook Pro

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 14: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

2 product guys

Monday April 23 12

no real back-end experience

Monday April 23 12

(you should have seen my first time finding my

way around psql)

Monday April 23 12

analytics amp python meebo

Monday April 23 12

CouchDB

Monday April 23 12

CrimeDesk SF

Monday April 23 12

Monday April 23 12

early mix PG Redis Memcached

Monday April 23 12

but were hosted on a single machine

somewhere in LA

Monday April 23 12

Monday April 23 12

less powerful than my MacBook Pro

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 15: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

no real back-end experience

Monday April 23 12

(you should have seen my first time finding my

way around psql)

Monday April 23 12

analytics amp python meebo

Monday April 23 12

CouchDB

Monday April 23 12

CrimeDesk SF

Monday April 23 12

Monday April 23 12

early mix PG Redis Memcached

Monday April 23 12

but were hosted on a single machine

somewhere in LA

Monday April 23 12

Monday April 23 12

less powerful than my MacBook Pro

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 16: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

(you should have seen my first time finding my

way around psql)

Monday April 23 12

analytics amp python meebo

Monday April 23 12

CouchDB

Monday April 23 12

CrimeDesk SF

Monday April 23 12

Monday April 23 12

early mix PG Redis Memcached

Monday April 23 12

but were hosted on a single machine

somewhere in LA

Monday April 23 12

Monday April 23 12

less powerful than my MacBook Pro

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 17: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

analytics amp python meebo

Monday April 23 12

CouchDB

Monday April 23 12

CrimeDesk SF

Monday April 23 12

Monday April 23 12

early mix PG Redis Memcached

Monday April 23 12

but were hosted on a single machine

somewhere in LA

Monday April 23 12

Monday April 23 12

less powerful than my MacBook Pro

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 18: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

CouchDB

Monday April 23 12

CrimeDesk SF

Monday April 23 12

Monday April 23 12

early mix PG Redis Memcached

Monday April 23 12

but were hosted on a single machine

somewhere in LA

Monday April 23 12

Monday April 23 12

less powerful than my MacBook Pro

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 19: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

CrimeDesk SF

Monday April 23 12

Monday April 23 12

early mix PG Redis Memcached

Monday April 23 12

but were hosted on a single machine

somewhere in LA

Monday April 23 12

Monday April 23 12

less powerful than my MacBook Pro

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 20: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

Monday April 23 12

early mix PG Redis Memcached

Monday April 23 12

but were hosted on a single machine

somewhere in LA

Monday April 23 12

Monday April 23 12

less powerful than my MacBook Pro

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 21: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

early mix PG Redis Memcached

Monday April 23 12

but were hosted on a single machine

somewhere in LA

Monday April 23 12

Monday April 23 12

less powerful than my MacBook Pro

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 22: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

but were hosted on a single machine

somewhere in LA

Monday April 23 12

Monday April 23 12

less powerful than my MacBook Pro

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 23: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

Monday April 23 12

less powerful than my MacBook Pro

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 24: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

less powerful than my MacBook Pro

Monday April 23 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 25: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

okay we launchednow what

Monday April 23 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 26: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

25k signups in the first day

Monday April 23 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 27: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

everything is on fire

Monday April 23 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 28: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

best amp worst day of our lives so far

Monday April 23 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 29: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

load was through the roof

Monday April 23 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 30: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

friday rolls around

Monday April 23 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 31: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

not slowing down

Monday April 23 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 32: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

letrsquos move to EC2

Monday April 23 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 33: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

Monday April 23 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 34: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

PG upgrade to 90

Monday April 23 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 35: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

Monday April 23 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 36: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

scaling = replacing all components of a car

while driving it at 100mph

Monday April 23 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 37: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

this is the story of how our usage of PG has

evolved

Monday April 23 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 38: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

Phase 1 All ORM all the time

Monday April 23 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 39: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

why pg at first postgis

Monday April 23 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 40: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

managepy syncdb

Monday April 23 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 41: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

ORM made it too easy to not really think through

primary keys

Monday April 23 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 42: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

pretty good for getting off the ground

Monday April 23 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 43: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

Mediaobjectsget(pk = 4)

Monday April 23 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 44: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

first version of our feed (pre-launch)

Monday April 23 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 45: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

friends = Relationshipsobjectsfilter(source_user = user)

recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]

Monday April 23 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 46: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

main feed at launch

Monday April 23 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 47: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt

for readingLRANGE feed4 0 20

Monday April 23 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 48: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

canonical data PGfeedslistssets Redis

object cache memcache

Monday April 23 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 49: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

post-launch

Monday April 23 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 50: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

moved db to its own machine

Monday April 23 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 51: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

at time largest table photo metadata

Monday April 23 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 52: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

ran master-slave from the beginning with

streaming replication

Monday April 23 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 53: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

backups stop the replica xfs_freeze drives and take EBS snapshot

Monday April 23 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 54: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

AWS tradeoff

Monday April 23 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 55: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

3 early problems we hit with PG

Monday April 23 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 56: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

1 oh that setting was the problem

Monday April 23 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 57: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

work_mem

Monday April 23 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 58: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

shared_buffers

Monday April 23 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 59: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

cost_delay

Monday April 23 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 60: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

2 Django-specific ltidle in transactiongt

Monday April 23 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 61: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

3 connection pooling

Monday April 23 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 62: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

(we use PGBouncer)

Monday April 23 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 63: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

somewhere in this crazy couple of months

Christophe to the rescue

Monday April 23 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 64: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

photos kept growing and growing

Monday April 23 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 65: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

and only 68GB of RAM on biggest machine in EC2

Monday April 23 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 66: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

so what now

Monday April 23 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 67: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

Phase 2 Vertical Partitioning

Monday April 23 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 68: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

django db routers make it pretty easy

Monday April 23 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 69: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

def db_for_read(self model) if app_label == photos return photodb

Monday April 23 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 70: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

once you untangle all your foreign key

relationships

Monday April 23 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 71: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

(all of those useruser_id interchangeable calls bite

you now)

Monday April 23 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 72: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

plenty of time spent in PGFouine

Monday April 23 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 73: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

read slaves (using streaming replicas) where

we need to reduce contention

Monday April 23 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 74: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

a few months later

Monday April 23 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 75: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

photosdb gt 60GB

Monday April 23 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 76: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

precipitated by being on cloud hardware but likely to have hit limit eventually

either way

Monday April 23 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 77: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

what now

Monday April 23 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 78: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

horizontal partitioning

Monday April 23 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 79: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

Phase 3 sharding

Monday April 23 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 80: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

ldquosurely wersquoll have hired someone experienced before we actually need

to shardrdquo

Monday April 23 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 81: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

never true about scaling

Monday April 23 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 82: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

1 choosing a method2 adapting the application

3 expanding capacity

Monday April 23 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 83: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

evaluated solutions

Monday April 23 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 84: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

at the time none were up to task of being our

primary DB

Monday April 23 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 85: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

NoSQL alternatives

Monday April 23 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 86: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

Skypersquos sharding proxy

Monday April 23 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 87: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

Rangedate-based partitioning

Monday April 23 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 88: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

did in Postgres itself

Monday April 23 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 89: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

requirements

Monday April 23 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 90: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

1 low operational amp code complexity

Monday April 23 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 91: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

2 easy expanding of capacity

Monday April 23 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 92: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

3 low performance impact on application

Monday April 23 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 93: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

schema-based logical sharding

Monday April 23 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 94: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

many many many (thousands) of logical

shards

Monday April 23 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 95: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

that map to fewer physical ones

Monday April 23 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 96: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

8 logical shards on 2 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B

Monday April 23 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 97: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

8 logical shards on 2 4 machines

user_id 8 = logical shard

logical shards -gt physical shard map

0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D

Monday April 23 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 98: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

iexclschemas

Monday April 23 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 99: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

all that lsquopublicrsquo stuff Irsquod been glossing over for 2

years

Monday April 23 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 100: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

- database - schema - table - columns

Monday April 23 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 101: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

spun up set of machines

Monday April 23 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 102: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

using fabric created thousands of schemas

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 103: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user

Monday April 23 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 104: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

(fabric or similar parallel task executor is

essential)

Monday April 23 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 105: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

application-side logic

Monday April 23 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 106: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

SHARD_TO_DB =

SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1

Monday April 23 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 107: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

instead of Django ORM wrote really simple db

abstraction layer

Monday April 23 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 108: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

selectupdateinsertdelete

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 109: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

select(fields table_name shard_key where_statement where_parameters)

Monday April 23 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 110: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

select(fields table_name shard_key where_statement where_parameters)

shard_key num_logical_shards = shard_id

Monday April 23 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 111: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

in most cases user_id for us

Monday April 23 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 112: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

custom Django test runner to createtear-down sharded DBs

Monday April 23 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 113: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

most queries involve visiting handful of shards

over one or two machines

Monday April 23 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 114: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

if mapping across shards on single DB UNION

ALL to aggregate

Monday April 23 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 115: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

clients to library pass in((shard_key id)

(shard_keyid)) etc

Monday April 23 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 116: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

library maps sub-selects to each shard and each

machine

Monday April 23 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 117: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

parallel execution (per-machine at least)

Monday April 23 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 118: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)

Monday April 23 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 119: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

eventually would be nice to parallelize across

machines

Monday April 23 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 120: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

next challenge unique IDs

Monday April 23 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 121: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

requirements

Monday April 23 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 122: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

1 should be time sortable without requiring

a lookup

Monday April 23 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 123: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

2 should be 64-bit

Monday April 23 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 124: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

3 low operational complexity

Monday April 23 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 125: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

surveyed the options

Monday April 23 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 126: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

ticket servers

Monday April 23 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 127: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

UUID

Monday April 23 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 128: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

twitter snowflake

Monday April 23 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 129: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

application-level IDs ala Mongo

Monday April 23 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 130: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

hey the db is already pretty good about

incrementing sequences

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 131: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 132: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 133: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 134: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]

Monday April 23 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 135: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id

SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL

Monday April 23 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 136: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

pulling shard ID from ID

shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23

Monday April 23 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 137: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

pros guaranteed unique in 64-bits not much of a

CPU overhead

Monday April 23 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 138: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

cons large IDs from the get-go

Monday April 23 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 139: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

hundreds of millions of IDs generated with this

scheme no issues

Monday April 23 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 140: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

well what about ldquore-shardingrdquo

Monday April 23 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 141: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

first recourse pg_reorg

Monday April 23 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 142: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

rewrites tables in index order

Monday April 23 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 143: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

only requires brief locks for atomic table renames

Monday April 23 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 144: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

20+GB savings on some of our dbs

Monday April 23 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 145: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

Monday April 23 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 146: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

especially useful on EC2

Monday April 23 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 147: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

but sometimes you just have to reshard

Monday April 23 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 148: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

streaming replication to the rescue

Monday April 23 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 149: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

(btw repmgr is awesome)

Monday April 23 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 150: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

repmgr standby clone ltmastergt

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 151: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 152: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Monday April 23 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 153: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

PGBouncer abstracts moving DBs from the

app logic

Monday April 23 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 154: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

can do this as long as you have more logical shards than physical

ones

Monday April 23 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 155: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

beauty of schemas is that they are physically

different files

Monday April 23 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 156: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

(no IO hit when deleting no lsquoswiss cheesersquo)

Monday April 23 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 157: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

downside requires ~30 seconds of maintenance to roll out new schema

mapping

Monday April 23 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 158: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

(could be solved by having concept of ldquoread-

onlyrdquo mode for some DBs)

Monday April 23 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 159: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

not great for range-scans that would span across

shards

Monday April 23 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 160: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

latest project follow graph

Monday April 23 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 161: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

v1 simple DB table(source_id target_id

status)

Monday April 23 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 162: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

who do I followwho follows me

do I follow Xdoes X follow me

Monday April 23 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 163: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

DB was busy so we started storing parallel

version in Redis

Monday April 23 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 164: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

follow_all(300 item list)

Monday April 23 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 165: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

inconsistency

Monday April 23 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 166: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

extra logic

Monday April 23 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 167: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

so much extra logic

Monday April 23 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 168: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

exposing your support team to the idea of cache invalidation

Monday April 23 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 169: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

Monday April 23 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 170: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

redesign took a page from twitterrsquos book

Monday April 23 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 171: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

PG can handle tens of thousands of requests very light memcached

caching

Monday April 23 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 172: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

next steps

Monday April 23 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 173: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

isolating services to minimize open conns

Monday April 23 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 174: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

investigate physical hardware etc to reduce

need to re-shard

Monday April 23 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 175: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

Wrap up

Monday April 23 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 176: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

you donrsquot need to give up PGrsquos durability amp features to shard

Monday April 23 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 177: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

continue to let the DB do what the DB is great at

Monday April 23 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 178: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

ldquodonrsquot shard until you have tordquo

Monday April 23 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 179: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

(but donrsquot over-estimate how hard it will be either)

Monday April 23 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 180: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

scaled within constraints of the cloud

Monday April 23 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 181: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

PG success story

Monday April 23 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 182: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

(wersquore really excited about 92)

Monday April 23 12

thanks any qs

Monday April 23 12

Page 183: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12

thanks any qs

Monday April 23 12