choosing a shard key

Post on 15-Jan-2015

4.723 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Choosing a shard key can be difficult, and the factors involved largely depend on your use case. In fact, there is no such thing as a perfect shard key; there are design tradeoffs inherent in every decision. This presentation goes through those tradeoffs, as well as the different types of shard keys available in MongoDB, such as hashed and compound shard keys

TRANSCRIPT

Sever Engineer, 10gen

Shaun Verch

#oscon

Choosing a Shard Key: Four Real-World Use Cases

Sever Engineer, 10gen

Shaun Verch

#oscon

Schema DesignFour Real-World Use Cases

Single Table En

Agenda

• Why is schema design important

• 4 Real World Schemas– Inbox– History– Indexed Attributes– Multiple Identities

• Conclusions

Why is Schema Design important?

• Largest factor for a performant system

• Schema design with MongoDB is different

• RDBMS – "What answers do I have?"• MongoDB – "What question will I have?"

#1 - Message Inbox

Let’s getSocial

Sending Messages

?

Design Goals

• Efficiently send new messages to recipients

• Efficiently read inbox

Reading my Inbox

?

3 Approaches (there are more)• Fan out on Read

• Fan out on Write

• Fan out on Write with Bucketing

// Shard on "from"db.shardCollection( ”oscon.inbox", { from: 1 } )

// Make sure we have an index to handle inbox readsdb.inbox.ensureIndex( { to: 1, sent: 1 } )

msg = {from: "Joe",to: [ "Bob", "Jane" ],

sent: new Date(), message: "Hi!",

}

// Send a messagedb.inbox.save( msg )

// Read my inboxdb.inbox.find( { to: "Joe" } ).sort( { sent: -1 } )

Fan out on read

Fan out on read – I/O

Shard 1

Shard 2

Shard 3

Send Message

Fan out on read – I/O

Shard 1

Shard 2

Shard 3

Read Inbox

Send Message

Considerations

• Write: One document per message sent

• Read: Find all messages with my own name in the recipient field

• Read: Requires scatter-gather on sharded cluster

• A lot of random I/O on a shard to find everything

// Shard on “recipient” and “sent” db.shardCollection( ”oscon.inbox", { ”recipient”: 1, ”sent”: 1 } )

msg = {from: "Joe",to: [ "Bob", "Jane" ],

sent: new Date(), message: "Hi!",

}

// Send a messagefor (var i = 0; i < msg.to.length; i++) {

msg.recipient = msg.to[i]db.inbox.save( msg );

}

// Read my inboxdb.inbox.find( { recipient: "Joe" } ).sort( { sent: -1 } )

Fan out on write

Fan out on write – I/O

Shard 1

Shard 2

Shard 3

Send Message

Fan out on write – I/O

Read Inbox

Send Message

Shard 1

Shard 2

Shard 3

Considerations

• Write: One document per recipient

• Read: Find all of the messages with me as the recipient

• Can shard on recipient, so inbox reads hit one shard

• But still lots of random I/O on the shard

// Shard on "owner / sequence"db.shardCollection( ”oscon.inbox", { owner: 1, sequence: 1 } )db.shardCollection( ”oscon.users", { user_name: 1 } )

msg = {from: "Joe",to: [ "Bob", "Jane" ],

sent: new Date(), message: "Hi!",

}

Fan out on write with buckets

// Send a messagefor(var i = 0; i < msg.to.length; i++) { count = db.users.findAndModify({

query: { user_name: msg.to[i] }, update: { "$inc": { "msg_count": 1 } }, upsert: true, new: true }).msg_count;

sequence = Math.floor(count / 50);

db.inbox.update({ owner: msg.to[i], sequence: sequence }, { $push: { "messages": msg } },{ upsert: true } );

}

// Read my inboxdb.inbox.find( { owner: "Joe" } ) .sort ( { sequence: -1 } ).limit( 2 )

Fan out on write with buckets

Fan out on write with buckets• Each “inbox” document is an array of

messages

• Append a message onto “inbox” of recipient

• Bucket inboxes so there’s not too many messages per document

• Can shard on recipient, so inbox reads hit one shard

• 1 or 2 documents to read the whole inbox

Fan out on write with buckets – I/O

Shard 1

Shard 2

Shard 3

Send Message

Shard 1

Shard 2

Shard 3

Fan out on write with buckets – I/O

Read Inbox

Send Message

#2 – History

Design Goals

• Need to retain a limited amount of history e.g.– Hours, Days, Weeks– May be legislative requirement (e.g. HIPPA, SOX,

DPA)

• Need to query efficiently by – match– ranges

3 Approaches (there are more)• Bucket by Number of messages

• Fixed size array

• Bucket by date + TTL collections

db.inbox.find() { owner: "Joe", sequence: 25, messages: [ { from: "Joe", to: [ "Bob", "Jane" ], sent: ISODate("2013-03-01T09:59:42.689Z"), message: "Hi!" }, …] }

// Query with a date rangedb.inbox.find ({owner: "friend1", messages: { $elemMatch: {sent:{$gte: ISODate("…") }}}})

// Remove elements based on a datedb.inbox.update({owner: "friend1" }, { $pull: { messages: { sent: { $gte: ISODate("…") } } } } )

Bucket by number of messages

Considerations

• Shrinking documents, space can be reclaimed with– db.runCommand ( { compact: '<collection>' } )

• Removing the document after the last element in the array as been removed– { "_id" : …, "messages" : [ ], "owner" : "friend1", "sequence" : 0 }

msg = { from: "Your Boss", to: [ "Bob" ],

sent: new Date(), message: "CALL ME NOW!"

}

// 2.4 Introduces $each, $sort and $slice for $pushdb.messages.update(

{ _id: 1 }, { $push: { messages: { $each: [ msg ],

$sort: { sent: 1 },

$slice: -50 }

} })

Fixed Size Array

Considerations

• Need to compute the size of the array based on retention period

// messages: one doc per user per day

db.inbox.findOne(){

_id: 1, to: "Joe", sequence: ISODate("2013-02-04T00:00:00.392Z"), messages: [ ] }

// Auto expires data after 31536000 seconds = 1 yeardb.messages.ensureIndex( { sequence: 1 }, { expireAfterSeconds: 31536000 } )

TTL Collections

#3 – Indexed Attributes

Design Goal

• Application needs to stored a variable number of attributes e.g.– User defined Form– Meta Data tags

• Queries needed– Equality– Range based

• Need to be efficient, regardless of the number of attributes

2 Approaches (there are more)• Attributes as Embedded Document

• Attributes as Objects in an Array

db.files.insert( { _id: "local.0", attr: { type: "text", size: 64, created: ISODate("..." } } )

db.files.insert( { _id: "local.1", attr: { type: "text", size: 128} } )

db.files.insert( { _id: "mongod", attr: { type: "binary", size: 256, created: ISODate("...") } } )

// Need to create an index for each item in the sub-documentdb.files.ensureIndex( { "attr.type": 1 } )db.files.find( { "attr.type": "text"} )

// Can perform range queriesdb.files.ensureIndex( { "attr.size": 1 } )db.files.find( { "attr.size": { $gt: 64, $lte: 16384 } } )

Attributes as a Sub-Document

Considerations

• Each attribute needs an Index

• Each time you extend, you add an index

• Lots and lots of indexes

db.files.insert( {_id: "local.0", attr: [ { type: "text" },

{ size: 64 },

{ created: ISODate("...") } ] } )

db.files.insert( { _id: "local.1", attr: [ { type: "text" },

{ size: 128 } ] } )

db.files.insert( { _id: "mongod", attr: [ { type: "binary" },

{ size: 256 }, { created: ISODate("...") } ] } )

db.files.ensureIndex( { attr: 1 } )

Attributes as Objects in Array

Considerations

• Only one index needed on attr

• Can support range queries, etc.

• Index can be used only once per query

#4 – Multiple Identities

Design Goal

• Ability to look up by a number of different identities e.g.

• Username• Email address• FB Handle• LinkedIn URL

2 Approaches (there are more)• Identifiers in a single document

• Separate Identifiers from Content

db.users.findOne(){ _id: "joe", email: "joe@example.com, fb: "joe.smith", // facebook li: "joe.e.smith", // linkedin other: {…}}

// Shard collection by _iddb.shardCollection(”oscon.users", { _id: 1 } )

// Create indexes on each keydb.users.ensureIndex( { email: 1} )db.users.ensureIndex( { fb: 1 } )db.users.ensureIndex( { li: 1 } )

Single Document by User

Read by _id (shard key)

Shard 1 Shard 2 Shard 3

find( { _id: "joe"} )

Read by email (non-shard key)

Shard 1 Shard 2 Shard 3

find ( { email: joe@example.com } )

Considerations

• Lookup by shard key is routed to 1 shard

• Lookup by other identifier is scatter gathered across all shards

• Secondary keys cannot have a unique index

// Create unique indexdb.identities.ensureIndex( { identifier : 1} , { unique: true} )

// Create a document for each users documentdb.identities.save( { identifier : { hndl: "joe" }, user: "1200-42" } )db.identities.save( { identifier : { email: "joe@abc.com" }, user: "1200-42" } )db.identities.save( { identifier : { li: "joe.e.smith" }, user: "1200-42" } )

// Shard collection by _iddb.shardCollection( "mydb.identities", { identifier : 1 } )

// Create unique indexdb.users.ensureIndex( { _id: 1} , { unique: true} )

// Shard collection by _iddb.shardCollection( "mydb.users", { _id: 1 } )

Document per Identity

Read requires 2 reads

Shard 1 Shard 2 Shard 3

db.identities.find({"identifier" : { "hndl" : "joe" }})

db.users.find( { _id: "1200-42"} )

Considerations

• Lookup to Identities is a routed query

• Lookup to Users is a routed query

• Unique indexes available

• Must do two queries per lookup

Conclusion

Summary

• Multiple ways to model a domain problem

• Understand the key uses cases of your app

• Balance between ease of query vs. ease of write

• Random I/O should be avoided

Server Engineer, 10gen

Shaun Verch

#oscon

Thank You

top related