schema tricks & tips

Technical Director, 10gen

@jonnyeight [email protected] alvinonmongodb.com

Alvin Richards

#MongoDBdays

Schema Design 4 Real World Use Cases

One size fits all?

Single Table En

Agenda

•  Why is schema design important

•  4 Real World Schemas –  Inbox –  History –  Indexed Attributes –  Multiple Identities

•  Conclusions

Why is Schema Design important?

•  Largest factor for a performant system

•  Schema design with MongoDB is different •  RBMS – "What answers do I have?" •  MongoDB – "What question will I have?"

#1 - Message Inbox

Let’s get Social

Sending Messages

?

Design Goals

•  Efficiently send new messages to recipients

•  Efficiently read inbox

Reading my Inbox

?

3 Approaches (there are more)

•  Fan out on Read

•  Fan out on Write

•  Fan out on Write with Bucketing

// Shard on "from" db.shardCollection( "mongodbdays.inbox", { from: 1 } ) // Make sure we have an index to handle inbox reads db.inbox.ensureIndex( { to: 1, sent: 1 } ) msg = { from: "Joe", to: [ "Bob", "Jane" ],

sent: new Date(), message: "Hi!",

} // Send a message db.inbox.save( msg ) // Read my inbox db.inbox.find( { to: "Joe" } ).sort( { sent: -1 } )

Fan out on read

Fan out on read – Send Message

Shard 1 Shard 2 Shard 3

Send Message

Fan out on read – Inbox Read


Read Inbox

Considerations

•  1 document per message sent

•  Multiple recipients in an array key

•  Reading an inbox is finding all messages with my own name in the recipient field

•  Requires scatter-gather on sharded cluster

•  Then a lot of random IO on a shard to find everything

// Shard on “recipient” and “sent” db.shardCollection( "mongodbdays.inbox", { ”recipient”: 1, ”sent”: 1 } ) msg = { from: "Joe”, to: [ "Bob", "Jane" ],


} // Send a message for ( recipient in msg.to ) {

msg.recipient = recipient db.inbox.save( msg );

} // Read my inbox db.inbox.find( { recipient: "Joe" } ).sort( { sent: -1 } )

Fan out on write

Fan out on write – Send Message


Send Message

Fan out on write– Read Inbox


Read Inbox

Considerations

•  1 document per recipient

•  Reading my inbox is just finding all of the messages with me as the recipient

•  Can shard on recipient, so inbox reads hit one shard

•  But still lots of random IO on the shard

Fan out on write with buckets

•  Each “inbox” document is an array of messages

•  Append a message onto “inbox” of recipient

•  Bucket inbox documents so there’s not too many per document

•  Can shard on recipient, so inbox reads hit one shard

•  A few documents to read the whole inbox

// Shard on “owner / sequence” db.shardCollection( "mongodbdays.inbox", { owner: 1, sequence: 1 } ) db.shardCollection( "mongodbdays.users", { user_name: 1 } ) msg = { from: "Joe", to: [ "Bob", "Jane" ],


} // Send a message for( recipient in msg.to) { count = db.users.findAndModify({ query: { user_name: msg.to[recipient] }, update: { "$inc": { "msg_count": 1 } }, upsert: true, new: true }).msg_count; sequence = Math.floor(count / 50);

db.inbox.update( { owner: msg.to[recipient], sequence: sequence }, { $push: { "messages": msg } }, { upsert: true } );

} // Read my inbox db.inbox.find( { owner: "Joe" } ).sort ( { sequence: -1 } ).limit( 2 )

Fan out on write – with buckets

Bucketed fan out on write - Send


Send Message

Bucketed fan out on write - Read


Read Inbox

#2 – History

Design Goals

•  Need to retain a limited amount of history e.g. –  Hours, Days, Weeks –  May be legislative requirement (e.g. HIPPA, SOX, DPA)

•  Need to query efficiently by –  match –  ranges


•  Bucket by Number of messages

•  Fixed size Array

•  Bucket by Date + TTL Collections

db.inbox.find() { owner: "Joe", sequence: 25, messages: [ { from: "Joe", to: [ "Bob", "Jane" ], sent: ISODate("2013-03-01T09:59:42.689Z"), message: "Hi!" }, … ] } // Query with a date range db.inbox.find ( { owner: "friend1", messages: { $elemMatch: { sent: { $gte: ISODate("2013-04-04…") }}}}) // Remove elements based on a date db.inbox.update( { owner: "friend1" }, { $pull: { messages: { sent: { $gte: ISODate("2013-04-04…") } } } } )

Inbox – Bucket by # messages

Considerations

•  Shrinking documents, space can be reclaimed with –  db.runCommand ( { compact: '<collection>' } )

•  Removing the document after the last element in the array as been removed –  { "_id" : …, "messages" : [ ], "owner" : "friend1", "sequence" : 0 }

msg = { from: "Your Boss", to: [ "Bob" ],

sent: new Date(), message: "CALL ME NOW!"

} // 2.4 Introduces $each, $sort and $slice for $push db.messages.update(

{ _id: 1 }, { $push: { messages: { $each: [ msg ],

$sort: { sent: 1 }, $slice: -50 } }

} )

Maintain the latest – Fixed Size Array

Considerations

•  Need to compute the size of the array based on retention period

// messages: one doc per user per day

db.inbox.findOne() {

_id: 1, to: "Joe", sequence: ISODate("2013-02-04T00:00:00.392Z"), messages: [ ] }

// Auto expires data after 31536000 seconds = 1 year db.messages.ensureIndex( { sequence: 1 }, { expireAfterSeconds: 31536000 } )

TTL Collections

#3 – Indexed Attributes

Design Goal

•  Application needs to stored a variable number of attributes e.g. –  User defined Form –  Meta Data tags

•  Queries needed –  Equality –  Range based

•  Need to be efficient, regardless of the number of attributes


•  Attributes

•  Attributes as Objects in an Array

db.files.insert( { _id: "local.0", attr: { type: "text", size: 64, created: ISODate("2013-03-01T09:59:42.689Z" } } )

db.files.insert( { _id:"local.1", attr: { type: "text", size: 128} } )

db.files.insert( { _id:"mongod", attr: { type: "binary", size: 256, created: ISODate("2013-04-01T18:13:42.689Z") } } )

// Need to create an index for each item in the sub-document db.files.ensureIndex( { "attr.type": 1 } ) db.files.find( { "attr.type": "text"} )

// Can perform range queries db.files.ensureIndex( { "attr.size": 1 } ) db.files.find( { "attr.size": { $gt: 64, $lte: 16384 } } )

Attributes as a Sub-Document

Considerations

•  Each attribute needs an Index

•  Each time you extend, you add an index

•  Lots and lots of indexes

db.files.insert( { _id: "local.0", attr: [ { type: "text" }, { size: 64 }, { created: ISODate("2013-03-01T09:59:42.689Z" } ] } )

db.files.insert( { _id: "local.1", attr: [ { type: "text" }, { size: 128 } ] } )

db.files.insert( { _id: "mongod", attr: [ { type: "binary" }, { size: 256 }, { created: ISODate("2013-04-01T18:13:42.689Z") } ] } )

db.files.ensureIndex( { attr: 1 } )

Attributes as Objects in Array

// Range queries db.files.find( { attr: { $gt: { size:64 }, $lte: { size: 16384 } } } )

db.files.find( { attr: { $gte: { created: ISODate("2013-02-01T00:00:01.689Z") } } } )

// Multiple condition – Only the first predicate on the query can use the Index // ensure that this is the most selective. // Index Intersection will allow multiple indexes, see SERVER-3071

db.files.find( { $and: [ { attr: { $gte: { created: ISODate("2013-02-01T…") } } }, { attr: { $gt: { size:128 }, $lte: { size: 16384 } } } ] } )

// Each $or can use an index db.files.find( { $or: [ { attr: { $gte: { created: ISODate("2013-02-01T…") } } }, { attr: { $gt: { size:128 }, $lte: { size: 16384 } } } ] } )

Queries

#4 – Multiple Identities

Design Goal

•  Ability to look up by a number of different identities e.g. •  Username •  Email address •  FB Handle •  LinkedIn URL


•  Identifiers in a single document

•  Separate Identifiers from Content

db.users.findOne() { _id: "joe", email: "[email protected], fb: "joe.smith", // facebook li: "joe.e.smith", // linkedin other: {…} }

// Shard collection by _id db.shardCollection("mongodbdays.users", { _id: 1 } )

// Create indexes on each key db.users.ensureIndex( { email: 1} ) db.users.ensureIndex( { fb: 1 } ) db.users.ensureIndex( { li: 1 } )

Single Document by User

Read by _id (shard key)


find( { _id: "joe"} )

Read by email (non-shard key)


find ( { email: [email protected] } )

Considerations

•  Lookup by shard key is routed to 1 shard

•  Lookup by other identifier is scatter gathered across all shards

•  Secondary keys cannot have a unique index

// Create unique index db.identities.ensureIndex( { identifier : 1} , { unique: true} ) // Create a document for each users document db.identities.save( { identifier : { hndl: "joe" }, user: "1200-42" } ) db.identities.save( { identifier : { email: "[email protected]" }, user: "1200-42" } ) db.identities.save( { identifier : { li: "joe.e.smith" }, user: "1200-42" } ) // Shard collection by _id db.shardCollection( "mongodbdays.identities", { identifier : 1 } )

// Create unique index db.users.ensureIndex( { _id: 1} , { unique: true} )

// Create a docuemnt that holds all the other user attributes db.users.save( { _id: "1200-42", ... } )

// Shard collection by _id db.shardCollection( "mongodbdays.users", { _id: 1 } )

Document per Identity

Read requires 2 reads


db.identities.find({"identifier" : { "hndl" : "joe" }})

db.users.find( { _id: "1200-42"} )

Solution

•  Lookup to Identities is a routed query

•  Lookup to Users is a routed query

•  Unique indexes available

Conclusion

Summary

•  Multiple ways to model a domain problem

•  Understand the key uses cases of your app

•  Balance between ease of query vs. ease of write

•  Random IO should be avoided

Technical Director, 10gen

@jonnyeight [email protected] alvinonmongodb.com

Alvin Richards

#MongoDBdays

Thank You

schema tricks & tips

Documents

inbox reads

shard collection

random io

send

create

read

identifier

inbox db