mongo db
DESCRIPTION
TRANSCRIPT
About me
● Delta Electronic CTBD Senior Engineer● Main developer of http://loltw.net
○ Website built via MongoDB with daily 600k PV○ Data grow up everyday with auto crawler bots
MongoDB - Simple Introduction
● Document based NOSQL(Not Only SQL) database
● Started from 2007 by 10Gen company● Wrote in C++● Fast (But takes lots of memory)● Stores JSON documents in BSON format● Full index on any document attribute● Horizontal scalability with auto sharding● High availability & replica ready
What is database?
● Raw data○ John is a student, he's 12 years old.
● Data○ Student
■ name = "John"■ age = 12
● Records○ Student(name="John", age=12)○ Student(name="Alice", age=11)
● Database○ Student Table○ Grades Table
Example of (relational) database
Student
Student ID
Name
Age
Class ID
Class
Class ID
Name
Student Grade
Grade ID
StudentID
GradeGrade
Grade ID
Name
SQL Language - How to find data?
● Find student name is John○ select * from student where name="John"
● Find class name of John○ select s.name, c.name as class_name from student
s, class c where name="John" and s.class_id=c.class_id
Why NOSQL?
● Big data○ Morden data size is too big for single DB server○ Google search engine
● Connectivity○ Facebook like button
● Semi-structure data○ Car equipments database
● High availability○ The basic of cloud service
Common NOSQL DB characteristic
● Schemaless● No join, stores pre-joined/embedded data● Horizontal scalability ● Replica ready - High availability
Common types of NOSQL DB
● Key-Value○ Based on Amazon's Dynamo paper○ Stores K-V pairs○ Example:
■ Dynomite■ Voldemort
Common types of NOSQL DB
● Bigtable clones○ Based on Google Bigtable paper○ Column oriented, but handles semi-structured data○ Data keyed by: row, column, time, index○ Example:
■ Google Big Table■ HBase■ Cassandra(FB)
Common types of NOSQL DB
● Document base○ Stores multi-level K-V pairs○ Usually use JSON as document format○ Example:
■ MongoDB■ CounchDB (Apache)■ Redis
Common types of NOSQL DB
● Graph○ Focus on modeling the structure of data -
interconnectivity○ Example
■ Neo4j■ AllegroGraph
Start using MongoDB - Installation
● From apt-get (debian / ubuntu only)○ sudo apt-get install mongodb
● Using 10-gen mongodb repository○ http://docs.mongodb.org/manual/tutorial/install-
mongodb-on-debian-or-ubuntu-linux/● From pre-built binary or source
○ http://www.mongodb.org/downloads● Note:
32-bit builds limited to around 2GB of data
Manual start your MongoDB
mkdir -p /tmp/mongomongod --dbpath /tmp/mongo
or
mongod -f mongodb.conf
Verify your MongoDB installation
$ mongo
MongoDB shell version: 2.2.0connecting to: test>_
--------------------------------------------------------mongo localhost/test2mongo 127.0.0.1/test
How many database do you have?
show dbs
Elements of MongoDB
● Database○ Collection
■ Document
What is JSON
● JavaScript Object Notation● Elements of JSON
○ Object: K/V pairs○ Key, String○ Value, could be
■ string■ bool■ number■ array■ object■ null
{"key1": "value1","key2": 2.0"key3": [1, "str", 3.0],"key4": false,"key5": { "name": "another object",}
}
Another sample of JSON
{"name": "John","age": 12,"grades": {
"math": 4.0,"english": 5.0
},"registered": true,"favorite subjects": ["math", "english"]
}
Insert document into MongoDB
s = {"name": "John","age": 12,"grades": {
"math": 4.0,"english": 5.0
},"registered": true,"favorite subjects": ["math", "english"]
}
db.students.insert(s);
Verify inserted document
db.students.find()
also try
db.student.insert(s)show collections
Save document into MongoDB
s.name = "Alice"s.age = 14s.grades.math = 2.0
db.students.save(s)
What is _id / ObjectId ?
● _id is the default primary key for indexing documents, could be any JSON acceptable value.
● By default, MongoDB will auto generate a ObjectId as _id
● ObjectId is 12 bytes value of unique document _id
● Use ObjectId().getTimestamp() to restore the timestamp in ObjectId
0 1 2 3 4 5 6 7 8 9 10 11
unix timestamp machine process id Increment
Save document with id into MongoDB
s.name = "Bob"s.age = 11s['favorite subjects'] = ["music", "math", "art"]s.grades.chinese = 3.0s._id = 1
db.students.save(s)
Save document with existing _id
delete s.registered
db.students.save(s)
How to find documents?
● db.xxxx.find()○ list all documents in collection
● db.xxxx.find(find spec, //how document looks likefind fields, //which parts I wanna see...
)● db.xxxx.findOne()
○ only returns first document match find spec.
find by id
db.students.find({_id: 1})db.students.find({_id: ObjectId('xxx....')})
find and filter return fields
db.students.find({_id: 1}, {_id: 1})db.students.find({_id: 1}, {name: 1})db.students.find({_id: 1}, {_id: 1, name: 1})db.students.find({_id: 1}, {_id: 0, name: 1})
find by name - equal or not equal
db.students.find({name: "John"})db.students.find({name: "Alice"})
db.students.find({name: {$ne: "John"}})● $ne : not equal
find by name - ignorecase ($regex)
db.students.find({name: "john"}) => Xdb.students.find({name: /john/i}) => O
db.students.find({name: {
$regex: "^b", $options: "i"
}})
find by range of names - $in, $nin
db.students.find({name: {$in: ["John", "Bob"]}})db.students.find({name: {$nin: ["John", "Bob"]}})
● $in : in range (array of items)● $nin : not in range
find by age - $gt, $gte, $lt, $lte
db.students.find({age: {$gt: 12}})db.students.find({age: {$gte: 12}})db.students.find({age: {$lt: 12}})db.students.find({age: {$lte: 12}})
● $gt : greater than● $gte : greater than or equal● $lt : lesser than● $lte : lesser or equal
find by field existence - $exists
db.students.find({registered: {$exists: true}})db.students.find({registered: {$exists: false}})
find by field type - $type
db.students.find({_id: {$type: 7}})db.students.find({_id: {$type: 1}})
1 Double 11 Regular expression
2 String 13 JavaScript code
3 Object 14 Symbol
4 Array 15 JavaScript code with scope
5 Binary Data 16 32 bit integer
7 Object id 17 Timestamp
8 Boolean 18 64 bit integer
9 Date 255 Min key
10 Null 127 Max key
find in multi-level fields
db.students.find({"grades.math": {$gt: 2.0}})db.students.find({"grades.math": {$gte: 2.0}})
find by remainder - $mod
db.students.find({age: {$mod: [10, 2]}})db.students.find({age: {$mod: [10, 3]}})
find in array - $size
db.students.find({'favorite subjects': {$size: 2}}
)db.students.find(
{'favorite subjects': {$size: 3}})
find in array - $all
db.students.find({'favorite subjects': {$all: ["music", "math", "art"]
}})db.students.find({'favorite subjects': {
$all: ["english", "math"]}})
find in array - find value in array
db.students.find({"favorite subjects": "art"}
)
db.students.find({"favorite subjects": "math"}
)
find with bool operators - $and, $or
db.students.find({$or: [{age: {$lt: 12}},{age: {$gt: 12}}
]})
db.students.find({$and: [{age: {$lt: 12}},{age: {$gte: 11}}
]})
find with bool operators - $and, $or
db.students.find({$and: [{age: {$lt: 12}},{age: {$gte: 11}}
]})
equals to
db.student.find({age: {$lt:12, $gte: 11}}
find with bool operators - $not
$not could only be used with other find filter
X db.students.find({registered: {$not: false}})O db.students.find({registered: {$ne: false}})
O db.students.find({age: {$not: {$gte: 12}}})
find with JavaScript- $where
db.students.find({$where: "this.age > 12"})
db.students.find({$where:"this.grades.chinese"
})
find cursor functions
● countdb.students.find().count()
● limitdb.students.find().limit(1)
● skipdb.students.find().skip(1)
● sortdb.students.find().sort({age: -1})db.students.find().sort({age: 1})
combine find cursor functions
db.students.find().skip(1).limit(1)db.students.find().skip(1).sort({age: -1})db.students.find().skip(1).limit(1).sort({age: -1})
more cursor functions
● snapshotensure cursor returns○ no duplicates○ misses no object○ returns all matching objects that were present at
the beginning and the end of the query.○ usually for export/dump usage
more cursor functions
● batchSizetell MongoDB how many documents should be sent to client at once
● explainfor performance profiling
● hinttell MongoDB which index should be used for querying/sorting
list current running operations
● list operationsdb.currentOP()
● cancel operationsdb.killOP()
MongoDB index - when to use index?
● while doing complicate find● while sorting lots of data
MongoDB index - sort() example
for (i=0; i<1000000; i++){ db.many.save({value: i});
}
db.many.find().sort({value: -1})
error: {"$err" : "too much data for sort() with no index. add an index or specify
a smaller limit","code" : 10128
}
MongoDB index - how to build index
db.many.ensureIndex({value: 1})
● Index options○ background○ unique○ dropDups○ sparse
MongoDB index - index commands
● list indexdb.many.getIndexes()
● drop indexdb.many.dropIndex({value: 1})db.many.dropIndexes() <-- DANGER!
MongoDB Index - find() example
db.many.dropIndex({value: 1})db.many.find({value: 5555}).explain()
db.many.ensureIndex({value: 1})db.many.find({value: 5555}).explain()
MongoDB Index - Compound Index
db.xxx.ensureIndex({a:1, b:-1, c:1})
query/sort with fields● a● a, b● a, b, c
will be accelerated by this index
Remove/Drop data from MongoDB
● Removedb.many.remove({value: 5555})db.many.find({value: 5555})db.many.remove()
● Dropdb.many.drop()
● Drop databasedb.dropDatabase() EXTREMELY DANGER!!!
How to update data in MongoDB
Easiest way:
s = db.students.findOne({_id: 1})s.registered = truedb.students.save(s)
In place update - update()
update({find spec},{update spec},upsert=false)
db.students.update({_id: 1},{$set: {registered: false}}
)
Update a non-exist document
db.students.update({_id: 2}, {name: 'Mary', age: 9},true
)db.students.update(
{_id: 2}, {$set: {name: 'Mary', age: 9}},true
)
set / unset field value
db.students.update({_id: 1},{$set: {"age": 15}})
db.students.update({_id: 1},{$set: {registered:
{2012: false, 2011:true}}})
db.students.update({_id: 1},{$unset: {registered: 1}})
increase/decrease value
db.students.update({_id: 1}, {$inc: {
"grades.math": 1.1,"grades.english": -1.5,"grades.history": 3.0
}})
push value(s) into array
db.students.update({_id: 1},{$push: {tags: "lazy"}
})
db.students.update({_id: 1},{$pushAll: {tags: ["smart", "cute"]}
})
add only not exists value to array
db.students.update({_id: 1},{$push: {tags: "lazy"}
})db.students.update({_id: 1},{
$addToSet:{tags: "lazy"}})db.students.update({_id: 1},{
$addToSet:{tags: {$each: ["tall", "thin"]}}})
remove value from array
db.students.update({_id: 1},{$pull: {tags: "lazy"}
})db.students.update({_id: 1},{
$pull: {tags: {$ne: "smart"}}})db.students.update({_id: 1},{
$pullAll: {tags: ["lazy", "smart"]}})
pop value from array
a = []; for(i=0;i<20;i++){a.push(i);}db.test.save({_id:1, value: a})
db.test.update({_id: 1}, {$pop: {value: 1}
})db.test.update({_id: 1}, {
$pop: {value: -1}})
rename field
db.test.update({_id: 1}, {$rename: {value: "values"}
})
Practice: add comments to student
Add a field into students ({_id: 1}):● field name: comments● field type: array of dictionary● field content:
○ {
by: author name, stringtext: content of comment, string
}● add at least 3 comments to this field
Example answer to practice
db.students.update({_id: 1}, {$addToSet: { comments: {$each: [
{by: "teacher01", text: "text 01"},{by: "teacher02", text: "text 02"},{by: "teacher03", text: "text 03"},
]}}})
The $ position operator (for array)
db.students.update({_id: 1,"comments.by": "teacher02"
}, {$inc: {"comments.$.vote": 1}
})
Atomically update - findAndModify
● Atomically update SINGLE DOCUMENT and return it
● By default, returned document won't contain the modification made in findAndModify command.
findAndModify parameters
db.xxx.findAndModify({query: filter to querysort: how to sort and select 1st document in query resultsremove: set true if you want to remove itupdate: update contentnew: set true if you want to get the modified objectfields: which fields to fetchupsert: create object if not exists})
GridFS
● MongoDB has 32MB document size limit● For storing large binary objects in MongoDB● GridFS is kind of spec, not implementation● Implementation is done by MongoDB drivers● Current supported drivers:
○ PHP○ Java○ Python○ Ruby○ Perl
GridFS - command line tools
● Listmongofiles list
● Putmongofiles put xxx.txt
● Getmongofiles get xxx.txt
MongoDB config - basic
● dbpath○ Which folder to put MongoDB database files○ MongoDB must have write permission to this folder
● logpath, logappend○ logpath = log filename○ MongoDB must have write permission to log file
● bind_ip○ IP(s) MongoDB will bind with, by default is all○ User comma to separate more than 1 IP
● port○ Port number MongoDB will use○ Default port = 27017
Small tip - rotate MongoDB log
db.getMongo().getDB("admin").runCommand("logRotate")
MongoDB config - journal
● journal○ Set journal on/off○ Usually you should keep this on
MongoDB config - http interface
● nohttpinterface○ Default listen on http://localhost:28017○ Shows statistic info with http interface
● rest○ Used with httpinterface option enabled only○ Example:
http://localhost:28017/test/students/http://localhost:28017/test/students/?filter_name=John
MongoDB config - authentication
● auth○ By default, MongoDB runs with no authentication○ If no admin account is created, you could login with
no authentication through local mongo shell and start managing user accounts.
MongoDB account management
● Add admin user> mongo localhost/admindb.addUser("testadmin", "1234")
● Authenticated as admin useruse admindb.auth("testadmin", "1234")
MongoDB account management
● Add user to test databaseuse testdb.addUser("testrw", "1234")
● Add read only user to test databasedb.addUser("testro", "1234", true)
● List usersdb.system.users.find()
● Remove user db.removeUser("testro")
MongoDB config - authentication
● keyFile○ At least 6 characters and size smaller than 1KB○ Used only for replica/sharding servers○ Every replica/sharding server should use the same
key file for communication○ On U*ix system, file permission to key file for
group/everyone must be none, or MongoDB will refuse to start
MongoDB configuration - Replica Set
● replSet○ Indicate the replica set name○ All MongoDB in same replica set should use the
same name○ Limitation
■ Maximum 12 nodes in a single replica set■ Maximum 7 nodes can vote
○ MongoDB replica set is Eventually consistent
How's MongoDB replica set working?
● Each a replica set has single primary(master) node and multiple slave nodes
● Data will only be wrote to primary node then will be synced to other slave nodes.
● Use getLastError() for confirming previous write operation is committed to whole replica set, otherwise the write operation may be rolled back if primary node is down before sync.
How's MongoDB replica set working?
● Once primary node is down, the whole replica set will be marked as fail and can't do any operation on it until the other nodes vote and elect a new primary node.
● During failover, any write operation not committed to whole replica set will be rolled back
Simple replica set configuration
mkdir -p /tmp/db01mkdir -p /tmp/db02mkdir -p /tmp/db03
mongod --replSet test --port 29001 --dbpath /tmp/db01mongod --replSet test --port 29002 --dbpath /tmp/db02mongod --replSet test --port 29003 --dbpath /tmp/db03
Simple replica set configuration
mongo localhost:29001
Another way to config replica set
rs.initiate()rs.add("localhost:29001")rs.add("localhost:29002")rs.add("localhost:29003")
Extra options for setting replica set
● arbiterOnly○ Arbiter nodes don't receive data, can't become
primary node but can vote.● priority
○ Node with priority 0 will never be elected as primary node.
○ Higher priority nodes will be preferred as primary○ If you want to force some node become primary
node, do not update node's vote result, update node's priority value and reconfig replica set.
● buildIndexes○ Can only be set to false on nodes with priority 0 ○ Use false for backup only nodes
Extra options for setting replica set
● hidden○ Nodes marked with hidden option will not be
exposed to MongoDB clients.○ Nodes marked with hidden option will not receive
queries.○ Only use this option for nodes with usage like
reporting, integration, backup, etc.● slaveDelay
○ How many seconds slave nodes could fall behind to primary nodes
○ Can only be set on nodes with priority 0○ Used for preventing some human errors
Extra options for setting replica set
● voteIf set to 1, this node can vote, else not.
Change primary node at runtime
config = rs.conf()config.members[1].priority = 2rs.reconfig(config)
What is sharding?
Name Value
Alice value
Amy value
Bob value
: value
: value
: value
: value
Yoko value
Zeus value
A value
to value
F value
G value
to value
N value
O value
to value
Z value
MongoDB sharding architecture
Elements of MongoDB sharding cluster
● Config ServerStoring sharding cluster metadata
● mongos RouterRouting database operations to correct shard server
● Shard ServerHold real user data
Sharding config - config server
● Config server is a MongoDB instance runs with --configsrv option
● Config servers will automatically synced by mongos process, so DO NOT run them with --replSet option
● Synchronous replication protocol is optimized for three machines.
Sharding config - mongos Router
● Use mongos (not mongod) for starting a mongos router
● mongos routes database operations to correct shard servers
● Exmaple command for starting mongosmongos --configdb db01, db02, db03
● With --chunkSize option, you could specify a smaller sharding chunk if you're just testing.
Sharding config - shard server
● Shard server is a MongoDB instance runs with --shardsvr option
● Shard server don't need to know where config server / mongos route is
Example script for building MongoDB shard cluster
mkdir -p /tmp/s00mkdir -p /tmp/s01mkdir -p /tmp/s02mkdir -p /tmp/s03
mongod --configsvr --port 29000 --dbpath /tmp/s00mongos --configdb localhost:29000 --chunkSize 1 --port 28000mongod --shardsvr --port 29001 --dbpath /tmp/s01mongod --shardsvr --port 29002 --dbpath /tmp/s02mongod --shardsvr --port 29003 --dbpath /tmp/s03
Sharding config - add shard server
mongo localhost:28000/admin
db.runCommand({addshard: "localhost:29001"})db.runCommand({addshard: "localhost:29002"})db.runCommand({addshard: "localhost:29003"})
db.printShardingStatus()db.runCommand( { enablesharding : "test" } )db.runCommand( {shardcollection: "test.shardtest",key: {_id: 1}, unique: true})
Let us insert some documents
use test
for (i=0; i<1000000; i++) {db.shardtest.insert({value: i});
}
Remove 1 shard & see what happens
use admindb.runCommand({removeshard: "shard0002"})
Let's add it backdb.runCommand({addshard: "localhost:29003"})
Pick your sharding key wisely
● Sharding key can not be changed after sharding enabled
● For updating any document in a sharding cluster, sharding key MUST BE INCLUDED as find spec
EX:sharding key= {name: 1, class: 1}db.xxx.update({name: "xxxx", class: "ooo},{..... update spec})
Pick your sharding key wisely
● Sharding key will strongly affect your data distribution model
EX:sharding by ObjectIdshard001 => data saved 2 months agoshard002 => data saved 1 months agoshard003 => data saved recently
Other sharding key examples
EX:sharding by Usernameshard001 => Username starts with a to kshard002 => Username starts with l to rshard003 => Username starts with s to z
EX:sharding by md5completely random distribution
What is Mapreduce?
● Map then Reduce● Map is the procedure to call a function for
emitting keys & values sending to reduce function
● Reduce is the procedure to call a function for reducing the emitted keys & values sent via map function into single reduced result.
● Example: map students grades and reduce into total students grades.
How to call mapreduce in MongoDB
db.xxx.mapreduce(map function,reduce function,{out: output option,query: query filter, optional,sort: sort filter, optional,finalize: finalize function,.... etc
})
Let's generate some data
for (i=0; i<10000; i++){db.grades.insert({
grades: {math: Math.random() * 100 % 100,art: Math.random() * 100 % 100,music: Math.random() * 100 % 100
}});
}
Prepare Map function
function map(){for (k in this.grades){
emit(k, {total: 1, pass: 1 ? this.grades[k] >= 60.0 : 0, fail: 1 ? this.grades[k] < 60.0 : 0, sum: this.grades[k], avg: 0});
}}
Prepare reduce function
function reduce(key, values){result = {total: 0, pass: 0, fail: 0, sum: 0, avg: 0};values.forEach(function(value){
result.total += value.total;result.pass += value.pass;result.fail += value.fail;result.sum += value.sum;
});return result;
}
Execute your 1st mapreduce call
db.grades.mapReduce(map, reduce, {out:{inline: 1}}
)
Add finalize function
function finalize(key, value){value.avg = value.sum / value.total;return value;
}
Run mapreduce again with finalize
db.grades.mapReduce(map, reduce, {out:{inline: 1}, finalize: finalize}
)
Mapreduce output options
● {replace: <result collection name>}Replace result collection if already existed.
● {merge: <result collection name>}Always overwrite with new results.
● {reduce: <result collection name>}Run reduce if same key exists in both old/current result collections. Will run finalize function if any.
● {inline: 1}Put result in memory
Other mapreduce output options
● db- put result collection in different database
● sharded - output collection will be sharded using key = _id
● nonAtomic - partial reduce result will be visible will processing.
MongoDB backup & restore
● mongodumpmongodump -h localhost:27017
● mongorestoremongorestore -h localhost:27017 --drop
● mongoexportmongoexport -d test -c students -h localhost:27017 > students.json
● mongoimport mongoimport -d test -c students -h localhost:27017 < students.json
Conclusion - Pros of MongoDB
● Agile (Schemaless)● Easy to use ● Built in replica & sharding● Mapreduce with sharding
Conclusion - Cons of MongoDB
● Schemaless = everyone need to know how data look like
● Waste of spaces on keys● Eats lots of memory● Mapreduce is hard to handle
Cautions of MongoDB
● Global write lock○ Add more RAM○ Use newer version (MongoDB 2.2 now has DB level
global write lock)○ Split your database properly
● Remove document won't free disk spaces○ You need run compact command periodically
● Don't let your MongoDB data disk full○ Once freespace of disk used by MongoDB if full, you
won't be able to move/delete document in it.