intelligent stream filtering using mongodb
TRANSCRIPT
![Page 1: Intelligent Stream Filtering Using MongoDB](https://reader036.vdocuments.mx/reader036/viewer/2022081401/55951af21a28ab215e8b4772/html5/thumbnails/1.jpg)
Mihnea Giurgea
![Page 2: Intelligent Stream Filtering Using MongoDB](https://reader036.vdocuments.mx/reader036/viewer/2022081401/55951af21a28ab215e8b4772/html5/thumbnails/2.jpg)
CONTENTS
• Who Are We?
• MongoDB On Amazon
• How We Do Stream-Filtering
• First Approach
• Second Approach
• Questions
![Page 3: Intelligent Stream Filtering Using MongoDB](https://reader036.vdocuments.mx/reader036/viewer/2022081401/55951af21a28ab215e8b4772/html5/thumbnails/3.jpg)
CONTENTS
• Who Are We?
• MongoDB On Amazon
• How We Do Stream-Filtering
• First Approach
• Second Approach
• Questions
![Page 4: Intelligent Stream Filtering Using MongoDB](https://reader036.vdocuments.mx/reader036/viewer/2022081401/55951af21a28ab215e8b4772/html5/thumbnails/4.jpg)
UBERVU AT A GLANCE
~50K
~30
32T
updates and
inserts per minute
Amazon instances
worth of EBS volumes
The force is
strong at
uberVU
![Page 5: Intelligent Stream Filtering Using MongoDB](https://reader036.vdocuments.mx/reader036/viewer/2022081401/55951af21a28ab215e8b4772/html5/thumbnails/5.jpg)
CONTENTS
• Who Are We?
• MongoDB On Amazon
• How We Do Stream-Filtering
• First Approach
• Second Approach
• Questions
![Page 6: Intelligent Stream Filtering Using MongoDB](https://reader036.vdocuments.mx/reader036/viewer/2022081401/55951af21a28ab215e8b4772/html5/thumbnails/6.jpg)
DB INFRASTRUCTURE
![Page 7: Intelligent Stream Filtering Using MongoDB](https://reader036.vdocuments.mx/reader036/viewer/2022081401/55951af21a28ab215e8b4772/html5/thumbnails/7.jpg)
DB INFRASTRUCTURE
NO
SCALABLE
single points
of failure
horizontally &
vertically
![Page 8: Intelligent Stream Filtering Using MongoDB](https://reader036.vdocuments.mx/reader036/viewer/2022081401/55951af21a28ab215e8b4772/html5/thumbnails/8.jpg)
MULTIPLE DB ENVIRONMENTS
• 4 different mongo environments
• each with its own shards, config servers, etc.
• Why?
• isolate problems & bad behavior
• ++reliability
• better resource (hardware) distribution
• different number of shards per database
• some databases need more or less replica nodes
![Page 9: Intelligent Stream Filtering Using MongoDB](https://reader036.vdocuments.mx/reader036/viewer/2022081401/55951af21a28ab215e8b4772/html5/thumbnails/9.jpg)
MULTIPLE ENVIRONMENTS
• application servers hold 4 mongos,
instead of just 1
• each of the 3 config servers has 4 x
mongod processes
mongodmongodmongodmongod
mongodmongodmongodmongod
mongodmongodmongodmongod
![Page 10: Intelligent Stream Filtering Using MongoDB](https://reader036.vdocuments.mx/reader036/viewer/2022081401/55951af21a28ab215e8b4772/html5/thumbnails/10.jpg)
MONGOD
• run only one mongod process per
replica node
• each shard resides on a MDADM
RAID 10 matrix
• consisting of 16 HDD x 250 GB each
![Page 11: Intelligent Stream Filtering Using MongoDB](https://reader036.vdocuments.mx/reader036/viewer/2022081401/55951af21a28ab215e8b4772/html5/thumbnails/11.jpg)
AMAZON EC2 INSTANCES
• mongod primary• High-Memory Double Extra Large 34.2 GB
• mongod secondary• High-Memory Extra Large 17.1 GB
• config servers• Large Instance (cheapest 64-bit machine)
• expensive for its purpose :(
![Page 12: Intelligent Stream Filtering Using MongoDB](https://reader036.vdocuments.mx/reader036/viewer/2022081401/55951af21a28ab215e8b4772/html5/thumbnails/12.jpg)
CONTENTS
• Who Are We?
• MongoDB On Amazon
• How We Do Stream-Filtering
• First Approach
• Second Approach
• Questions
![Page 13: Intelligent Stream Filtering Using MongoDB](https://reader036.vdocuments.mx/reader036/viewer/2022081401/55951af21a28ab215e8b4772/html5/thumbnails/13.jpg)
THE PROBLEM
Gather mentions from web (Twitter,
Facebook, etc.)
Data Stream =
mentions around
a certain term
• mentions are
annotated (language,
location, sentiment,
etc.)
• data stream is
indexed in MongoDB
![Page 14: Intelligent Stream Filtering Using MongoDB](https://reader036.vdocuments.mx/reader036/viewer/2022081401/55951af21a28ab215e8b4772/html5/thumbnails/14.jpg)
FILTERING
• filter data stream by time (since & until)
• filter by other attributes:
• platform: Twitter, Facebook
• language: English, French
• location: UK, US, Romania
• sentiment
• gender
• etc.
![Page 15: Intelligent Stream Filtering Using MongoDB](https://reader036.vdocuments.mx/reader036/viewer/2022081401/55951af21a28ab215e8b4772/html5/thumbnails/15.jpg)
FILTERING
“MongoDB”
filtered by:
• United States
• gender: female
• sentiment: positive
![Page 16: Intelligent Stream Filtering Using MongoDB](https://reader036.vdocuments.mx/reader036/viewer/2022081401/55951af21a28ab215e8b4772/html5/thumbnails/16.jpg)
CONTENTS
• Who Are We?
• MongoDB On Amazon
• How We Do Stream-Filtering
• First Approach
• Second Approach
• Questions
![Page 17: Intelligent Stream Filtering Using MongoDB](https://reader036.vdocuments.mx/reader036/viewer/2022081401/55951af21a28ab215e8b4772/html5/thumbnails/17.jpg)
FIRST APPROACH
• if no filters are needed, 1 index will suffice:1. stream, time
• 1 filter => 2 indexes1. stream, time
2. stream, platform, time
• sort attribute must be last in index
![Page 18: Intelligent Stream Filtering Using MongoDB](https://reader036.vdocuments.mx/reader036/viewer/2022081401/55951af21a28ab215e8b4772/html5/thumbnails/18.jpg)
FIRST APPROACH
• 2 filters => 4 indexes
1. stream, time
2. stream, platform, time
3. stream, language, time
4. stream, platform, language, time
• ...etc... (F filters => 2F indexes)
![Page 19: Intelligent Stream Filtering Using MongoDB](https://reader036.vdocuments.mx/reader036/viewer/2022081401/55951af21a28ab215e8b4772/html5/thumbnails/19.jpg)
IMPROVEMENTS
• don’t really need (stream, platform, language,
time)
• when filtering for platform & language, use:
• stream, platform, time OR
• stream, language, time
• which one?
• the one with the smallest cardinality
![Page 20: Intelligent Stream Filtering Using MongoDB](https://reader036.vdocuments.mx/reader036/viewer/2022081401/55951af21a28ab215e8b4772/html5/thumbnails/20.jpg)
IMPROVEMENTS
• saves index space
• but increases query scanning time
• finding the right indexes is a trade-off between:
• indexing space
• query scanning time
![Page 21: Intelligent Stream Filtering Using MongoDB](https://reader036.vdocuments.mx/reader036/viewer/2022081401/55951af21a28ab215e8b4772/html5/thumbnails/21.jpg)
CONTENTS
• Who Are We?
• MongoDB On Amazon
• How We Do Stream-Filtering
• First Approach
• Second Approach
• Questions
![Page 22: Intelligent Stream Filtering Using MongoDB](https://reader036.vdocuments.mx/reader036/viewer/2022081401/55951af21a28ab215e8b4772/html5/thumbnails/22.jpg)
IMPROVEMENTS
• Question: when filtering by platform &
language, what index should we use?
• stream, platform, time
• stream, language, time
• Answer: smallest cardinality
• we need to know the size of each attribute:
• platform: twitter - 90%
• language: English - 60%
• location: France - 8%
![Page 23: Intelligent Stream Filtering Using MongoDB](https://reader036.vdocuments.mx/reader036/viewer/2022081401/55951af21a28ab215e8b4772/html5/thumbnails/23.jpg)
ATTRIBUTES
• normalize each attribute
• language: English => 13
• gender: male => 2038, etc.
• numbers use less space & are faster
• each mention now has several attributes:
{ 'platform': 'twitter','language': English', ---> { 'attributes': [13, 213, 2039, 1] }'location': 'UK','gender': 'female' }
![Page 24: Intelligent Stream Filtering Using MongoDB](https://reader036.vdocuments.mx/reader036/viewer/2022081401/55951af21a28ab215e8b4772/html5/thumbnails/24.jpg)
MULTIKEY INDEX
• use a multikey index for attributes:
• stream, attributes, time
• use $all to query for multiple filters
db.find( { 'stream': 'mongo', 'platform': 'twitter', --->'gender': 'male','language': 'romanian'
} )
db.mentions.find( {'stream': 'mongo','attributes': { ‘$all':
[1, 2038, 58]}
} )
![Page 25: Intelligent Stream Filtering Using MongoDB](https://reader036.vdocuments.mx/reader036/viewer/2022081401/55951af21a28ab215e8b4772/html5/thumbnails/25.jpg)
SORT BY FILTER
• $all: only the first item uses the index!
• the rest are scanned through
• ensure the first item has the smallest cardinality
• for the smallest query scanning time
{ location: france} < { gender: male } < { platform: twitter }
![Page 26: Intelligent Stream Filtering Using MongoDB](https://reader036.vdocuments.mx/reader036/viewer/2022081401/55951af21a28ab215e8b4772/html5/thumbnails/26.jpg)
SECOND APPROACH
• now we only need 2 indexes!
• stream, time
• stream, attributes, time
• works for any number of filters
• is far from perfect
• but gets the job done
• with little resources
![Page 27: Intelligent Stream Filtering Using MongoDB](https://reader036.vdocuments.mx/reader036/viewer/2022081401/55951af21a28ab215e8b4772/html5/thumbnails/27.jpg)
MORE IMPROVEMENTS
• don’t store all normalized attributes in index
• skip the very big ones:• platform: twitter - 90%
• language: English - 60%
• 90% selection rate: no index needed
• decreases index size
• no noticeable performance loss
![Page 28: Intelligent Stream Filtering Using MongoDB](https://reader036.vdocuments.mx/reader036/viewer/2022081401/55951af21a28ab215e8b4772/html5/thumbnails/28.jpg)
USE _ID!
• use _id index instead of (stream, time)
• saves memory!
• Problem: _id must be unique
• (stream, time) index is not!
• Question: how to make (stream, time) unique?
• Answer: add some random number
![Page 29: Intelligent Stream Filtering Using MongoDB](https://reader036.vdocuments.mx/reader036/viewer/2022081401/55951af21a28ab215e8b4772/html5/thumbnails/29.jpg)
USE _ID!
• pack stream, time & random into a number
• why: number look-ups are faster
• use all 64 bits available!
![Page 30: Intelligent Stream Filtering Using MongoDB](https://reader036.vdocuments.mx/reader036/viewer/2022081401/55951af21a28ab215e8b4772/html5/thumbnails/30.jpg)
DUPLICATES
• we need to detect duplicates
• modify bit packing
• use mention.url
• instead of random bits
• uniquely identifies a mention
• for fastest index lookup use:• db.find({ _id: docid }).count()
![Page 31: Intelligent Stream Filtering Using MongoDB](https://reader036.vdocuments.mx/reader036/viewer/2022081401/55951af21a28ab215e8b4772/html5/thumbnails/31.jpg)
CONTENTS
• Who Are We?
• MongoDB On Amazon
• How We Do Stream-Filtering
• First Approach
• Second Approach
• Questions
![Page 32: Intelligent Stream Filtering Using MongoDB](https://reader036.vdocuments.mx/reader036/viewer/2022081401/55951af21a28ab215e8b4772/html5/thumbnails/32.jpg)
?