finding the right nosql db for the job - the path to a non-rdbms solution at traackr

137
Finding the right NoSQL DB for the job The path to a non-RDBMS solution at

Upload: george-stathis

Post on 17-Nov-2014

1.289 views

Category:

Documents


2 download

DESCRIPTION

A walkthrough of Traackr's experience in choosing a NoSQL solution and how we ended up up switching from HBase to MongoDB. This deck goes through some in depth technical aspects, like schema design and our use of secondary indexes.

TRANSCRIPT

Page 1: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Finding the right NoSQL DB for the job

The path to a non-RDBMS solution at

Page 2: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Who we are

• A search engine• A people

search engine• An influencer

search engine• Subscription-

based

Page 3: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

George Stathis

VP Engineering14+ years of experience building full-stack web software systems with a past focus on e-commerce and publishing. Currently responsible for building engineering capability to enable Traackr's growth goals.

Page 4: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

What’s this talk about?

• Why we picked a NoSQL database

• How we picked a NoSQL database

• My NoSQL does not do the job! What now?!

• Nirvana = the right tool for the job

Page 5: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Why did we pick a NoSQL DB?

Page 6: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

There are some misconceptions around NoSQL only being appropriate when one needs to achieve

“Web Scale”

Page 8: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Traackr picked NoSQL; are we “Web Scale”?

Page 9: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

• In terms of users/traffic?

Do we fit the “Web scale” profile?

Page 10: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Source: compete.com

Page 11: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Source: compete.com

Page 12: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Source: compete.com

Page 13: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Source: compete.com

Page 14: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Source: highscalability.com

Page 15: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr
Page 16: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

• In terms of users/traffic?

• In terms of the amount of data?

Do we fit the “Web scale” profile?

Page 17: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

PRIMARY> use traackrswitched to db traackrPRIMARY> db.stats(){

"db" : "traackr","collections" : 12,"objects" : 68226121,"avgObjSize" : 2972.0800625760330,"dataSize" : 202773493971,"storageSize" : 221491429671,"numExtents" : 199,"indexes" : 33,"indexSize" : 27472394891,"fileSize" : 266623699968,"nsSizeMB" : 16,"ok" : 1

}

That’s a quarter of a terabyte …

Page 18: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Wait! What? My Synology NAS at home can hold 2TB!

Page 19: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

No need for us to track the entire web

Web Content

Influencer Content

Not at scale :-)

Page 20: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

• In terms of users/traffic?

• In terms of the amount of data?

Do we fit the “Web scale” profile?

Page 21: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Alternate view of “Web Scale”

Web data is:

Heterogeneous

Unstructured (text)

Page 22: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Source: http://www.opte.org/

Visualization of the Internet, Nov. 23rd 2003

Page 23: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Data sources are

isolated islands of rich

data with lose links to

one another

Page 24: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

How do we build a database that models all possible entities found on the web?

Page 25: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Modeling the web: the RDBMS way

Page 26: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Source: socialbutterflyclt.com

Page 27: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

or

Page 28: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr
Page 29: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

{ "realName": "David Chancogne", "title": "CTO", "description": "Web. Geek.\r\nTraackr: http://traackr.com\r\nPropz: http://propz.me", "primaryAffiliation": "Traackr", "email": "[email protected]", "location": "Cambridge, MA, United States", "siteReferences": [ { "siteUrl": "http://twitter.com/dchancogne", "metrics": [ { "value": 216, "name": "twitter_followers_count" }, { "value": 2107, "name": "twitter_statuses_count" } ] }, { "siteUrl": "http://traackr.com/blog/author/david", "metrics": [ { "value": 21, "name": "google_inbound_links" } ] } ]}

Influencer data as JSON

Page 30: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

“In the old world of data analysis you knew exactly which questions you wanted to ask,

which drove a very predictable collection and storage model. In the new world of data

analysis your questions are going to evolve and change over time and as such you need to be able to collect, store and analyze data without

being constrained by resources.”— Werner Vogels, CTO/VP Amazon.com

Page 31: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

NoSQL = schema flexibility

Page 32: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

• In terms of users/traffic?

• In terms of the amount of data?

Do we fit the “Web scale” profile?

Page 33: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

• In terms of users/traffic?

• In terms of the amount of data?

• In terms of the variety of the data

Do we fit the “Web scale” profile?

Page 34: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Traackr’s Datastore Requirements

• Schema flexibility

• Good at storing lots of variable length text

• Batch processing options

Page 35: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Requirement: text storage

Variable text length:

< big variance <140

character tweets

multi-page

blog posts

Page 36: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Requirement: text storage

RDBMS’ answer to variable text length:

Plan ahead for largest value

CLOB/BLOB

Page 37: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Requirement: text storage

Issues with CLOB/BLOG for us:

No clue what largest value is

CLOB/BLOB for tweets = wasted space

Page 38: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Requirement: text storage

NoSQL solutions are great for text:

No length requirements (automated

chunking)

Limited space overhead

Page 39: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Traackr’s Datastore Requirements

• Schema flexibility

• Good at storing lots of variable length text

• Batch processing options

Page 40: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Requirement: batch processing

Some NoSQL

solutions come

with MapReduce

Source: http://code.google.com/

Page 41: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Requirement: batch processing

MapReduce + RDBMS:

Possible but proprietary solutions

Usually involves exporting data from

RDBMS into a NoSQL system anyway.

Defeats data locality benefit of MR

Page 42: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Traackr’s Datastore Requirements

• Schema flexibility

• Good at storing lots of variable length text

• Batch processing options

A NoSQL option is the right fit

Page 43: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

How did we pick a NoSQL DB?

Page 44: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Bewildering number of optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent

Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped

into families

Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema

Graph Databases• Graph Theory G=(E,V)• Great for modeling

networks• Great for graph-based

query algorithms

Page 45: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Bewildering number of optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent

Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped

into families

Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema

Graph Databases• Graph Theory G=(E,V)• Great for modeling

networks• Great for graph-based

query algorithms

Page 46: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Trimming optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent

Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped

into families

Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema

Graph Databases• Graph Theory G=(E,V)• Great for modeling

networks• Great for graph-based

query algorithms

Graph Databases: while we can model our domain as a graph we don’t want to pigeonhole ourselves into this structure.

We’d rather use these tools for specialized data analysis but not as the

main data store.

Page 47: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Trimming optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent

Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped

into families

Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema

Graph Databases• Graph Theory G=(E,V)• Great for modeling

networks• Great for graph-based

query algorithms

Memcache: memory-based,we need true persistence

Page 48: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Trimming optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent

Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped

into families

Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema

Graph Databases• Graph Theory G=(E,V)• Great for modeling

networks• Great for graph-based

query algorithms

Amazon SimpleDB: not willing to store our data in a proprietary

datastore.

Page 49: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Trimming optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent

Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped

into families

Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema

Graph Databases• Graph Theory G=(E,V)• Great for modeling

networks• Great for graph-based

query algorithms

Not willing to store our data in a proprietary datastore.

Redis and LinkedIn’s Project Voldermort: no query filters,

better used as queues or distributed caches

Page 50: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Trimming optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent

Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped

into families

Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema

Graph Databases• Graph Theory G=(E,V)• Great for modeling

networks• Great for graph-based

query algorithms

CouchDB: no ad-hoc queries; maturity in early 2010 made us shy away although we did try

early prototypes.

Page 51: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Trimming optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent

Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped

into families

Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema

Graph Databases• Graph Theory G=(E,V)• Great for modeling

networks• Great for graph-based

query algorithms

Cassandra: in early 2010, maturity questions, no secondary indexes and no batch processing options

(came later on).

Page 52: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Trimming optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent

Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped

into families

Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema

Graph Databases• Graph Theory G=(E,V)• Great for modeling

networks• Great for graph-based

query algorithms

MongoDB: in early 2010, maturity questions, adoption questions

and no batch processing options.

Page 53: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Trimming optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent

Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped

into families

Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema

Graph Databases• Graph Theory G=(E,V)• Great for modeling

networks• Great for graph-based

query algorithms

Riak: very close but in early 2010, we had adoption questions.

Page 54: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Trimming optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent

Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped

into families

Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema

Graph Databases• Graph Theory G=(E,V)• Great for modeling

networks• Great for graph-based

query algorithms

HBase: came across as the most mature at the time, with several deployments, a

healthy community, "out-of-the box" secondary indexes through a contrib and

support for batch processing using Hadoop/MR .

Page 55: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Climbing the learning curve

Page 56: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

When Big-Data = Big Architectures

Source: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html

Must have a Hadoop HDFS cluster of at least 2x replication

factor nodes

Must have an odd number of

Zookeeper quorum nodes

Then you can run your Hbase nodes but it’s recommended to

co-locate regionservers with hadoop datanodes so you have

to manage resources.

Master/slave architecture means a single point of failure,

so you need to protect your master.

And then we also have to manage the MapReduce

processes and resources in the Hadoop layer.

Page 57: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Source: socialbutterflyclt.com

Page 58: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Jokes aside, no one said open source was easy to use

Page 59: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

To be expected

• Hadoop/Hbase are

designed to move

mountains

• If you want to move big

stuff, be prepared to

sometimes use big

equipment

Page 60: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

What it means to a startup

Development capacity before

Development capacity after

Congrats, you are now a sysadmin…

Page 61: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Whatever, we can do it!

Source: http://knowyourmeme.com/memes/honey-badger

Page 62: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Mapping an A-List to a column store

Name

Ranks References to influencer records

Page 63: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Mapping an A-List to a column store

Unique key

“attributes” column family

for general attributes

“influencerId” column familyfor influencer ranks and foreign keys

Page 64: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Mapping an A-List to a column store

Qualifiers (basically attribute names)

Page 65: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Mapping an A-List to a column store

“name” attribute

Influencer ranks can be attribute names as well

Page 66: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Mapping an A-List to a column store

Alist name value

Influencer id values assigned to each rank (basically foreign keys to an influencer table)

Page 67: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Mapping an A-List to a column store

Can get pretty long so needs indexing and pagination

Page 68: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Problem: no out-of-the-box row-based indexing and pagination

Page 69: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Whatever, it’s open-source!

Source: http://knowyourmeme.com/memes/honey-badger

Page 70: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Jumping right into the code

Page 71: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

MapReduce for batch scoring

• Need to re-score our influencer

database once a week

• M/R cranked through it in 15 mins

Page 72: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Source: http://www.charliesheentshirts.info/

Page 73: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

a few months later…

Page 74: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Need to upgrade to Hbase 0.90

• Making sure to remain on recent code base

• Performance improvements

• Mostly to get the latest bug fixes

No thanks!

Page 75: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Looks like something is missing

Page 76: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr
Page 77: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Our DB indexes depend on this!

Page 78: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Let’s get this straight

• Hbase no longer comes with secondary

indexing out-of-the-box

• It’s been moved out of the trunk to GitHub

• Where only one other company besides us

seems to care about it

Page 79: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Only one other maintainer besides us

Page 80: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

What it means to a startup

Development capacity

Congrats, you are now an hbase maintainer…

Page 81: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Source: socialbutterflyclt.com

Page 82: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Whatever, we’ll roll our own indexing!

Source: http://knowyourmeme.com/memes/honey-badger

Page 83: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Homegrown Hbase Indexes

Rows have id prefixes that can be efficiently scanned using STARTROW and STOPROW filters

Row ids for Posts

Page 84: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Homegrown Hbase Indexes

Find posts for influencer_id_1234

Row ids for Posts

Page 85: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Homegrown Hbase Indexes

Find posts for influencer_id_5678

Row ids for Posts

Page 86: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Homegrown Hbase Indexes

• No longer depending on

unmaintained code

• Work with out-of-the-box Hbase

installation

Page 87: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

What it means to a startup

Development capacity

You are back but you still need to

maintain indexing logic

Page 88: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Source: http://www.charliesheentshirts.info/

Application layer indexes are slow and brittle. The DB should be doing this, not us.

Sort of…

Page 89: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

a few months later…

Page 90: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Cracks in the data modelhuffingtonpost.com

huffingtonpost.com

http://www.huffingtonpost.com/arianna-huffington/post_1.htmlhttp://www.huffingtonpost.com/arianna-huffington/post_2.html

http://www.huffingtonpost.com/arianna-huffington/post_3.html

http://www.huffingtonpost.com/shaun-donovan/post1.htmlhttp://www.huffingtonpost.com/shaun-donovan/post2.html

http://www.huffingtonpost.com/shaun-donovan/post3.html

writes for

authored by

published under

writes for

authored by

published under

Page 91: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Cracks in the data modelhuffingtonpost.com

huffingtonpost.com

http://www.huffingtonpost.com/arianna-huffington/post_1.htmlhttp://www.huffingtonpost.com/arianna-huffington/post_2.html

http://www.huffingtonpost.com/arianna-huffington/post_3.html

http://www.huffingtonpost.com/shaun-donovan/post1.htmlhttp://www.huffingtonpost.com/shaun-donovan/post2.html

http://www.huffingtonpost.com/shaun-donovan/post3.html

writes for

authored by

published under

writes for

authored by

published under

Denormalized/duplicated for fast runtime access

and storage of influencer-to-site relationship

properties

Page 92: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Cracks in the data modelhuffingtonpost.com

huffingtonpost.com

http://www.huffingtonpost.com/arianna-huffington/post_1.htmlhttp://www.huffingtonpost.com/arianna-huffington/post_2.html

http://www.huffingtonpost.com/arianna-huffington/post_3.html

http://www.huffingtonpost.com/shaun-donovan/post1.htmlhttp://www.huffingtonpost.com/shaun-donovan/post2.html

http://www.huffingtonpost.com/shaun-donovan/post3.html

writes for

authored by

published under

writes for

authored by

published under

Content attribution logic could sometimes mis-attribute posts because of the

duplicated data.

Page 93: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Cracks in the data modelhuffingtonpost.com

huffingtonpost.com

http://www.huffingtonpost.com/arianna-huffington/post_1.html

http://www.huffingtonpost.com/arianna-huffington/post_2.htmlhttp://www.huffingtonpost.com/arianna-huffington/post_3.html

http://www.huffingtonpost.com/shaun-donovan/post1.htmlhttp://www.huffingtonpost.com/shaun-donovan/post2.html

http://www.huffingtonpost.com/shaun-donovan/post3.html

writes for

authored by

published under

writes for

authored by

published under

Exacerbated when we started tracking people’s content on a daily basis in mid-

2011

Page 94: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Fixing the cracks in the data model

huffingtonpost.com

http://www.huffingtonpost.com/arianna-huffington/post_1.htmlhttp://www.huffingtonpost.com/arianna-huffington/post_2.html

http://www.huffingtonpost.com/arianna-huffington/post_3.html

http://www.huffingtonpost.com/shaun-donovan/post1.htmlhttp://www.huffingtonpost.com/shaun-donovan/post2.html

http://www.huffingtonpost.com/shaun-donovan/post3.html

writes for

authored by

published under

writes for

authored by

published under

Page 95: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Fixing the cracks in the data model

huffingtonpost.com

http://www.huffingtonpost.com/arianna-huffington/post_1.htmlhttp://www.huffingtonpost.com/arianna-huffington/post_2.html

http://www.huffingtonpost.com/arianna-huffington/post_3.html

http://www.huffingtonpost.com/shaun-donovan/post1.htmlhttp://www.huffingtonpost.com/shaun-donovan/post2.html

http://www.huffingtonpost.com/shaun-donovan/post3.html

writes for

authored by

published under

writes for

authored by

published under

Normalize the sites

Page 96: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Fixing the cracks in the data model

• Normalization requires stronger

secondary indexing

• Our application layer indexing would

need revisiting…again!

Page 97: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

What it means to a startup

Development capacity

Psych! You are back to writing indexing

code.

Page 98: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Source: socialbutterflyclt.com

Page 99: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Whatever, we’ll change our NoSQL!

Source: http://knowyourmeme.com/memes/honey-badger

Page 100: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Traackr’s Datastore Requirements (Revisited)

• Schema flexibility

• Good at storing lots of variable length text

• Batch processing options (maybe)

• Out-of-the-box SECONDARY INDEX support!

• Simple to use and administer

Page 101: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

NoSQL picking – Round 2Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent

Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped

into families

Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema

Graph Databases• Graph Theory G=(E,V)• Great for modeling

networks• Great for graph-based

query algorithms

Page 102: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

NoSQL picking – Round 2Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent

Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped

into families

Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema

Graph Databases• Graph Theory G=(E,V)• Great for modeling

networks• Great for graph-based

query algorithms

Nope!

Page 103: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

NoSQL picking – Round 2Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent

Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped

into families

Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema

Graph Databases• Graph Theory G=(E,V)• Great for modeling

networks• Great for graph-based

query algorithms

Graph Databases: we looked at Neo4J a bit closer but passed again

for the same reasons as before.

Page 104: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

NoSQL picking – Round 2Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent

Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped

into families

Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema

Graph Databases• Graph Theory G=(E,V)• Great for modeling

networks• Great for graph-based

query algorithms

Memcache: still no

Page 105: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

NoSQL picking – Round 2Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent

Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped

into families

Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema

Graph Databases• Graph Theory G=(E,V)• Great for modeling

networks• Great for graph-based

query algorithms

Amazon SimpleDB: still no.

Page 106: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

NoSQL picking – Round 2Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent

Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped

into families

Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema

Graph Databases• Graph Theory G=(E,V)• Great for modeling

networks• Great for graph-based

query algorithms

Not willing to store our data in a proprietary datastore.

Redis and LinkedIn’s Project Voldermort: still no

Page 107: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

NoSQL picking – Round 2Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent

Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped

into families

Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema

Graph Databases• Graph Theory G=(E,V)• Great for modeling

networks• Great for graph-based

query algorithms

CouchDB: more mature but still no ad-hoc queries.

Page 108: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

NoSQL picking – Round 2Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent

Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped

into families

Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema

Graph Databases• Graph Theory G=(E,V)• Great for modeling

networks• Great for graph-based

query algorithms

Cassandra: matured quite a bit, added secondary indexes and batch processing

options but more restrictive in its’ use than other solutions. After the Hbase lesson,

simplicity of use was now more important.

Page 109: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

NoSQL picking – Round 2Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent

Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped

into families

Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema

Graph Databases• Graph Theory G=(E,V)• Great for modeling

networks• Great for graph-based

query algorithms

Riak: strong contender still but adoption questions remained.

Page 110: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

NoSQL picking – Round 2Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent

Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped

into families

Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema

Graph Databases• Graph Theory G=(E,V)• Great for modeling

networks• Great for graph-based

query algorithms

MongoDB: matured by leaps and bounds, increased adoption, support from 10gen, advanced indexing out-of-the-box as well as some batch processing

options, breeze to use, well documented and fit into our existing code base very nicely.

Page 111: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Immediate Benefits

• No more maintaining custom application-layer

secondary indexing code

Page 112: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

What it means to a startup

Development capacity

Yay! I’m back!

Page 113: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Immediate Benefits

• No more maintaining custom application-layer

secondary indexing code

• Single binary installation greatly simplifies

administration

Page 114: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

What it means to a startup

Development capacity

Honestly, I thought I’d never see you

guys again!

Page 115: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Immediate Benefits

• No more maintaining custom application-layer

secondary indexing code

• Single binary installation greatly simplifies

administration

• Our NoSQL could now support our domain

model

Page 116: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

many-to-many relationship

Page 117: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Modeling an influencer

Embedded list of references to sites augmented with

influencer-specific site attributes (e.g.

percent contribution to content)

{ ”_id": "770cf5c54492344ad5e45fb791ae5d52”, "realName": "David Chancogne", "title": "CTO", "description": "Web. Geek.\r\nTraackr: http://traackr.com\r\nPropz: http://propz.me", "primaryAffiliation": "Traackr", "email": "[email protected]", "location": "Cambridge, MA, United States", "siteReferences": [ { "siteId": "b31236da306270dc2b5db34e943af88d", "contribution": 0.25 }, { "siteId": "602dc370945d3b3480fff4f2a541227c", "contribution": 1.0 } ]}

Page 118: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Modeling an influencer

siteId indexed for “find influencers

connected to site X”

> db.influencers.ensureIndex({siteReferences.siteId: 1});> db.influencers.find({siteReferences.siteId: "602dc370945d3b3480fff4f2a541227c"});

{ ”_id": "770cf5c54492344ad5e45fb791ae5d52”, "realName": "David Chancogne", "title": "CTO", "description": "Web. Geek.\r\nTraackr: http://traackr.com\r\nPropz: http://propz.me", "primaryAffiliation": "Traackr", "email": "[email protected]", "location": "Cambridge, MA, United States", "siteReferences": [ { "siteId": "b31236da306270dc2b5db34e943af88d", "contribution": 0.25 }, { "siteId": "602dc370945d3b3480fff4f2a541227c", "contribution": 1.0 } ]}

Page 119: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Embedded list of influencer references

augmented with “usernames” (useful

for content attribution)

{ ”_id": "0001e86f73cc3975a29e6a98a41a4280”, ”url": "http://traackr.com/blog/", "metrics": [ { "name": "google_inbound_links", "value": 5432 } ], "authors": [ { "username": "dchancogne", "influencerId": "770cf5c54492344ad5e45fb791ae5d52" }, { "username": ”gstathis", "influencerId": "0001e86f73cc3975a29e6a98a41a4280" } ]}

Modeling a site

Page 120: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Modeling a site

Indexed for “find sites associated to

influencer X”

> db.sites.ensureIndex({authors.influencerId: 1});> db.sites.find({authors.influencerId: "0001e86f73cc3975a29e6a98a41a4280"});

{ ”_id": "0001e86f73cc3975a29e6a98a41a4280”, ”url": "http://traackr.com/blog/", "metrics": [ { "name": "google_inbound_links", "value": 5432 } ], "authors": [ { "username": "dchancogne", "influencerId": "770cf5c54492344ad5e45fb791ae5d52" }, { "username": ”gstathis", "influencerId": "0001e86f73cc3975a29e6a98a41a4280" } ]}

Page 121: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Other index uses

Support for alternate site URLs (a.k.a. URL aliases):{ "_id": "0001e86f73cc3975a29e6a98a41a4280", "url_hash_list": [ { "url": "http://traackr.com/blog", "hash": "770cf5c54492344ad5e45fb791ae5d52" }, { "url": "http://blog.traackr.com/", "hash": "0001e86f73cc3975a29e6a98a41a4280" } ]}

Indexed for “find sites associated to

influencer X”

Index on MD5 hash of URL

Page 122: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Other Benefits

• Ad hoc queries and reports became easier to write with JavaScript:

no need for a Java developer to write map reduce code to extract

the data in a usable form like it was needed with Hbase.

Page 123: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Ad hoc report example// File Name: retweetTotal.js// Purpose: report the count of twitter URLs for which we have// computed the the number of total retweetsprint( "NUMBER OF TWITTER URLS where retweetTotal IS SET:" );print( db.sites.find( { platformName: "twitter.com", retweetTotal: { $exists: true } } ).count() );

• Easy to execute JS report script remotely:

> mongo <hostname>:<port>/traackr --quiet retweetTotal.js

• Run as a cron job, pipe the output to a file and email it out• Also, more complex MR-based reports are easily accessible to

someone with some JavaScript knowledge

Page 124: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Other Benefits (cont.)

• Ad hoc queries and reports became easier to write with JavaScript:

no need for a Java developer to write map reduce code to extract

the data in a usable form like it was needed with Hbase.

• Simpler backups: Hbase mostly relied on HDFS redundancy; intra-

cluster replication is available but experimental and a lot more

involved to setup.

Page 125: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Same binary can be deployed several times for replication & backups

Page 126: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Same binary can be deployed several times for replication & backups

Different Availability Zones for better SPOF

tolerance

Page 127: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Same binary can be deployed several times for replication & backups

priority 0 for backup server so that it never

gets elected as primary

Page 128: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Same binary can be deployed several times for replication & backups

Using xfs_freeze before taking backups

Page 129: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Same binary can be deployed several times for replication & backups

EBS snapshots as backups are portable to new instances (e.g.

QA)

Page 130: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Other Benefits (cont.)

• Ad hoc queries and reports became easier to write with

JavaScript: no need for a Java developer to write map reduce code

to extract the data in a usable form like it was needed with Hbase.

• Simpler backups: Hbase mostly relied on HDFS redundancy; intra-

cluster replication is available but experimental and a lot more

involved to setup.

• Great documentation

• Great adoption and community

Page 131: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Mongo cursors for batch scoring

• Mongo is fast enough for our data size to

be able to serially score the DB faster

than the MapReduce jobs did in parallel.

• When we grow larger, MapReduce is still

available as an option

Page 132: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

looks like we found the right fit!

Page 133: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

We have more of this

Development capacity

Page 134: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

And less of this

Source: socialbutterflyclt.com

Page 135: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Source: http://www.charliesheentshirts.info/

for now…

Page 136: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Additional takeaways

• Fearless refactoring

• Ease of use and administration cannot be

overstated for a small startup

Page 137: Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Q&A