neo, titan & cassandra

NEO4J, TITAN &

CASSANDRAComparisons

JOHN JENSON12 YEARS BUILDING SOFTWARE FOR THE WEB

Salt Lake City

Software consultant, developer, project lead, on dozens of projects

Extensive experience developing Java apps and using Oracle

Two largest relational databases, banking database, PHI Database

Massive relational databases, tables in excess of 500 million rows

Boston

Cengage Learning – Principal Engineer

MEAN stack development

A year of Experience developing with Mongo, and Neo4J

Researched Cassandra and Titan

TandemSeven – Principal Architect

Software Consulting

GRAPH DATABASES

Arcs and Nodes

Objects and the relationships between them

Objects (nodes)

Schemaless

Can have arbitrary attributes

Relationships (arcs)

Have a type

Can also have arbitrary attributes

GRAPH DATABASES

Anything that can be modeled in a relational database, could also be modeled in relational database.

Nothing new here

Querying a tree in SQL sucks

The power of a graph database comes from the query language

Oracle provides “connect by” feature for trees, but it only works for trees and you have to use Oracle

What if your data is highly connected and breaks the rules of a tree?

Good luck bringing all of your data into memory and writing your own algorithm to traverse your data

NEO4J

Cypher Query

Powerful query language (this is what sets Neo4J apart from other graph DBs)

MATCH (actor:Person)-[:ACTED_IN]->(movie:Movie)WHERE movie.title =~ "T.*"RETURN movie.title as title, collect(actor.name) as castORDER BY title ASC LIMIT 10;

http://docs.neo4j.org/chunked/stable/examples-from-sql-to-

cypher.html (sql to cypher reference)

http://docs.neo4j.org/refcard/2.0/ (useful cypher reference)

http://docs.neo4j.org/chunked/stable/examples-from-sql-to-cypher.html

http://docs.neo4j.org/refcard/2.0/

COLUMN DATABASES

Rows are organized into individual tables

Columns are represented as rows in those tables

Designed to reduce IO and seek times when accessing data

CASSANDRA

Massively distributed

Can support massive clusters with 75,000+ machines

CQL – Cassandra Query Language

No joins or subqueries

SELECT *FROM usersWHERE last_name = ”smith”;

MapReduce

Hadoop is all you get

CASSANDRA - PERFORMANCE

"In terms of scalability, there is a clear winner throughout our

experiments. Cassandra achieves the highest throughput for the

maximum number of nodes in all experiments" although "this comes

at the price of high write and read latencies.” – Toronto University

Absolutely amazing throughput

Not so amazing response times for each individual query

TITAN

Runs on Cassandra or Hbase

Oracle Berkeley DB

Can store massive graphs

Doesn’t support Cypher Query

Supports Gremlin

http://sql2gremlin.com/ (useful gremlin reference)

http://sql2gremlin.com/

CASSANDRA VS NEO4J

15 points of comparison

CASSANDRA

Cassandra is a non-relational data store that

stores data in tables. Cassandra organizes

columns into rows and rows into tables.

Neo is a graph database that organizes data

into arcs and nodes.

NEO4J

Point 1

CASSANDRA

Because columns are stored as rows, tables

can have a huge number of columns

(maximum of 2 billion columns).

Neo can house at most 34 billion nodes, 34

billion relationships, and 68 billion properties in

total.

NEO4J

Point 2

CASSANDRA

All tables must have an index which is used as

a basis for sharding the data.

Indexes can be added and removed wherever

desired.

NEO4J

Point 3

CASSANDRA

Cassandra has impressive HA capabilities that

can span multiple data centers with little effort.

Neo uses master slave replication.

NEO4J

Point 4

CASSANDRA

Cassandra can elegantly run on huge clusters

that replicate and shard data effortlessly.

Neo doesn’t shard your data.

NEO4J

Point 5

CASSANDRA

Cassandra scales linearly by adding more

hardware. There is pretty much no limit to the

hardware that you can add.

Neo read throughput scales linearly with the

number of servers, but the number of servers

in a cluster has to stay relatively small.

NEO4J

Point 6

CASSANDRA

The dataset can grow virtually endlessly while

still getting the same performance.

The dataset size is limited to at most 34 billion

nodes, 34 billion relationships, and 68 billion

properties in total.

NEO4J

Point 7

CASSANDRA

Cassandra does not use a master/slave

paradigm, so there is no down-time when a

machine dies.

There is a brief window of downtime while a

new master is elected.

NEO4J

Point 8

CASSANDRA

Cannot do traversal queries

Traversal queries that have exponential cost

on traditional RDBMS have linear cost on Neo.

NEO4J

Point 9

CASSANDRA

Write performance is just as good as read

performance.

Write performance is slower than read

performance.

NEO4J

Point 10

CASSANDRA

Every query has additional latency due to

cluster overhead

Individual queries can be serviced much faster

with far less latency.

NEO4J

Point 11

CASSANDRA

ACID transactions are mostly supported, but

with tunable consistency.

ACID transactions are fully supported and

completely consistent, but there is a

performance hit for the consistency.

NEO4J

Point 12

CASSANDRA

Cassandra can perform operations completely

synchronously or alternatively at variously

levels of consistency with corresponding

performance on an operation by operation

basis.

Consistency is not tunable

NEO4J

Point 13

CASSANDRA

Cassandra uses it’s own query language

(CQL) that has similar syntax to SQL (no joins)

Neo uses Cypher and also supports Gremlin

NEO4J

Point 14

CASSANDRA

Instead of performing joins at runtime data

must be de-normalized before hand

Graphs are normalized and highly connected.

Traversals are very fast.

NEO4J

Point 15

CASSANDRA

“Unlike in relational databases, it’s not easy to tune or introduce new

query patterns in Cassandra by simply creating secondary indexes

or building complex SQLs (using joins, order by, group by?) because

of its high-scale distributed nature. So think about query patterns up

front, and design column families accordingly.” –Ebay

http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-

best-practices-part-1/#.U-j6fICwIeY

http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/#.U-j6fICwIeY

NEO4J

“A single instance of Neo4j can house at most 34 billion nodes, 34

billion relationships, and 68 billion properties, in total. Businesses like

Google obviously push these limits, but in general, this does not pose

a limitation in practice. It is also important to understand that these

limits were chosen purely as a storage optimization, and do not

indicate any particular shortcoming of the product. They are easily,

and are in fact being, increased.”

http://info.neotechnology.com/rs/neotechnology/images/Understanding

%20Neo4j%20Scalability(2).pdf

http://info.neotechnology.com/rs/neotechnology/images/Understanding Neo4j Scalability(2).pdf

Gemling.V('customerId','ALFKI').as('customer')

.out('ordered').out('contains').out('is').as('products')

.in('is').in('contains').in('ordered').except(‘customer')

.out('ordered').out('contains').out('is').except('products')

.groupCount().cap().orderMap(T.decr)[0..<5].productName

CypherMATCH (c1)-[:ordered]->(o1)-[:contains]->(p1)<-[:contains]-(o2)<-[:ordered]-(c2)-[:ordered]->(o3)-[:contains]->(p2)

WHERE c1.customerId = "ALFKI" AND c1 != c2 AND p1 != p2

RETURN p2.productName, count(p2) num

SQLSELECT TOP (5) [t14].[ProductName]

FROM (SELECT COUNT(*) AS [value],

[t13].[ProductName]

FROM [customers] AS [t0]

CROSS APPLY (SELECT [t9].[ProductName]

FROM [orders] AS [t1]

CROSS JOIN [order details] AS [t2]

INNER JOIN [products] AS [t3]

ON [t3].[ProductID] = [t2].[ProductID]


INNER JOIN [orders] AS [t5]

ON [t5].[OrderID] = [t4].[OrderID]

LEFT JOIN [customers] AS [t6]

ON [t6].[CustomerID] = [t5].[CustomerID]

CROSS JOIN ([orders] AS [t7]



ON [t9].[ProductID] = [t8].[ProductID])

WHERE NOT EXISTS(SELECT NULL AS

[EMPTY]

FROM [orders] AS [t10]



ON [t12].[ProductID] =

[t11].[ProductID]

WHERE [t9].[ProductID] =

[t12].[ProductID]

AND [t10].[CustomerID] =

[t0].[CustomerID]

AND [t11].[OrderID] =

[t10].[OrderID])

AND [t6].[CustomerID] <> [t0].[CustomerID]

AND [t1].[CustomerID] = [t0].[CustomerID]

AND [t2].[OrderID] = [t1].[OrderID]

AND [t4].[ProductID] = [t3].[ProductID]

AND [t7].[CustomerID] = [t6].[CustomerID]

AND [t8].[OrderID] = [t7].[OrderID]) AS [t13]

WHERE [t0].[CustomerID] = N'ALFKI'

GROUP BY [t13].[ProductName]) AS [t14]

ORDER BY [t14].[value] DESC

CONCLUSION

If one plans on writing a recommendation queries, a graph db is a more elegant fit than a relational DB.

Only use Titan if you need it

You have an insanely large graph

Or you expect an insanely high load

Neo4J

Faster queries

A more straightforward and powerful query language