overiew of cassandra and doradus
DESCRIPTION
Overview of the Doradus database open source project and the Cassandra database on which it is based. This presentation was given to the Orange County Big Data Meetup group on July 16, 2014.TRANSCRIPT
![Page 1: Overiew of Cassandra and Doradus](https://reader033.vdocuments.mx/reader033/viewer/2022052901/55702d0cd8b42aa8558b532d/html5/thumbnails/1.jpg)
Overview of Cassandra and The Doradus OSS Project
Randy Guck
Principal Engineer, Dell Software
![Page 2: Overiew of Cassandra and Doradus](https://reader033.vdocuments.mx/reader033/viewer/2022052901/55702d0cd8b42aa8558b532d/html5/thumbnails/2.jpg)
Overview
• What is No SQL? – Common RDB roadblocks – NoSQL database types
• Overview of Cassandra – What's unique – Limitations
• Doradus – Architecture – Features – The OLAP and Spider storage managers – What each is good for – Where to get Doradus
![Page 3: Overiew of Cassandra and Doradus](https://reader033.vdocuments.mx/reader033/viewer/2022052901/55702d0cd8b42aa8558b532d/html5/thumbnails/3.jpg)
Why RDB Apps Look for Something Else
• Performance – B-trees – Locking – One writable copy of each record
• Scaling costs – RDBs scale "up" – Big boxes, SANs, fiber channel, etc.
• What if you want... – Distributed access – No single points of failure – Instant failover – Sharding – Replication
![Page 4: Overiew of Cassandra and Doradus](https://reader033.vdocuments.mx/reader033/viewer/2022052901/55702d0cd8b42aa8558b532d/html5/thumbnails/4.jpg)
NoSQL Data Models
Data Model Examples Elastic? Queries? Relationships?
Key–Value LevelDB, Kyoto Cabinet, Redis
No No No
Distributed Key–Value
Dynamo, MemcacheDB, Riak, Voldemort
Yes No No
Column-Oriented Accumulo, Cassandra, HBase
Yes Some No
Document-Oriented
Couchbase, Elasticsearch, MongoDB
Yes Yes Some
Graph Neo4J, OrientDB, Titan No Yes Yes
Sharding + replication
AND/OR/ranges/etc.
Built-in support
![Page 5: Overiew of Cassandra and Doradus](https://reader033.vdocuments.mx/reader033/viewer/2022052901/55702d0cd8b42aa8558b532d/html5/thumbnails/5.jpg)
NoSQL Data Models
Data Model Examples Elastic? Queries? Relationships?
Key–Value LevelDB, Kyoto Cabinet, Redis
No No No
Distributed Key–Value
Dynamo, MemcacheDB, Riak, Voldemort
Yes No No
Column-Oriented Accumulo, Cassandra, HBase
Yes Some No
Document-Oriented
Couchbase, Elasticsearch, MongoDB
Yes Yes Some
Graph Neo4J, OrientDB, Titan No Yes Yes
Sharding + replication
AND/OR/ranges/etc.
Built-in support
Doradus goals
![Page 6: Overiew of Cassandra and Doradus](https://reader033.vdocuments.mx/reader033/viewer/2022052901/55702d0cd8b42aa8558b532d/html5/thumbnails/6.jpg)
NoSQL Common Traits
• Distributed cluster of nodes – Commodity, shared-nothing servers – Scales horizontally – Expands elastically
• Replication – Performant local access – Automatic failover
• De-normalized data model
• Schemaless/dynamic columns
• Eventual consistency
N=5, RF=3
![Page 7: Overiew of Cassandra and Doradus](https://reader033.vdocuments.mx/reader033/viewer/2022052901/55702d0cd8b42aa8558b532d/html5/thumbnails/7.jpg)
Is NoSQL Catching On?
Source: db-engines.com
![Page 8: Overiew of Cassandra and Doradus](https://reader033.vdocuments.mx/reader033/viewer/2022052901/55702d0cd8b42aa8558b532d/html5/thumbnails/8.jpg)
Overview of Cassandra
• Wide column NoSQL database
• Open sourced by Facebook
• Apache Project with active community
• Commercially support by DataStax, Acunu, others
• Used by 1,500+ companies
• "Pure peer" architecture
• Largest known Cassandra cluster: 300+ TB data and 400+ machines.
![Page 9: Overiew of Cassandra and Doradus](https://reader033.vdocuments.mx/reader033/viewer/2022052901/55702d0cd8b42aa8558b532d/html5/thumbnails/9.jpg)
What is Cassandra best for?
• Continuous data streams – Logs, events, audit records, measurements, ... – Fast data ingestion – Predictable read performance
• Partitionable data – "1,000's of little databases in one"
• Elastic scalability – Expand/upgrade/repair without downtime
• Not good for: – Blob store – Persistent queue – OLTP transactions
![Page 10: Overiew of Cassandra and Doradus](https://reader033.vdocuments.mx/reader033/viewer/2022052901/55702d0cd8b42aa8558b532d/html5/thumbnails/10.jpg)
CQL Static Table
CREATE TABLE songs ( id uuid PRIMARY KEY, title text, album text, artist text, data blob ); CREATE INDEX ON songs (artist);
Row Key Columns: "<column name>"="<column value>"
62c36... "album"="90125" "artist"="Yes" "data"=<audio> "title"="Changes"
837a2... "album"="Crystal Ball" "artist"="Styx" "data"=<audio> "title"="Put Me On"
2de83... "album"="Nevermind" "artist"="Nirvana" "data"=<audio> "title"="Breed"
...
![Page 11: Overiew of Cassandra and Doradus](https://reader033.vdocuments.mx/reader033/viewer/2022052901/55702d0cd8b42aa8558b532d/html5/thumbnails/11.jpg)
CQL Clustered Table
CREATE TABLE playlists ( id uuid, song_order int, song_id uuid, // copied from songs.id title text, // copied from songs.title album text, // copied from songs.album artist text, // copied from songs.artist PRIMARY KEY (id, song_order) // compound key );
Row Key Columns: "<song_order>:<column name>"="<column value>"
28d23...
"1:"="" "1:album"="90125" "1:artist"="Yes" "1:song_id"="62c36..."
"1:title"="Changes" "2:"="" "2:album"="Nevermind" "2:artist"="Nirvana"
"2:song_id"="2de83..." "2:title"="Breed" "3:"="" ...
2ed91... "1:"="" "1:album"="Crystal Ball" "1:artist"="Styx" "1:song_id"="837a2..."
"1:title"="Put Me On" "2:"="" ...
...
![Page 12: Overiew of Cassandra and Doradus](https://reader033.vdocuments.mx/reader033/viewer/2022052901/55702d0cd8b42aa8558b532d/html5/thumbnails/12.jpg)
Row Key Columns: "<song_order>:<column name>"="<column value>"
28d23...
"1:"="" "1:album"="90125" "1:artist"="Yes" "1:song_id"="62c36..."
"1:title"="Changes" "2:"="" "2:album"="Nevermind" "2:artist"="Nirvana"
"2:song_id"="2de83..." "2:title"="Breed" "3:"="" ...
2ed91... "1:"="" "1:album"="Crystal Ball" "1:artist"="Styx" "1:song_id"="837a2..."
"1:title"="Put Me On" "2:"="" ...
...
CQL Clustered Table (cont.)
CQL "Rows"
CREATE TABLE playlists ( id uuid, song_order int, song_id uuid, // copied from songs.id title text, // copied from songs.title album text, // copied from songs.album artist text, // copied from songs.artist PRIMARY KEY (id, song_order) // compound key );
![Page 13: Overiew of Cassandra and Doradus](https://reader033.vdocuments.mx/reader033/viewer/2022052901/55702d0cd8b42aa8558b532d/html5/thumbnails/13.jpg)
Can we make Cassandra more appealing?
• Data Model – No direct support for relationships
• Indexing – Secondary indexes: single column only – Hash table only: no range searching
• Searching – No joins, embedded queries – No aggregate queries – Limited equalities (e.g., SELECT * WHERE <key> IN (<list>)) – No full text search – No OR clauses – ...
![Page 14: Overiew of Cassandra and Doradus](https://reader033.vdocuments.mx/reader033/viewer/2022052901/55702d0cd8b42aa8558b532d/html5/thumbnails/14.jpg)
What is Doradus?
• Java service that enhances Cassandra
• Adds features: – REST API (JSON and XML) – Multi-tenancy – Graph model – Multi-field/full text query language – Automatic data aging – OLAP and Spider storage services
• Compatible with NoSQL tenets such as idempotent updates
• Under development for ~3 years
• Open source: Apache 2.0 License
![Page 15: Overiew of Cassandra and Doradus](https://reader033.vdocuments.mx/reader033/viewer/2022052901/55702d0cd8b42aa8558b532d/html5/thumbnails/15.jpg)
Doradus Graph Model
• A cluster hosts one of more applications • An application own tables which store objects • An object consists of single- and multi-valued fields • A pair of link fields form a bi-directional relationship
Message {Size, SendDate}
Participant {ReceiptDate}
Address {Name}
Person {Name, Department}
Attachment {Size, Extension} Managerè
çEmployees
êPerson
Address é
êAttachments
Messageé
Recipientsè
çMessageAsRecipient
Addressè
çParticipants
Senderè
çMessageAsSender
![Page 16: Overiew of Cassandra and Doradus](https://reader033.vdocuments.mx/reader033/viewer/2022052901/55702d0cd8b42aa8558b532d/html5/thumbnails/16.jpg)
Example Object and Aggregate Queries
• Lucene full text query GET /Email/Person/_query?q=FirstName:j* AND NOT Office:[q TO z]
• Link path with filtering GET /Email/Message/_query?q=
Sender.WHERE(ReceiptDate>'2010-‐06-‐01').Address.Name="*.com"
• Quantifiers GET /Email/Message/_aggregate?m=COUNT(*)
&q=ANY(Recipients).ALL(Address).NONE(Person).Department:sales &f=Tags,TOP(3,TRUNCATE(SendDate,DAY))
• Transitive links GET /Email/Person/_query?q=DirectReports^(3).LastName=wilson
&f=DirectReports(Name,DirectReports(Name))
![Page 17: Overiew of Cassandra and Doradus](https://reader033.vdocuments.mx/reader033/viewer/2022052901/55702d0cd8b42aa8558b532d/html5/thumbnails/17.jpg)
Doradus: Architecture
Application
Doradus
Cassandra
REST API
Thrift or CQL
Data and Log files
![Page 18: Overiew of Cassandra and Doradus](https://reader033.vdocuments.mx/reader033/viewer/2022052901/55702d0cd8b42aa8558b532d/html5/thumbnails/18.jpg)
Doradus: Multi-Data Center Clusters
Cassandra
Doradus
Cassandra Cassandra
Doradus
Cassandra
Doradus
Cassandra Cassandra
Doradus
Node 1 Node 2 Node 3 Node 4 Node 5 Node 6
Rack 1, Data Center 1 Rack 1, Data Center 2
Applications Applications
DC=2, N=6, RF=3
![Page 19: Overiew of Cassandra and Doradus](https://reader033.vdocuments.mx/reader033/viewer/2022052901/55702d0cd8b42aa8558b532d/html5/thumbnails/19.jpg)
Doradus: Internal Architecture
App App App Monitor
App
Spider Storage Service
OLAP Storage Service
Cassandra Cluster
JMX
REST: Embedded Jetty Server
Cassandra Interface do
rad
us.
yam
l
REST
![Page 20: Overiew of Cassandra and Doradus](https://reader033.vdocuments.mx/reader033/viewer/2022052901/55702d0cd8b42aa8558b532d/html5/thumbnails/20.jpg)
Doradus OLAP Service
• Borrows from online analytical processing – Sharding as data "cubes" – Columnar storage
• Very dense storage – No indexes! – Value arrays are compressed
• Fast load time – Up to 500,000 objects/second/node – Small "data lag" time
• Very fast queries – Searches millions of objects/second – Full DQL object and aggregate query support
![Page 21: Overiew of Cassandra and Doradus](https://reader033.vdocuments.mx/reader033/viewer/2022052901/55702d0cd8b42aa8558b532d/html5/thumbnails/21.jpg)
OLAP Data Loading
Events Events Events
Events Events People
Events Events Computers
Events Events Domains
Sources
![Page 22: Overiew of Cassandra and Doradus](https://reader033.vdocuments.mx/reader033/viewer/2022052901/55702d0cd8b42aa8558b532d/html5/thumbnails/22.jpg)
OLAP Data Loading
T1 Events Events Events
Events Events People
Events Events Computers
Events Events Domains
T2
T3
T4
T4
Sources Segments
…
Changes in last n minutes
![Page 23: Overiew of Cassandra and Doradus](https://reader033.vdocuments.mx/reader033/viewer/2022052901/55702d0cd8b42aa8558b532d/html5/thumbnails/23.jpg)
OLAP Data Loading
T1 Events Events Events
Events Events People
Events Events Computers
Events Events Domains
T2
T3
T4
T4
2013-03-01
2013-02-28
2013-02-27
Sources Segments Shards
… …
Changes in last n minutes
Date-based shards
![Page 24: Overiew of Cassandra and Doradus](https://reader033.vdocuments.mx/reader033/viewer/2022052901/55702d0cd8b42aa8558b532d/html5/thumbnails/24.jpg)
OLAP Data Loading
T1 Events Events Events
Events Events People
Events Events Computers
Events Events Domains
T2
T3
T4
T4
2013-03-01
2013-02-28
2013-02-27
Sources Segments Shards OLAP Store
… …
Changes in last n minutes
Date-based shards
![Page 25: Overiew of Cassandra and Doradus](https://reader033.vdocuments.mx/reader033/viewer/2022052901/55702d0cd8b42aa8558b532d/html5/thumbnails/25.jpg)
OLAP Use Case
• Data: Windows Events – 115M events
• Test parameters – Server: Quad Xeon CPUs, 32GB memory, 3 disks – Cassandra memory: 1GB – Load app/embedded Doradus memory: 4GB – Load threads: 5 – Batch size: 5,000 events – Shard size: 1 day (860 shards total)
• Test results – Total objects loaded: ~1 billion – Total time: 32 minutes, 56 seconds – Load rate: 502,991 objects/second – Final database size: ~2GB
![Page 26: Overiew of Cassandra and Doradus](https://reader033.vdocuments.mx/reader033/viewer/2022052901/55702d0cd8b42aa8558b532d/html5/thumbnails/26.jpg)
Doradus Spider Service
• Analogous to Lucene + NoSQL
• Fully inverted field indexing – Configurable analyzers – Stored-only (non-indexed) fields
• Unique features: – Automatic table-level sharding – Statistics
– Pre-computed aggregate queries – Refreshed in background
– Object-level data aging
• Use case example: – Indexing a massive number of documents
![Page 27: Overiew of Cassandra and Doradus](https://reader033.vdocuments.mx/reader033/viewer/2022052901/55702d0cd8b42aa8558b532d/html5/thumbnails/27.jpg)
OLAP and Spider: When to Use
• Spider is best for: – Unstructured/variable-
structure data – Configurable indexing – Fine-grained updates with
immediate indexing – Document storage and
searching – Emphasis on full-text/multi-
field searching
• OLAP is best for: – High-volume data streams – High performance analytic
queries – Dense data storage – Immutable/semi-mutable
data – Data that can be loaded in
batches – Data that can be partitioned
(e.g., time-sharded)
![Page 28: Overiew of Cassandra and Doradus](https://reader033.vdocuments.mx/reader033/viewer/2022052901/55702d0cd8b42aa8558b532d/html5/thumbnails/28.jpg)
Summary
• What's cool about Doradus? – Bi-directional links with referential integrity – Link paths: simpler than joins – Idempotent updates – Partial object updates – Simple transitive searching – OLAP: dense storage and fast queries – It's free!
![Page 29: Overiew of Cassandra and Doradus](https://reader033.vdocuments.mx/reader033/viewer/2022052901/55702d0cd8b42aa8558b532d/html5/thumbnails/29.jpg)
Thank you ! Doradus is available at: https://github.com/dell-oss/Doradus Contact me: [email protected]