satisfying the public's demand for cat videos with cassandra and azure (from microsoft denver...

44
Satisfying the Public’s Demand for Cat Videos with Cassandra and Azure Luke Tillman (@LukeTillman) Language Evangelist at DataStax

Upload: luke-tillman

Post on 05-Jul-2015

260 views

Category:

Technology


1 download

DESCRIPTION

Everyone wants to build applications that are scalable and highly available. But how do you build a site that’s capable of withstanding the public’s insatiable demand for sharing cat videos, even if your data center gets hit with a nuclear bomb? In this session we’ll take a look at KillrVideo, an open source video sharing application demo (similar to YouTube) built on Apache Cassandra and Microsoft Azure. You’ll get an introduction to Cassandra, a highly available distributed database including data modelling (and how it’s different from the relational world you probably have experience with), using CQL to query, and how to interact with Cassandra from your code. We’ll also touch on using Azure Media Services for processing and streaming video content as well as how to setup a Cassandra cluster in Azure. While the code samples in this session will be in C#, the same APIs are available and the same concepts apply to other languages (like Java and Python). If you’re interested in learning more about NoSQL solutions, Cassandra, or Azure, this talk will get you started. No kittens were harmed in the making of this talk.

TRANSCRIPT

Page 1: Satisfying the Public's Demand for Cat Videos with Cassandra and Azure (from Microsoft Denver Dev Day 2014)

Satisfying the Public’s Demand for Cat

Videos with Cassandra and Azure

Luke Tillman (@LukeTillman)

Language Evangelist at DataStax

Page 2: Satisfying the Public's Demand for Cat Videos with Cassandra and Azure (from Microsoft Denver Dev Day 2014)

Who are you?!

• Evangelist with a focus on the .NET Community

• Long-time Developer

• Recently presented at Cassandra Summit 2014 with Microsoft

• Very Recent Denver Transplant

2

Page 3: Satisfying the Public's Demand for Cat Videos with Cassandra and Azure (from Microsoft Denver Dev Day 2014)

1 What is this KillrVideo thing you speak of?

2 Cassandra, the really short version

3 CQL: NoSQL, now with more SQL!

4 Breaking the Relational Mindset

5 Putting it all together: Cassandra, Azure, and .NET

3

Page 4: Satisfying the Public's Demand for Cat Videos with Cassandra and Azure (from Microsoft Denver Dev Day 2014)

What is this KillrVideo thing you speak of?

4

Page 5: Satisfying the Public's Demand for Cat Videos with Cassandra and Azure (from Microsoft Denver Dev Day 2014)

KillrVideo, a Video Sharing Site

• Think a YouTube competitor

– Users add videos, rate them, comment on them, etc.

– Can search for videos by tag

5

Page 6: Satisfying the Public's Demand for Cat Videos with Cassandra and Azure (from Microsoft Denver Dev Day 2014)

See the Live Demo, Get the Code

• Live demo available at http://www.killrvideo.com

– Written in C#

– Live Demo running in Azure

– Open source: https://github.com/luketillman/killrvideo-csharp

• Interesting use case because of different data modeling

challenges and the scale of something like YouTube

– More than 1 billion unique users visit YouTube each month

– 100 hours of video are uploaded to YouTube every minute

6

Page 9: Satisfying the Public's Demand for Cat Videos with Cassandra and Azure (from Microsoft Denver Dev Day 2014)

Cassandra, the really short version

Page 10: Satisfying the Public's Demand for Cat Videos with Cassandra and Azure (from Microsoft Denver Dev Day 2014)

What is Cassandra?

• A Linearly Scaling and Fault Tolerant Distributed Database

• Fully Distributed

– Data spread over many nodes

– All nodes participate in a cluster

– All nodes are equal

– No SPOF (shared nothing)

10

Page 11: Satisfying the Public's Demand for Cat Videos with Cassandra and Azure (from Microsoft Denver Dev Day 2014)

What is Cassandra?

Linearly Scaling

– Have More Data? Add more nodes.

– Need More Throughput? Add more nodes.

11

Fault Tolerant

– Nodes Down != Database Down

– Datacenter Down != Database Down

Page 12: Satisfying the Public's Demand for Cat Videos with Cassandra and Azure (from Microsoft Denver Dev Day 2014)

What is Cassandra?

• Fully replicated across multiple DCs

• Clients write local

• Data syncs across WAN

• Replication Factor per DC

12

US Europe

Client

Page 13: Satisfying the Public's Demand for Cat Videos with Cassandra and Azure (from Microsoft Denver Dev Day 2014)

Cassandra and the CAP Theorem

• The CAP Theorem limits what distributed systems can do

– Consistency

– Availability

– Partition Tolerance

• Limits? “Pick 2 out of 3”

• Cassandra is an AP system that is Eventually Consistent

13

Page 14: Satisfying the Public's Demand for Cat Videos with Cassandra and Azure (from Microsoft Denver Dev Day 2014)

Two knobs control Cassandra fault tolerance

• Replication Factor (server side)

– How many copies of the data should exist?

14

Client

B AD

C AB

A CD

D BC

Write A

RF=3

Page 15: Satisfying the Public's Demand for Cat Videos with Cassandra and Azure (from Microsoft Denver Dev Day 2014)

Two knobs control Cassandra fault tolerance

• Consistency Level (client side)

– How many replicas do we need to hear from before we acknowledge?

15

Client

B AD

C AB

A CD

D BC

Write A

CL=QUORUM

Client

B AD

C AB

A CD

D BC

Write A

CL=ONE

Page 16: Satisfying the Public's Demand for Cat Videos with Cassandra and Azure (from Microsoft Denver Dev Day 2014)

Consistency Levels

• Applies to both Reads and Writes (i.e. is set on each query)

• ONE – one replica from any DC

• LOCAL_ONE – one replica from local DC

• QUORUM – 51% of replicas from any DC

• LOCAL_QUORUM – 51% of replicas from local DC

• ALL – all replicas

• TWO

16

Page 17: Satisfying the Public's Demand for Cat Videos with Cassandra and Azure (from Microsoft Denver Dev Day 2014)

Consistency Level and Availability

• Consistency Level choice affects availability

• For example, QUORUM can tolerate one replica being down and

still be available (in RF=3)

17

Client

B AD

C AB

A CD

D BC

A=2

A=2

A=2

Read A

(CL=QUORUM)

Page 18: Satisfying the Public's Demand for Cat Videos with Cassandra and Azure (from Microsoft Denver Dev Day 2014)

Eventual Consistency

• Cassandra is an AP system that is Eventually Consistent so

replicas may disagree

• Column values are timestamped

• In Cassandra, Last Write Wins (LWW)

18

Client

B AD

C AB

A CD

D BC

Read A

(CL=QUORUM) A=2

Newer

A=1

Older

A=2

Page 19: Satisfying the Public's Demand for Cat Videos with Cassandra and Azure (from Microsoft Denver Dev Day 2014)

CQL: NoSQL, now with more SQL!

Page 20: Satisfying the Public's Demand for Cat Videos with Cassandra and Azure (from Microsoft Denver Dev Day 2014)

Schema Definition (DDL)

• Easy to define tables for storing data

• First part of Primary Key is the Partition Key

CREATE TABLE videos ( videoid uuid, userid uuid, name text, description text, preview_image_location text, tags set<text>, added_date timestamp, PRIMARY KEY (videoid) );

20

Page 21: Satisfying the Public's Demand for Cat Videos with Cassandra and Azure (from Microsoft Denver Dev Day 2014)

Partition Key

Partition Key Determines Data Distribution

• Partition Key determines node placement

21

name description ...

Keyboard Cat Keyboard Cat is the ... ...

Nyan Cat Check out Nyan cat ... ...

Original Grumpy Cat Visit Grumpy Cat’s … ...

videoid

689d56e5- …

93357d73- …

d978b136- …

Page 22: Satisfying the Public's Demand for Cat Videos with Cassandra and Azure (from Microsoft Denver Dev Day 2014)

Partition Key – Hashing

• The Partition Key is hashed using a consistent hashing function

(Murmur 3) and the output is used to place the data on a node

• The data is also replicated to RF-1 other nodes

22

Murmur3 videoid: 689d56e5- ... Murmur3: A

B AD

C AB

A CD

D BC

RF=3 Partition Key

name description ...

Keyboard Cat Keyboard Cat is the ... ...

videoid

689d56e5- ...

Page 23: Satisfying the Public's Demand for Cat Videos with Cassandra and Azure (from Microsoft Denver Dev Day 2014)

Hashing – Back to Reality

• Back in reality, Partition Keys actually hash to 128 bit numbers

• Nodes in Cassandra own token ranges (i.e. hash ranges)

23

B AD

C AB

A CD

D BC

Range Start End

A 0xC000000..1 0x0000000..0

B 0x0000000..1 0x4000000..0

C 0x4000000..1 0x8000000..0

D 0x8000000..1 0xC000000..0

Murmur3 0xadb95e99da887a8a4cb474db86eb5769

Partition Key

videoid

689d56e5- ...

Page 24: Satisfying the Public's Demand for Cat Videos with Cassandra and Azure (from Microsoft Denver Dev Day 2014)

Clustering Columns

• Second part of Primary Key is Clustering Column(s)

• Clustering columns affect ordering of data (on disk)

• Ascending/Descending order is possible

24

CREATE TABLE comments_by_video ( videoid uuid, commentid timeuuid, userid uuid, comment text, PRIMARY KEY (videoid, commentid) ) WITH CLUSTERING ORDER BY (commentid DESC);

Page 25: Satisfying the Public's Demand for Cat Videos with Cassandra and Azure (from Microsoft Denver Dev Day 2014)

Clustering Columns – Wide Rows

• Use of Clustering Columns (and the layout on disk) is where the

term “Wide Rows” comes from

25

videoid='0fe6a...'

userid= 'ac346...'

comment= 'Awesome!'

commentid='82be1...' (10/1/2014 9:36AM)

userid= 'f89d3...'

comment= 'Garbage!'

commentid='765ac...' (9/17/2014 7:55AM)

CREATE TABLE comments_by_video ( videoid uuid, commentid timeuuid, userid uuid, comment text, PRIMARY KEY (videoid, commentid) ) WITH CLUSTERING ORDER BY (commentid DESC);

Page 26: Satisfying the Public's Demand for Cat Videos with Cassandra and Azure (from Microsoft Denver Dev Day 2014)

Inserts and Updates

• Use INSERT or UPDATE to add and modify data

• Both will overwrite data (no constraints like RDBMS)

• INSERT and UPDATE functionally equivalent 26

INSERT INTO comments_by_video ( videoid, commentid, userid, comment) VALUES ( '0fe6a...', '82be1...', 'ac346...', 'Awesome!');

UPDATE comments_by_video SET userid = 'ac346...', comment = 'Awesome!' WHERE videoid = '0fe6a...' AND commentid = '82be1...';

Page 27: Satisfying the Public's Demand for Cat Videos with Cassandra and Azure (from Microsoft Denver Dev Day 2014)

TTL and Deletes

• Can specify a Time to Live (TTL) in seconds when doing an

INSERT or UPDATE

• Use DELETE statement to remove data

• Can optionally specify columns to remove part of a row

27

INSERT INTO comments_by_video ( ... ) VALUES ( ... ) USING TTL 86400;

DELETE FROM comments_by_video WHERE videoid = '0fe6a...' AND commentid = '82be1...';

Page 28: Satisfying the Public's Demand for Cat Videos with Cassandra and Azure (from Microsoft Denver Dev Day 2014)

Querying

• Use SELECT to get data from your tables

• Always include Partition Key and optionally Clustering Columns

• Can use ORDER BY (on Clustering Columns) and LIMIT

• Use range queries (for example, by date) to slice partitions

28

SELECT * FROM comments_by_video WHERE videoid = 'a67cd...' LIMIT 10;

Page 29: Satisfying the Public's Demand for Cat Videos with Cassandra and Azure (from Microsoft Denver Dev Day 2014)

Breaking the Relational Mindset

Page 30: Satisfying the Public's Demand for Cat Videos with Cassandra and Azure (from Microsoft Denver Dev Day 2014)

Breaking the Relational Mindset

• How do we data model when we have to query by the Partition Key (and optionally Clustering Columns)?

• Denormalize all the things!

• Disk is cheap now and writes in Cassandra are FAST

• Data modeling is very much query driven

• Many times we end up with a “table per query”

30

Page 31: Satisfying the Public's Demand for Cat Videos with Cassandra and Azure (from Microsoft Denver Dev Day 2014)

Users – The Relational Way

• Single Users table with all user data and an Id Primary Key

• Add an index on email address to allow queries by email

User Logs

into site

Find user by email

address

Show basic

information

about user Find user by id

31

Page 32: Satisfying the Public's Demand for Cat Videos with Cassandra and Azure (from Microsoft Denver Dev Day 2014)

Users – The Cassandra Way

User Logs

into site

Find user by email

address

Show basic

information

about user Find user by id

CREATE TABLE user_credentials ( email text, password text, userid uuid, PRIMARY KEY (email) );

CREATE TABLE users ( userid uuid, firstname text, lastname text, email text, created_date timestamp, PRIMARY KEY (userid) );

32

Page 33: Satisfying the Public's Demand for Cat Videos with Cassandra and Azure (from Microsoft Denver Dev Day 2014)

Considerations When Duplicating Data

• Can the data change?

• How likely is it to change or how frequently will it change?

• Do I have all the information I need to update duplicates and

maintain consistency?

• Just scratching the surface of data modeling examples here

33

Page 34: Satisfying the Public's Demand for Cat Videos with Cassandra and Azure (from Microsoft Denver Dev Day 2014)

Putting it all together: Cassandra, Azure,

and .NET

Page 35: Satisfying the Public's Demand for Cat Videos with Cassandra and Azure (from Microsoft Denver Dev Day 2014)

KillrVideo on Azure

Cassandra Cluster (DSE)

App data storage (video

metadata, comments, users,

ratings, etc.)

Azure Media Services

Uploaded video encoding,

thumbnail generation, Video

access URI generation

Azure Storage

Queues – notifications on

encoding job progress

Blob – uploaded video storage

OpsCenter

provisioning,

monitoring,

management

KillrVideo Web App C# MVC Web Application, Azure Web Role

Serves up UI, JSON Endpoints

KillrVideo Upload Worker C#, Azure Worker Role

Monitors encoding job events, publishes completed

uploads

Web UI HTML5 / JavaScript (KnockoutJS, jQuery, Bootstrap, etc)

35

Page 36: Satisfying the Public's Demand for Cat Videos with Cassandra and Azure (from Microsoft Denver Dev Day 2014)

Deploying Cassandra in Azure

• Cassandra is a JVM application and should be deployed on Linux

VMs (parity in Windows is coming – 3.0?)

• IOPs is super important (recommend A7 instances for

production, A4 for testing and development)

• New SSD instances in Azure look promising

• In-depth documentation and scripts available to help

36

Page 37: Satisfying the Public's Demand for Cat Videos with Cassandra and Azure (from Microsoft Denver Dev Day 2014)

.NET and Cassandra

• Open Source (on GitHub), available via NuGet

• Bootstrap using the Builder and then reuse the ISession object

Cluster cluster = Cluster.Builder() .AddContactPoint("127.0.0.1") .Build(); ISession session = cluster.Connect("killrvideo");

37

Page 38: Satisfying the Public's Demand for Cat Videos with Cassandra and Azure (from Microsoft Denver Dev Day 2014)

.NET and Cassandra

• Executing CQL

• Sync and Async API available

var statement = new SimpleStatement("SELECT * FROM users WHERE userid = ?"); statement = statement.Bind(145); RowSet rows = await session.ExecuteAsync(statement);

38

Page 39: Satisfying the Public's Demand for Cat Videos with Cassandra and Azure (from Microsoft Denver Dev Day 2014)

.NET and Cassandra

• Getting values from a RowSet is easy

• Rowset is a collection of Row (IEnumerable<Row>)

RowSet rows = await _session.ExecuteAsync(statement); foreach (Row row in rows) { var videoId = row.GetValue<Guid>("videoid"); var addedDate = row.GetValue<DateTimeOffset>("added_date"); var name = row.GetValue<string>("name"); }

39

Page 40: Satisfying the Public's Demand for Cat Videos with Cassandra and Azure (from Microsoft Denver Dev Day 2014)

.NET and Cassandra

• Mapping results to DTOs: if you like using CQL, try CqlPoco

package

• Note: This package may be pulled into the official driver soon.

public class User { public Guid UserId { get; set; } public string Name { get; set; } } // Get a user by id from Cassandra or null if not found var user = client.SingleOrDefault<User>( "SELECT userid, name FROM users WHERE userid = ?", someUserId);

40

Page 41: Satisfying the Public's Demand for Cat Videos with Cassandra and Azure (from Microsoft Denver Dev Day 2014)

.NET and Cassandra

• Mapping results to DTOs: if you like LINQ, use built-in LINQ

provider

[Table("users")] public class User { [Column("userid"), PartitionKey] public Guid UserId { get; set; } [Column("name")] public string Name { get; set; } } var user = session.GetTable<User>() .SingleOrDefault(u => u.UserId == someUserId) .Execute();

41

Page 42: Satisfying the Public's Demand for Cat Videos with Cassandra and Azure (from Microsoft Denver Dev Day 2014)

Some Tips for .NET and Cassandra

• Look at Prepared Statements in the documentation for an easy

performance optimization

• Take advantage of the async API to run queries in parallel

• Don’t write boilerplate mapping code—use LINQ or CqlPoco

42

Page 44: Satisfying the Public's Demand for Cat Videos with Cassandra and Azure (from Microsoft Denver Dev Day 2014)

Questions?

44

Follow me on Twitter for updates or to ask questions later: @LukeTillman