cfs: cassandra backed storage for hadoop

31
CFS Cassandra-backed storage for Hadoop Nick Bailey @nickmbailey [email protected]

Upload: nickmbailey

Post on 29-Nov-2014

1.293 views

Category:

Technology


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: CFS: Cassandra backed storage for Hadoop

CFSCassandra-backed storage for HadoopNick Bailey@[email protected]

Page 2: CFS: Cassandra backed storage for Hadoop

©2012 DataStax

Motivation

2

Page 3: CFS: Cassandra backed storage for Hadoop

©2012 DataStax

Help me Cassandra, you’re my only hope

3

Page 4: CFS: Cassandra backed storage for Hadoop

©2012 DataStax

Cassandra• Distributed architecture

• No SPOF

• Scalable

• Real time data

• No ad-hoc query support

4

Page 5: CFS: Cassandra backed storage for Hadoop

©2012 DataStax

Cassandra, why can’t you...

5

Page 6: CFS: Cassandra backed storage for Hadoop

©2012 DataStax

...do the things Hadoop was built for.

6

Page 7: CFS: Cassandra backed storage for Hadoop

©2012 DataStax

Cassandra + Hadoop = <3

7

Page 8: CFS: Cassandra backed storage for Hadoop

©2012 DataStax

The Solution• InputFormat/OutputFormat

• Unfortunately, still need a DFS

• Run tasktrackers/datanodes locally• Data Locality FTW!

• Run namenode/jobtracker somewhere

• Since Cassandra 0.6 (the dark ages)

8

Page 9: CFS: Cassandra backed storage for Hadoop

©2012 DataStax

Ok, but what about these parts that suck...

9

Page 10: CFS: Cassandra backed storage for Hadoop

©2012 DataStax

Do not want...• Multiple hadoop stacks?

• SPOF?

• 3 JVMS?

10

Page 11: CFS: Cassandra backed storage for Hadoop

©2012 DataStax

CFS

11

Page 12: CFS: Cassandra backed storage for Hadoop

©2012 DataStax

Cassandra Data model in 1 minute

12

Page 13: CFS: Cassandra backed storage for Hadoop

©2012 DataStax

Column Families• Column Family ~= Table

• Row Key + columns

• Columns are sparse

13

Page 14: CFS: Cassandra backed storage for Hadoop

©2012 DataStax

Static - Users Column Family

14

Row Key

nickmbailey password: * name: Nick

zznate password: * name: Nate phone: 512-7777

Page 15: CFS: Cassandra backed storage for Hadoop

©2012 DataStax

Select * from Users where name=Nick;

Secondary Indexes

15

Page 16: CFS: Cassandra backed storage for Hadoop

©2012 DataStax

Dynamic - Friends

16

Row Key

nickmbailey zznate: thobbs:

zznate jbeiber: thobbs: steve_watt:

Page 17: CFS: Cassandra backed storage for Hadoop

©2012 DataStax

So what about CFS...

17

Page 18: CFS: Cassandra backed storage for Hadoop

©2012 DataStax

Simple...

18

Page 19: CFS: Cassandra backed storage for Hadoop

©2012 DataStax 19

Page 20: CFS: Cassandra backed storage for Hadoop

©2012 DataStax

CF: inode• Essentially, namenode replacement

• File metadata

20

Page 21: CFS: Cassandra backed storage for Hadoop

©2012 DataStax 21

Page 22: CFS: Cassandra backed storage for Hadoop

©2012 DataStax

CF: inode• Row Key = UUID

• Allows for file renames

• Secondary indexes for file browsing

• Columns:

22

Columnfilename /home/nick/data.txt

parent_path /home/nick/attributes nick:nick:777

TimeUUID1 <block metadata>TimeUUID2 <block metadata>TimeUUID3 <block metadata>

...

Page 23: CFS: Cassandra backed storage for Hadoop

©2012 DataStax 23

Page 24: CFS: Cassandra backed storage for Hadoop

©2012 DataStax

CF: sblocks• Essentially, datanode replacement

• Stores actual contents of files

• Each row is an hdfs block

• Row Key = Block ID

24

Column

TimeUUID1 <compressed file data>

TimeUUID2 <compressed file data>

TimeUUID3 <compressed file data>

...

Page 25: CFS: Cassandra backed storage for Hadoop

©2012 DataStax 25

Page 26: CFS: Cassandra backed storage for Hadoop

©2012 DataStax

Writes• Write file metadata

• Split into blocks• Still controlled by ‘dfs.block.size’• also ‘cfs.local.subblock.size’

• Read in a block• split into sub blocks

• Update inode, sblocks

• rinse, repeat

26

Page 27: CFS: Cassandra backed storage for Hadoop

©2012 DataStax 27

Page 28: CFS: Cassandra backed storage for Hadoop

©2012 DataStax

Reads• Check for file in inode

• Determine appropriate blocks

• Request blocks via thrift

• If data is local...• ...get location on local filesystem

• If data is remote...• ...get actual file content via thrift

28

Page 29: CFS: Cassandra backed storage for Hadoop

©2012 DataStax

What Else?• Current Implementation: 1.0.4• <property>

<name>fs.cfs.impl</name> <value>com.datastax.bdp.hadoop.cfs.CassandraFileSystem</value> </property>

• Supports HDFS append()

• Immutability makes things easy

• See the first incarnation• https://github.com/riptano/brisk

29

Page 31: CFS: Cassandra backed storage for Hadoop

Questions?