hydroelectric dam - carnegie mellon university

9
FAWN A Fast Array of Wimpy Nodes !"#$% '(%)*+)(, -"+.( /*"(01$(, 2$34")1 5"6$(+078, '6"* 94"($+4"7)), :";*)(3) <"(, !"#$% !$'()*+$, ="*()>$) 2)11.( ?($#)*+$@7 8A(@)1 :"B+ 9$C+BD*>4 Energy in computing ! 9.;)* $+ " +$>($E3"(@ BD*%)( .( 3.6FDG(> ! HI7)"* <=J +..( @. B) %.6$("@)% B7 F.;)* 2 Hydroelectric Dam Can we reduce energy use by a factor of ten? -.// '*0+* 12* '$3* 4506/5$)' 7+5") ",80*$'",9 8$:"1$/ 85'1 3 FAWN /"+@ '**"7 .K L$6F7 M.%)+ A6F*.#) 3.6FD@"G.("1 )N3$)(37 .K %"@"I$(@)(+$#) 3.6FDG(> D+$(> "( "**"7 .K ;)11IB"1"(3)% 1.;IF.;)* +7+@)6+O !"#$ %&' ()* ()* %&' +,-#. ()* %&' +,-#. ()* %&' +,-#. ()* %&' +,-#. ()* ()* ()* ()* %&' +,-#. ()* %&' +,-#. ()* %&' +,-#. ()* %&' +,-#. /001 201 34-5"6"78-, 9&4:&4 +;1< '2! P).%) QRS2T !U'2 VPT =.6F"3@/1"+4 4

Upload: others

Post on 12-Jul-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hydroelectric Dam - Carnegie Mellon University

FAWNA Fast Array of Wimpy Nodes

!"#$%&'(%)*+)(,&-"+.(&/*"(01$(,&2$34")1&5"6$(+078,

'6"*&94"($+4"7)),&:";*)(3)&<"(,&!"#$%&!$'()*+$,

="*()>$)&2)11.(&?($#)*+$@7&&&&&&&&8A(@)1&:"B+&9$C+BD*>4

Energy in computing

! 9.;)*&$+&"&+$>($E3"(@&BD*%)(&.(&3.6FDG(>

! HI7)"*&<=J&+..(&@.&B)&%.6$("@)%&B7&F.;)*

2

Hydroelectric Dam

Can we reduce energy use by a factor of ten?

-.//&'*0+*&12*&'$3*&4506/5$)'&

7+5")&",80*$'",9&8$:"1$/&85'13

FAWN/"+@&'**"7&.K&L$6F7&M.%)+

A6F*.#)&3.6FD@"G.("1&)N3$)(37&.K&

%"@"I$(@)(+$#)&3.6FDG(>&D+$(>&"(&"**"7&

.K&;)11IB"1"(3)%&1.;IF.;)*&+7+@)6+O

Understanding Data-Intensive Workloads on FAWN

Iulian Moraru, Lawrence Tan, Vijay Vasudevan15-849 Low Power Project

Abstract

In this paper, we explore the use of the Fast Arrayof Wimpy Nodes (FAWN) architecture for a wideclass of data-intensive workloads. While the bene-fits of FAWN are highest for I/O-bound workloadswhere the CPU cycles of traditional machines areoften wasted, we find that CPU-bound workloadsrun on FAWN can be up to six times more efficientin work done per Joule of energy than traditionalmachines.

1 Introduction

Power has become a dominating factor in the costof provisioning and operation of large datacenters.This work focuses on one promising approach toreduce both average and peak power using a FastArray of Wimpy Node (FAWN) architecture [2],which proposes using a large cluster of low-powernodes instead of a cluster of traditional, high powernodes. FAWN (Figure 1) was originally designed totarget mostly I/O-bound workloads, where the ad-ditional processing capabilities of high-speed pro-cessors were often wasted. While the FAWN ar-chitecture has been shown to be significantly moreenergy efficient than traditional architectures forseek-bound workloads [16], an open question iswhether this architecture is well-suited for otherdata-intensive workloads common in cluster-basedcomputing.

Recent work has shown that the FAWN archi-tecture benefits from fundamental trends in com-puting and power—running at a lower speed savesenergy, while the low-power processors used inFAWN are significantly more efficient in work doneper joule [16]. Combined with the inherent paral-lelism afforded by popular computing frameworks

!"#$

%&'

()*

()* %&' +,-#.

()* %&' +,-#.

()* %&' +,-#.

()* %&' +,-#.

()*

()* ()*()* %&' +,-#.

()* %&' +,-#.

()* %&' +,-#.

()* %&' +,-#.

/001 201

34-5"6"78-,

9&4:&4+;1<

Figure 1: FAWN architecture

such as Hadoop [1] and Dryad [9], a FAWN sys-tem’s improvement in both CPU-I/O balance andinstruction efficiency allows for an increase in over-all energy efficiency for a much larger class of dat-acenter workloads.

In this work, we present a taxonomy and analysisof some primitive data-intensive workloads and op-erations to understand when FAWN can perform aswell as a traditional cluster architecture and reduceenergy consumption for data centers.

To understand where FAWN improves energyefficiency for data-intensive computing, we in-vestigate a wide-range of benchmarks commonin frameworks such as Hadoop [1], finding thatFAWN is between three to ten times more efficientthan a traditional machine in performing operationssuch as distributed grep and sort. For more CPU-bound operations, such as encryption and compres-sion, FAWN architectures are still between three tosix times more energy efficient. We believe thesetwo categories of workloads—CPU-bound and I/O-bound—encompass a large enough range of com-

1

'2!&P).%)

QRS2T&!U'2

VPT&=.6F"3@/1"+4

4

Page 2: Hydroelectric Dam - Carnegie Mellon University

Goal: reduce peak power

5

<*"%$G.("1&!"@"3)(@)*

9.;)*

=..1$(>

!$+@*$BDG.(QWX

QWX&)()*>7&1.++

Y>..%Z

{[WWWL

\RWL

100%

FAWN

[WWL

][WWL

^)*#)*+

Overview

! T"30>*.D(%

! ;7<=&>0",8":/*'

! /'LMI5_&!)+$>(

! `#"1D"G.(

!2.*)&L.*01."%+

! =.(31D+$.(

6

Towards balanced systems

7

[`IW[

[`aWW

[`aW[

[`aWQ

[`aWH

[`aWV

[`aWR

[`aWS

[`aW\

[`aWb

[cbW [cbR [ccW [ccR QWWW QWWR

M"(.+)3.(%+

d)"*

?"'6&-**6

@>A&@%8/*

L"+@)%&

*)+.D*3)+

U)B"1"(3$(>&JFG.(+

<.%"7e+&=9?+

'**"7&.K

/"+@)+@&!$+0+

^1.;)*&=9?+

/"+@&^@.*">)

^1.;&=9?+

<.%"7e+&!$+0+

?B7C&788*''

^F))%&#+O&`N3$)(37/"+@)+@&F*.3)++.*+

)f4$B$@&+DF)*1$()"*

F.;)*&D+">)

/$f)%&F.;)*&3.+@+&3"(

%.6$("@)&)N3$)(37

K.*&+1.;&F*.3)++.*+

/'LM&@"*>)@+&+;))@&+F.@

$(&+7+@)6&)N3$)(37&;4)(&

$(31D%$(>&Ef)%&3.+@+

8

YA(31D%)+&WO[L&F.;)*&.#)*4)"%Z

Targeting the sweet-spot in efficiency

Page 3: Hydroelectric Dam - Carnegie Mellon University

9

<.%"7e+&=9?

'**"7&.K

/"+@)+@&!$+0+

^1.;)*&=9?

/"+@&^@.*">)^1.;&=9?

<.%"7e+&!$+0

FAWN

2.*)&)N3$)(@

Targeting the sweet-spot in efficiency Overview! T"30>*.D(%

! /'LM&9*$(3$F1)+

! ;7<=DE!&?*'"9,

! '*34$@)3@D*)

! =.(+@*"$(@+

! `#"1D"G.(

!2.*)&L.*01."%+

! =.(31D+$.(

10

Data-intensive key value

! =*$G3"1&$(K*"+@*D3@D*)&+)*#$3)

! ^)*#$3)&1)#)1&">*))6)(@+&K.*&F)*K.*6"(3)g1"@)(37

! U"(%.6I"33)++,&*)"%I6.+@17,&4"*%&@.&3"34)

11

FAWN-KV: Key Value System

! `()*>7I)N3$)(@&31D+@)*&0)7I#"1D)&+@.*)

! P."1h&$6F*.#)&F(*0"*'GH5(/*

! 9*.@.@7F)h&'1$fH3Q&(.%)+&;$@4&i"+4&+@.*">)

! RWW2jk&=9?,&QRS2T&!U'2,&VPT&=.6F"3@/1"+4

12

Page 4: Hydroelectric Dam - Carnegie Mellon University

FAWN-KV: Key Value System

• Energy-efficient cluster key-value store

• Goal: improve Queries/Joule

• Prototype: Alix3c2 nodes with flash storage

• 500MHz CPU, 256MB DRAM, 4GB CompactFlash

13

?($lD)&=4"11)(>)+h

!&IJ8"*,1&$,)&K$'1&K$"/5+*0

!&<"3:%&@>A'L&/"3"1*)&?B7C

!&;/$'2&:550&$1&'3$//&0$,)53&40"1*'

FAWN-KV architecture

14

/*.(@I)(% X

T"30I)(%

T"30I)(%

T"30I)(%

T"30I)(%

T"30I)(%5_&U$(>

Consistent hashing

FAWN Back-endFAWN-DS

Front-end

Front-end

Switch

Requests

Responses

E2A1

B1

D1

E1

F1D2

A2

F2

B2

2"(">)+&T"30)(%+

'3@+&"+&P"@);"7

U.D@)+&U)lD)+@+

FAWN-KV architecture

15

Front-end X

Back-end

Back-end

Back-end

Back-end

Back-end

FAWN Back-endFAWN-DS

Front-end

Front-end

Switch

Requests

Responses

E2A1

B1

D1

E1

F1D2

A2

F2

B2

'#.$%&*"(%.6&;*$@)+

:$6$@)%&U)+.D*3)+

/'LMI!^ /'LMI5_

`N3$)(@&/"$1.#)*

'#.$%&*"(%.6&;*$@)+

From key to value

16

Data Log In-memoryHash Index

Log Entry

KeyFrag Valid Offset

160-bit Key

KeyFrag

Key Len Data

Inserted valuesare appended

Scan and Split

ConcurrentInserts

Datastore List Datastore ListData in new rangeData in original range Atomic Update

of Datastore List

(a) (b) (c)

Figure 2: (a) FAWN-DS appends writes to the end of the Data Log. (b) Split requires a sequential scan of the data region, transfer-

ring out-of-range entries to the new store. (c) After scan is complete, the datastore list is atomically updated to add the new store.

Compaction of the original store will clean up out-of-range entries.

3.2 Understanding Flash Storage

Flash provides a non-volatile memory store with several signifi-cant benefits over typical magnetic hard disks for random-access,read-intensive workloads—but it also introduces several challenges.

Three characteristics of flash underlie the design of the FAWN-KVsystem described throughout this section:

1. Fast random reads: (� 1 ms), up to 175 times faster thanrandom reads on magnetic disk [35, 40].

2. Efficient I/O: Flash devices consume less than one Watt evenunder heavy load, whereas mechanical disks can consume over10 W at load. Flash is over two orders of magnitude moreefficient than mechanical disks in terms of queries/Joule.

3. Slow random writes: Small writes on flash are very expen-sive. Updating a single page requires first erasing an entireerase block (128 KB–256 KB) of pages, and then writing themodified block in its entirety. As a result, updating a single byteof data is as expensive as writing an entire block of pages [37].

Modern devices improve random write performance using write

buffering and preemptive block erasure. These techniques improveperformance for short bursts of writes, but recent studies show thatsustained random writes still perform poorly on these devices [40].

These performance problems motivate log-structured techniquesfor flash filesystems and data structures [36, 37, 23]. These sameconsiderations inform the design of FAWN’s node storage manage-ment system, described next.

3.3 The FAWN Data Store

FAWN-DS is a log-structured key-value store. Each store containsvalues for the key range associated with one virtual ID. It acts toclients like a disk-based hash table that supports Store, Lookup,and Delete.1

FAWN-DS is designed specifically to perform well on flash stor-age and to operate within the constrained DRAM available on wimpynodes: all writes to the datastore are sequential, and reads require asingle random access. To provide this property, FAWN-DS maintainsan in-DRAM hash table (Hash Index) that maps keys to an offset in

the append-only Data Log on flash (Figure 2a). This log-structureddesign is similar to several append-only filesystems [42, 15], which

avoid random seeks on magnetic disks for writes.

1We differentiate datastore from database to emphasize that we do not provide atransactional or relational interface.

/* KEY = 0x93df7317294b99e3e049, 16 index bits */INDEX = KEY & 0xffff; /* = 0xe049; */KEYFRAG = (KEY >> 16) & 0x7fff; /* = 0x19e3; */for i = 0 to NUM HASHES do

bucket = hash[i](INDEX);if bucket.valid && bucket.keyfrag==KEYFRAG &&

readKey(bucket.offset)==KEY then

return bucket;end if

{Check next chain element...}end for

return NOT FOUND;

Figure 3: Pseudocode for hash bucket lookup in FAWN-DS.

Mapping a Key to a Value. FAWN-DS uses an in-memory(DRAM) Hash Index to map 160-bit keys to a value stored in theData Log. It stores only a fragment of the actual key in memory to

find a location in the log; it then reads the full key (and the value)from the log and verifies that the key it read was, in fact, the correctkey. This design trades a small and configurable chance of requiringtwo reads from flash (we set it to roughly 1 in 32,768 accesses) for

drastically reduced memory requirements (only six bytes of DRAMper key-value pair).

Figure 3 shows the pseudocode that implements this design forLookup. FAWN-DS extracts two fields from the 160-bit key: the ilow order bits of the key (the index bits) and the next 15 low orderbits (the key fragment). FAWN-DS uses the index bits to select abucket from the Hash Index, which contains 2i

hash buckets. Each

bucket is only six bytes: a 15-bit key fragment, a valid bit, and a4-byte pointer to the location in the Data Log where the full entry isstored.

Lookup proceeds, then, by locating a bucket using the index bitsand comparing the key against the key fragment. If the fragmentsdo not match, FAWN-DS uses hash chaining to continue searchingthe hash table. Once it finds a matching key fragment, FAWN-DSreads the record off of the flash. If the stored full key in the on-flashrecord matches the desired lookup key, the operation is complete.Otherwise, FAWN-DS resumes its hash chaining search of the in-memory hash table and searches additional records. With the 15-bit

key fragment, only 1 in 32,768 retrievals from the flash will beincorrect and require fetching an additional record.

The constants involved (15 bits of key fragment, 4 bytes of logpointer) target the prototype FAWN nodes described in Section 4.

[SWIB$@&0)7

Data Log In-memoryHash Index

Log Entry

KeyFrag Valid Offset

160-bit Key

KeyFrag

Key Len Data

Inserted valuesare appended

Scan and Split

ConcurrentInserts

Datastore List Datastore ListData in new rangeData in original range Atomic Update

of Datastore List

(a) (b) (c)

Figure 2: (a) FAWN-DS appends writes to the end of the Data Log. (b) Split requires a sequential scan of the data region, transfer-

ring out-of-range entries to the new store. (c) After scan is complete, the datastore list is atomically updated to add the new store.

Compaction of the original store will clean up out-of-range entries.

3.2 Understanding Flash Storage

Flash provides a non-volatile memory store with several signifi-cant benefits over typical magnetic hard disks for random-access,read-intensive workloads—but it also introduces several challenges.

Three characteristics of flash underlie the design of the FAWN-KVsystem described throughout this section:

1. Fast random reads: (� 1 ms), up to 175 times faster thanrandom reads on magnetic disk [35, 40].

2. Efficient I/O: Flash devices consume less than one Watt evenunder heavy load, whereas mechanical disks can consume over10 W at load. Flash is over two orders of magnitude moreefficient than mechanical disks in terms of queries/Joule.

3. Slow random writes: Small writes on flash are very expen-sive. Updating a single page requires first erasing an entireerase block (128 KB–256 KB) of pages, and then writing themodified block in its entirety. As a result, updating a single byteof data is as expensive as writing an entire block of pages [37].

Modern devices improve random write performance using write

buffering and preemptive block erasure. These techniques improveperformance for short bursts of writes, but recent studies show thatsustained random writes still perform poorly on these devices [40].

These performance problems motivate log-structured techniquesfor flash filesystems and data structures [36, 37, 23]. These sameconsiderations inform the design of FAWN’s node storage manage-ment system, described next.

3.3 The FAWN Data Store

FAWN-DS is a log-structured key-value store. Each store containsvalues for the key range associated with one virtual ID. It acts toclients like a disk-based hash table that supports Store, Lookup,and Delete.1

FAWN-DS is designed specifically to perform well on flash stor-age and to operate within the constrained DRAM available on wimpynodes: all writes to the datastore are sequential, and reads require asingle random access. To provide this property, FAWN-DS maintainsan in-DRAM hash table (Hash Index) that maps keys to an offset in

the append-only Data Log on flash (Figure 2a). This log-structureddesign is similar to several append-only filesystems [42, 15], which

avoid random seeks on magnetic disks for writes.

1We differentiate datastore from database to emphasize that we do not provide atransactional or relational interface.

/* KEY = 0x93df7317294b99e3e049, 16 index bits */INDEX = KEY & 0xffff; /* = 0xe049; */KEYFRAG = (KEY >> 16) & 0x7fff; /* = 0x19e3; */for i = 0 to NUM HASHES do

bucket = hash[i](INDEX);if bucket.valid && bucket.keyfrag==KEYFRAG &&

readKey(bucket.offset)==KEY then

return bucket;end if

{Check next chain element...}end for

return NOT FOUND;

Figure 3: Pseudocode for hash bucket lookup in FAWN-DS.

Mapping a Key to a Value. FAWN-DS uses an in-memory(DRAM) Hash Index to map 160-bit keys to a value stored in theData Log. It stores only a fragment of the actual key in memory to

find a location in the log; it then reads the full key (and the value)from the log and verifies that the key it read was, in fact, the correctkey. This design trades a small and configurable chance of requiringtwo reads from flash (we set it to roughly 1 in 32,768 accesses) for

drastically reduced memory requirements (only six bytes of DRAMper key-value pair).

Figure 3 shows the pseudocode that implements this design forLookup. FAWN-DS extracts two fields from the 160-bit key: the ilow order bits of the key (the index bits) and the next 15 low orderbits (the key fragment). FAWN-DS uses the index bits to select abucket from the Hash Index, which contains 2i

hash buckets. Each

bucket is only six bytes: a 15-bit key fragment, a valid bit, and a4-byte pointer to the location in the Data Log where the full entry isstored.

Lookup proceeds, then, by locating a bucket using the index bitsand comparing the key against the key fragment. If the fragmentsdo not match, FAWN-DS uses hash chaining to continue searchingthe hash table. Once it finds a matching key fragment, FAWN-DSreads the record off of the flash. If the stored full key in the on-flashrecord matches the desired lookup key, the operation is complete.Otherwise, FAWN-DS resumes its hash chaining search of the in-memory hash table and searches additional records. With the 15-bit

key fragment, only 1 in 32,768 retrievals from the flash will beincorrect and require fetching an additional record.

The constants involved (15 bits of key fragment, 4 bytes of logpointer) target the prototype FAWN nodes described in Section 4.

5)7/*">&&&&_"1$%&&&&&&&&&O

Data Log In-memoryHash Index

Log Entry

KeyFrag Valid Offset

160-bit Key

KeyFrag

Key Len Data

Inserted valuesare appended

Scan and Split

ConcurrentInserts

Datastore List Datastore ListData in new rangeData in original range Atomic Update

of Datastore List

(a) (b) (c)

Figure 2: (a) FAWN-DS appends writes to the end of the Data Log. (b) Split requires a sequential scan of the data region, transfer-

ring out-of-range entries to the new store. (c) After scan is complete, the datastore list is atomically updated to add the new store.

Compaction of the original store will clean up out-of-range entries.

3.2 Understanding Flash Storage

Flash provides a non-volatile memory store with several signifi-cant benefits over typical magnetic hard disks for random-access,read-intensive workloads—but it also introduces several challenges.

Three characteristics of flash underlie the design of the FAWN-KVsystem described throughout this section:

1. Fast random reads: (� 1 ms), up to 175 times faster thanrandom reads on magnetic disk [35, 40].

2. Efficient I/O: Flash devices consume less than one Watt evenunder heavy load, whereas mechanical disks can consume over10 W at load. Flash is over two orders of magnitude moreefficient than mechanical disks in terms of queries/Joule.

3. Slow random writes: Small writes on flash are very expen-sive. Updating a single page requires first erasing an entireerase block (128 KB–256 KB) of pages, and then writing themodified block in its entirety. As a result, updating a single byteof data is as expensive as writing an entire block of pages [37].

Modern devices improve random write performance using write

buffering and preemptive block erasure. These techniques improveperformance for short bursts of writes, but recent studies show thatsustained random writes still perform poorly on these devices [40].

These performance problems motivate log-structured techniquesfor flash filesystems and data structures [36, 37, 23]. These sameconsiderations inform the design of FAWN’s node storage manage-ment system, described next.

3.3 The FAWN Data Store

FAWN-DS is a log-structured key-value store. Each store containsvalues for the key range associated with one virtual ID. It acts toclients like a disk-based hash table that supports Store, Lookup,and Delete.1

FAWN-DS is designed specifically to perform well on flash stor-age and to operate within the constrained DRAM available on wimpynodes: all writes to the datastore are sequential, and reads require asingle random access. To provide this property, FAWN-DS maintainsan in-DRAM hash table (Hash Index) that maps keys to an offset in

the append-only Data Log on flash (Figure 2a). This log-structureddesign is similar to several append-only filesystems [42, 15], which

avoid random seeks on magnetic disks for writes.

1We differentiate datastore from database to emphasize that we do not provide atransactional or relational interface.

/* KEY = 0x93df7317294b99e3e049, 16 index bits */INDEX = KEY & 0xffff; /* = 0xe049; */KEYFRAG = (KEY >> 16) & 0x7fff; /* = 0x19e3; */for i = 0 to NUM HASHES do

bucket = hash[i](INDEX);if bucket.valid && bucket.keyfrag==KEYFRAG &&

readKey(bucket.offset)==KEY then

return bucket;end if

{Check next chain element...}end for

return NOT FOUND;

Figure 3: Pseudocode for hash bucket lookup in FAWN-DS.

Mapping a Key to a Value. FAWN-DS uses an in-memory(DRAM) Hash Index to map 160-bit keys to a value stored in theData Log. It stores only a fragment of the actual key in memory to

find a location in the log; it then reads the full key (and the value)from the log and verifies that the key it read was, in fact, the correctkey. This design trades a small and configurable chance of requiringtwo reads from flash (we set it to roughly 1 in 32,768 accesses) for

drastically reduced memory requirements (only six bytes of DRAMper key-value pair).

Figure 3 shows the pseudocode that implements this design forLookup. FAWN-DS extracts two fields from the 160-bit key: the ilow order bits of the key (the index bits) and the next 15 low orderbits (the key fragment). FAWN-DS uses the index bits to select abucket from the Hash Index, which contains 2i

hash buckets. Each

bucket is only six bytes: a 15-bit key fragment, a valid bit, and a4-byte pointer to the location in the Data Log where the full entry isstored.

Lookup proceeds, then, by locating a bucket using the index bitsand comparing the key against the key fragment. If the fragmentsdo not match, FAWN-DS uses hash chaining to continue searchingthe hash table. Once it finds a matching key fragment, FAWN-DSreads the record off of the flash. If the stored full key in the on-flashrecord matches the desired lookup key, the operation is complete.Otherwise, FAWN-DS resumes its hash chaining search of the in-memory hash table and searches additional records. With the 15-bit

key fragment, only 1 in 32,768 retrievals from the flash will beincorrect and require fetching an additional record.

The constants involved (15 bits of key fragment, 4 bytes of logpointer) target the prototype FAWN nodes described in Section 4.

Data Log In-memoryHash Index

Log Entry

KeyFrag Valid Offset

160-bit Key

KeyFrag

Key Len Data

Inserted valuesare appended

Scan and Split

ConcurrentInserts

Datastore List Datastore ListData in new rangeData in original range Atomic Update

of Datastore List

(a) (b) (c)

Figure 2: (a) FAWN-DS appends writes to the end of the Data Log. (b) Split requires a sequential scan of the data region, transfer-

ring out-of-range entries to the new store. (c) After scan is complete, the datastore list is atomically updated to add the new store.

Compaction of the original store will clean up out-of-range entries.

3.2 Understanding Flash Storage

Flash provides a non-volatile memory store with several signifi-cant benefits over typical magnetic hard disks for random-access,read-intensive workloads—but it also introduces several challenges.

Three characteristics of flash underlie the design of the FAWN-KVsystem described throughout this section:

1. Fast random reads: (� 1 ms), up to 175 times faster thanrandom reads on magnetic disk [35, 40].

2. Efficient I/O: Flash devices consume less than one Watt evenunder heavy load, whereas mechanical disks can consume over10 W at load. Flash is over two orders of magnitude moreefficient than mechanical disks in terms of queries/Joule.

3. Slow random writes: Small writes on flash are very expen-sive. Updating a single page requires first erasing an entireerase block (128 KB–256 KB) of pages, and then writing themodified block in its entirety. As a result, updating a single byteof data is as expensive as writing an entire block of pages [37].

Modern devices improve random write performance using write

buffering and preemptive block erasure. These techniques improveperformance for short bursts of writes, but recent studies show thatsustained random writes still perform poorly on these devices [40].

These performance problems motivate log-structured techniquesfor flash filesystems and data structures [36, 37, 23]. These sameconsiderations inform the design of FAWN’s node storage manage-ment system, described next.

3.3 The FAWN Data Store

FAWN-DS is a log-structured key-value store. Each store containsvalues for the key range associated with one virtual ID. It acts toclients like a disk-based hash table that supports Store, Lookup,and Delete.1

FAWN-DS is designed specifically to perform well on flash stor-age and to operate within the constrained DRAM available on wimpynodes: all writes to the datastore are sequential, and reads require asingle random access. To provide this property, FAWN-DS maintainsan in-DRAM hash table (Hash Index) that maps keys to an offset in

the append-only Data Log on flash (Figure 2a). This log-structureddesign is similar to several append-only filesystems [42, 15], which

avoid random seeks on magnetic disks for writes.

1We differentiate datastore from database to emphasize that we do not provide atransactional or relational interface.

/* KEY = 0x93df7317294b99e3e049, 16 index bits */INDEX = KEY & 0xffff; /* = 0xe049; */KEYFRAG = (KEY >> 16) & 0x7fff; /* = 0x19e3; */for i = 0 to NUM HASHES do

bucket = hash[i](INDEX);if bucket.valid && bucket.keyfrag==KEYFRAG &&

readKey(bucket.offset)==KEY then

return bucket;end if

{Check next chain element...}end for

return NOT FOUND;

Figure 3: Pseudocode for hash bucket lookup in FAWN-DS.

Mapping a Key to a Value. FAWN-DS uses an in-memory(DRAM) Hash Index to map 160-bit keys to a value stored in theData Log. It stores only a fragment of the actual key in memory to

find a location in the log; it then reads the full key (and the value)from the log and verifies that the key it read was, in fact, the correctkey. This design trades a small and configurable chance of requiringtwo reads from flash (we set it to roughly 1 in 32,768 accesses) for

drastically reduced memory requirements (only six bytes of DRAMper key-value pair).

Figure 3 shows the pseudocode that implements this design forLookup. FAWN-DS extracts two fields from the 160-bit key: the ilow order bits of the key (the index bits) and the next 15 low orderbits (the key fragment). FAWN-DS uses the index bits to select abucket from the Hash Index, which contains 2i

hash buckets. Each

bucket is only six bytes: a 15-bit key fragment, a valid bit, and a4-byte pointer to the location in the Data Log where the full entry isstored.

Lookup proceeds, then, by locating a bucket using the index bitsand comparing the key against the key fragment. If the fragmentsdo not match, FAWN-DS uses hash chaining to continue searchingthe hash table. Once it finds a matching key fragment, FAWN-DSreads the record off of the flash. If the stored full key in the on-flashrecord matches the desired lookup key, the operation is complete.Otherwise, FAWN-DS resumes its hash chaining search of the in-memory hash table and searches additional records. With the 15-bit

key fragment, only 1 in 32,768 retrievals from the flash will beincorrect and require fetching an additional record.

The constants involved (15 bits of key fragment, 4 bytes of logpointer) target the prototype FAWN nodes described in Section 4.

!{

[Q&B7@)+&F)*&)(@*7

'#.$%&*"(%.6&;*$@)+

:$6$@)%&U)+.D*3)+

/'LMI!^ /'LMI5_

`N3$)(@&/"$1.#)*

'#.$%&*"(%.6&;*$@)+

5)7/*">&mn&5)7

9.@)(G"1&3.11$+$.(+m

&:.;&F*.B"B$1$@7&.K

6D1GF1)&/1"+4&*)"%+

j"+4@"B1) !"@"&*)>$.(

Page 5: Hydroelectric Dam - Carnegie Mellon University

Log-structured datastore

! :.>I+@*D3@D*$(>&"#.$%+&+6"11&*"(%.6&;*$@)+

17

P)@

9D@

!)1)@)

U"(%.6&U)"%

'FF)(%

!

/'LMI5_

`N3$)(@&/"$1.#)*

'#.$%&*"(%.6&;*$@)+

!'#.$%&*"(%.6&;*$@)+

:$6$@)%&U)+.D*3)+

/'LMI!^

Yj,To

j"+4&A(%)f&&&&&&_"1D)+&

H

G

F

DC

B

A

On a node addition

18

M.%)&"%%$G.(+,&K"$1D*)+&*)lD$*)&@*"(+K)*&.K&0)7I*"(>)+

Nodes stream data range

19

Data Log In-memoryHash Index

Log Entry

KeyFrag Valid Offset

160-bit Key

KeyFrag

Key Len Data

Inserted valuesare appended

Scan and Split

ConcurrentInserts

Datastore List Datastore ListData in new rangeData in original range Atomic Update

of Datastore List

B

A !&&M$86905(,)&5:*0$.5,'&'*N(*,.$/!&&@5,.,(*&15&3**1&-O7

^@*)"6&K*.6&T&@.&'

=.(3D**)(@&A(+)*@+,

2$($6$k)+&1.30$(>

=.6F"3@&!"@"+@.*)

!

/'LMI5_

`N3$)(@&/"$1.#)*

'#.$%&*"(%.6&;*$@)+

!!!

'#.$%&*"(%.6&;*$@)+

:$6$@)%&U)+.D*3)+

/'LMI!^

FAWN-KV take-aways

! :.>I+@*D3@D*)%&%"@"+@.*)&

! '#.$%+&*"(%.6&;*$@)+&"@&"11&1)#)1+

! 2$($6$k)+&1.30$(>&%D*$(>&K"$1.#)*

! ="*)KD1&*)+.D*3)&D+)&BD@&4$>4&F)*K.*6$(>

! U)F1$3"G.(&"(%&+@*.(>&3.(+$+@)(37&

! _"*$"(@&.K&34"$(&*)F1$3"G.(

20

Page 6: Hydroelectric Dam - Carnegie Mellon University

Overview

! T"30>*.D(%

! /'LM&F*$(3$F1)+

! /'LMI5_&!)+$>(

! I+$/($.5,

!2.*)&L.*01."%+

! =.(31D+$.(

21

Evaluation roadmap

! 5)7I#"1D)&1..0DF&)N3$)(37&3.6F"*$+.(

! A6F"3@&.K&B"30>*.D(%&.F)*"G.(+

! J(&F)*K.*6"(3)&"(%&1"@)(37

! <=J&"("17+$+&K.*&*"(%.6&*)"%&;.*01."%+

22

FAWN-DS lookups

! JD*&/'LMIB"+)%&+7+@)6&.#)*&Sf&6.*)&)N3$)(@&

@4"(&QWWbI)*"&@*"%$G.("1&+7+@)6+

23

System QPS Watts QPSWatt

Alix3c2/Sandisk(CF) 1298 3.75 346

Desktop/Mobi (SSD) 4289 83 51.7

MacbookPro / HD 66 29 2.3

Desktop / HD 171 87 1.96

Modern system speculation

24

VQW,WWW&AJ9^

QRWL

VQW,WWW&AJ9^

[WWL

#+O

<*"%$G.("1 ;7<=

Page 7: Hydroelectric Dam - Carnegie Mellon University

Impact of background ops

25

0

400

800

1200

1600

9)"0 =.6F"3@ ^F1$@ 2)*>)

pD)*$)+&F)*&+)3.(%

&9)"0&lD)*7&1."%

0

400

800

1200

1600

Peak Compact Split Merge

Que

ries

per

sec

ond

&HWX&.K&F)"0&lD)*7&1."%

T"30>*.D(%&.F)*"G.(+&4"#)h&

!C5)*0$1*&$6F"3@&"@&F)"0&1."%

! =*9/"9"P/*&$6F"3@&"@&HWX&1."%

Impact on query latency

0

0.2

0.4

0.6

0.8

1

100 1000 10000 100000 1e+06

CD

F o

f Q

uery

Late

ncy

Query Latency (us)

Median 99.9%

No Split (Max)

Split (Low)

Split (High)

891us 26.3ms

863us 491ms

873us 611ms

Get Query Latency (Max Load)During Split (Low Load)During Split (High Load)

:.;&6)%$"(&1"@)(37

Y$(31D%)+&()@;.*0Z

j$>4&ccOcX&1"@)(37

%D*$(>&6"$(@)("(3)

YBD@&+G11&.0"7Z

26

When to use FAWN?

27

<=J&n&="F$@"1&=.+@&a&9.;)*&=.+@&YrWO[Wg0L4Z

/'LM&Y[WL&)"34Z

Q&<T&%$+0

SVPT&^'<'&/1"+4&^^!

QPT&!U'2&F)*&(.%)

<*"%$G.("1&YQWWLZ

/$#)&Q&<T&%$+0+

[SWPT&9=AI)&/1"+4&^^!

SVPT&/T!A22&F)*&(.%)

qrQRWIRWW&F)*&(.%)qrQWWWIbWWW&F)*&(.%)

Architecture with lowest TCOK.*&*"(%.6&"33)++&;.*01."%+

/'LMIB"+)%&+7+@)6+&3"(&F*.#$%)&

1.;)*&3.+@&F)*&sPT,&pD)*7U"@)t

U"G.&.K&lD)*7&*"@)&@.&

%"@"+)@&+$k)&%)@)*6$()+&

+@.*">)&@)34(.1.>7

P*"F4&$>(.*)+&

6"(">)6)(@,&3..1$(>,&

()@;.*0$(>OOO

System Cost W QPS QueriesJoule

GBWatt

TCOGB

TCOQPS

Traditionals:5-2TB HD $2K 250 1500 6 40 0.26 1.77160GB PCIe SSD $8K 220 200K 909 0.72 53 0.0464GB DRAM $3K 280 1M 3.5K 0.23 59 0.004

FAWNs:2TB Disk $350 20 250 12.5 100 0.20 1.6132GB SSD $500 15 35K 2.3K 2.1 16.9 0.0152GB DRAM $250 15 100K 6.6K 0.13 134 0.003

Table 4: Traditional and FAWN node statistics

!"#$

!$

!$"

!$""

!$"""

!$""""

!"#$ !$ !$" !$"" !$"""

%&'&()'!*+,)!+-!./

01)23!4&')!56+77+8-(9():;

!"#$%&%'(#

)*+*,-./

0.12*+*,%34

0.12*+*0)#35

0.12*+*,-./

Figure 16: Solution space for lowest 3-year TCO as a functionof dataset size and query rate.

second. The dividing lines represent a boundary across which onesystem becomes more favorable than another.

Large Datasets, Low Query Rates: FAWN+Disk has the lowesttotal cost per GB. While not shown on our graph, a traditionalsystem wins for exabyte-sized workloads if it can be configuredwith sufficient disks per node (over 50), though packing 50 disks permachine poses reliability challenges.

Small Datasets, High Query Rates: FAWN+DRAM costs thefewest dollars per queries/second, keeping in mind that we do notexamine workloads that fit entirely in L2 cache on a traditional node.This somewhat counterintuitive result is similar to that made bythe intelligent RAM project, which coupled processors and DRAMto achieve similar benefits [5] by avoiding the memory wall. Weassume the FAWN nodes can only accept 2 GB of DRAM per node,so for larger datasets, a traditional DRAM system provides a highquery rate and requires fewer nodes to store the same amount of data(64 GB vs 2 GB per node).

Middle Range: FAWN+SSDs provide the best balance of storagecapacity, query rate, and total cost. As SSD capacity improves, thiscombination is likely to continue expanding into the range servedby FAWN+Disk; as SSD performance improves, so will it reach intoDRAM territory. It is therefore conceivable that FAWN+SSD couldbecome the dominant architecture for a wide range of random-accessworkloads.

Are traditional systems obsolete? We emphasize that this analysisapplies only to small, random access workloads. Sequential-readworkloads are similar, but the constants depend strongly on the per-byte processing required. Traditional cluster architectures retaina place for CPU-bound workloads, but we do note that architec-tures such as IBM’s BlueGene successfully apply large numbers of

low-power, efficient processors to many supercomputing applica-tions [14]—but they augment their wimpy processors with customfloating point units to do so.

Our definition of “total cost of ownership” also ignores severalnotable costs: In comparison to traditional architectures, FAWNshould reduce power and cooling infrastructure, but may increasenetwork-related hardware and power costs due to the need for moreswitches. Our current hardware prototype improves work done pervolume, thus reducing costs associated with datacenter rack or floorspace. Finally, of course, our analysis assumes that cluster softwaredevelopers can engineer away the human costs of management—anoptimistic assumption for all architectures. We similarly discardissues such as ease of programming, though we ourselves selectedan x86-based wimpy platform precisely for ease of development.

6. RELATED WORKFAWN follows in a long tradition of ensuring that systems are bal-anced in the presence of scaling challenges and of designing systemsto cope with the performance challenges imposed by hardware ar-chitectures.

System Architectures: JouleSort [44] is a recent energy-efficiency benchmark; its authors developed a SATA disk-based“balanced” system coupled with a low-power (34 W) CPU that sig-nificantly out-performed prior systems in terms of records sorted perjoule. A major difference with our work is that the sort workloadcan be handled with large, bulk I/O reads using radix or merge sort.FAWN targets even more seek-intensive workloads for which eventhe efficient CPUs used for JouleSort are excessive, and for whichdisk is inadvisable.

More recently, several projects have begun using low-powerprocessors for datacenter workloads to reduce energy consump-tion [6, 34, 11, 50, 20, 30]. The Gordon [6] hardware architectureargues for pairing an array of flash chips and DRAM with low-powerCPUs for low-power data intensive computing. A primary focus oftheir work is on developing a Flash Translation Layer suitable forpairing a single CPU with several raw flash chips. Simulations ongeneral system traces indicate that this pairing can provide improvedenergy-efficiency. Our work leverages commodity embedded low-power CPUs and flash storage for cluster key-value applications,enabling good performance on flash regardless of FTL implemen-tation. CEMS [20], AmdahlBlades [50], and Microblades [30] alsoleverage low-cost, low-power commodity components as a buildingblock for datacenter systems, similarly arguing that this architecturecan provide the highest work done per dollar and work done perjoule. Microsoft has recently begun exploring the use of a large clus-ter of low-power systems called Marlowe [34]. This work focuseson taking advantage of the very low-power sleep states providedby this chipset (between 2–4 W) to turn off machines and migrateworkloads during idle periods and low utilization, initially target-ing the Hotmail service. We believe these advantages would alsotranslate well to FAWN, where a lull in the use of a FAWN clusterwould provide the opportunity to significantly reduce average en-ergy consumption in addition to the already-reduced peak energyconsumption that FAWN provides. Dell recently designed and hasbegun shipping VIA Nano-based servers consuming 20–30 W eachfor large webhosting services [11].

Considerable prior work has examined ways to tackle the “mem-ory wall.” The Intelligent RAM (IRAM) project combined CPUsand memory into a single unit, with a particular focus on energy effi-ciency [5]. An IRAM-based CPU could use a quarter of the power

28

Page 8: Hydroelectric Dam - Carnegie Mellon University

Overview

! T"30>*.D(%

! /'LM&F*$(3$F1)+

! /'LMI5_&!)+$>(

! `#"1D"G.(

!C50*&<506/5$)'

! =.(31D+$.(

29

Sorting small records

30

W

[\OR

HR

RQOR

\W

j"%..F&^.*@ M^.*@

70

2734

18

55

^.*@&+F))%&Y2TF+Z

/'LM&YRLZ

<*"%$G.("1&a&j!&Yb\LZ

'@.6^^!I/'LM&YHWLZ

W

WO\R

[OR

QOQR

H

j^&`KK$3$)(37 M^.*@&`KK$3$)(37

2.333

0.60.3790.31

11

^.*@&`KK$3$)(37&Y2TF-Z

0

15

30

45

60

`(3*7F@$.(&^F))%&Y2Tg+Z

51.15

3.65

/'LM&YRLZ

<*"%$G.("1&Yb\LZ

0

0.2

0.4

0.6

0.8

`(3*7F@$.(&`KK$3$)(37&Y2Tg-Z

0.365

0.73

AES encryption/decryption of a 512MB file with a 256-bit key

31

/'LM&$+&Qf&6.*)&)N3$)(@&K.*&=9?IB.D(%&.F)*"G.(+m

9)*K.*6"(3) `N3$)(37

CPU-bound encryption Future directions

!L)&())%&B)C)*&6)@*$3+m

! `()*>7I)N3$)(37&#+O&s:"@)(37,&=.+@t

! 2"(">)"B$1$@7,&*)1$"B$1$@7,&F*.>*"66"B$1$@7&OOO

! U)"1&;.*01."%+&.(&/'LM

! 2"FU)%D3),&L)B&^)*#)*+,&J:<9u

! /'LMI.FG6$k)%&%"@"&F*.3)++$(>&E1@)*+

! ="(&;)&6"0)&^";k"11g:AMpg9$>&)N3$)(@&.(&/'LMu

32

Page 9: Hydroelectric Dam - Carnegie Mellon University

Conclusion! /'LM&"*34$@)3@D*)&*)%D3)+&)()*>7&

3.(+D6FG.(&.K&31D+@)*&3.6FDG(>

! /'LMI5_&"%%*)++)+&34"11)(>)+&.K&;$6F7&(.%)+&K.*&

0)7&#"1D)&+@.*">)

! A($G"1&*)+D1@+&>)()*"1$k)&@.&.@4)*&;.*01."%+

! !"#$%&'($)*#+&,-'(-&,.&*#/0)12'(&)0$-(#3(&)0&4#-#++(+)3*&-(52)-(3&#&*#6,-&-('(3)/0&#0'&-(7-)1(&,.&4#-#++(+&$,'(8&9&:#1%;&<(+)$=

33 34

What about the network?

'>>*)>"G.(

^)*#)*

=.*)

^)*#)*

'>>*)>"G.(

L$6F$)+

=.*)

/'LM&^;$@34

L$6F$)+ L$6F$)+

/'LM&^;$@34

L$6F$)+

"%%+&[&1.;&F.;)*&/'LM&+;$@34&F)*&vF.%w