scaling data servers via cooperative caching
DESCRIPTION
In this thesis, we present design techniques -- and systems that illustrate and validate these techniques -- for building data-intensive applications over the Internet. We enable the use of a traditional bandwidth-limited server in these applications. A large number of cooperating users contribute resources such as disk space and network bandwidth, and form the backbone of such applications. The applications we consider fall in one of two categories. The first type provide user-perceived utility in proportion to the data download rates of the participants; bulk data distributionsystems is a typical example. The second type are usable only when the participants have data download rates above a certain threshold; video streaming is a prime example.TRANSCRIPT
Scaling Data Servers via Cooperative Caching
Siddhartha Annapureddy
New York University
Committee:David Mazieres, Lakshminarayanan Subramanian, Jinyang Li,
Dennis Shasha, Zvi Kedem
Siddhartha Annapureddy Thesis Defense
Motivation
New functionality over the Internet
I Large number of users consume many (large) filesI YouTube, Azureus etc. for movies – DVDs no moreI Data dissemination to large distributed farms
Siddhartha Annapureddy Thesis Defense
Current approaches
Server
Client Client Client
I Scalability ↓ – Server’s uplink saturates
I Operation takes long time to finish
Siddhartha Annapureddy Thesis Defense
Current approaches
Server
Client Client Client
I Scalability ↓ – Server’s uplink saturatesI Operation takes long time to finish
Siddhartha Annapureddy Thesis Defense
Current approaches (contd.)
I Server farms replace central serverI Scalability problem persistsI Notoriously expensive to maintain
I Content Distribution Networks(CDNs)
I Still quite expensive
I Peer-to-peer (P2P) systemsI A new development
Siddhartha Annapureddy Thesis Defense
Peer-to-peer (P2P) systems
I Low-end central server
I Large client population
I File usually divided into chunks/blocksI Use client resources to disseminate file
I Bandwidth, disk space etc.
I Clients organized into a connected overlayI Each client knows only a few other clients
How to design P2P systems to scale data servers?
Siddhartha Annapureddy Thesis Defense
P2P scalability – Design principles
In order to achieve scalability, need to avoid hotspots
I Keep bw load on server to a minimumI Server gives data only to few users
I Leverage participants to distribute to other users
I Maintenance overhead with users to be kept low
I Keep bw load on users to a minimumI Each node interacts with only a small subset of users
Siddhartha Annapureddy Thesis Defense
P2P systems – Themes
I Usability – Exploiting P2P for a range of systemsI File systems, user-level apps etc.
I Security issues in such wide-area P2P systemsI Providing data integrity, privacy, authentication
I Cross file(system) sharing – Use other groupsI Make use of all available sources
I Topology management – Arrange nodes efficientlyI Underlying network (locality properties)I Node capacities (bandwidth, disk space etc.)I Application-specific (certain nodes together work well)
I Scheduling dissemination of chunks efficientlyI Nodes only have partial information
Siddhartha Annapureddy Thesis Defense
P2P systems – Themes
I Usability – Exploiting P2P for a range of systemsI File systems, user-level apps etc.
I Security issues in such wide-area P2P systemsI Providing data integrity, privacy, authentication
I Cross file(system) sharing – Use other groupsI Make use of all available sources
I Topology management – Arrange nodes efficientlyI Underlying network (locality properties)I Node capacities (bandwidth, disk space etc.)I Application-specific (certain nodes together work well)
I Scheduling dissemination of chunks efficientlyI Nodes only have partial information
Siddhartha Annapureddy Thesis Defense
P2P systems – Themes
I Usability – Exploiting P2P for a range of systemsI File systems, user-level apps etc.
I Security issues in such wide-area P2P systemsI Providing data integrity, privacy, authentication
I Cross file(system) sharing – Use other groupsI Make use of all available sources
I Topology management – Arrange nodes efficientlyI Underlying network (locality properties)I Node capacities (bandwidth, disk space etc.)I Application-specific (certain nodes together work well)
I Scheduling dissemination of chunks efficientlyI Nodes only have partial information
Siddhartha Annapureddy Thesis Defense
P2P systems – Themes
I Usability – Exploiting P2P for a range of systemsI File systems, user-level apps etc.
I Security issues in such wide-area P2P systemsI Providing data integrity, privacy, authentication
I Cross file(system) sharing – Use other groupsI Make use of all available sources
I Topology management – Arrange nodes efficientlyI Underlying network (locality properties)I Node capacities (bandwidth, disk space etc.)I Application-specific (certain nodes together work well)
I Scheduling dissemination of chunks efficientlyI Nodes only have partial information
Siddhartha Annapureddy Thesis Defense
Shark – Motivation
Scenario – Want to test a new application on a hundred nodes
Workstation
Workstation
Workstation
Workstation
WorkstationWorkstation
Workstation Workstation
WorkstationWorkstation
Problem – Need to push software to all nodes and collect results
Siddhartha Annapureddy Thesis Defense
Current approaches
Distribute from a central server
Client
Client
Client
Network A Network B
Server
rsync, unison, scp
– Server’s network uplink saturates
– Wastes bandwidth along bottleneck links
Siddhartha Annapureddy Thesis Defense
Current approaches
File distribution mechanisms
BitTorrent, Bullet
+ Scales by offloading burden on server
– Client downloads from half-way across the world
Siddhartha Annapureddy Thesis Defense
Inherent problems with copying files
Users have to decide a priori what to ship
I Ship too much – Waste bandwidth, takes long time
I Ship too little – Hassle to work in a poor environment
Idle environments consume disk space
I Users are loath to cleanup ⇒ Redistribution
I Need low cost solution for refetching files
Manual management of software versions
I Point /usr/local to a central server
Illusion of having development environmentPrograms fetched transparently on demand
Siddhartha Annapureddy Thesis Defense
Networked file systems
+ Know how to deploy these systems
+ Know how to administer such systems
+ Simple accountability mechanisms
Eg: NFS, AFS, SFS etc.
Siddhartha Annapureddy Thesis Defense
Problem: Scalability
0
100
200
300
400
500
600
700
800
0 20 40 60 80 100
Tim
e to
fini
sh R
ead
(sec
)
Unique nodes
40 MB readSFS
Very slow ≈ 775sMuch better !!! ≈ 88s (9x better)
Siddhartha Annapureddy Thesis Defense
Problem: Scalability
0
100
200
300
400
500
600
700
800
0 20 40 60 80 100
Tim
e to
fini
sh R
ead
(sec
)
Unique nodes
40 MB readSFSShark
Very slow ≈ 775sMuch better !!! ≈ 88s (9x better)
Siddhartha Annapureddy Thesis Defense
Let’s design this new filesystem
ServerClient
Vanilla SFS
I Central server model
Siddhartha Annapureddy Thesis Defense
Scalability – Client-side caching
ServerClient
Cache
Large client cache a la AFS
I Whole file caching
I Scales by avoiding redundant data transfers
I Leases to ensure NFS-style cache consistency
Siddhartha Annapureddy Thesis Defense
Scalability – Cooperative caching
Clients fetch data from each other and offload burden from server
Client
Client
Client
Server
ClientClient
I With multiple clients, must address bandwidth concerns
Siddhartha Annapureddy Thesis Defense
Scalability – Cooperative caching
Clients fetch data from each other and offload burden from server
Client
Client
Client
Server
ClientClient
I Shark clients maintain distributed index
Siddhartha Annapureddy Thesis Defense
Scalability – Cooperative caching
Clients fetch data from each other and offload burden from server
Client
Client
Client
Server
ClientClient
I Shark clients maintain distributed index
I Fetch a file from multiple other clients in parallel
Siddhartha Annapureddy Thesis Defense
Scalability – Cooperative caching
Clients fetch data from each other and offload burden from server
Client
Client
Client
Server
ClientClient
I Shark clients maintain distributed index
I Fetch a file from multiple other clients in parallel
I Read-heavy workloads – Writes still go to server
Siddhartha Annapureddy Thesis Defense
P2P systems – Themes
I Usability – Exploiting P2P for a range of systems
I Cross file(system) sharing – Use other groups
I Security issues in such wide-area P2P systems
I Topology management – Arrange nodes efficiently
I Scheduling dissemination of chunks efficiently
Siddhartha Annapureddy Thesis Defense
Cross file(system) sharing – Application
Linux distribution with LiveCDs
Sharkcd
Sharkcd
Sharkcd
Sharkcd
Knoppix
Knoppix Mirror
Knoppix
Knoppix
Slax
Slax
Slax
I LiveCD – Run an entire OS without using hard diskI But all your programs must fit on a CD-ROMI Download dynamically from server but scalability problemsI Knoppix and Slax users form global cache – Relieve servers
Siddhartha Annapureddy Thesis Defense
Cross file(system) sharing
Global cooperative cache regardless of origin servers
Client
ClientClient
Client
Client
Client
ClientClient
Client
Client
Slax Knoppix
I Two groups of clients accessing Slax/Knoppix servers
I Client groups share a large amount of software
I Such clients automatically form a global cache
Siddhartha Annapureddy Thesis Defense
Cross file(system) sharing
Global cooperative cache regardless of origin servers
Client
ClientClient
Client
Client
Client
ClientClient
Client
Client
KnoppixSlax
I Two groups of clients accessing Slax/Knoppix servers
I Client groups share a large amount of software
I Such clients automatically form a global cache
Siddhartha Annapureddy Thesis Defense
Cross file(system) sharing
Global cooperative cache regardless of origin servers
Client
ClientClient
Client
Client
Client
ClientClient
Client
Client
KnoppixSlax
I Two groups of clients accessing Slax/Knoppix servers
I Client groups share a large amount of software
I BitTorrent supports only per-file groups
Siddhartha Annapureddy Thesis Defense
Content-based chunking
Client
Client
Client
Server
ClientClient
I Single overlay for all filesystems and content-based chunksI Chunks – Variable-sized blocks based on content
I Chunks are better – Exploit commonalities across filesI LBFS-style chunks preserved across versions, concatenations
[Muthitacharoen et al. A Low-Bandwidth Network File System. SOSP ’01]
Siddhartha Annapureddy Thesis Defense
P2P systems – Themes
I Usability – Exploiting P2P for a range of systems
I Cross file(system) sharing – Use other groups
I Security issues in such wide-area P2P systems
I Topology management – Arrange nodes efficiently
I Scheduling dissemination of chunks efficiently
Siddhartha Annapureddy Thesis Defense
Security issues – Client authentication
Client Client
Client
Server
Client
Client
TraditionalAuthentication
I Traditionally, server authenticated read requests using uids
I Challenge – Interaction between distrustful clients
Siddhartha Annapureddy Thesis Defense
Security issues – Client authentication
Client Client
Client
Server
Client
Client
TraditionalAuthentication From Token
Authenticator Derived
I Traditionally, server authenticated read requests using uids
I Challenge – Interaction between distrustful clients
Siddhartha Annapureddy Thesis Defense
Security issues – Client communication
I Client should be able to check integrity of downloaded chunk
I Client should not send chunks to other unauthorized clients
I An eavesdropper shouldn’t be able to obtain chunk contents
I Accomplish this without a complex PKI
I Keep the number of rounds to a minimum
Siddhartha Annapureddy Thesis Defense
File metadata
Client ServerMetadata:(fh,offset,count)GET_TOKEN
Chunk tokens
I Possession of token implies server permissions to read
I Tokens are a shared secret between authorized clients
I Tokens can be used to check integrity of fetched data
Given a chunk B...
I Chunk token TB = H(B)
I H is a collision resistant hash function
Siddhartha Annapureddy Thesis Defense
File metadata
Client ServerMetadata:(fh,offset,count)GET_TOKEN
Chunk tokens
I Possession of token implies server permissions to read
I Tokens are a shared secret between authorized clients
I Tokens can be used to check integrity of fetched data
Given a chunk B...
I Chunk token TB = H(B)
I H is a collision resistant hash function
Siddhartha Annapureddy Thesis Defense
File metadata
Client ServerMetadata:(fh,offset,count)GET_TOKEN
Chunk tokens
I Possession of token implies server permissions to read
I Tokens are a shared secret between authorized clients
I Tokens can be used to check integrity of fetched data
Given a chunk B...
I Chunk token TB = H(B)
I H is a collision resistant hash function
Siddhartha Annapureddy Thesis Defense
File metadata
Client ServerMetadata:(fh,offset,count)GET_TOKEN
Chunk tokens
I Possession of token implies server permissions to read
I Tokens are a shared secret between authorized clients
I Tokens can be used to check integrity of fetched data
Given a chunk B...
I Chunk token TB = H(B)
I H is a collision resistant hash function
Siddhartha Annapureddy Thesis Defense
File metadata
Client ServerMetadata:(fh,offset,count)GET_TOKEN
Chunk tokens
I Possession of token implies server permissions to read
I Tokens are a shared secret between authorized clients
I Tokens can be used to check integrity of fetched data
Given a chunk B...
I Chunk token TB = H(B)
I H is a collision resistant hash function
Siddhartha Annapureddy Thesis Defense
Security protocol
Client
Receiver
Client
Source
I RC,RP – Random nonces to ensure freshness
I AuthC – Authenticator to prove receiver has tokenI AuthC = MAC (TB, “Auth C”, C,P,RC,RP)
I K – Key to encrypt chunk contentsI K = MAC (TB, “Encryption”, C,P,RC,RP)
Siddhartha Annapureddy Thesis Defense
Security protocol
Client
Receiver
Client
Source
RC
RP
I RC,RP – Random nonces to ensure freshness
I AuthC – Authenticator to prove receiver has tokenI AuthC = MAC (TB, “Auth C”, C,P,RC,RP)
I K – Key to encrypt chunk contentsI K = MAC (TB, “Encryption”, C,P,RC,RP)
Siddhartha Annapureddy Thesis Defense
Security protocol
Client
Receiver
Client
Source
RC
RP
IB, AuthC
EK(B)
I RC,RP – Random nonces to ensure freshness
I AuthC – Authenticator to prove receiver has tokenI AuthC = MAC (TB, “Auth C”, C,P,RC,RP)
I K – Key to encrypt chunk contentsI K = MAC (TB, “Encryption”, C,P,RC,RP)
Siddhartha Annapureddy Thesis Defense
Security protocol
Client
Receiver
Client
Source
RC
RP
IB, AuthC
EK(B)
I RC,RP – Random nonces to ensure freshness
I AuthC – Authenticator to prove receiver has tokenI AuthC = MAC (TB, “Auth C”, C,P,RC,RP)
I K – Key to encrypt chunk contentsI K = MAC (TB, “Encryption”, C,P,RC,RP)
Siddhartha Annapureddy Thesis Defense
Security properties
Client
Receiver
Client
Source
RC
RP
IB, AuthC
EK(B)
Client can check integrity of downloaded chunk
I Client checks H(Downloaded chunk)?= TB
Source should not send chunks to unauthorized clients
I Malicious clients cannot send correct AuthC
Eavesdropper shouldn’t get chunk contents
I All communication encrypted with K
Siddhartha Annapureddy Thesis Defense
P2P systems – Themes
I Usability – Exploiting P2P for a range of systems
I Cross file(system) sharing – Use other groups
I Security issues in such wide-area P2P systems
I Topology management – Arrange nodes efficiently
I Scheduling dissemination of chunks efficiently
Siddhartha Annapureddy Thesis Defense
Topology management
Client Client
Client
Server
Client
Client
Fetch Metadata
I Fetch metadata from the server
I Look up clients caching needed chunks in overlay
I Overlay is common to all file systems
I Overlay organized like a DHT – Get and PutSiddhartha Annapureddy Thesis Defense
Topology management
Client Client
Client
Server
Client
Client
Lookup Clients
I Fetch metadata from the server
I Look up clients caching needed chunks in overlay
I Overlay is common to all file systems
I Overlay organized like a DHT – Get and PutSiddhartha Annapureddy Thesis Defense
Topology management
Client Client
Client
Server
Client
Client
Lookup Clients
I Fetch metadata from the server
I Look up clients caching needed chunks in overlay
I Overlay is common to all file systems
I Overlay organized like a DHT – Get and PutSiddhartha Annapureddy Thesis Defense
Topology management
Client Client
Client
Server
Client
Client
Lookup Clients
I Fetch metadata from the server
I Look up clients caching needed chunks in overlay
I Overlay is common to all file systems
I Overlay organized like a DHT – Get and PutSiddhartha Annapureddy Thesis Defense
Overlay – Get operation
Client Client
Client
Server
Client
Client
Discover Clients
I For every chunk B, there’s indexing key IBI IB used to index clients caching BI Cannot set IB = TB, as TB is a secret
I IB = MAC (TB, “Indexing Key”)
Siddhartha Annapureddy Thesis Defense
Overlay – Put operation
Register as a source
I Client now becomes a source for the downloaded chunk
I Client registers in distributed index – PUT(IB,Addr)
Chunk Reconciliation
I Reuse TCP connection to download more chunks
I Exchange mutually needed chunks w/o indexing overhead
Siddhartha Annapureddy Thesis Defense
Overlay – Locality awareness
Client
Client
Client
Client
Network A Network B
I Overlay organized as clusters based on latency
I Preferentially returns sources in same cluster as the client
I Hence, chunks usually transferred from nearby clients
[Freedman et al. Democratizing content publication with Coral. NSDI ’04]
Siddhartha Annapureddy Thesis Defense
Topology management – Shark vs. BitTorrent
BitTorrent
I Neighbours “fixed”
I Makes cross file sharing difficult
Shark
I Chunk-based overlay
I Neighbours based on data
I Enables cross file sharing
Server
Client
ClientClient
Client
Siddhartha Annapureddy Thesis Defense
Evaluation
I How does Shark compare with SFS? With NFS?
I How scalable is the server?
I How fair is Shark across clients?
I Which order is better? Random or Sequential
Siddhartha Annapureddy Thesis Defense
Emulab – 100 Nodes on LAN
0
20
40
60
80
100
0 100 200 300 400 500 600 700 800
Perc
enta
ge c
ompl
eted
with
in ti
me
Time since initialization (sec)
40 MB readShark, rand, negotiationNFSSFS
I Shark – 88sI SFS – 775s (≈ 9x better), NFS – 350s (≈ 4x better)I SFS less fair because of TCP backoffs
Siddhartha Annapureddy Thesis Defense
PlanetLab – 185 Nodes
0
20
40
60
80
100
0 500 1000 1500 2000 2500
Perc
enta
ge c
ompl
eted
with
in ti
me
Time since initialization (sec)
40 MB readSharkSFS
I Shark ≈ 7 min – 95th Percentile
I SFS ≈ 39 min – 95th Percentile (5x better)
I NFS – Triggered kernel panics in server
Siddhartha Annapureddy Thesis Defense
Data pushed by server
SFS Shark0.0
0.5
1.0
Nor
mal
ized
ban
dwid
th
7400
954
I Shark vs SFS – 23 copies vs 185 copies (8x better)I Overlay not responsive enoughI Torn connections, timeouts force going to server
Siddhartha Annapureddy Thesis Defense
Data served by Clients
0
20
40
60
80
100
120
140
0 20 40 60 80 100 120 140 160 180
Band
widt
h tra
nsm
itted
(MB)
Unique Shark proxies
PlanetLab hostsShark, 40MB
I Maximum contribution ≈ 3.5 copies
I Median contribution ≈ 0.75 copies
Siddhartha Annapureddy Thesis Defense
P2P systems – Themes
I Usability – Exploiting P2P for a range of systems
I Cross file(system) sharing – Use other groups
I Security issues in such wide-area P2P systems
I Topology management – Arrange nodes efficiently
I Scheduling dissemination of chunks efficiently
Siddhartha Annapureddy Thesis Defense
Fetching Chunks – Order Matters
I Scheduling problem – What order to fetch chunks of file?
I Natural choices – Random or Sequential
Intuitively, when many clients start simultaneouslyI Random
I All clients fetch independent chunksI More chunks become available in the cooperative cache
I SequentialI Better disk I/O scheduling on the serverI Client that downloads most chunks alone adds to cache
Siddhartha Annapureddy Thesis Defense
Emulab – 100 Nodes on LAN
0
20
40
60
80
100
0 100 200 300 400 500 600 700 800
Perc
enta
ge c
ompl
eted
with
in ti
me
Time since initialization (sec)
40 MB readShark, randShark, seq
I Random – 133sI Sequential – 203sI Random Wins !!! – 35% better
Siddhartha Annapureddy Thesis Defense
Performance characteristics of workloads
I For Shark, what matters is throughputI Operation should finish as soon as
possible
Utilit
y →
Throughput →
I Another interesting class of applicationsI Video streaming – RedCarpet
Utilit
y →
Throughput →
I Study the scheduling problem for Video-on-DemandI Press a button, wait a little while, and watch playback
Siddhartha Annapureddy Thesis Defense
Challenges for VoD
I Near VoDI Small setup time to play
videos
I High sustainable goodputI Largest slope osculating block
arrival curveI Highest video encoding rate
system can support
Goal: Do block scheduling to achieve low setup time and highgoodput
Siddhartha Annapureddy Thesis Defense
System design – Outline
I Components based on BitTorrentI Central server (tracker + seed)I Users interested in the data
I Central serverI Maintains a list of nodes
I Each node finds neighboursI Contacts the server for thisI Another option – Use gossip
protocols
I Exchange data with neighboursI Connections are bi-directional
Siddhartha Annapureddy Thesis Defense
Simulator – Outline
I All nodes have unit bandwidth capacityI 500 nodes in most of our experiments
I Each node has 4-6 neighboursI Simulator is round-based
I Block exchanges done at each roundI System moves as a whole to the next round
I File divided into 250 blocksI Segment is a set of consecutive blocksI Segment size = 10
I Maximum possible goodput – 1 block/round
Siddhartha Annapureddy Thesis Defense
Why 95th percentile?
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80
50
100
150
200
250
300
350
400
450
500
Goodput (in blocks/round)
Node
s
I (x, y)I y nodes have a
goodput atleast x
I Red line at topshows 95th
percentile
I 95th percentile agood indicator ofgoodput
I Most nodeshave at leastthat goodput
Siddhartha Annapureddy Thesis Defense
Feasibility of VoD – What does block/round mean?
I 0.6 blocks/round with 35 rounds of setup time
I Two independent parametersI 1 round ≈ 10 secondsI 1 block/round ≈ 512 kbps
I Consider 35 rounds a good setup timeI 35 rounds ≈ 6 min
I Goodput = 0.6 blocks/roundI Max encoding rate = 0.6 ∗ 512 > 300 kbps
I Size of video = 250/0.6 ≈ 70 min
Siddhartha Annapureddy Thesis Defense
Naıve scheduling policies
I Segment-Random policyI Divide file into segmentsI Fetch blocks in random order inside same segmentI Download segments in order
Siddhartha Annapureddy Thesis Defense
Naıve approaches – Random
0 10 20 30 40 50 60 70 80 90 1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Setup Time (in rounds)
Goo
dput
(blo
cks/
roun
d)
I Each node fetches ablock at random
I Throughput – Highas nodes fetchdisjoint blocks
I More opportunitiesfor blockexchanges
I Goodput – Low asnodes do not getblocks in order
I 0 blocks/round
Siddhartha Annapureddy Thesis Defense
Naıve approaches – Sequential
0 10 20 30 40 50 60 70 80 90 1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Setup Time (in rounds)
Goo
dput
(blo
cks/
roun
d)
0.2 blocks/round
SequentialRandom
I Each node fetches blocks in order of playbackI Throughput – Low as fewer opportunities for exchange
I Need to increase this
Siddhartha Annapureddy Thesis Defense
Naıve approaches – Segment-Random policy
0 10 20 30 40 50 60 70 80 90 1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Setup Time (in rounds)
Goo
dput
(blo
cks/
roun
d)
0.4 blocks/round
Segment-RandomSequentialRandom
I Segment-RandomI Fetch random block from the segment the earliest block falls in
I Increases the number of blocks propagating in the system
Siddhartha Annapureddy Thesis Defense
NetCoding – How it helps?
I A has blocks 1 and 2
I B gets 1 or 2 withequal prob. from A
I C gets 1 in parallel
I If B downloaded 1,link B-C becomesuseless
I Network coding routinely sends 1⊕ 2
Siddhartha Annapureddy Thesis Defense
Network coding – Mechanics
I Coding over blocksB1,B2, . . . ,Bn
I Choose coefficientsc1, c2, . . . , cn from finitefield
I Generate encoded block:E1 =
∑ni=1 ci × Bi
I Without n coded blocks, decoding cannot be doneI Setup time at least the time to fetch a segment
[Gkantsidis et al. Network Coding for Large Scale Content Distribution. InfoCom ’05]
Siddhartha Annapureddy Thesis Defense
Benefits of network coding
0 10 20 30 40 50 60 70 80 90 1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Setup Time (in rounds)
Goo
dput
(blo
cks/
roun
d)
Segment-NetcodingSegment-RandomSequentialRandom
I GoodputI ≈ 0.6 blocks/round with netcodingI 0.4 blocks/round with seg-rand
Siddhartha Annapureddy Thesis Defense
Segment scheduling
I Netcoding good for fetching blocks in a segment, but cannotbe used across segments
I Because decoding requires all encoded blocks
I Problem: How to schedule fetching of segments?I Until now, sequential segment schedulingI Results in rare segments when nodes arrive at different times
I Segment scheduling algorithm to avoid rare segmentsI Evaluation with C# prototype
Siddhartha Annapureddy Thesis Defense
Segment scheduling
I A has 75% of the file
I Flash crowd of 20 Bs joinwith empty caches
I Server shared between Aand Bs
I Throughput to Adecreases, as server isbusy serving Bs
I Initial segments servedrepeatedly
I Throughput of Bs lowbecause the samesegments are fetchedfrom server
Siddhartha Annapureddy Thesis Defense
Segment scheduling – Problem
I 8 segments in the video; A has 75% = 6 segments, Bs havenone
I Green is popularity 2, Red is popularity 1I Popularity = # full copies in the system
Siddhartha Annapureddy Thesis Defense
Segment scheduling – Problem
I Instead of serving A...
Siddhartha Annapureddy Thesis Defense
Segment scheduling – Problem
I The server gives segments to Bs
Siddhartha Annapureddy Thesis Defense
Segment scheduling – Problem
0
10
20
30
40
50
60
70
80
90
100
0 25 50 75 100 125 150 175 200 225 250
# Da
ta B
lock
s
Time (in sec)
Naive -- #Blks Downloaded (A)Naive -- #Blks Downloaded (B1..n)
I A
I Bs
I A’s goodput plummetsI Get the best of both worlds – Improve A and Bs
I Idea: Each node serves only rarest segments in system
Siddhartha Annapureddy Thesis Defense
Segment scheduling – Algorithm
I When dst node connects to the src node...I Here: dst is B, src is server
Siddhartha Annapureddy Thesis Defense
Segment scheduling – Algorithm
I src node sorts segments in order of popularityI Segments 7, 8 least popular at 1I Segments 1-6 equally popular at 2
Siddhartha Annapureddy Thesis Defense
Segment scheduling – Algorithm
I src considers segments in sorted order one-by-one and servesdst
I Either completely available at src, orI First segment required by dst
Siddhartha Annapureddy Thesis Defense
Segment scheduling – Algorithm
I Server injects blocks from segment needed by A into systemI Avoids wasting bandwidth in serving initial segments multiple
times
Siddhartha Annapureddy Thesis Defense
Segment scheduling – Popularities
I How does src figure out popularities?I Centrally available at the server
I Our implementation uses this technique
I Each node maintains popularities for segmentsI Could use a gossiping protocol for aggregation
Siddhartha Annapureddy Thesis Defense
Segment scheduling
0
10
20
30
40
50
60
70
80
90
100
0 25 50 75 100 125 150 175 200 225 250
# Da
ta B
lock
s
Time (in sec)
Naive -- #Blks Downloaded (A)Naive -- #Blks Downloaded (B1..n)
#Blks Downloaded (A)#Blks Downloaded (B1..n)
I Note that goodput of both A and Bs improvesSiddhartha Annapureddy Thesis Defense
Conclusions
I Problem: Designing scalable data-intensive applicationsI Enabled low-end servers to scale to large numbers of users
I Shark: Scales file server for read-heavy workloadsI RedCarpet: Scales video server for near-Video-on-Demand
I Exploited P2P for file systems, solved security issuesI Techniques to improve scalability
I Cross file(system) sharingI Topology management – Data-based, locality awarenessI Scheduling of chunks – Network coding
Siddhartha Annapureddy Thesis Defense