nectar: efficient management of computation and data in data centers
Post on 24-Feb-2016
41 Views
Preview:
DESCRIPTION
TRANSCRIPT
Nectar: Efficient Management of Computation and Data in Data Centers
Lenin Ravindranath
Pradeep Kumar Gunda, Chandu Thekkath, Yuan Yu, Li Zhuang
Motivation
Resources are poorly managed in a data center
Computation Storage
Redundant computations– Wasting resources
Manually managed– Unused files occupying space– Redundant output files
Goal
Efficiently manage resources in a cluster
Computation Storage
Nectar
Key Insight
Data Center
Computation Storage
Single query interface for computation and data access
DryadLINQ
Query Interface
User
Goal
Efficiently manage resources in a cluster
Computation Storage
Nectar
ComputationPROBLEM: Redundant Computation– Programs share sub queries
– Programs share partial data sets
SOLUTION: Caching– Cache results of popular sub queries – Automatically rewrite user query to use cache
X.Select(…)X.Select(…).Where(…)
X.Select(…)(X+X’).Select(…)
1 2 3 4 5 6 7
2 3 4 5 6 7 8
Does caching help?
• Analyzed logs from production clusters• Logs of 3 months (Oct – Dec 2008)• 33 virtual clusters, 36000 jobs• Parsed SCOPE programs, extracted sub queries• Simulated caching
Caching helps
search
DM
domainRele
vance2
domainRele
vance
shopping
releva
nce IE
CosmosA
dmin
search
DM-prod
autopilo
t
search
XAP
adCen
ter
adcen
ter.au
diencei
ntellig
ence
search
DM-prod2
search
UX
MSR.Liv
eLabs bi
sandbox
search
STC
msn wlc
adLab
s
Selecti
on
search
Relevan
ce-prod2
search
Platform
search
Relevan
ce.ae
ther
cosmoste
st_vc1
adCen
ter.AdCen
terDeli
very
adCen
ter.KSP
search
Exec
tellm
e
adPlatf
orm.at
las0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Cluster
Prog
ram
s hel
ped
by ca
chin
g
• About 50% cache hit on 10 clusters• More than 30% cache hit on 20 clusters• 35% on average
Goal
Efficiently manage resources in a cluster
Computation Storage
Nectar
StoragePROBLEM: Manually managed– Unused files occupying space
0 100 200 300 400 5000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Last accessed (days before)
CDF
Total Size: 190 TB
50% data was never accessed in the last 275 days
StorageSOLUTION: Automatically manage data– Track usage and delete infrequently used files– Store programs which re-computes the data
Query Interface
Data Center
Computation Storage
DryadLINQ
Query Interface
User
Goal
Efficiently manage resources in a cluster
Computation Storage
Nectar
Nectar
Data Center
Computation Storage
DryadLINQ
Query Interface
Nectar
User
Nectar Architecture
Query Rewriter
DryadLINQ
Dryad
DryadLINQ program
Query
Cache entries
Nectar Client
Cache Server
Add T to cache
P
P’ Add R to cache
R
TCluster
Nectar Architecture
Query Rewriter
Nectar Client
Cache Server
Query Rewriter
Select
X
R
X X’
Select
X’
SelectR
Concat
(R+R’)
Cache
Query Rewriter
Select
X
R
X X’
Select
X’
Select
R
Merge Sort
(R+R’)
Cache
Order by Order by Order by
Query Rewriter
• Generates multiple plans– Using multiple cache entries
• Selects the best plan– Based on benefit• Execution time• Output Size• Whether pipeline is broken
• Operators supported– Select, Where, Order by, Group by, Join
X.Select(…)X.Select(…).Where(…)
Nectar Architecture
Query Rewriter
Nectar Client
Cache Server
Cache Server
SQL Server
Garbage CollectorCache Policy
Cache Server
URI Query Fingerprint
Query + Data Fingerprint
Execution Time
Output Size
Inquire Stats
Usage Stats
Fingerprints
Cache policy• Insertion Policy– Always add program output to cache– Sub query outputs are added to cache• Popularity exceeds a threshold• Savings exceeds a threshold
elapsed Time1
SizeOutput TimeExecution Savings Sum
Garbage Collector
• Storage pressure– Delete infrequently used files
• Deletion policy– Based on savings – Cache type
• Mark and sweep algorithm– Delete cache entry– Reachability analysis• Delete files
Cache Server1
2
3
Distributed FS
1
2
What if I try to access a garbage collected file?
Nectar Architecture
Query Rewriter
Nectar Client
Cache Server
Program store
Program Store
• Store executed programs in the cluster• Output file is tied to its corresponding
program that generates the output• If a file is deleted, the program is executed to
regenerate the output
Managing Data
Nectar Client
Program Store
Distributed FSfoo.pt
Cache Server
FPFP Program
FPA31E4.pt
ToPartitionedTable (lenin\foo.pt)
DryadLINQ
Dryad
usr Nectar
P’
Program
P
Program
Managing Data
Nectar Client
Program Store
Distributed FSfoo.pt
Cache Server
FPFP Program
FP
FromPartitionedTable (lenin\foo.pt)
DryadLINQ
Dryad
usr Nectar
P
A31E4.pt
Managing Data
Nectar Client
Program Store
Distributed FSfoo.pt
Cache Server
FPFP Program
FP
FromPartitionedTable (lenin\foo.pt)
DryadLINQ
Dryad
usr Nectar
P
A31E4.pt
Program
KJ1LM.pt
Goal
Efficiently manage resources in a cluster
Computation Storage
Nectar
Computation Storage
Unified computation and data
Distributed cache servers
Cache ServerSQL Server
Partitioned by query fingerprint
Nectar Client
CentralizedGarbage collector
Hash based on query fingerprint
Program store Program store
Cache ServerSQL Server
Summary• We built Nectar
– Automatically manage data– Efficiently manage computation
Components• Query Rewriter
– Automatically rewrite queries to use cache• Cache server
– Popular sub queries are cached– Garbage collected based on usage
• Program store– Store programs which regenerates the output
Status
• Almost done with development– Query Rewriter• Including other operators
– Fingerprinter• Program static analysis
– Cache Server– Program Store
• In the process of deploying
Can we do better?
Cluster Utilization
search
DM
domainRele
vance2
domainRele
vance
shopping
releva
nce IE
CosmosA
dmin
search
DM-prod
autopilo
t
search
XAP
adCen
ter
adcen
ter.au
diencei
ntellig
ence
search
DM-prod2
search
UX
MSR.Liv
eLabs bi
sandbox
search
STC
msn wlc
adLab
s
Selecti
on
search
Relevan
ce-prod2
search
Platform
search
Relevan
ce.aet
her
cosmoste
st_vc1
adCen
ter.AdCen
terDeli
very
adCen
ter.KSP
search
Exec
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Clusters
Idle
Per
cent
• Most clusters have more than 40% Idle time• Even the busiest clusters have 10-20% idle time
Exploiting idle time
• Do speculative caching– Cache popular data before query issued– Run program on new streams when available
• No side effects– Executed only when cluster is idle– Low priority jobs– Output garbage collected with high priority– More electric bill? Not Really!
Questions
Backup
Caching Results
search
DM
domainRele
vance2
domainRele
vance
shopping
releva
nce IE
CosmosA
dmin
search
DM-prod
autopilo
t
search
XAP
adCen
ter
adcen
ter.au
diencei
ntellig
ence
search
DM-prod2
search
UX
MSR.Liv
eLabs bi
sandbox
search
STC
msn wlc
adLab
s
Selecti
on
search
Relevan
ce-prod2
search
Platform
search
Relevan
ce.ae
ther
cosmoste
st_vc1
adCen
ter.AdCen
terDeli
very
adCen
ter.KSP
search
Exectel
lme
adPlatf
orm.at
las
search
Web
Load
cosmosTe
st_common1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Cach
e Hi
t
top related