in-situ mapreduce for log processing
DESCRIPTION
Paper presentation in DB readingTRANSCRIPT
In-situ MapReduce for Log Processing
Speaker: LIN Qianhttp://www.comp.nus.edu.sg/~linqian
2
Log analytics• Data centers with 1000s of servers
• Data-intensive computing: Store and analyze TBs of logs
Examples:• Click logs
– ad-targeting, personalization
• Social media feeds– brand monitoring
• Purchase logs– fraud detection
• System logs– anomaly detection, debugging
3
Log analytics today
• “Store-first-query later”
Problems:• Scale
– Stress network and disks
• Failures– Delay analysis or
process incomplete data
• Timeliness– Hinder real-time apps
Dedicated cluster
MapReduce
Store first ...
... query later
Servers
4
In-situ MapReduce (iMR)
Idea:• Move analysis to the servers• MapReduce for continuous
data• Ability to trade fidelity for
latency
Optimized for:• Highly selective workloads
– e.g., up to 80% data filtered or summarized!
• Online analytics– e.g., ad re-targeting based
on most recent clicksDedicated cluster
MapReduce
Servers
5
An iMR query
The same:• MapReduce API
– map(r) {k,v} : extract/filter data– reduce({k,v[]}) v’ : data aggregation– combine({k,v[]}) v’ : early, partial aggregation
The new:• Provides continuous results
– because logs are continuous
6
Continuous MapReduce
• Input– An infinite stream of logs
• Bound input with sliding windows– Range of data (R)– Update frequency (S)
• Output– Stream of results, one
for each window
...Time
0’’ 30’’ 60’’ 90’’
Map
Combine
Reduce
Log entries
7
Processing windows in-network
...Time
0’’ 30’’ 60’’ 90’’
Map
Combine
Reduce
...
Overlapping data
User’s reduce function
Aggregation tree for efficiency
8
Efficient processing with panes
• Divide window into panes (sub-windows)– Each pane is
processed and sent only once
– Root combines panes to produce window
• Eliminate redundant work– Save CPU & network
resources, faster analysis
...Time
0’’ 30’’ 60’’ 90’’
Map
Combine
Reduce
P1 P2 P3 P4 P5
P2P1
P3P4P5
9
Impact of data loss on analysis
• Servers may get overloaded or fail
Challenges:• Characterize
incomplete results• Allow users to
trade fidelity for latency
...
Map
Combine
Reduce
P1 P2 P3 P4 P5
X?
10
Quantifying data fidelity
• Data are naturally distributed– Space (server nodes)– Time (processing window)
• C2 metric– Annotates result windows
with a “scoreboard”
11
Trading fidelity for latency
• Use C2 to trade fidelity for latency– Maximum latency requirement– Minimum fidelity requirement
• Different ways to meet minimum fidelity– 4 useful classes of C2
specifications
12
Minimizing result latency
• Minimum fidelity with earlier results• Give freedom to decrease latency
– Return the earliest data available
• Appropriate for uniformly distributed events
13
Sampling non-uniform events
• Minimum fidelity with random sampling• Less freedom to decrease latency
– Included data may not be the first available
• Appropriate even for non-uniform data
14
Correlating events across time and space
• Temporal completeness– Include all data from a
node or no data at all
• Spatial completeness– Each pane contains data
from all nodes
Leverage knowledge about data distribution
15
Prototype
• Build upon Mortar – Sliding windows– In-network aggregation trees
• Extended to support:– MapReduce API– Pane-based processing– Fault tolerance mechanisms
16
Processing data in-situ
• Useful when ...• Goal: use available resources intelligently
• Load shedding mechanism– Nodes monitor local processing rate– Shed panes that cannot be processed on
time
• Increase result fidelity under time and resource constraints
17
Evaluation
• System scalability• Usefulness of C2 metric
– Understanding incomplete results– Trading fidelity for latency
• Processing data in-situ– Improving fidelity under load with load
shedding– Minimizing impact on services
18
Scaling
• Synthetic input data, reducer of word count
• 3 reducers provide sufficient processing to handle the 30 map tasks
19
Exploring fidelity-latency trade-offs
Data loss affects accuracy of distribution
• Temporal completeness• Spatial completeness and
random sampling
C2 allows to trade fidelity for lower latency
100% accuracy
>25% decrease
20
In-situ performance
• iMR side-by-side with a real service (Hadoop)
• Vary CPU allocated to iMR– Result fidelity– Hadoop performance (job
throughput)
560%
<11% overhead
21
Conclusion
• In-situ architecture processes logs at the sources, avoids bulk data transfers, and reduces analysis time
• Model allows incomplete data under failures or server load, provides timely analysis
• C2 metric helps understand incomplete data and trade fidelity for latency
• Pro-actively sheds load, improves data fidelity under resource and time constraints