in-situ mapreduce for log processing

In-situ MapReduce for Log Processing

Speaker: LIN Qianhttp://www.comp.nus.edu.sg/~linqian

2

Log analytics• Data centers with 1000s of servers

• Data-intensive computing: Store and analyze TBs of logs

Examples:• Click logs

– ad-targeting, personalization

• Social media feeds– brand monitoring

• Purchase logs– fraud detection

• System logs– anomaly detection, debugging

3

Log analytics today

• “Store-first-query later”

Problems:• Scale

– Stress network and disks

• Failures– Delay analysis or

process incomplete data

• Timeliness– Hinder real-time apps

Dedicated cluster

MapReduce

Store first ...

... query later

Servers

4

In-situ MapReduce (iMR)

Idea:• Move analysis to the servers• MapReduce for continuous

data• Ability to trade fidelity for

latency

Optimized for:• Highly selective workloads

– e.g., up to 80% data filtered or summarized!

• Online analytics– e.g., ad re-targeting based

on most recent clicksDedicated cluster

MapReduce

Servers

5

An iMR query

The same:• MapReduce API

– map(r) {k,v} : extract/filter data– reduce({k,v[]}) v’ : data aggregation– combine({k,v[]}) v’ : early, partial aggregation

The new:• Provides continuous results

– because logs are continuous

6

Continuous MapReduce

• Input– An infinite stream of logs

• Bound input with sliding windows– Range of data (R)– Update frequency (S)

• Output– Stream of results, one

for each window

...Time

0’’ 30’’ 60’’ 90’’

Map

Combine

Reduce

Log entries

7

Processing windows in-network

...Time

0’’ 30’’ 60’’ 90’’

Map

Combine

Reduce

...

Overlapping data

User’s reduce function

Aggregation tree for efficiency

8

Efficient processing with panes

• Divide window into panes (sub-windows)– Each pane is

processed and sent only once

– Root combines panes to produce window

• Eliminate redundant work– Save CPU & network

resources, faster analysis

...Time

0’’ 30’’ 60’’ 90’’

Map

Combine

Reduce

P1 P2 P3 P4 P5

P2P1

P3P4P5

9

Impact of data loss on analysis

• Servers may get overloaded or fail

Challenges:• Characterize

incomplete results• Allow users to

trade fidelity for latency

...

Map

Combine

Reduce

P1 P2 P3 P4 P5

X?

10

Quantifying data fidelity

• Data are naturally distributed– Space (server nodes)– Time (processing window)

• C2 metric– Annotates result windows

with a “scoreboard”

11

Trading fidelity for latency

• Use C2 to trade fidelity for latency– Maximum latency requirement– Minimum fidelity requirement

• Different ways to meet minimum fidelity– 4 useful classes of C2

specifications

12

Minimizing result latency

• Minimum fidelity with earlier results• Give freedom to decrease latency

– Return the earliest data available

• Appropriate for uniformly distributed events

13

Sampling non-uniform events

• Minimum fidelity with random sampling• Less freedom to decrease latency

– Included data may not be the first available

• Appropriate even for non-uniform data

14

Correlating events across time and space

• Temporal completeness– Include all data from a

node or no data at all

• Spatial completeness– Each pane contains data

from all nodes

Leverage knowledge about data distribution

15

Prototype

• Build upon Mortar – Sliding windows– In-network aggregation trees

• Extended to support:– MapReduce API– Pane-based processing– Fault tolerance mechanisms

16

Processing data in-situ

• Useful when ...• Goal: use available resources intelligently

• Load shedding mechanism– Nodes monitor local processing rate– Shed panes that cannot be processed on

time

• Increase result fidelity under time and resource constraints

17

Evaluation

• System scalability• Usefulness of C2 metric

– Understanding incomplete results– Trading fidelity for latency

• Processing data in-situ– Improving fidelity under load with load

shedding– Minimizing impact on services

18

Scaling

• Synthetic input data, reducer of word count

• 3 reducers provide sufficient processing to handle the 30 map tasks

19

Exploring fidelity-latency trade-offs

Data loss affects accuracy of distribution

• Temporal completeness• Spatial completeness and

random sampling

C2 allows to trade fidelity for lower latency

100% accuracy

>25% decrease

20

In-situ performance

• iMR side-by-side with a real service (Hadoop)

• Vary CPU allocated to iMR– Result fidelity– Hadoop performance (job

throughput)

560%

<11% overhead

21

Conclusion

• In-situ architecture processes logs at the sources, avoids bulk data transfers, and reduces analysis time

• Model allows incomplete data under failures or server load, provides timely analysis

• C2 metric helps understand incomplete data and trade fidelity for latency

• Pro-actively sheds load, improves data fidelity under resource and time constraints

in-situ mapreduce for log processing

Education

quantifying data fidelity

latency processing data

data aggregation

extractfilter data

meetminimum fidelity

impact of data loss

synthetic input data

log analytics data centers