real time analytics for big data - a twitter inspired case study

Post on 15-Jan-2015

2.271 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Real Time Analytics for Big Data - A twitter inspired case study

TRANSCRIPT

Real Time Analytics for Big DataA Twitter Case Study

Uri Cohen Head of Product

@uri1803

Big Data Predictions

“Over the next few years we'll see the adoption of scalable frameworks and platforms for handling streaming, or near real-time, analysis and processing. In the same way that Hadoop has been borne out of large-scale web applications, these platforms will be driven by the needs of large-scale location-aware mobile, social and sensor use.”

Edd Dumbill, O’REILLY

2® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

We’re Living in a Real Time World…Google RT Web Analytics

Google Real Time Search

Facebook Real Time Social Analytics

Twitter paid tweet analytics

Real Time User Engagement

New Real Time Analytics Startups

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved3

3 Flavors to Analytics

Counting Correlating Research

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved4

Analytics @ Twitter – Counting

How many signups, tweets, retweets for a topic?

What’s the average latency?

Demographics Countries and cities Gender Age groups Device types …

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved5

Analytics @ Twitter – Correlating

What devices fail at the same time?

What features get user hooked?

What places on the globe are “happening”?

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved6

Analytics @ Twitter – Research

Sentiment analysis “Obama is popular”

Trends “People like to tweet

after watching American Idol”

Spam patterns How can you tell when

a user spams?

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved7

It’s All about Timing

“Real time” (< few Seconds)

Reasonably Quick (seconds - minutes)

Batch (hours/days)

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved8

It’s All about Timing

• Event driven / stream processing • High resolution – every tweet gets counted

• Ad-hoc querying • Medium resolution (aggregations)

• Long running batch jobs (ETL, map/reduce) • Low resolution (trends & patterns)

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved9

This is what we’re here to discuss

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved10

It takes a week for users to

send 1 billion Tweets.

Twitter in Numbers (March 2011)

Source: http://blog.twitter.com/2011/03/numbers.html

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved11

On average,

140 million tweets get sent every day.

Twitter in Numbers (March 2011)

Source: http://blog.twitter.com/2011/03/numbers.html

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved12

The highest throughput to date is

6,939 tweets/sec.

Twitter in Numbers (March 2011)

Source: http://blog.twitter.com/2011/03/numbers.html

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved13

460,000 new accounts

are created daily.

Twitter in Numbers (March 2011)

Source: http://blog.twitter.com/2011/03/numbers.html

14

5% of the users generate

75% of the content.

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Twitter in Numbers

Source: http://www.sysomos.com/insidetwitter/

Challenge 1 – Computing Reach

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved15

Challenge 1 – Computing Reach

Count

Tweets Followers Distinct Followers

Reach

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved16

Challenge 2 – Word Count

Word:Count

Tweets

Count?® Copyright 2011 Gigaspaces Ltd. All Rights Reserved17

• Hottest topics• URL mentions• etc.

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved18

URL Mentions – Here’s One Use Case

(Tens of) thousands of tweets per second to process Assumption: Need to process in near real time

Aggregate counters for each word A few 10s of thousands of words (or hundreds of

thousands if we include URLs) System needs to linearly scale

Analyze the Problem

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved19

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved20

It’s Really Just an Online Map/Reduce

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved21

Event Driven Flow

Tokenize

StoreRaw

Filter Count

Store Counters

Research

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved22

(Tens) of thousands of tweets to tokenize every second, even more words to filter CPU bottleneck

Tens/hundreds of thousands of counters to update Counters contention

Tens/hundreds of thousands of counters to persist Database bottleneck

(Tens) of thousands of tweets to store every second in the database Database bottleneck

It’s Not That Simple…

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved23

Implementation Challenges

Twitter is growing by the day, how can this scale?

Fault tolerance. What happens if a server fails?

Consistency. Counters should be accurate!

Solutions, Solutions

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved24

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved25

We need a system that could scale linearly Event partitioning to the rescue! What to partition by?

Dealing with the Massive Tweet Stream

Tokenizer1 Filterer 1

Tokenizer2 Filterer 2

Tokenizer 3 Filterer 3

Tokenizer n Filterer n

26

Why not keep things in memory? Treat it as a SoR

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Counters Persistence & Contention

Tokenizer1

Tokenizer2

Tokenizer 3

Tokenizer n

Filterer 1

Filterer 2

Filterer 3

Filterer n

Counter Updater 1

Counter Updater 2

Counter Updater 3

Counter Updater n

IMDG

27

Why not keep things in memory (data grid)? Question: How to partition counters?

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Counters Persistence & Contention

Tokenizer1

Tokenizer2

Tokenizer 3

Tokenizer n

Filterer 1

Filterer 2

Filterer 3

Filterer n

Counter Updater 1

Counter Updater 2

Counter Updater 3

Counter Updater n

IMDG

28

Why not keep things in memory? Question: How to partition counters?

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Counters Persistence & Contention

Tokenizer1

Tokenizer2

Tokenizer 3

Tokenizer n

Filterer 1

Filterer 2

Filterer 3

Filterer n

Counter Updater 1

Counter Updater 2

Counter Updater 3

Counter Updater n

IMDG

Facebook keeps 80% of its data in Memory (Stanford research)

RAM is 100-1000x faster than Disk (Random seek)• Disk: 5 -10ms • RAM: ~0.001msec

29

RDBMS won’t cut it (unless you have $$$$$) Hadoop / NoSQL to the rescue Use the data grid for batching and as a reliable

transaction log

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Database Bottleneck – Storing Raw Tweets

Persister 1

Persister 2

Persister n

30

Primary/hot backup setup, sync replication Transactional, atomic updates “On Demand” backups

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Counter Consistency & Resiliency

IMDG

P

P

P

P

B

B

B

B

31

Primary/hot backup setup, sync replication Transactional, atomic updates “On Demand” backups

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Counter Consistency & Resiliency

IMDG

P

P

P

P

B

B

B

B

P

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved32

Implementing Event Flows

P B

TokenizerRaw FiltererTokenized CounterFiltered

Use the data grid as event bus Stateful objects as transactional events

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved33

Putting It All Together

IMDG

P

P

P

P

B

B

B

B

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved34

Putting It All Together

IMDG

P

P

P

P

B

B

B

B

• Use an IMDG • Partition events, counters• Collocate event handlers with data for low

latency and simple scaling • Use atomic updates to update counters in

memory • Persist asynchronously to a big data store

Reducing Counter Contention - Optimization

Store processed tweets locally in each partition

Use periodic batch writes to update counters

Process batches as a whole

35

word1

word2

word3

1 sec

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved35

Keep Your Eyes Open…

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved36

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved37

Learn and fork the code on github: https://github.com/Gigaspaces/rt-analytics

Detailed blog posthttp://bit.ly/gs-bigdata-analytics

Twitter in numbers: http://blog.twitter.com/2011/03/numbers.html

Twitter Storm: http://bit.ly/twitter-storm

Apache S4http://incubator.apache.org/s4/

References

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved38

top related