real time analytics for big data a twiiter case study
DESCRIPTION
Learn how to build a Twitter-like analytics system, designed to meet real time needs, in a simple way. Using frameworks such as Spring Social, Active In-Memory Data Grid for Big Data event processing, and NoSQL database. Hadoop's batch-oriented processing is sufficient for many use cases, especially where the frequency of data reporting doesn't need to be up-to-the-minute. However, batch processing isn't always adequate, particularly when serving online needs such as mobile and web clients, or markets with real-time changing conditions such as finance and advertising. In the same way that Hadoop was born out of large-scale web applications, a new class of scalable frameworks and platforms for handling streaming or real time analysis and processing is born to handle the needs of large-scale location-aware mobile, social and sensor use. Do we want to limit ourselves to just these use cases? Facebook, Twitter and Google have been pioneers in that arena and recently launched new analytics services designed to meet the real time needs. In this session we will Review the common patterns and architecture that drive these platforms and learn how to build a Twitter-like analytics system in a simple way using frameworks such as Spring Social, Active In-Memroy Data Grid for Big Data event processing, and NoSQL database such as Cassandra or Hbase for handling the managing the historical data.TRANSCRIPT
Real Time Analytics for Big DataA Twitter Inspired Case Study
@natishalom
Big Data Predictions
2® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved3
The Two Vs of Big Data
Velocity Volume
We’re Living in a Real Time World…Homeland Security
Real Time Search
Social
eCommerce
User Tracking & Engagement
Financial Services
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved4
The Flavors of Big Data Analytics
Counting Correlating Research
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved5
Analytics @ Twitter – Counting
How many signups, tweets, retweets for a topic?
What’s the average latency?
Demographics Countries and cities Gender Age groups Device types …
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved6
Analytics @ Twitter – Correlating
What devices fail at the same time?
What features get user hooked?
What places on the globe are “happening”?
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved7
Analytics @ Twitter – Research
Sentiment analysis “Obama is popular”
Trends “People like to tweet
after watching American Idol”
Spam patterns How can you tell when
a user spams?
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved8
It’s All about Timing
“Real time” (< few Seconds)
Reasonably Quick (seconds - minutes)
Batch (hours/days)
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved9
It’s All about Timing
• Event driven / stream processing • High resolution – every tweet gets counted
• Ad-hoc querying • Medium resolution (aggregations)
• Long running batch jobs (ETL, map/reduce) • Low resolution (trends & patterns)
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved10
This is what we’re here to discuss
This is what we’re here to discuss
Challenge – Word Count
Word:Count
Tweets
CountCount??® Copyright 2011 Gigaspaces Ltd. All Rights Reserved11
• Hottest topics• URL mentions• etc.
• Hottest topics• URL mentions• etc.
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved12
URL Mentions – Here’s One Use Case
It takes a week for users to
send 1 billion Tweets.
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved13
Twitter in Numbers (March 2011)
Source: http://blog.twitter.com/2011/03/numbers.html
On average,
140 million tweets get sent every day.
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved14
Twitter in Numbers (March 2011)
Source: http://blog.twitter.com/2011/03/numbers.html
The highest throughput to date is
6,939 tweets/sec.® Copyright 2011 Gigaspaces Ltd. All Rights Reserved15
Twitter in Numbers (March 2011)
Source: http://blog.twitter.com/2011/03/numbers.html
460,000 new accounts
are created daily.
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved16
Twitter in Numbers (March 2011)
Source: http://blog.twitter.com/2011/03/numbers.html
5% of the users generate
75% of the content.
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved17
Twitter in Numbers
Source: http://www.sysomos.com/insidetwitter/
(Tens of) thousands of tweets per second to process Assumption: Need to process in near real time
Aggregate counters for each word A few 10s of thousands of words (or hundreds of
thousands if we include URLs) System needs to linearly scale System needs to be fault tolerant
Analyze the Problem
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved18
Key Elements in Real Time Big Data Analytics
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved19
Sharding (Partitioning)
Tokenizer1
Tokenizer1 Filterer 1Filterer 1
Tokenizer2
Tokenizer2 Filterer 2Filterer 2
Tokenizer 3
Tokenizer 3 Filterer 3Filterer 3
Tokenizer n
Tokenizer n Filterer nFilterer n
Counter Updater 1Counter
Updater 1
Counter Updater 2Counter
Updater 2
Counter Updater 3Counter
Updater 3
Counter Updater nCounter
Updater n
Keep Things In Memory
Facebook keeps 80% of its data in Memory (Stanford research)
RAM is 100-1000x faster than Disk (Random seek)• Disk: 5 -10ms • RAM: ~0.001msec
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved22
Use EDA (Event Driven Architecture)
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved23
Putting it all together
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved24
Know Your Toolset
Writing your own twitter analytics: http://ht.ly/d8j4I Detailed blog post
http://bit.ly/gs-bigdata-analytics Twitter in numbers:
http://blog.twitter.com/2011/03/numbers.html Twitter Storm:
http://bit.ly/twitter-storm Apache S4
http://incubator.apache.org/s4/
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved25
References
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved26