twitter stream processing
DESCRIPTION
Process the Twitter stream using Storm & Redstorm with Ruby & JRuby. Full working demo, code on github https://github.com/colinsurprenant/tweitgeist and live demo http://tweitgeist.needium.com/TRANSCRIPT
![Page 1: Twitter Stream Processing](https://reader030.vdocuments.mx/reader030/viewer/2022020207/554e8c0ab4c90526358b4af2/html5/thumbnails/1.jpg)
Twitterstream processing
Colin Surprenant@colinsurprenantLead ninja
Big Data MontrealApril 2012
![Page 2: Twitter Stream Processing](https://reader030.vdocuments.mx/reader030/viewer/2022020207/554e8c0ab4c90526358b4af2/html5/thumbnails/2.jpg)
Twitter - Spring 2012
● 350 million tweets/day● 140 million active users● >1 million applications using API
![Page 3: Twitter Stream Processing](https://reader030.vdocuments.mx/reader030/viewer/2022020207/554e8c0ab4c90526358b4af2/html5/thumbnails/3.jpg)
Daily Twitter @ Needium
50 000 000 processed tweets 5 000 opportunities 500 messages sent 100GB data
![Page 4: Twitter Stream Processing](https://reader030.vdocuments.mx/reader030/viewer/2022020207/554e8c0ab4c90526358b4af2/html5/thumbnails/4.jpg)
Anatomy of a tweet
What's in a tweet? Anything else?
avatar
usertimestamp
message
![Page 5: Twitter Stream Processing](https://reader030.vdocuments.mx/reader030/viewer/2022020207/554e8c0ab4c90526358b4af2/html5/thumbnails/5.jpg)
![Page 6: Twitter Stream Processing](https://reader030.vdocuments.mx/reader030/viewer/2022020207/554e8c0ab4c90526358b4af2/html5/thumbnails/6.jpg)
How to get the tweets?
● Streaming API
Subscribe to realtime feeds moving forward ● REST Search API
Search request on past data (1 week)
![Page 7: Twitter Stream Processing](https://reader030.vdocuments.mx/reader030/viewer/2022020207/554e8c0ab4c90526358b4af2/html5/thumbnails/7.jpg)
Streaming APIpublic statuses from all users
● status/filtertrack/location/follow○ 5000 follow user ids
○ 400 track keywords
○ 25 location boxes
○ rate limited
● status/sample○ 1% of all public statuses (message id mod 100)
○ two status/sample streams will result in same data
![Page 8: Twitter Stream Processing](https://reader030.vdocuments.mx/reader030/viewer/2022020207/554e8c0ab4c90526358b4af2/html5/thumbnails/8.jpg)
Streaming APIper user streams
● User Streams
○ all data required to update a user's display○ requires user's OAuth token○ statuses from followings, direct messages, mentions○ cannot open large number of user streams from same host
● Site Streams○ multiplexing of multiple User Streams
![Page 9: Twitter Stream Processing](https://reader030.vdocuments.mx/reader030/viewer/2022020207/554e8c0ab4c90526358b4af2/html5/thumbnails/9.jpg)
Streaming APIFirehose
need more/full data? only through partners ● gnip.com● datasift.com
● filtering/tracking
● partial to full Firehose
What's the catch? $$$lots of it
![Page 10: Twitter Stream Processing](https://reader030.vdocuments.mx/reader030/viewer/2022020207/554e8c0ab4c90526358b4af2/html5/thumbnails/10.jpg)
Search API
● REST API (http request/response)
○ search query○ geocode (lat, long, radius)○ result type (mixed/recent/popular)○ since id
● max 100 rpp and 1500 results● rate limited (~1 request/sec)
![Page 11: Twitter Stream Processing](https://reader030.vdocuments.mx/reader030/viewer/2022020207/554e8c0ab4c90526358b4af2/html5/thumbnails/11.jpg)
StormDistributed and fault-tolerant realtime computation
https://github.com/nathanmarz/storm
![Page 12: Twitter Stream Processing](https://reader030.vdocuments.mx/reader030/viewer/2022020207/554e8c0ab4c90526358b4af2/html5/thumbnails/12.jpg)
Storm
The promise ● Guaranteed data processing● Horizontal scalability● Fault-tolerance● No intermediate message brokers● Higher level abstraction than message passing● Just work
![Page 13: Twitter Stream Processing](https://reader030.vdocuments.mx/reader030/viewer/2022020207/554e8c0ab4c90526358b4af2/html5/thumbnails/13.jpg)
RedStorm
Storm: Java + Closure
RedStorm: JRuby integration & DSL for Storm
Ruby + Storm on JVM
https://github.com/colinsurprenant/redstorm
+
![Page 14: Twitter Stream Processing](https://reader030.vdocuments.mx/reader030/viewer/2022020207/554e8c0ab4c90526358b4af2/html5/thumbnails/14.jpg)
StormTypical use cases
Streamprocessing
Continuouscomputation
DistributedRPC
![Page 15: Twitter Stream Processing](https://reader030.vdocuments.mx/reader030/viewer/2022020207/554e8c0ab4c90526358b4af2/html5/thumbnails/15.jpg)
StormConcepts
Streams
Unbounded sequence of tuples
Tuple Tuple Tuple Tuple Tuple
![Page 16: Twitter Stream Processing](https://reader030.vdocuments.mx/reader030/viewer/2022020207/554e8c0ab4c90526358b4af2/html5/thumbnails/16.jpg)
StormConcepts
Spouts
Source of streams
TupleTuple
TupleTuple
Tuple
TupleTuple
TupleTuple
Tuple
![Page 17: Twitter Stream Processing](https://reader030.vdocuments.mx/reader030/viewer/2022020207/554e8c0ab4c90526358b4af2/html5/thumbnails/17.jpg)
StormConcepts
Bolts
Processes input streams and produce new streams
Tuple Tuple Tuple Tuple Tuple
Tuple Tuple Tuple Tuple Tuple
Tuple Tuple Tuple Tuple Tuple
![Page 18: Twitter Stream Processing](https://reader030.vdocuments.mx/reader030/viewer/2022020207/554e8c0ab4c90526358b4af2/html5/thumbnails/18.jpg)
StormConcepts
Topology
Network of spouts and bolts
![Page 19: Twitter Stream Processing](https://reader030.vdocuments.mx/reader030/viewer/2022020207/554e8c0ab4c90526358b4af2/html5/thumbnails/19.jpg)
StormConcepts
bolt A bolt B
bolt A bolt B
field X
field Y
bolt A bolt B
bolt A bolt B
random+fair distribution across tasks
replication to all tasks
partition on specified field entire stream to single task
Shuffle
GlobalFields
All
Grouping
![Page 20: Twitter Stream Processing](https://reader030.vdocuments.mx/reader030/viewer/2022020207/554e8c0ab4c90526358b4af2/html5/thumbnails/20.jpg)
Storm
What Storm does ● Distributes code● Robust process management● Monitors topologies and reassigns failed tasks● Provides reliability by tracking tuple trees● Routing and partitioning of stream
![Page 21: Twitter Stream Processing](https://reader030.vdocuments.mx/reader030/viewer/2022020207/554e8c0ab4c90526358b4af2/html5/thumbnails/21.jpg)
TweitgeistLive top 10 trending hashtags on Twitter
DEMO
https://github.com/colinsurprenant/tweitgeistLive demo: http://tweitgeist.needium.com
![Page 22: Twitter Stream Processing](https://reader030.vdocuments.mx/reader030/viewer/2022020207/554e8c0ab4c90526358b4af2/html5/thumbnails/22.jpg)
TweitgeistArchitecture
Twitterstream
Queue
Parallel extraction
Partitionedcounting
Partitionedcounting
Partitionedranking
Partitionedranking
Merging
Queue
Frontend
N partitions N partitions
![Page 23: Twitter Stream Processing](https://reader030.vdocuments.mx/reader030/viewer/2022020207/554e8c0ab4c90526358b4af2/html5/thumbnails/23.jpg)
TweitgeistArchitecture
● Parallel computation across partitions of the stream● Scalable architecture● Work for full Twitter firehose with very little mods
![Page 24: Twitter Stream Processing](https://reader030.vdocuments.mx/reader030/viewer/2022020207/554e8c0ab4c90526358b4af2/html5/thumbnails/24.jpg)
TweitgeistStorm Topology
TwitterStreamSpout
ExtractMessageBolt
ExtractHashtagBolt
RollingCountBolt
shuf
fle
shuf
fle
field
hash
tag
RankBolt
field
hash
tag
MergeBolt
glob
al
Shuffle grouping: Tuples are randomly distributed across the bolt's tasksFields grouping: The stream is partitioned by the fields specified in the groupingGlobal grouping: The entire stream goes to a single one of the bolt's tasks
Redisqueue
Redisqueue
Twitter U
I
streamreader
messageextract
hashtagextract
rollingcounter
ranking merging
![Page 25: Twitter Stream Processing](https://reader030.vdocuments.mx/reader030/viewer/2022020207/554e8c0ab4c90526358b4af2/html5/thumbnails/25.jpg)
TweitgeistTopology definition
![Page 26: Twitter Stream Processing](https://reader030.vdocuments.mx/reader030/viewer/2022020207/554e8c0ab4c90526358b4af2/html5/thumbnails/26.jpg)
TweitgeistWhere
Codehttps://github.com/colinsurprenant/tweitgeist
Live demohttp://tweitgeist.needium.com