Storm
@danklynn
As deep into real-time data processing as you can get**in 30 minutes.
Keeps Contact Information Current and Complete
Based in Denver, Colorado
CTO & [email protected]
@danklynn
Turn Partial Contacts Into Full Contacts
Storm
StormDistributed and fault-‐tolerant real-‐3me computa3on
StormDistributed and fault-‐tolerant real-‐3me computa3on
StormDistributed and fault-‐tolerant real-‐3me computa3on
StormDistributed and fault-‐tolerant real-‐3me computa3on
THE HARD WAY
Queues
Workers
THE HARD WAY
Key Concepts
TuplesOrdered list of elements
TuplesOrdered list of elements
("search-01384", "e:[email protected]")
StreamsUnbounded sequence of tuples
StreamsUnbounded sequence of tuples
Tuple Tuple Tuple Tuple Tuple Tuple
SpoutsSource of streams
SpoutsSource of streams
SpoutsSource of streams
Tuple Tuple Tuple Tuple Tuple Tuple
Spouts can talk with
some images from h,p://commons.wikimedia.org
•Queues
•Web logs
•API calls
•Event data
BoltsProcess tuples and create new streams
Bolts
some images from h,p://commons.wikimedia.org
•Apply funcBons / transforms•Filter•AggregaBon•Streaming joins•Access DBs, APIs, etc...
Bolts
Tuple Tuple Tuple Tuple Tuple Tuple
some images from h,p://commons.wikimedia.org
TupleTuple
TupleTuple
TupleTuple
TupleTuple
TupleTuple
TupleTuple
TopologiesA directed graph of Spouts and Bolts
This is a Topology
some images from h,p://commons.wikimedia.org
This is also a topology
some images from h,p://commons.wikimedia.org
TasksExecute Streams or Bolts
Running a Topology
$ storm jar my-code.jar com.example.MyTopology arg1 arg2
Storm Cluster
Nathan Marz
Storm Cluster
Nathan Marz
If this wereHadoop...
Storm Cluster
Nathan Marz
Job Tracker
If this wereHadoop...
Storm Cluster
Nathan MarzTask Trackers
If this wereHadoop...
Storm Cluster
Nathan Marz
Coordinates everything
But it’s not Hadoop
Example:Streaming Word Count
Streaming Word Count
TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("sentences", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("sentences");builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word"));
Streaming Word Count
TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("sentences", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("sentences");builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word"));
Streaming Word Count
public static class SplitSentence extends ShellBolt implements IRichBolt { public SplitSentence() { super("python", "splitsentence.py"); }
@Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); }
@Override public Map<String, Object> getComponentConfiguration() { return null; }}
SplitSentence.java
Streaming Word Count
public static class SplitSentence extends ShellBolt implements IRichBolt { public SplitSentence() { super("python", "splitsentence.py"); }
@Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); }
@Override public Map<String, Object> getComponentConfiguration() { return null; }}
SplitSentence.java
splitsentence.py
Streaming Word Count
public static class SplitSentence extends ShellBolt implements IRichBolt { public SplitSentence() { super("python", "splitsentence.py"); }
@Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); }
@Override public Map<String, Object> getComponentConfiguration() { return null; }}
SplitSentence.java
Streaming Word Count
TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("sentences", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("sentences");builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word"));
java
Streaming Word Count
public static class WordCount extends BaseBasicBolt { Map<String, Integer> counts = new HashMap<String, Integer>();
@Override public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.get(word); if(count==null) count = 0; count++; counts.put(word, count); collector.emit(new Values(word, count)); }
@Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word", "count")); }}
WordCount.java
Streaming Word Count
TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("sentences", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("sentences");builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word"));
java
Groupings control how tuples are routed
Shuffle groupingTuples are randomly distributed across all of the
tasks running the bolt
Fields groupingGroups tuples by specific named fields and routes
them to the same task
Fields groupingGroups tuples by specific named fields and routes
them to the same task
Analogous to Hadoop’s
partitioning behavior
Trending Topics
Twitter Trending Topics
TwitterStreamingTopicSpoutparallelism = 1 (unless you use GNip)
(word)
RollingCountsBoltparallelism = n
(word, count)
IntermediateRankingsBoltparallelism = n
(rankings)
(tweets)
(JSON rankings)
RankingsReportBoltparallelism = 1
TotalRankingsBoltparallelism = 1
(rank
ings)
Live Coding!
Twitter Trending Topics
TwitterStreamingTopicSpoutparallelism = 1 (unless you use GNip)
(word)
RollingCountsBoltparallelism = n
(word, count)
IntermediateRankingsBoltparallelism = n
(rankings)
(tweets)
(JSON rankings)
RankingsReportBoltparallelism = 1
TotalRankingsBoltparallelism = 1
(rank
ings)
Tips
loggly.com
Graylog2logstash
Use a log aggregator
"$topologyName-$buildNumber"
Rolling Deploys
1. Launch new topology
2. Wait for it to be healthy
3. Kill the old one
Rolling Deploys
These are under active development
Rolling Deploys
TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("sentences", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("sentences");builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word"));
java
see:https://github.com/nathanmarz/storm/wiki/Understanding-the-parallelism-of-a-Storm-topology
Tune your parallelism
Tune your parallelismSupervisor
Worker Process (JVM)
Executor (thread)
Task
Task
Executor (thread)
Task
Task
Worker Process (JVM)
Executor (thread)
Task
Task
Executor (thread)
Task
Task
Parallelism hints control the number of Executors
collector.emit(new Values(word, count));
see:https://github.com/nathanmarz/storm/wiki/Understanding-the-parallelism-of-a-Storm-topology
Anchor your tuples (or not)
collector.emit(tuple, new Values(word, count));
But Dan, you left out Trident!
if (storm == hadoop) { trident = pig / cascading}
A little taste of Trident TridentState urlToTweeters = topology.newStaticState(getUrlToTweetersState());TridentState tweetersToFollowers = topology.newStaticState(getTweeterToFollowersState());
topology.newDRPCStream("reach") .stateQuery(urlToTweeters, new Fields("args"), new MapGet(), new Fields("tweeters")) .each(new Fields("tweeters"), new ExpandList(), new Fields("tweeter")) .shuffle() .stateQuery(tweetersToFollowers, new Fields("tweeter"), new MapGet(), new Fields("followers")) .parallelismHint(200) .each(new Fields("followers"), new ExpandList(), new Fields("follower")) .groupBy(new Fields("follower")) .aggregate(new One(), new Fields("one")) .parallelismHint(20) .aggregate(new Count(), new Fields("reach"));
h,ps://github.com/nathanmarz/storm/wiki/Trident-‐tutorial
Thanks:
@stormprocessorhttp://github.com/nathanmarz/storm
Nathan Marz - @nathanmarz
http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/
Michael Knoll - @miguno
Michael Rose - @xorlevhttp://github.com/xorlev
https://github.com/danklynn/storm-starter/tree/gluecon2013