scalable stream processing with storm

Post on 25-Feb-2016

40 Views

Category:

Documents

4 Downloads

Preview:

Click to see full reader

DESCRIPTION

Scalable stream processing with Storm. Brian Johnson Luke Forehand. Our Ambition: Marketing Decision Platform. Brand Health. Choice & Experience. Brand & Category Environment. Budgets. Equity. Image & Personality. Perceptions & Associations. Choice. Purchase Funnel. Budget Planning. - PowerPoint PPT Presentation

TRANSCRIPT

©2012 Networked InsightsProprietary and confidential

Scalable stream processing with StormBrian JohnsonLuke Forehand

©2012 Networked InsightsProprietary and confidential

Our Ambition: Marketing Decision Platform

Competitive Positioning

Position LoyaltyCore Benefit & Differentiation

Product DevelopmentFeatures & Functions Design Cost

StructurePackagingQuality

Advertising Content

Message Naming & Taglines

Relationship Marketing

CRM Engagement

Owned Social Engagement

Consumer Promotion

DirectCoupon

Price/Value Perception

Price Justification

Price Change Response

Price Management

Competitive Pricing

Price Optimization

E-Commerce

Online

Sales Management

Demand Planning

Sales Analysis

Global & Local

Market Management

Channel Management

Bran

dAd

verti

sing

Cons

umer

Pric

ing Ch

anne

l

Public Relations

Buzz Generation

Damage Control

Own Stores

OwnedStores

Retailer Management

Distribution Loyalty ProgramPrice & Costs Feature

PromotionIn-Store

Promotion

Digital Marketing & Advertising

Owned Media Search Email

Social Ad/Display Ad Mobile

Tracking & Attribution

Influencing

Influence & Advocacy

Endorsers & Spokespeople Partnerships Sponsorships

Budgets

Marketing & Media Mix

Budget Planning

Segmentation & Targeting

Brand & Category

Demos & Geos LifestylesBehavioral &

Attitudinal Lifestages Trends

Traditional Advertising

TV RadioOut of HomePrint

Brand & Category Environment

Substitutes Complements

Category Trends

Unmet Needs

Product Lifecycle

Roles & Portfolio

Value Chain

Laws & Regulations

External Forces (i.e. economy)

Category Management

Assortment Price Promotion & Co-marketing

Brand Health

Equity Image & Personality

Perceptions & Associations

Choice & Experience

Choice

EngagementExperience & Usage

Purchase Funnel

2

©2012 Networked InsightsProprietary and confidential 3

Big Data Analytics

• What is “Big Data” to Networked Insights?• Almost exclusively social media posts and metadata• Twitter (~67%), Forums, Blogs, Facebook, etc.

• Total index ~60 Billion documents, ~500 TB in production• New documents of 2 Billion/month, increasing• Historical data going back to 2009

Thematic Clustering(Doppler)

Data

Information

©2012 Networked InsightsProprietary and confidential 4

Utilizing Social Media Data

We do two things: 1) Filter data; 2) Analyze dataOur filtering technology must accomodate two scenarios

I don’t know what I am looking for I know what I want to find

Discovery Technologies• Doppler• Word/Phrase Clouds

Search Technologies• Elastic Search• Named Entity Recognition• Supervised Machine Learning• Computational Linguistics

Explicit Information Implicit InformationPost Content (words and phrases used) LanguageAuthor Topical Themes / Categories (sometimes)Day/Time Tone and Sentiment (sometimes)URL Gender (sometimes)*Followers/Following (sometimes) Location (sometimes)*Likes (sometimes) Relative age (sometimes)*

We analyze 2 types of information: Implicit & Explicit

©2012 Networked InsightsProprietary and confidential 5

Implicit Information Mining Example

Gender Classification – List of methods and features

1. Author name / author ID analysis: compare both fields list of first names from US Census

2. Twitter summary field analysis

3. Post content features: analyze the content for certain clues or common characteristics that one gender has over another

1. Text formality – males tend to have more formality than females2. Suffix preferences – many suffixes show up more in female posts than

male3. Word classes – 23 different groups of words that reflect certain topics

or emotions that skew towards one gender more than another4. Lexical words & phrases – certain words/phrases that are giveaways like

“my husband”5. POS sequences – certain part of speech patterns for unigram, bigram,

trigram, and quadgram phrases

©2012 Networked InsightsProprietary and confidential 6

Lots of data, lots of routing

Spam Classifiers

Gender Analysis

Sentiment

Age Classification

Age Classification

iPhone? Samsung?

BlackBerry? Etc.

Taco Bell?McDonald’s?

Subway?

World War Z?Monsters U?

White House Down?

Timberlake?Bieber?Jay Z?

Meta Data

Original Documents

Reporting Layer

SocialSense Application

Layer

Topical Categorization

= Storm

©2012 Networked InsightsProprietary and confidential 7

Storm Agenda

• Overview• Architecture• Working Example• Spout API / Reliability• Bolt API / Scalability• Topology Demo• Monitoring

©2012 Networked InsightsProprietary and confidential 8

Overview

• Storm is a realtime distributed processing system• Think of Hadoop but in realtime• Data can be transformed and grouped in complex

ways using simple constructs

©2012 Networked InsightsProprietary and confidential 9

Overview

• Storm is a realtime distributed processing system• Think of Hadoop but in realtime• Data can be transformed and grouped in complex

ways using simple constructs• Storm is reliable and fault tolerant• Message delivery is guaranteed

©2012 Networked InsightsProprietary and confidential 10

Overview

• Storm is a realtime distributed processing system• Think of Hadoop but in realtime• Data can be transformed and grouped in complex

ways using simple constructs• Storm is reliable and fault tolerant• Message delivery is guaranteed

• Storm is easy to configure and scale• Each component can be scaled independently

©2012 Networked InsightsProprietary and confidential 11

Overview

• Storm is a realtime distributed processing system• Think of Hadoop but in realtime• Data can be transformed and grouped in complex

ways using simple constructs• Storm is reliable and fault tolerant• Message delivery is guaranteed

• Storm is easy to configure and scale• Each component can be scaled independently

• Components can be written in any language

©2012 Networked InsightsProprietary and confidential 12

Overview

• Storm is a realtime distributed processing system• Think of Hadoop but in realtime• Data can be transformed and grouped in complex

ways using simple constructs• Storm is reliable and fault tolerant• Message delivery is guaranteed

• Storm is easy to configure and scale• Each component can be scaled independently

• Components can be written in any language• Written in Clojure (functional language), driven by ZeroMQ

©2012 Networked InsightsProprietary and confidential 13

Architecture

• Components

©2012 Networked InsightsProprietary and confidential 14

Architecture

• Nimbus• “Master”• Uses Zookeeper to communicate with Supervisors• Responsible for assigning work to supervisors

©2012 Networked InsightsProprietary and confidential 15

Architecture

• Nimbus• “Master”• Uses Zookeeper to communicate with Supervisors• Responsible for assigning work to supervisors

• Supervisor• Manages a set of workers (JVMs) on each storm node• Receives work assignments from Nimbus

©2012 Networked InsightsProprietary and confidential 16

Architecture

• Nimbus• “Master”• Uses Zookeeper to communicate with Supervisors• Responsible for assigning work to supervisors

• Supervisor• Manages a set of workers (JVMs) on each storm node• Receives work assignments from Nimbus

• Worker• Managed by Supervisor• Responsible for receiving, executing, and emitting data

inside a storm topology

©2012 Networked InsightsProprietary and confidential 17

Working Example

©2012 Networked InsightsProprietary and confidential 18

Working Example

©2012 Networked InsightsProprietary and confidential 19

Working Example

• Topology• Defines the logical components of a data flow

©2012 Networked InsightsProprietary and confidential 20

Working Example

• Topology• Defines the logical components of a data flow• Composed of Spouts, Bolts, Streams

©2012 Networked InsightsProprietary and confidential 21

Working Example

• Topology• Defines the logical components of a data flow• Composed of Spouts, Bolts, Streams• Spout is a special component that emits data tuples

into a topology

©2012 Networked InsightsProprietary and confidential 22

Working Example

• Topology• Defines the logical components of a data flow• Composed of Spouts, Bolts, Streams• Spout is a special component that emits data tuples

into a topology• Bolt processes tuples emitted from upstream

components and produces zero or many outputtuples

©2012 Networked InsightsProprietary and confidential 23

Working Example

• Topology• Defines the logical components of a data flow• Composed of Spouts, Bolts, Streams• Spout is a special component that emits data tuples

into a topology• Bolt processes tuples emitted from upstream

components and produces zero or many outputtuples

• Stream is a flow of tuples from one component toanother, there can be many

©2012 Networked InsightsProprietary and confidential 24

Working Example

• Topology• Defines the logical components of a data flow• Composed of Spouts, Bolts, Streams• Spout is a special component that emits data tuples

into a topology• Bolt processes tuples emitted from upstream

components and produces zero or many outputtuples

• Stream is a flow of tuples from one component toanother, there can be many

• Tuple is a single record containing a named list of values

©2012 Networked InsightsProprietary and confidential 25

Working Example

©2012 Networked InsightsProprietary and confidential 26

Spout API

ISpoutvoid declareOutputFields(OutputFieldsDeclarer declarer)void open(Map conf, TopologyContext context, SpoutOutputCollector collector)void nextTuple()void close()

ISpoutOutputCollectorList<Integer> emit(String streamId, List<Object> tuple, Object messageId)

©2012 Networked InsightsProprietary and confidential 27

Reliability

• Each Storm component acknowledges that a tuplehas been processed

©2012 Networked InsightsProprietary and confidential 28

Reliability

• Each Storm component acknowledges that a tuplehas been processed

• An ACK is sent to the upstream component, eventuallypropagating back to the emitting spout

©2012 Networked InsightsProprietary and confidential 29

Reliability

• Each Storm component acknowledges that a tuplehas been processed

• An ACK is sent to the upstream component, eventuallypropagating back to the emitting spout

• The emitting spout will replay the tuple if ACK is notreceived within a configured timeout

©2012 Networked InsightsProprietary and confidential 30

Reliability

• Each Storm component acknowledges that a tuplehas been processed

• An ACK is sent to the upstream component, eventuallypropagating back to the emitting spout

• The emitting spout will replay the tuple if ACK is notreceived within a configured timeout

• Spouts can control the number of “pending” tuplesthat are in memory in the topology

©2012 Networked InsightsProprietary and confidential 31

Reliability

• Each Storm component acknowledges that a tuplehas been processed

• An ACK is sent to the upstream component, eventuallypropagating back to the emitting spout

• The emitting spout will replay the tuple if ACK is notreceived within a configured timeout

• Spouts can control the number of “pending” tuplesthat are in memory in the topology

• Spouts need to transact properly with an upstream data source when a tuple is fully acknowledged

©2012 Networked InsightsProprietary and confidential 32

Reliability

ISpoutvoid ack(Object msgId)void fail(Object msgId)

©2012 Networked InsightsProprietary and confidential 33

Reliability

• MAX_SPOUT_PENDING is the parameter to controlhow many pending tuples a spout can emit intoa topology

• Be careful not to artificially decrease throughput!

©2012 Networked InsightsProprietary and confidential 34

Reliability

• MAX_SPOUT_PENDING is the parameter to controlhow many pending tuples a spout can emit intoa topology

• Be careful not to artificially decrease throughput!• Batching operations with reliability turned on can also

create issues

©2012 Networked InsightsProprietary and confidential 35

Reliability

• If max_spout_pending is smaller thanbatch size, topo will collapse

• If interruption in tuple flow, batch may never fill

©2012 Networked InsightsProprietary and confidential 36

Reliability

• Solution: time based batching with TickTuple• TickTuple exercises the component to prompt a batch

commit on a specified interval

©2012 Networked InsightsProprietary and confidential 37

Reliability

• Questions?

©2012 Networked InsightsProprietary and confidential 38

Bolt API

©2012 Networked InsightsProprietary and confidential 39

Bolt API

• Stream Groupings defines how bolts receive streamsas input, we’ll talk about the two basic types.

©2012 Networked InsightsProprietary and confidential 40

Bolt API

• Stream Groupings defines how bolts receive streamsas input, we’ll talk about the two basic types.

• Shuffle grouping – tuples are randomly distributedacross the instances of a bolt

©2012 Networked InsightsProprietary and confidential 41

Bolt API

• Stream Groupings defines how bolts receive streamsas input, we’ll talk about the two basic types.

• Shuffle grouping – tuples are randomly distributedacross the instances of a bolt

• Fields grouping – stream is partitioned by fields specifiedin the grouping, so tuples with a particular named valuewill always flow to the same bolt instance

©2012 Networked InsightsProprietary and confidential 42

Bolt API

©2012 Networked InsightsProprietary and confidential 43

Bolt API

IBoltvoid declareOutputFields(OutputFieldsDeclarer declarer)void prepare(Map stormConf, TopologyContext context, OutputCollector collector)void execute(Tuple input)void cleanup()

IOutputCollectorList<Integer> emit(String streamId, Collection<Tuple> anchors, List<Object> tuple)void ack(Tuple input)void fail(Tuple input)

©2012 Networked InsightsProprietary and confidential 44

Bolt API

• You can also build the components of your topology inother languages

public class MyPythonBolt extends ShellBolt {public MyPythonBolt() {

super("python", "mybolt.py");

}...

}

©2012 Networked InsightsProprietary and confidential 45

Scalability

• The goal should be to scale components accordingly inorder to keep up with realtime data flow

©2012 Networked InsightsProprietary and confidential 46

Scalability

• The goal should be to scale components accordingly inorder to keep up with realtime data flow

• Scalability is easy and can happen in several ways• Increase the number of executors (threads) that work

within a component (bolt or spout)

©2012 Networked InsightsProprietary and confidential 47

Scalability

• The goal should be to scale components accordingly inorder to keep up with realtime data flow

• Scalability is easy and can happen in several ways• Increase the number of executors (threads) that work

within a component (bolt or spout)• Increase the number of workers assigned to a topology

©2012 Networked InsightsProprietary and confidential 48

Scalability

• The goal should be to scale components accordingly inorder to keep up with realtime data flow

• Scalability is easy and can happen in several ways• Increase the number of executors (threads) that work

within a component (bolt or spout)• Increase the number of workers assigned to a topology• Increase total workers available in cluster

©2012 Networked InsightsProprietary and confidential 49

Scalability

Example Topology increasing number of executors per component

©2012 Networked InsightsProprietary and confidential 50

Scalability

2 workers, MySpout with 2 executors, MyBolt with 4 executors

4 workers, MySpout with 2 executors, MyBolt with 4 executors

Example Topology increasing number of workers in the topology

• Work will always be spreadevenly across the workerswhen possible

©2012 Networked InsightsProprietary and confidential 51

Scalability

• Questions?

©2012 Networked InsightsProprietary and confidential 52

Topology Demo

• Demonstrate Topology

©2012 Networked InsightsProprietary and confidential 53

Monitoring

• Monitoring is important to verify data throughput iskeeping up with realtime data flow

• Storm provides excellent monitoring via a UI

©2012 Networked InsightsProprietary and confidential 54

Monitoring

• Monitoring is important to verify data throughput iskeeping up with realtime data flow

• Storm provides excellent monitoring via a UI• UI per topology component will indicate• Tuples transferred• Tuples ACKd, tuples failed (timeout)• Execute Latency ms (self time)• Process Latency ms (total time)

©2012 Networked InsightsProprietary and confidential 55

Monitoring

• Monitoring is important to verify data throughput iskeeping up with realtime data flow

• Storm provides excellent monitoring via a UI• UI per topology component will indicate• Tuples transferred• Tuples ACKd, tuples failed (timeout)• Execute Latency ms (self time)• Process Latency ms (total time)

• Nimbus also provides this interface via Thrift service so one can flexibly collect and aggregate stats (graphite?)

©2012 Networked InsightsProprietary and confidential 56

Monitoring

• Another key indicator of problems is the capacity of a component, if it is at 1.0 or greater, it is a bottleneck

©2012 Networked InsightsProprietary and confidential 57

Monitoring

• Another key indicator of problems is the capacity of a component, if it is at 1.0 or greater, it is a bottleneck

• If you trend the standard deviation of the throughput ofyour components (using either average execute or processlatency) you can quickly respond to changes in typicaldata flow

©2012 Networked InsightsProprietary and confidential 58

Monitoring

• Questions?

©2012 Networked InsightsProprietary and confidential

THANK YOU

top related