c* summit 2013: real-time big data with storm, cassandra, and in-memory computing by dewayne filppi
Post on 01-Nov-2014
1.959 Views
Preview:
DESCRIPTION
TRANSCRIPT
Real Time Big Data With Storm, Cassandra, and In-‐Memory Compu=ng
DeWayne Filppi @dfilppi
Big Data Predic=ons
“Over the next few years we'll see the adop=on of scalable frameworks and pla1orms for handling streaming, or near real-‐=me, analysis and processing. In the same way that Hadoop has been borne out of large-‐scale web applica=ons, these plaMorms will be driven by the needs of large-‐scale loca=on-‐aware mobile, social and sensor use.”
Edd Dumbill, O’REILLY
2 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 3
The Two Vs of Big Data
Velocity Volume
We’re Living in a Real Time World… Homeland Security
Real Time Search
Social
eCommerce
User Tracking & Engagement
Financial Services
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 4
The Flavors of Big Data Analy=cs
Coun:ng Correla:ng Research
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 5
Analy=cs @ Twi`er – Coun=ng
§ How many signups, tweets, retweets for a topic?
§ What’s the average latency?
§ Demographics § Countries and ci=es § Gender § Age groups § Device types § …
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 6
Analy=cs @ Twi`er – Correla=ng
§ What devices fail at the same =me?
§ What features get user hooked?
§ What places on the globe are “happening”?
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 7
Analy=cs @ Twi`er – Research
§ Sen=ment analysis § “Obama is popular”
§ Trends § “People like to tweet
aeer watching American Idol”
§ Spam pa`erns § How can you tell when
a user spams?
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 8
It’s All about Timing
“Real :me” (< few Seconds)
Reasonably Quick (seconds -‐ minutes)
Batch (hours/days)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 9
It’s All about Timing
• Event driven / stream processing • High resolu=on – every tweet gets counted
• Ad-‐hoc querying • Medium resolu=on (aggrega=ons)
• Long running batch jobs (ETL, map/reduce) • Low resolu=on (trends & pa`erns)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 10
This is what we’re here to discuss J
VELOCITY + VAST VOLUME = IN MEMORY + BIG DATA
11
§ RAM is the new disk § Data par==oned across a cluster
§ Large “virtual” memory space § Transac=onal § Highly available § Code collocated with data.
In Memory Data Grid Review
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 12
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 13
Data Grid + Cassandra: A Complete Solu=on • Data flows through the in-‐memory cluster async to Cassandra • Side effects calculated • Filtering an op=on • Enrichment an op=on • Results instantly available • Internal and external event listeners no=fied
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 14
Simplified Event Flow
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 15
Grid – Cassandra Interface § Hector and CQL based interface § In memory data must be mapped to column families.
§ Configurable class to column family mapping § Must serialize individual fields
§ Fixed fields can use defined types § Variable fields ( for schemaless in-‐memory mode) need serializers
§ Object model fla`ening § By default, nested fields are fla`ened. § Can be overridden by custom serializer.
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 16
Virtues and Limita=ons
§ Could be faster: high availability has a cost § Complex flows not easy to assemble or understand with simple
event handlers
§ Complete stack, not just two tools of many § Fast.
§ Microsecond latencies for in memory opera=ons § Fast enough for almost anybody
§ Highly available/self healing § Elas=c
§ Popular open source, real =me, in-‐memory, streaming computa=on plaMorm.
§ Includes distributed run=me and intui=ve API for defining distributed processing flows.
§ Scalable and fault tolerant. § Developed at BackType, and open sourced by Twi`er
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 17
Storm Background
§ Streams § Unbounded sequence of tuples
§ Spouts § Source of streams (Queues)
§ Bolts § Func=ons, Filters, Joins, Aggrega=ons
§ Topologies
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 18
Storm Abstrac=ons Spout
Bolt
Topologies
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 19
Streaming word count with Storm
§ Storm has a simple builder interface to crea=ng stream processing topologies
§ Storm delegates persistence to external providers § Cassandra, because of its write performance, is commonly used
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 20
Storm : Op=mis=c Processing
§ Storm (quite ra=onally) assumes success is normal § Storm uses batching and pipelining for performance § Therefore the spout must be able to replay tuples on demand
in case of error. § Any kind of quasi-‐queue like data source can be fashioned
into a spout. § No persistence is ever required, and speed a`ained by
minimizing network hops during topology processing.
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 21
Fast. Want to go faster?
§ Eliminate non-‐memory components § Subs=tute disk based queue for reliable in-‐memory queue § Subs=tute disk based state persistence to in-‐memory
persistence § Asynchronously update disk based state (C*)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 22
Sample Architecture
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 23
References § Try the Cloudify recipe
§ Download Cloudify : h`p://www.cloudifysource.org/ § Download the Recipe (apps/xapstream, services/xapstream):
– h`ps://github.com/CloudifySource/cloudify-‐recipes § XAP – Cassandra Interface Details;
§ h`p://wiki.gigaspaces.com/wiki/display/XAP95/Cassandra+Space+Persistency § Check out the source for the XAP Spout and a sample state
implementa=on backed by XAP, and a Storm friendly streaming implemen=on on github: § h`ps://github.com/Gigaspaces/storm-‐integra=on
§ For more background on the effort, check out my recent blog posts at h`p://blog.gigaspaces.com/ § h`p://blog.gigaspaces.com/gigaspaces-‐and-‐storm-‐part-‐1-‐storm-‐clouds/ § h`p://blog.gigaspaces.com/gigaspaces-‐and-‐storm-‐part-‐2-‐xap-‐integra=on/ § Part 3 coming soon.
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 24
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 25
Twi`er Storm With Cassandra
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 26
Storm Overview
§ Streams § Unbounded sequence of tuples
§ Spouts § Source of streams (Queues)
§ Bolts § Func=ons, Filters, Joins, Aggrega=ons
§ Topologies
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 27
Storm Concepts Spouts
Bolt
Topologies
Challenge – Word Count
Word:Count
Tweets
Count ?® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 28
• HoWest topics • URL men:ons • etc.
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 29
Streaming word count with Storm
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 30
Supercharging Storm § Storm doesn’t supply persistence, but provides for it § Storm op=mizes IO to slow persistence (e.g. databases) using
batching. § Storm processes streams. The stream provider itself needs to
support persistency, batching, and reliability.
Tweets, events,whatever….
XAP Real Time Analy=cs
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 31
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Two Layer Approach § Advantage: Minimal
“impedance mismatch” between layers. – Both NoSQL cluster
technologies, with similar advantages
§ Grid layer serves as an in memory cache for interac=ve requests.
§ Grid layer serves as a real =me computa=on fabric for CEP, and limited ( to allocated memory) real =me distributed query capability.
In Memory Compute Cluster
NoSQL Cluster
...
Raw Event Stream
Raw Event Stream
Raw Event Stream
Real Tim
e Even
ts
Raw And Derived Events
Real Tim
e Even
ts
Repo
rting En
gine
SCALE
SCALE
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 33
Simplified Architecture
§ Flowing event streams through memory for side effects § Event driven architecture execu=ng in-‐memory § Raw events flushed, aggrega=ons/deriva=ons retained § All layers horizontally scalable § All layers highly available § Real-‐=me analy=cs & cached batch analy=cs on same scalable
layer § Data grid provides a transac=onal/consistent façade on
NoSQL store (in this case elimina=ng SQL database en=rely)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 34
Key Concepts
Keep Things In Memory
Facebook keeps 80% of its data in Memory (Stanford research)
RAM is 100-‐1000x faster than Disk (Random seek) • Disk: 5 -‐10ms • RAM: ~0.001msec
Take Aways
§ A data grid can serve different needs for big data analy=cs: § Supercharge a dedicated stream processing cluster like Storm.
– Provide fast, reliable, transac=onal tuple streams and state § Provide a general purpose analy=cs plaMorm
– Roll your own § Simplify overall architecture while enhancing scalability
– Ultra high performance/low latency – Dynamically scalable processing and in-‐memory storage – Eliminate messaging =er – Eliminate or minimize need for RDBMS
§ Real:me Analy:cs with Storm and Hadoop § hWp://www.slideshare.net/Hadoop_Summit/real:me-‐
analy:cs-‐with-‐storm § Learn and fork the code on github:
hWps://github.com/Gigaspaces/storm-‐integra:on
§ Twi`er Storm: hWp://storm-‐project.net
§ XAP + Storm Detailed Blog Post hWp://blog.gigaspaces.com/gigaspaces-‐and-‐storm-‐part-‐2-‐xap-‐integra:on/
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 37
References
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 38
top related