real time analytics for big data a twitter case study

28
SINGLE PLATFORM. COMPLETE SCALABILITY. Real Time Analytics Real Time Analytics for Big Data for Big Data Lessons from Twitter.. Lessons from Twitter..

Upload: nati-shalom

Post on 15-May-2015

2.992 views

Category:

Technology


1 download

DESCRIPTION

Hadoop's batch-oriented processing is sufficient for many use cases, especially where the frequency of data reporting doesn't need to be up-to-the-minute. However, batch processing isn't always adequate, particularly when serving online needs such as mobile and web clients, or markets with real-time changing conditions such as finance and advertising. In the same way that Hadoop was born out of large-scale web applications, a new class of scalable frameworks and platforms for handling real time streaming processing or real time analysis is born to handle the needs of large-scale location-aware mobile, social and sensor use. Facebook, Twitter and Google have been pioneers in that arena and recently launched new analytics services designed to meet the real time needs. In this session we will review the common patterns and architectures that drive these platforms and learn how to build a Twitter-like analytics system in a simple way using frameworks such as Spring Social, Active In-Memory Data Grid for Big Data event processing, and NoSQL database such as Cassandra or HBase for handling the managing the historical data. Participants in this session will also receive a hands-on tutorial for trying out these patterns on their own environment. A detailed post covering the topic including a reference to a code example illustrating the reference architecture is available below: http://horovits.wordpress.com/2012/01/27/analytics-for-big-data-venturing-with-the-twitter-use-case/

TRANSCRIPT

Page 1: Real Time Analytics for Big Data a Twitter Case Study

SINGLE PLATFORM. COMPLETE SCALABILITY.

Real Time Analytics for Big DataReal Time Analytics for Big DataLessons from Twitter..Lessons from Twitter..

Page 2: Real Time Analytics for Big Data a Twitter Case Study

The Real Time Boom..

2® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Google Real Time Web Analytics

Google Real Time Search

Facebook Real Time Social Analytics

Twitter paid tweet analytics

SaaS Real Time User Tracking

New Real Time Analytics Startups..

Page 3: Real Time Analytics for Big Data a Twitter Case Study

Analytics @ Twitter

Page 4: Real Time Analytics for Big Data a Twitter Case Study

Note the Time dimension

Page 5: Real Time Analytics for Big Data a Twitter Case Study

The data resolution & processing models

Page 6: Real Time Analytics for Big Data a Twitter Case Study

Twitter Real-timeAnalytics System

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

6

Page 7: Real Time Analytics for Big Data a Twitter Case Study

Twitter by the numbers

7® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Page 8: Real Time Analytics for Big Data a Twitter Case Study

Real-time Analytics for Twitter Reach

8® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Reach is the number of unique people exposed to a URL on Twitter

Page 9: Real Time Analytics for Big Data a Twitter Case Study

Computing Reach

9® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Count

Tweets Followers Distinct Followers

Reach

Page 10: Real Time Analytics for Big Data a Twitter Case Study

Challenge – Word Count

10® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Count

Tweets

Word:Count

Page 11: Real Time Analytics for Big Data a Twitter Case Study

Real-time Analytics System With GigaSpaces

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

11

Page 12: Real Time Analytics for Big Data a Twitter Case Study

Performance/Latency

•Instead of treating memory as a cache, why not treat it as a primary data store?

– Facebook keeps 80% of its data in Memory (Stanford research)

– RAM is 100-1000x faster than Disk (Random seek)

• Disk - 5 -10ms

• RAM – x0.001msec

12® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Data Feeds

Page 13: Real Time Analytics for Big Data a Twitter Case Study

ONE Data any API’s:

13® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Any API

• Document – Storing tweets

• Key/Value– Atomic-counters

• JPA– Complex query

• Executors– Real Time

Map/Reduce– Aggregated join

query

Data Feeds

The right API for the JOB

Page 14: Real Time Analytics for Big Data a Twitter Case Study

Availability

•Backup node per partition•Synchronous replication to ensure consistency and no data loss •On-demand backup to minimizing over provisioning overhead (cost, performance, capacity)

14® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Data Feeds

Page 15: Real Time Analytics for Big Data a Twitter Case Study

Write Scalability/Performance

•Partition incoming tweets based on tweet id

•Partition word/count index based on word hash key – all updates to the same index are routed to the same partition.

15® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

word1

word2

word3

Page 16: Real Time Analytics for Big Data a Twitter Case Study

Global index update - optimization

•Use batch write for updating the word/count index•Use atomic update to ensure consistency with minimum performance overhead on concurrent updates

16® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

word1

word2

word3

Local batch update

1 sec

Page 17: Real Time Analytics for Big Data a Twitter Case Study

Processing the data

• Use event handlers to process the data as its coming.

• Use shared state to control the flow (order)

• Use FIFO to ensure order

• Use local TX to recover from failure

1) Parse

2) Update Global index

Page 18: Real Time Analytics for Big Data a Twitter Case Study

Collocate

• Group event handler and the data into processing-units– Minimize latency– Simple scaling (less

moving parts)– Better reliability

1) Parse

2) Update Global index

Page 19: Real Time Analytics for Big Data a Twitter Case Study

BigData Database for long term data

• Write-behind – Batch update to the DB to

Minimize disk performance and latency overhead

– Logs are backed up to avoid data loss between batches

• Plug-in to any DB– Use plug-in approach to

enable flexibility for choosing the right DB for the JOB

Page 20: Real Time Analytics for Big Data a Twitter Case Study

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved20

Automation & Cloud enablement

My Recipe

2 Tomcats

10 Data PU’s

10 Cassandra

Page 21: Real Time Analytics for Big Data a Twitter Case Study

application { name="simple app"

service { name = "mysql-service”} service { name = "jboss-service" dependsOn = [“mysql-service”}}

Application description through RECIPES

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

21

Recipe DSLLifecycle scriptsCustom plug-ins (optional)Service binaries (optional)

service {name "jboss-service"icon "jboss.jpg"type "APP_SERVER“numInstances 2[recipe body]

}

lifecycle{ init "mysql_install.groovy” start "mysql_start.groovy” stop "mysql_stop.groovy"}

..

Page 22: Real Time Analytics for Big Data a Twitter Case Study

® Copyri

ght 2011

Gigaspaces

Ltd. All Rights

Reserved

22

Cloudify Application Management

Page 23: Real Time Analytics for Big Data a Twitter Case Study

Putting it all together

Analytic Application

EventSources

Write behind

- In Memory Data Grid

- RT Processing Grid• Light Event Processing• Map-reduce• Event driven• Execute code with data• Transactional• Secured• Elastic

NoSQL DB• Low cost storage• Write/Read

scalability• Dynamic scaling• Raw Data and

aggregated Data

Generate Patterns

23

RScript script = new StaticScritpt(“groovy ”,”println hi; return 0”)

Query q = em.createNativeQuery(“execute ?”);q.setParamter(1, script);

Integer result = query.getSingleResult();

Real Time Map/Reduce

Page 24: Real Time Analytics for Big Data a Twitter Case Study

Economic Data Scaling

• Combine memory and disk– Memory is x10, x100 lower

than disk for high data access rate (Stanford research)

– Disk is lower at cost for high capacity lower access rate.

– Solution: • Memory - short-term data, • Disk - long term. data

– Only ~16G required to store the log in memory ( 500b messages at 10k/h ) at a cost of ~32$ month per server.

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

24

Memory Disk

Page 25: Real Time Analytics for Big Data a Twitter Case Study

Economic Scaling

• Automation - reduce operational cost• Elastic Scaling – reduce over provisioning cost• Cloud portability (JClouds) – choose the right cloud for the job• Cloud bursting – scavenge extra capacity when needed

25® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Page 26: Real Time Analytics for Big Data a Twitter Case Study

Big Data Predictions

26® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Page 27: Real Time Analytics for Big Data a Twitter Case Study

Summary

Big Data Development Made Simple: Focus on your business logic, Use Big Data platform for dealing scalability, performance, continues availability,..

Its Open: Use Any Stack : Avoid LockinAny database (RDBMS or NoSQL); Any Cloud, Use common API’s & Frameworks.

All While Minimizing CostUse Memory & Disk for optimum cost/performance . Built-in Automation and management - Reduces operational costsElasticity – reduce over provisioning cost

Page 28: Real Time Analytics for Big Data a Twitter Case Study

Thank YOU!

@natishalomhttp://blog.gigaspaces.com

28