scaling up wso2 bam for billions of requests and terabytes of data

Post on 21-Jun-2015

1.194 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Scaling Up WSO2 BAM for Billions of Requests and Terabytes of Data

Buddhika ChamithSoftware Engineer – WSO2 BAM

Business Activity Monitoring

“The aggregation, analysis, and presentation of real-time information about activities inside organizations and involving customers and partners.” - Gartner

Aggregation

● Capturing data● Data storage● What data to

capture?

Analysis

● Data operations● Building KPIs● Operate on large

amounts of historic data or new data

● Building BI

Presentation

● Visualizing KPIs/BI● Custom Dashboards● Visualization tools● Not just dashboards!

Need for Scalability

BAM 2.x - Component Architecture

Data Agents

● Push data to BAM● Collecting

● Service data● Mediation data● Logs etc.

● Various interceptors used● Axis2 Handlers● Synapse Mediators● Tomcat Valves● Log4j Appenders

Performance Considerations

● Should be asynchronous ● Event batching ● SOAP?● Apache Thrift (Binary protocol)

Apache Thrift

● A RPC framework● With a pluggable architecture

for mixing different transports with different protocols

● Has multiple language bindings (Java, C++, Python, Perl, C# etc.)

● We mainly use Java binding

Not Just Performance...

● Load balancing● Failover● All available within a Java SDK libary. ● You can use it too.

Data Receiver

● Capture and transfer data to subscribed sinks.● Not just the database. ● Can be clustered. ● Load balancing is handled from client side.

Data Bridge

Data Storage

● Apache Cassandra● NoSQL column family

implementation● Scalable, HA and no

SPOF.● Very high write

throughput and good read throughput

● Tunable consistency with data replication

Deployment – Storage Cluster

Reciever Cluster

Results

With a single receiver node allocated 2GB heap with quad core on RHEL.

Disk Growth

Analyzer Engine

● Idea : Distribute processing to multiple nodes to run in parallel

● Obvious choice : Hadoop ● Uses Map Reduce Programming paradigm

Map Reduce

● Process multiple data chunks paralley at Mappers.

● Aggregate map outputs having similar keys at Reducers and store the result.

● Let's think of a useful example..

Hadoop Components

● Job Tracker● Name node● Secondary Name Node● Task Trackers● Data Nodes

It's Cool But ..● Do we need to have a

Hadoop cluster in order to try out BAM?

● Are we supposed to code Hadoop jobs to get

BAM to summarize some thing?

● Answers

1) No

2) No. Ok may be very rarely at best.

Courtesy: http://goo.gl/QEnpN

Apache Hive

● You write SQL. (Almost)● Let Hive convert to Map Reduce jobs.● So Hive does two things

● Provide an abstraction for Hadoop Map Reduce● Submit the analytic jobs to Hadoop

● Hive may spawn a Hadoop JVM locally or delegate to a Hadoop Cluster

A Typical Hive Script

Results

Task Framework

● Run Hive scripts periodically● Can specify as cron expressions/ predefined

templates● Handles task failover in case of node faliure● Uses Zookeeper for coordination

Zookeeper

● Can be run seperately or embedded within BAM

Analyzer Cluster

Dashboard

● Making dashboard scale.

Deployment Patterns

Single Node Single Node

High AvailabilityHigh Availability

Fully Distributed SetupFully Distributed Setup

Summary

● BAM ● Need for scalability● Scaling BAM components● Results● BAM deployment patterns

top related