scaling up wso2 bam for billions of requests and terabytes of data

Scaling Up WSO2 BAM for Billions of Requests and Terabytes of Data

Buddhika ChamithSoftware Engineer – WSO2 BAM

Business Activity Monitoring

“The aggregation, analysis, and presentation of real-time information about activities inside organizations and involving customers and partners.” - Gartner

Aggregation

● Capturing data● Data storage● What data to

capture?

Analysis

● Data operations● Building KPIs● Operate on large

amounts of historic data or new data

● Building BI

Presentation

● Visualizing KPIs/BI● Custom Dashboards● Visualization tools● Not just dashboards!

Need for Scalability

BAM 2.x - Component Architecture

Data Agents

● Push data to BAM● Collecting

● Service data● Mediation data● Logs etc.

● Various interceptors used● Axis2 Handlers● Synapse Mediators● Tomcat Valves● Log4j Appenders

Performance Considerations

● Should be asynchronous ● Event batching ● SOAP?● Apache Thrift (Binary protocol)

Apache Thrift

● A RPC framework● With a pluggable architecture

for mixing different transports with different protocols

● Has multiple language bindings (Java, C++, Python, Perl, C# etc.)

● We mainly use Java binding

Not Just Performance...

● Load balancing● Failover● All available within a Java SDK libary. ● You can use it too.

Data Receiver

● Capture and transfer data to subscribed sinks.● Not just the database. ● Can be clustered. ● Load balancing is handled from client side.

Data Bridge

Data Storage

● Apache Cassandra● NoSQL column family

implementation● Scalable, HA and no

SPOF.● Very high write

throughput and good read throughput

● Tunable consistency with data replication

Deployment – Storage Cluster

Reciever Cluster

Results

With a single receiver node allocated 2GB heap with quad core on RHEL.

Disk Growth

Analyzer Engine

● Idea : Distribute processing to multiple nodes to run in parallel

● Obvious choice : Hadoop ● Uses Map Reduce Programming paradigm

Map Reduce

● Process multiple data chunks paralley at Mappers.

● Aggregate map outputs having similar keys at Reducers and store the result.

● Let's think of a useful example..

Hadoop Components

● Job Tracker● Name node● Secondary Name Node● Task Trackers● Data Nodes

It's Cool But ..● Do we need to have a

Hadoop cluster in order to try out BAM?

● Are we supposed to code Hadoop jobs to get

BAM to summarize some thing?

● Answers

1) No

2) No. Ok may be very rarely at best.

Courtesy: http://goo.gl/QEnpN

Apache Hive

● You write SQL. (Almost)● Let Hive convert to Map Reduce jobs.● So Hive does two things

● Provide an abstraction for Hadoop Map Reduce● Submit the analytic jobs to Hadoop

● Hive may spawn a Hadoop JVM locally or delegate to a Hadoop Cluster

A Typical Hive Script

Results

Task Framework

● Run Hive scripts periodically● Can specify as cron expressions/ predefined

templates● Handles task failover in case of node faliure● Uses Zookeeper for coordination

Zookeeper

● Can be run seperately or embedded within BAM

Analyzer Cluster

Dashboard

● Making dashboard scale.

Deployment Patterns

Single Node Single Node

High AvailabilityHigh Availability

Fully Distributed SetupFully Distributed Setup

Summary

● BAM ● Need for scalability● Scaling BAM components● Results● BAM deployment patterns

scaling up wso2 bam for billions of requests and terabytes of data

Documents

data data storage

transfer data

data agents

data tocapture

data bridge

analysis data operations

data receiver capture

hadoop hive