scaling up wso2 bam for billions of requests and terabytes of data
TRANSCRIPT
Scaling Up WSO2 BAM for Billions of Requests and Terabytes of Data
Buddhika ChamithSoftware Engineer – WSO2 BAM
Business Activity Monitoring
“The aggregation, analysis, and presentation of real-time information about activities inside organizations and involving customers and partners.” - Gartner
Aggregation
● Capturing data● Data storage● What data to
capture?
Analysis
● Data operations● Building KPIs● Operate on large
amounts of historic data or new data
● Building BI
Presentation
● Visualizing KPIs/BI● Custom Dashboards● Visualization tools● Not just dashboards!
Need for Scalability
BAM 2.x - Component Architecture
Data Agents
● Push data to BAM● Collecting
● Service data● Mediation data● Logs etc.
● Various interceptors used● Axis2 Handlers● Synapse Mediators● Tomcat Valves● Log4j Appenders
Performance Considerations
● Should be asynchronous ● Event batching ● SOAP?● Apache Thrift (Binary protocol)
Apache Thrift
● A RPC framework● With a pluggable architecture
for mixing different transports with different protocols
● Has multiple language bindings (Java, C++, Python, Perl, C# etc.)
● We mainly use Java binding
Not Just Performance...
● Load balancing● Failover● All available within a Java SDK libary. ● You can use it too.
Data Receiver
● Capture and transfer data to subscribed sinks.● Not just the database. ● Can be clustered. ● Load balancing is handled from client side.
Data Bridge
Data Storage
● Apache Cassandra● NoSQL column family
implementation● Scalable, HA and no
SPOF.● Very high write
throughput and good read throughput
● Tunable consistency with data replication
Deployment – Storage Cluster
Reciever Cluster
Results
With a single receiver node allocated 2GB heap with quad core on RHEL.
Disk Growth
Analyzer Engine
● Idea : Distribute processing to multiple nodes to run in parallel
● Obvious choice : Hadoop ● Uses Map Reduce Programming paradigm
Map Reduce
● Process multiple data chunks paralley at Mappers.
● Aggregate map outputs having similar keys at Reducers and store the result.
● Let's think of a useful example..
Hadoop Components
● Job Tracker● Name node● Secondary Name Node● Task Trackers● Data Nodes
It's Cool But ..● Do we need to have a
Hadoop cluster in order to try out BAM?
● Are we supposed to code Hadoop jobs to get
BAM to summarize some thing?
● Answers
1) No
2) No. Ok may be very rarely at best.
Courtesy: http://goo.gl/QEnpN
Apache Hive
● You write SQL. (Almost)● Let Hive convert to Map Reduce jobs.● So Hive does two things
● Provide an abstraction for Hadoop Map Reduce● Submit the analytic jobs to Hadoop
● Hive may spawn a Hadoop JVM locally or delegate to a Hadoop Cluster
A Typical Hive Script
Results
Task Framework
● Run Hive scripts periodically● Can specify as cron expressions/ predefined
templates● Handles task failover in case of node faliure● Uses Zookeeper for coordination
Zookeeper
● Can be run seperately or embedded within BAM
Analyzer Cluster
Dashboard
● Making dashboard scale.
Deployment Patterns
Single Node Single Node
High AvailabilityHigh Availability
Fully Distributed SetupFully Distributed Setup
Summary
● BAM ● Need for scalability● Scaling BAM components● Results● BAM deployment patterns