a big data lake based on spark for bbva bank-(oscar mendez, stratio)
TRANSCRIPT
STARTING POSITION
Absence of a soDware capable of processing
the data
Isolated data silos
MulIple structured & unstructured data sources
MulIple log management soDware
ApplicaIons just wriIng to disk
(no network logging)
DRIVERS Countless applicaIons & benefits
FRAUD SECURITY DATA ANALYSIS
MONITORING
SIEM
AUDIT
E-‐COMMERCE
USER-‐TR
ACKING
DEVELOPMENT DEBUGGING REGULATORY COMPLIANCE
HIGH-‐LEVEL SOLUTION • MulIple source ingesIon to a common bus • NormalizaIon and transformaIon to a unified log (hard work!)
• MulIple data sinks depending on the clients and/or use cases:
-‐ Analy(cs -‐ Regulatory compliance -‐ Indexing engine -‐ …
Big Data Lake Normalized log
Raw log
SOFTWARE PIECES
1. LOGS SENT FROM SYSLOG-‐NG DEVICES THAT DON'T SUPPORT INSTALLATION OF SYSLOG-‐NG,
SEND LOGS VIA SYSLOG TO A SYSLOG-‐NG RELAY
2. LOGS SENT FROM SYSLOG-‐NG
USED AS A DISTRIBUTION HUB
A TOPIC PER CONSUMER/CLIENT
3. NEW APPLICATIONS TO WRITE DIRECTLY TO KAFKA
4. MULTIPLE DESTINATIONS
SPARKTA
ELK
RDD-‐Based Matrices
Batch InteracIve [SQL] Streaming Machine Learning
WHY SPARK
1 ONE STACK TO RULE THEM ALL
Learn just one system Develop within one framework Deploy/Manage just one system
InteracIve
Batch processing
Stream processing
SPARK
Databricks co-‐founder & CTO Matei Zaharia (source)
LOG COLLECTION
• Syslog-‐ng is a log collecIon soDware capable of processing them in near real-‐Ime & deliver them to a wide variety of desInaIons.
• Syslog-‐ng provides reliable log management for environments ranging from a few to thousands of hosts, with an extreme message collecIon rate.
• Supported in more than 50 server plahorms (including legacy ones!)
• Syslog-‐ng can naIvely collect and process log messages from a wide variety of Enterprise soDware and custom applicaIons.
LOG DISTRIBUTION
• Kaia is a distributed, parIIoned, replicated commit log service, originally developed by LinkedIn.
• It is designed to opImize its performance, offer strong durability guarantees and scale easily.
• Kaia has huge throughput, built-‐in parIIoning, replicaIon, and fault-‐tolerance which makes it a good soluIon for large scale message processing applicaIons.
• Normally used for consumpIon of raw data from topics and then it is aggregated, enriched, and transformed into new Kaia topics for further processing.
PRODUCER PRODUCER PRODUCER
KAFKA CLUSTER
CONSUMER CONSUMER CONSUMER
LOG STORAGE
• HDFS is a distributed file system that provides high performance access to data stored in a cluster. • It is the ‘de facto’ clustered-‐storage soluIon in the Hadoop ecosystem, supported by the vast majority of
Big Data soDware. HDFS is a key technology when you are required to process, specially when it is staIc data.
• It is designed to achieve high availability, high performance and easy scalability. • Parquet is an efficient columnar storage format. Parquet is built to support very efficient compression
and encoding schemes. • Apache Avro is a data serializaIon system with rich data structures and a compact, fast, binary data
format.
Developer(s): Apache SoDware FoundaIon Stable release: 2.7.0/April 2015 OperaIng system: Cross-‐plahorm Type: Distributed filesystem License: Apache License 2.0 Website: hadoop.apache.org
…IN APROX. 200 SERVERS
STREAMED 11 TB/DAY
>2000 APPLICATIONS/DEVICES
FIGURES
NOT YET FULLY DEPLOYED
OBJETIVES / ESTIMATION
CONSIDERATIONS
REPLICATION
COMPRESSION
BOTTLENECKS
FAILURES
APROX. 2PB OF STORE DATA
Towards a generic real-time aggregation platform
At Stratio, we have implemented several real-time analytic projects based on Apache Spark, Kafka, Flume, Cassandra, or MongoDB. These technologies were always a perfect fit, but we soon found ourselves writing the same pieces of integration code over and over again.
Towards a generic real-time aggregation platform
Some initiatives have tried to solve this problem, but until now most of them were complex or obsolete while others were not open source. For this reason, Stratio created SPARKTA: an open source and full-featured platform for real-time analytics, based on Apache Spark.
Distributed, high-volume & pluggable analytics framework
Our goals:
Since Aryabhatta invented zero, Mathematicians such as John von Neuman have been in pursuit of efficient counting and architects have constantly built systems that computes counts quicker. In this age of social media, where 100s of 1000s events take place every second, we designed a aggregation engine to deliver real-time
service
• No need of coding, only declarative aggregation workflows
• Data continuously streamed in & processed in near real-time
• Ready to use out of the box • Plug & play: flexible workflows (inputs, outputs, parsers,
etc…) • High performance • Scalable and fault tolerant
nice intro from countandra
A first look
DRIVER - SUPERVISOR
AGGREGATION POLICY
QUERY SERVICES
Aggregation policy definition is sent to the
engine
Allows multiple application to be defined, each of which is bound to a context, executing the aggregation
workflow
others AGGREGATION WORKFLOW
Deploy any number of real-time aggregation policies
DRIVER - SUPERVISOR
You can start several workflows at any time, and
also stop or monitor them
Key Technologies
any spark streaming receiver :)
Use Spark dataframes API or RDDs to integrate any
datasource
+
Apache Kite SDK INPUTS PROCESSIN
G
RabbitMQ ZeroMQ Twitter
Flume Kafka
....
OUTPUTS
...
.
Define your real-time needs
AGGREGATION POLICY
Remember: no need to code anything. Define your workflow in a JSON document, including:
INPUT Where is the data coming from?
OUTPUT(s) Where should aggregate data be stored?
DIMENSION(s) Which fields will you need for your real-time needs? ROLLUP(s) How do you want to aggregate the dimensions?
TRANSFORMATION(s) Which functions should be applied before aggregation?
SAVE RAW DATA Do you want to save raw events?
Key Technologies
ROLLUPS
• Pass-through • Time-based
• Secondly, minutely, hourly, daily, monthly, yearly...
• Hierarchycal • GeoRange: Areas with different sizes (rectangles)
OPERATORS
• Max, min, count, sum • Average, median • Stdev, variance, count distinct • Last value • Full-text search
KiteSDK
SDK
INPUT
OUTPUT(s)
DIMENSION(s)
OPERATORS
TRANSFORMATION(s)
Sparkta has been conceived as an SDK. You can extend several points of the platform to fulfill your needs, such as adding new inputs, outputs, operators, dimension types. Add new functions to Apache Kite in order to extend the data cleaning, enrichment and normalization capabilities.