spark seattle meetup - breaking etl barrier with spark streaming
TRANSCRIPT
Breaking ETL barrier with Real-time reportingusing Kafka, Spark Streaming
Santosh SahooArchitect at Concur
About us
Concur (now part of SAP) provides travel and expense management services to businesses.
Data Insights team is building solutions to provide customer access to data, visualization and reporting.
Stack so far..
OLAP ReportETL
OLTP
App
Numbers
7K OLTP database sources14K OLAP Reporting dbs28K ETL Jobs300M rows (Compacted), 2B row changesOnly ~20 failure a night
Batch ETL challenges
Scheduled (High latency)Processing timeHard to scale.Not fault toleranceMonolithicHigh maintenance
Moving forwardScheduled (High latency) Streaming, real time
Hard to scale Scalable
Monolithic Modular
Not fault tolerant Fault tolerant
ACID Consistent, Normalized Eventual Consistency
High maintenance (Single Tenant)
Reduce maintenance overhead(Multi tenant)
Source Flow Manager
StreamingProcessor Storage Reporting
Streaming Data Pipeline
Applications
Mobile Devices
Sensors
IOT - Internet of things
Database Log scrapping
Alert
Message Queues
Kafka
Flume
Azure Event hub
AWS Kinesis
HDFS
Storm
Spark Streaming
Azure Stream analytics
Samza
Flink
RDBMS
NoSQL
HDFS
Redshift
Custom App D3
Tableau
Cognos
Excel
Spark StreamingWhat? A data processing framework to build scalable fault-tolerant streaming applications.Why? It lets you reuse the same code for batch processing, join streams against historical data, or run ad-hoc queries on stream state.
Demo….
Kafka - Flow Management
No nonsense logging100K/s throughput vs 20k of RabbitMQLog compactionDurable persistencePartition tolerance ReplicationBest in class integration with Spark
Spark Streaming Architecture
Worker
Worker
Worker
Receiver
Driver Master
Executor
Executor
Executor
Source
D1 D2
D3 D4
WAL
D1 D2
Replication
DataStore
TASK
DStream- Discretized Stream of RDDRDD - Resilient Distributed Datasets
Optimized Direct Kafka API
https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html
Architecture
OLTP
Reporting
CognosTableau ?
StreamProcessorSpark
HDFSImport
FTP
HTTP
SMTP
P
ProtobufJson
Broker
Kafka
Hive/Spark SQL
OLAP
Load balanceFailover
HANA
HANAOLAP
Replication
Service bus
Normalization
ExtractCompensate
Data {Quality, Correction, Analytics}Migrate method
API/SQL
ExpenseTravel
TTXAPI
Reporting Next Gen Architecture
C
Tachyon
Can Spark Streaming survive Chaos Monkey?
http://techblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-monkey.html
QnA
concur.com/en-us/careers
We are hiring
Thank you!