denodo datafest 2017: integrating big data and streaming data with enterprise data
TRANSCRIPT
Confidential. Not to be copied, distributed, or reproduced without prior approval.
Integrating Big Data and Streaming Data with Enterprise Data
October 16, 2017
Confidential. Not to be copied, distributed, or reproduced without prior
approval.
Capital
Aviation
Power
Healthcare Oil & Gas
Transportation
Lighting
Global OpsDigital
Additive
Renewables
Multiple Mighty Businesses
Confidential. Not to be copied, distributed, or reproduced without prior
approval.
Machines, Chips, Sensors & Data everywhere
Confidential. Not to be copied, distributed, or reproduced without prior
approval.
Need for Data Integration & Data Pipe Lines
October 16, 2017Presentation Title 4
Confidential. Not to be copied, distributed, or reproduced without prior
approval.
Integrating Streaming Data with Enterprise Data
Integration needs will depend on the use cases
Confidential. Not to be copied, distributed, or reproduced without prior
approval.
Real-Time Data & Data Pipeline
Data orchestration strategies will vary depending on use cases
Confidential. Not to be copied, distributed, or reproduced without prior
approval.
A Look at few Data Collection & Processing Tools
AkkaKafka
Spark Streaming
October 16, 2017Presentation Title 7
Confidential. Not to be copied, distributed, or reproduced without prior
approval.
AkkaAkka is a toolkit for building highly concurrent, distributed, and resilient message-driven
applications for Java and Scala. Uses reactive streaming model by leveraging back-pressure controlled messages.
Reactive Streams – Pull based back Pressure
Confidential. Not to be copied, distributed, or reproduced without prior
approval.
KafkaKafka is a distributed publish-subscribe messaging system that is designed to be fast,
scalable, and durable. Like many publish-subscribe messaging systems, Kafka maintains feeds of messages in topics. Producers write data to topics and consumers read from topics. Since Kafka is a distributed system, topics are partitioned and replicated across multiple nodes.
Confidential. Not to be copied, distributed, or reproduced without prior
approval.
Spark StreamingSpark Streaming is an extension of the core Spark API that enables scalable, high-
throughput, fault-tolerant stream processing of live data streams.
Confidential. Not to be copied, distributed, or reproduced without prior
approval.
Multiple Data Integration Techniques
What option would we choose?
October 16, 2017Presentation Title 11
Confidential. Not to be copied, distributed, or reproduced without prior
approval.
Data Integration using DVA virtual data stitching can be enabled to integrate data from streaming data
and enterprise datasets regardless of data velocity, variety and volume.
Data
Warehouse(s)
&
Data Mart(s)Spark
Streamin
g
Data
Analytics
SparkSQ
L
Data Lake (s)NoSQL
Database(s)
Data Virtualization layer to connect big data based data sources
Confidential. Not to be copied, distributed, or reproduced without prior
approval.
If you have these scenarios…
Streaming datasets
Disparate data sources within enterprise
Structured or Unstructured datasets
Need for stitching historical and new datasets
Real-Time Analytical solutions
Need for an agile solution