big data and fast data – big and fast combined, is it possible?
DESCRIPTION
TRANSCRIPT
2013 © Trivadis
BASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN
WELCOME Big Data and Fast Data – Big and Fast Combined, is it Possible?
Guido Schmutz
UKOUG Tech 2013
2.12.2013
2.12.2013 Big Data and Fast Data – Big and Fast Combined, is it Possible?
1
2013 © Trivadis
Guido Schmutz
• Working for Trivadis for more than 16 years
• Oracle ACE Director for Fusion Middleware and SOA • Co-Author of different books • Consultant, Trainer Software Architect for Java, Oracle, SOA
and EDA • Member of Trivadis Architecture Board • Technology Manager @ Trivadis
• More than 20 years of software development experience
• Contact: [email protected] • Blog: http://guidoschmutz.wordpress.com • Twitter: gschmutz
2.12.2013
2 Big Data and Fast Data – Big and Fast Combined, is it Possible?
2013 © Trivadis
Trivadis is a market leader in IT consulting, system integration, solution engineering and the provision of IT services focusing on and technologies in Switzerland, Germany and Austria.
We offer our services in the following strategic business fields: Trivadis Services takes over the interacting operation of your IT systems.
Our company
O P E R A T I O N
02/12/13 Trivadis – the company
2013 © Trivadis
With over 600 specialists and IT experts in your region
4
12 Trivadis branches and more than 600 employees 200 Service Level Agreements Over 4,000 training participants Research and development budget: CHF 5.0 / EUR 4 million Financially self-supporting and sustainably profitable Experience from more than 1,900 projects per year at over 800 customers
Trivadis – the company 02/12/13
Hamburg
Düsseldorf
Frankfurt
Freiburg München
Wien
Basel Zurich Bern
Lausanne
Stuttgart
Brugg
2013 © Trivadis
Agenda
1. Big Data, what is it?
2. Motivation
3. The Lambda Architecture
4. Implementing the Lambda Architecture
5. Summary
2.12.2013 Big Data and Fast Data – Big and Fast Combined, is it Possible?
5
2013 © Trivadis
Big Data Definition (Gartner et al)
2.12.2013 Big Data and Fast Data – Big and Fast Combined, is it Possible?
6
Velocity
Tera-, Peta-, Exa-, Zetta-, Yota- bytes and constantly growing
“Traditional” computing in RDBMS is not scalable enough.
We search for “linear scalability”
“Only … structured information is not enough” – “95% of produced data in
unstructured”
Characteristics of Big Data: Its Volume, Velocity and Variety in combination
+ Veracity (IBM) - information uncertainty + Time to action ? – Big Data + Event Processing = Fast Data
2013 © Trivadis
Big Data Definition (4 Vs)
2.12.2013 Big Data and Fast Data – Big and Fast Combined, is it Possible?
7
+ Time to action ? – Big Data + Event Processing = Fast Data
Characteristics of Big Data: Its Volume, Velocity and Variety in combination
2013 © Trivadis
Volume Development
0
20
40
60
80
100
0
2000
4000
6000
8000
2005 2007 2009 2011 2013 2015
Agg
rega
te U
ncer
tain
ty %
Glo
bal D
ata
Volu
me
in E
xaby
tes
Year
Sensors: “internet of things”
Social Media: video, audio, text
VoIP: Skype, MSN, ICQ, ...
Enterprise Data: data dictionary, ERD, ...
2.12.2013 Big Data and Fast Data – Big and Fast Combined, is it Possible?
8
2013 © Trivadis
2.12.2013 Big Data and Fast Data – Big and Fast Combined, is it Possible?
9
2013 © Trivadis
Internet Of Things
2.12.2013 Big Data and Fast Data – Big and Fast Combined, is it Possible?
10
There are more devices tapping into the internet than people on earth
How do we prepare our systems/architecture for the future?
Source: Cisco Source: The Economist
2013 © Trivadis
Big Data in Context
NoSQL databases • The storage for Big Data à Polyglot Persistence
Complex Event Processing (CEP) • An architectural style for Fast Data
Lots of new terms § HDFS, Hive, Hadoop, MapReduce, HBase, Pig, Cascading, Flume, Oozie
Not only Open Source • Oracle Big Data Appliance & Microsoft HD Insight
No longer a clear distinction between Software Development and Business Intelligence !?
• Java, Python, Clojure, R, … know how needed • Data Scientists: Natural Language Processing, Statistics, Network Analysis
2.12.2013 Big Data and Fast Data – Big and Fast Combined, is it Possible?
11
2013 © Trivadis
Big Data Use Cases / Scenarios
General
• Analyzing social media data for service optimization, sentiment analysis, ...
Retail
§ Personalized travel- and shopping guidance depending on location detection (mobile, tablets, previous purchases)
Automotive
§ Analyzing telemetric data (e.g. for insurance: „Pay how you drive“, warranty, recall, warnings etc.)
Finance
§ Fraud detection for payments (real time)
Telco
§ Mobile user location analytics for „behavior mining“
2.12.2013 Big Data and Fast Data – Big and Fast Combined, is it Possible?
12
2013 © Trivadis
Velocity
2.12.2013 Big Data and Fast Data – Big and Fast Combined, is it Possible?
13
§ Velocity requirement examples: § Recommendation Engine § Predictive Analytics § Marketing Campaign Analysis § Customer Retention and Churn Analysis § Social Graph Analysis § Capital Markets Analysis § Risk Management § Rogue Trading § Fraud Detection § Retail Banking § Network Monitoring § Research and Development
2013 © Trivadis
Agenda
1. Big Data, what is it?
2. Motivation
3. The Lambda Architecture
4. Implementing the Lambda Architecture
5. Summary
2.12.2013 Big Data and Fast Data – Big and Fast Combined, is it Possible?
14
2013 © Trivadis
What is a data system?
• A system that manages the storage and querying of data with a lifetime measured in years encompassing every version of the application to ever exist, every hardware failure and every human mistake ever made.
• A data system answers questions based on information that was acquired in the past
2.12.2013 Big Data and Fast Data – Big and Fast Combined, is it Possible?
15
2013 © Trivadis
Desired Properties of a (Big) Data System
Robust and fault-tolerant
Low latency reads and updates
Scalable
General
Extensible
Allows ad hoc queries
Minimal maintenance
Debug-able
2.12.2013 Big Data and Fast Data – Big and Fast Combined, is it Possible?
16
2013 © Trivadis
Complexity in today‘s architecture/systems
24. April 2013 Big Data und Fast Data
17
Lack of Human Fault Tolerance
Same structure for write/query
Schemas done wrong
2013 © Trivadis
Typical problem in today’s architecture/systems
Bugs will be deployed to production over the lifetime of a data system
Operational mistakes will be made
Humans are part of the overall system • Just like hard disks, CPUs, memory, software • design for human error like you design for any other fault
Examples of human error • Deploy a bug that increments counters by two instead of by one • Accidentally delete data from database • Accidental DOS on important internal service
Worst two consequences: data loss or data corruption
As long as an error doesn‘t lose or corrupt good data, you can fix what went wrong
2.12.2013 Big Data and Fast Data – Big and Fast Combined, is it Possible?
18
Lack of Human Fault Tolerance
2013 © Trivadis
Mutability
The U and D in CRUD
A mutable system updates the current state of the world
Mutable systems inherently lack human fault-tolerance
Easy to corrupt or lose data
2.12.2013 Big Data and Fast Data – Big and Fast Combined, is it Possible?
19
Capturing change traditionally
Lack of Human Fault Tolerance
Name City Guido Berne Albert Zurich
Name City Guido Basel Albert Zurich
2013 © Trivadis
Immutability
An immutable system captures historical records of events
Each event happens at a particular time and is always true
2.12.2013 Big Data and Fast Data – Big and Fast Combined, is it Possible?
20
Capturing change by storing events
Lack of Human Fault Tolerance
Name City Timestamp Guido Berne 1.8.1999 Albert Zurich 10.5.1988
Name City Timestamp Guido Berne 1.8.1999 Albert Zurich 10.5.1988 Guido Basel 1.4.2013
2013 © Trivadis
Immutability
Immutability greatly restricts the range of errors that can cause data loss or data corruption
Vastly more human fault-tolerant
Much easier to reason about systems based on immutability
Conclusion: Your source of truth should always be immutable
2.12.2013 Big Data and Fast Data – Big and Fast Combined, is it Possible?
21
Lack of Human Fault Tolerance
2013 © Trivadis
What about traditional/today’s architectures ?
Source of Truth is mutable!
Rather than build systems like this ….
2.12.2013 Big Data and Fast Data – Big and Fast Combined, is it Possible?
22
Mutable Database
Application (Query)
RDBMS NoSQL
NewSQL
Mobile Web RIA
Rich Client
Source of Truth
Source of Truth
2013 © Trivadis
A different kind of architecture with immutable source of truth
… why not building them like this
2.12.2013 Big Data and Fast Data – Big and Fast Combined, is it Possible?
23
HDFS NoSQL
NewSQL RDBMS
View on Data
Mobile Web RIA
Rich Client
Source of Truth
Immutable data
View on Data
Application (Query)
Source of Truth
2013 © Trivadis
How to create the views on the Immutable data?
On the fly ?
Materialized, i.e. Pre-computed ?
2.12.2013 Big Data and Fast Data – Big and Fast Combined, is it Possible?
24
Immutable data View
Immutable data
Pre- Computed
Views
Query
Query
2013 © Trivadis
Data = the most raw information
Data is information which is not derived from anywhere else • The most raw form of information • from which everything else is derived
Questions on data can be answered by running functions that take data as input
The most general purpose data system can answer questions by running functions that take the entire dataset as input
query = function (all data)
The lambda architecture provides a general purpose approach for implementing arbitrary functions on an arbitrary datasets
2.12.2013 Big Data and Fast Data – Big and Fast Combined, is it Possible?
25
2013 © Trivadis
Data = the most raw information
2.12.2013 Big Data and Fast Data – Big and Fast Combined, is it Possible?
26
1.2.13 Add iPAD 64GB 10.3.13 Add Sony RX-100 11..3.13 Add Canon GX-10 11.3.13 Remove Sony RX-100 12.3.13 Add Nikon S-100 14.4.13 Add BoseQC-15 15.4.13 Add MacBook Pro 15 20.4.13 Remove Canon GX10
iPAD 64GB Nikon S-100 BoseQC-15 MacBook Pro 15
4 derive derive
Favorite Product List Changes Current Favorite
Product List Current Product Count
Raw information => data Information => derived
2013 © Trivadis
Big Data and Batch Processing
2.12.2013 Big Data and Fast Data – Big and Fast Combined, is it Possible?
27
Immutable data
Batch View Query ? ? Incoming
Data
How to compute the batch views ?
How to compute queries from the views ?
2013 © Trivadis
Big Data and Batch Processing
2.12.2013 Big Data and Fast Data – Big and Fast Combined, is it Possible?
28
Fully processed data Last full batch period
Time forbatch job
time
now non-processed data
time
now
batch-processed data
§ Using only batch processing, leaves you always with a portion of non-processed data.
Adapted from Ted Dunning (March 2012): http://www.youtube.com/watch?v=7PcmbI5aC20
But we are not done yet …
2013 © Trivadis
Big Data and Batch Processing
2.12.2013 Big Data and Fast Data – Big and Fast Combined, is it Possible?
29
HDFS
Data Store optimized for appending large
results
Queries
Stream 1
Stream 2
Event
Hadoop cluster Map/Reduce in Pig
Hadoop Distributed File System
2013 © Trivadis
Adding Real-Time Processing
2.12.2013 Big Data and Fast Data – Big and Fast Combined, is it Possible?
30
Immutable data
Batch Views
Query
? Data Stream
Realtime Views
Incoming Data
How to compute queries from the views ? How to compute real-time views
2013 © Trivadis
Adding Real-Time Processing
2.12.2013 Big Data and Fast Data – Big and Fast Combined, is it Possible?
31
1.2.13 Add iPAD 64GB 10.3.13 Add Sony RX-100 11..3.13 Add Canon GX-10 11.3.13 Remove Sony RX-100 12.3.13 Add Nikon S-100 14.4.13 Add BoseQC-15 15.4.13 Add MacBook Pro 15 20.4.13 Remove Canon GX10 Now Add Canon Scanner
iPAD 64GB Nikon S-100 BoseQC-15 MacBook Pro 15
5
compute
Favorite Product List Changes Current Favorite
Product List
Current Product Count
Now Canon Scanner compute Add Canon Scanner
Stream of Favorite Product List Changes
Immutable data
Views
Data Stream
Query
incoming
2013 © Trivadis
Big Data and Real Time Processing
2.12.2013 Big Data and Fast Data – Big and Fast Combined, is it Possible?
32
time
Fully processed data Last full batch period
now
Time forbatch job
batch processingworked fine here
(e.g. Hadoop)
real time processingworks here
blended view for end user
Adapted from Ted Dunning (March 2012): http://www.youtube.com/watch?v=7PcmbI5aC20
2013 © Trivadis
Agenda
1. Big Data, what is it?
2. Motivation
3. The Lambda Architecture
4. Implementing the Lambda Architecture
5. Summary
2.12.2013 Big Data and Fast Data – Big and Fast Combined, is it Possible?
33
2013 © Trivadis
Lambda Architecture
2.12.2013 Big Data and Fast Data – Big and Fast Combined, is it Possible?
34
Immutable data
Batch View
Query
Data Stream
Realtime View
Incoming Data
Serving Layer
Speed Layer
Batch Layer
A
B C D
E F
G
2013 © Trivadis
Lambda Architecture
A. All data is sent to both the batch and speed layer
B. Master data set is an immutable, append-only set of data
C. Batch layer pre-computes query functions from scratch, result is called Batch Views. Batch layer constantly re-computes the batch views.
D. Batch views are indexed and stored in a scalable database to get particular values very quickly. Swaps in new batch views when they are available
E. Speed layer compensates for the high latency of updates to the Batch Views
F. Uses fast incremental algorithms and read/write databases to produce real-time views
G. Queries are resolved by getting results from both batch and real-time views
2.12.2013 Big Data and Fast Data – Big and Fast Combined, is it Possible?
35
2013 © Trivadis
Layered Architecture
Stores the immutable constantly growing dataset Computes arbitrary views from this dataset using BigData technologies (can take hours) Can be always recreated Computes the views from the constant stream of data it receives Needed to compensate for the high latency of the batch layer Incremental model and views are transient Responsible for indexing and exposing the pre-computed batch views so that they can be queried Exposes the incremented real-time views Merges the batch and the real-time views into a consistent result
2.12.2013 Big Data and Fast Data – Big and Fast Combined, is it Possible?
36
Serving Layer
Batch Layer
Speed Layer
2013 © Trivadis
Agenda
1. Big Data, what is it?
2. Motivation
3. The Lambda Architecture
4. Implementing the Lambda Architecture
5. Summary
2.12.2013 Big Data and Fast Data – Big and Fast Combined, is it Possible?
37
2013 © Trivadis
Lambda Architecture
2.12.2013 Big Data and Fast Data – Big and Fast Combined, is it Possible?
38
Speed Layer
Precompute Views
query
Source: Marz, N. & Warren, J. (2013) Big Data. Manning.
Batch Layer
Precomputed information All data
Incremented information Process stream
Incoming Data
Batch recompute
Realtime increment
Serving Layer
batch view
batch view
real time view
real time view
Mer
ge
2013 © Trivadis
Lambda Architecture in Action
Implementation in ongoing Proof-of-concept (after completion of phase 1)
2.12.2013 Big Data and Fast Data – Big and Fast Combined, is it Possible?
39
Speed Layer
Precompute Views
query
Batch Layer
Precomputed information All data
Incremented information Process stream
Incoming Data
Batch recompute
Realtime increment
Serving Layer
batch view
batch view
real time view
real time view
Mer
ge
2013 © Trivadis
Lambda Architecture in Action
2.12.2013 Big Data and Fast Data – Big and Fast Combined, is it Possible?
40
Cloudera Distribution
• Distribution of Apache Hadoop: HDFS, MapReduce, Hive, Flume, Pig, Impala
Cloudera Impala
• distributed query execution engine that runs against data stored in HDFS and HBase
Apache Zookeeper
• Distributed, highly available coordination service. Provides primitives such as distributed locks
Apache Storm & Trident
• distributed, fault-tolerant realtime computation system
Apache Cassandra
• distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure
Twitter Horsebird Client (hbc)
• Twitter Java API over Streaming API
Spring Framework
• Popular Java Framework used to modularize part of the logic (sensor and serving layer)
Apache Kafka
• Simple messaging framework based on file system to distribute information to both batch and speed layer
Apache Avro
• Serialization system for efficient cross-language RPC and persistent data storage
JSON
• open standard format that uses human-readable text to transmit data objects consisting of attribute–value pairs.
2013 © Trivadis
Lambda Architecture with Oracle Product Stack
2.12.2013 Big Data and Fast Data – Big and Fast Combined, is it Possible?
41
Possible implementation with Oracle Product stack
Speed Layer
Precompute Views
query
Batch Layer
Precomputed information All data
Incremented information Process stream
Incoming Data
Batch recompute
Serving Layer
batch view
batch view
real time view
real time view
Mer
ge
Oracle NoSQL
Oracle RDBMS Oracle Coherence
Oracle BigData Appliance
Oracle NoSQL Oracle Coherence
Oracle Event Processing Oracle GoldenGate
Oracle Data Integrator
Oracle GoldenGate
Oracle Event Processing
Oracle Service Bus
Ora
cle
Web
Log
ic S
erve
r O
racl
e A
DF
OBI
EE
Ora
cle
Ende
ca
Ora
cle
Big
Dat
a C
onne
ctor
s
2013 © Trivadis
Agenda
1. Big Data, what is it?
2. Motivation
3. The Lambda Architecture
4. Implementing the Lambda Architecture
5. Summary
2.12.2013 Big Data and Fast Data – Big and Fast Combined, is it Possible?
42
2013 © Trivadis
Summary – The lambda architecture
2.12.2013 Big Data and Fast Data – Big and Fast Combined, is it Possible?
43
§ The Lambda Architecture § Can discard batch views and real-time views and recreate everything from
scratch § Mistakes corrected via re-computation § Data storage layer optimized independently from query resolution layer § Still in a very early …. But a very interesting idea!
- Today a zoo of technologies are needed => Operations won‘t like it
§ The technology/implementation § Different query language for batch and real time § An abstraction over batch and speed layer needed
- Cascading and Trident are already similar § Not everything works out-of-the-box and together § Industry standards needed!
2013 © Trivadis
BASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN
THANK YOU. Trivadis AG
Guido Schmutz
Europa-Strasse 5CH-8095 Glattbrugg
[email protected] www.trivadis.com
2.12.2013 Big Data and Fast Data – Big and Fast Combined, is it Possible?
44