project nautilus unified data stack for iot industry needs a new storage paradigm that solves for...
TRANSCRIPT
© Copyright 2017 Dell Inc.2
Market Drivers For Streaming Analytics
Open SourceCommunity
MassiveData Growth
Emergence ofReal-Time Apps
Infrastructure Commoditizationand Scale-Out
Rapid Disseminationof Data to Apps
Monetize Datawith Analytics
Data Velocityand Variety
© Copyright 2017 Dell Inc.3
Streaming Use Cases Across VerticalsMarket Snapshot by Verticals
Source:MarketsandMarkets, Stream Analytics Forecast
Algorithmic Trading
Real-time Customer Engagement
Location Intelligence
Operations Management
Supply Chain Optimization
Vehicle & Route Tracking
Network Monitoring
Real-time Patient Monitoring
Real-Time Call-Center Analysis
• Use Cases Across Different Verticals
• BFSI Use Cases– Algorithmic Trading– Fraud Detection– Real-time Transaction Analysis– Transaction Cost Analysis– Smart Order Routing
© Copyright 2017 Dell Inc.4
Today’s challenges with streaming analytics
Deriving actionable real-time business
insights
The industry needs a new storage paradigm that solves for extreme stream velocity and unified real-time analytics
Complex app ecosystem
>300
Siloed real-time and batch
Extreme stream and
storage scale
DIY architecture is complex, siloed, and expensive
© Copyright 2017 Dell Inc.6
Today’s “Accidental Architecture”Batch
Real-Time
Interactive exploration by Data Scientists
Real-time intelligence at the NOC
Sensors
MirrorMaker
DR Site
Mobile Devices
App Logs
© Copyright 2017 Dell Inc.7
A New Architecture Emerges: Streaming
• A new class of streaming systems is emerging to address the accidental architecture’s problems and enable new applications not possible before
• Some of the unique characteristics of streaming applications– Treat data as continuous and infinite– Compute correct results in real-time with stateful, exactly-once processing
• These systems are applicable for real-time applications, batch applications, and interactive applications
• Web-scale companies (Google, Twitter) are beginning to demonstrate the disruptive value of streaming systems
• What are the implications for storage in a streaming world?
© Copyright 2017 Dell Inc.8
Let’s Rewind A Bit: The Importance of Log Storage
Traditional Apps/Middleware Streaming Apps/Middleware
BLOCKS• Structured Data• Relational DBs
FILES• Unstructured Data• Pub/Sub• NoSQL DBs
OBJECTS• Unstructured Data• Internet Friendly (REST)• Scale over Semantics• Geo
LOGS• Append-only • Low-latency• Tail Read/Write
© Copyright 2017 Dell Inc.9
The Importance of Log StorageThe Fundamental Data Structure for Scale-out Distributed Systems
APPEND-ONLY LOG
x=5 z=6 y=2 x=4 a=7 y=5… …
older newer
High Throughput
catch-up reads
Low Latency
tailing reads writes
© Copyright 2017 Dell Inc.10
Our Goal: Refactor the “Accidental Storage Stack”
Ingest Buffer & Pub/Sub
“Pravega Streams”
Scale-out SDS
NoSQL DB Search Analytics Engines
Using Logs as a Shared Storage Primitive
Ingest Buffer & Pub/Sub
ProprietaryLog Storage
Local Files
DAS
Kafka
NoSQL DBProprietary
Log Storage
Local Files
DAS
Cassandra et al
© Copyright 2017 Dell Inc.11
Introducing Pravega StreamsA new log primitive designed for streaming architectures
• Pravega is an open source distributed storage service offering a new storage abstraction called a stream
• A stream is the foundation for building reliable streaming systems: a high-performance, durable, elastic, and infinite append-only log with strict ordering and consistency
• A stream is as lightweight as a file – you can create millions of them in a single cluster
• Streams greatly simplify the development and operation of a variety of distributed systems: messaging, databases, analytic engines, search engines, and so on
© Copyright 2017 Dell Inc.12
Streaming Storage System
Pravega Architecture
StreamAbstraction
Pravega Streaming Service
Cloud Scale Storage(Isilon or ECS)
• High-Throughput• High-Scale, Low-Cost
Low-Latency Storage
Apache Bookkeeper
Auto-Tiering
Cache(Rocks)
Messaging Apps
Real-Time / Batch / Interactive Predictive Analytics
Stream Processors: Spark, Flink, …Other Apps & Middleware
Pravega Design Innovations1. Zero-Touch Dynamic Scaling
- Automatically scale read/write parallelism based on load and SLO
- No service interruptions- No manual reconfiguration of clients- No manual reconfiguration of
service resources2. Smart Workload Distribution
- No need to over-provision servers for peak load
3. I/O Path Isolation- For tail writes- For tail reads- For catch-up reads
4. Tiering for “Infinite Streams”5. Transactions For “Exactly Once”
© Copyright 2017 Dell Inc.14
Nautilus VisionA Unified Platform For Big & Fast Data For The Enterprise
Enable Real-time Ingestion
• Support for low latency ingestion of continuous as well as trickle streams
• In-place analytics via Multi-protocol access: Kafka, HDFS & S3
• Multi-tenancy & Role Based access for streaming data
Unify Stream & Batch Processing
• Access to storage at memory speeds
• Fast checkpoint from streaming engine to storage
• Support for data science workflows
Reduce Storage & Operational costs
• Single cluster for storage (streaming, long-term) & analytics
• Seamless disaster recovery via geo-enabled storage
• Simplified architecture results in lower operational cost
© Copyright 2017 Dell Inc.15
Nautilus: A Unified Storage Architecture• An extensible collection of elastic storage micro-services
Open Source HDFS
N/A
IsilonIsilon
ECSECS
DeepStorage Layer
StorageAccess Layer
StreamingStorage Layer
Ingest Buffer Pub/Sub Search Persistent Data Structures …
ApplicationsInfrastructure Monitoring
Patient SurveillanceIntrusion Detection Systems
Fleet Tracking Supply Chain ControlIntelligent Shopping Applications
Smart GridSnow Level Monitoring
Traffic Congestion Analytics
Wearable
Commercial Deployment Options
Pravega Streams
Object or FileData Access
© Copyright 2017 Dell Inc.16
Nautilus: A Unified Data PipelineStrongly Consistent Storage Exactly Once Processing Unified Analytics
Unified AnalyticsReal-Time, Batch, Interactive
Interactive exploration by Data Scientists
Real-time intelligence at the NOC
Sensors
Mobile Devices
App Logs
Isilon / ECS
Ingest Buffer Pub/Sub Search Persistent Data Structures
Pravega Streams
Unified Storage
© Copyright 2017 Dell Inc.17
Nautilus PlatformSecure | Integrated | Efficient | Elastic | Scalable
Nautilus: A Turn-Key Streaming Data Platform
PravegaStreaming Storage
FlinkStateful Stream Processor
Streaming SQL API
ZeppelinNotebook Experience
Notebook API
Mesos + DC/OS
Commodity Servers or Cloud
Secu
rity
Serviceability
Streaming Storage API
Digital WorldReal-Time/Batch Analytics
Frameworks and Apps Interactive Exploration
© Copyright 2017 Dell Inc.18
Operational Scenario: TodayHow to test & deploy new version of analytics business logic?
Archived DataHDFS or NFS or Object
Recent Data: Kafka Logs
x=5 z=6 y=2 x=4 a=7 y=5… …
older newer
HDFS API
ETL
StreamingApplicationBusiness Logic
Requirements• Run new business logic against historical data sets• Validate correct results for problematic scenarios• Ensure no regression• Deploy new business logic in production• Ensure new version is functioning properly before switching
users/applications• Revert to prior if something fails
© Copyright 2017 Dell Inc.19
Operational Scenario: TodayHow to test & deploy new version of an analytics business logic?
Archived DataHDFS or NFS or Object
Recent Data: Kafka Logs
x=5 z=6 y=2 x=4 a=7 y=5… …
older newer
HDFS API
ETL
StreamingApplicationBusiness Logic
StreamingApplicationBusiness Logic’
New version of app deployed with different data access methods
Challenges• Custom scripting and deployment required because historical
data is located in different storage system and accessed via different data type (e.g. files vs. logs)
• Test run is not exactly like production due to mismatches between log/file access and deployment differences – requires more time, is error prone, and leads to inaccurate test results
• Often requires downtime if upgrading in place• Upgrade “next to existing” requires complex workflow
Historical data accessed via files(not logs!) from HDFS archive
© Copyright 2017 Dell Inc.20
Operational Scenario: With NautilusHow to test & deploy new version of a analytics business logic?
StreamingApplicationBusiness Logic
Streams
Tiering to/from ECS handled automatically by the Streaming Storage Subsystem
StreamingApplicationBusiness Logic’
New version of app deployed exactly like production
1
Historical data accessed via same stream as production – just rewind the stream!
2
Once history is consumed, seamlessly start reading real-
time data!
3
Once you are confident things are working, turn off old version and redirect NOC consoles
4
© Copyright 2017 Dell Inc.22
Summary
1. “Streaming Architecture” replaces “Accidental Architecture”– Data: infinite/continuous vs. static/finite– Correctness in real-time: Exactly once processing + consistent storage
2. Pravega Streaming Storage Enables Storage Refactoring– Infinite, durable, scalable, re-playable, elastic append-only log– Open source project
3. Nautilus = Unified Storage + Unified Data Pipeline– The New Data Stack!