data architecture for data-driven enterprises · 2019-12-21 · metadata enrichment: quality, type,...
TRANSCRIPT
Data Architecture forData-driven Enterprises
© 2018 NetApp, Inc. All rights reserved.
Deepti Aggarwal, Rukma TalwadkerATG, NetAppMay 25, 2018
Big Data and AI/ML for insights
© 2017 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---2
Data lake
Local vs. Cloudcompute
ingest
Data Filtration/ cleaning
Data Processing(real / near real / batch)
Data AnalyticsMachine LearningDeep Learning
Data Visualization
Change in the Data LifecycleWarrants the need for a re-look at Data Architecture
© 2018 NetApp, Inc. All rights reserved.
~160 ZB by 2025
Kind of data
Generation of Data
Processing of data
Extracting value from
data
Evolving Data Pipeline
© 2018 NetApp, Inc. All rights reserved
Data Ingest
- Data collection and aggregation
- Real time model inference
Data Processing
- Dedicated hardware
- Model Building
Data Archival
- Hosted as-a-service solutions
- Long term data retention
Edge Core Cloud
Edge Cloud
Data Lake
Evolving Data Pipeline
Data Ingest
- Data collection and aggregation
- Real time model inference
Data Processing
- Dedicated hardware
- Model Building
Data Archival
- Hosted as-a-service solutions
- Long term data retention
Edge Core Cloud
Edge Cloud
Data Lake
PERFORMANCE
SECURITY
COMPLIANCE
EFFICIENT STORAGE
DATA QUALITY
Edge Real Time InsightsData Ingestion & Transfer
6 © 2018 NetApp, Inc. All rights reserved. — NETAPP CONFIDENTIAL —
Storage Efficiency over the Wire
• WAN transfer from the edge to the core/cloud
• Two views:• Transfer limited data
• ‘Useful’ data• Deviations from ‘normal’
• Transfer all data • Keep everything • Quantifiable data loss/approximation
© 2018 NetApp, Inc. All rights reserved.
Edge
Matrix Profiles: An anytime algorithm for time series analysis
• Anytime Algorithm• Consistent result at any iteration
• Scales with hardware
• Helps identify recurring sub-sequences, anomalies, change points, etc.
• Time series represented as a sequence of repeating patterns à data reduction
© 2018 NetApp, Inc. All rights reserved.
Edge
http://www.cs.ucr.edu/~eamonn/MatrixProfile.html
Metadata Enrichment: Quality, type, content, behavior….
• First point of data entry into the pipeline
• What tags to add?• Data Quality
• Completeness, accuracy, timeliness
• Kind of data• File extension based, content based
• Behavior• Anomalous, normal
• Lineage, auditing
• Aids data workflows down the pipeline
• Kafka added support for enriching streams with custom metadata
© 2018 NetApp, Inc. All rights reserved.
Edge
Security is a major concern at the edge
© 2018 NetApp, Inc. All rights reserved.
• Data privacy
• Access controls and perimeter security (firewalls)
• Too many end-points
• What if device gets hacked/stolen?• Means of authentication at the edge node
• Establish secure session based public key sharing• SSL Certificates
Edge
CorePrivate Cloud, DatacenterSecure Domain
11 © 2018 NetApp, Inc. All rights reserved. — NETAPP CONFIDENTIAL —
Parallel compute demands high performance from storage
12 © 2018 NetApp, Inc. All rights reserved. — NETAPP CONFIDENTIAL —
Core
• More advanced Neural Network variants
• RNN, CNN etc
• Easily available platforms
• Tensorflow, Keras, Theano…
• More and more data!!
• Operationalizing AI/ML workflows and prevalence of GPUs
• High, predictable performance from the underlying storage
• Model training stretching to days
• A small factor slowdown in storage impacts cost heavily.
• Changing workloads
• New analytic applications
• Dynamic indexing based on query patterns
• Which columns to index?
• When to update the indices?
• Cost of updating indexes is compute heavy
• Which indices are obsolete?
‘Smart’ Data Indexing to meet performance requirements
§ Traditional indices do not take advantage of
the data distribution or patterns in the data.
• ’Learned’ Indexes
• https://arxiv.org/abs/1712.01208
• Key idea: Structure of keys or sort order learnt, used to
predict location of record
13 © 2018 NetApp, Inc. All rights reserved. — NETAPP CONFIDENTIAL —
Core
Futuristic scenario: AI/ML models replacing Core Data Management components
§ Effective aggregation to present holistic view of data to applications§ Raw data from millions of sensors§ Continuous streams – transformations by analytic
apps§ Structured, unstructured, text, multimedia
§ Data Anonymization§ Data goes out of private domain
§ Data preparation for transfer § Bundle which data together? § Encrypt and maintain Key Metadata Store§ Compression
Aggregation and Preparation for transfer to Cloud
14 © 2018 NetApp, Inc. All rights reserved. — NETAPP CONFIDENTIAL —
Core
Cloud Data archivalUntrusted domain
15 © 2018 NetApp, Inc. All rights reserved. — NETAPP CONFIDENTIAL —
• Compliance regulations dictate lifecycle of data
• Right to be forgotten GDPR• Mandates need for efficient indexing
• Long term data retention• Cost • Unpredictable growth in data
• Provenance/traceability• Metadata about every small modification• Data auditing
Compliance governs data archival
16
Cloud
• Reducing archival cost due to exponential growth in data• Tiering• Erasure Coding over replication
• Archival Tiering• AWS archive tiers: S3, Glacier Expedite, Glacier Bulk• Tradeoff between retrieval times and $/GB
• Another way to minimize cost further: Choose different Erasure coding Schemes for different tiers• Tradeoff between storage overhead, read times, bandwidth and compute performance. • Coldest tier: minimal storage overhead, active archive: lower read times.
Archival Tiering
17
Cloud
Cloud goes hand-in-hand with Data Security
§ Store encrypted data in cloud with key metadata on-prem in protected environment
§ VPNs – one mechanism to provide restricted access
§ Enterprises’ looking to run analytics on archives?§ In presence of encryption?
§ Ongoing work: encryption schemes which compromise on some security to preserve a particular aspect of data§ Searchable encryption§ Order-preserving encryption
18 © 2018 NetApp, Inc. All rights reserved. — NETAPP CONFIDENTIAL —
Other Dimensions of interest
§ Storage infrastructure suitable for each layer§ Object stores v/s tradition NAS
§ Implementation of Compliance regulations
§ Global namespace across the pipeline
19 © 2018 NetApp, Inc. All rights reserved. — NETAPP CONFIDENTIAL —
Putting it all together…NetApp Data Fabric
20 © 2018 NetApp, Inc. All rights reserved. — NETAPP CONFIDENTIAL —
§ Changing data lifecycle
§ Data management issues remain the same – need a re-think in context of
changing data workflows
§ Performance, Compliance, Data Quality, Security and Efficient Storage are the
key data management challenges that emerge
21 © 2018 NetApp, Inc. All rights reserved. — NETAPP CONFIDENTIAL —
Key Takeaways
22
Thank You
© 2018 NetApp, Inc. All rights reserved. — NETAPP CONFIDENTIAL —
Backup
23 © 2018 NetApp, Inc. All rights reserved. — NETAPP CONFIDENTIAL —
Concluding Remarks
§ Everything about data is changing§ Types of data§ Data flow§ Data lifecycle
§ IOT – one of the most emerging upcoming trends
§ Key data management challenges § Performance§ Compliance
24 © 2018 NetApp, Inc. All rights reserved. — NETAPP CONFIDENTIAL —
Light: Title Slide 2Subtitle text placeholder
© 2018 NetApp, Inc. All rights reserved. — NETAPP CONFIDENTIAL —
Light: Title Slide 3Subtitle text placeholder
© 2018 NetApp, Inc. All rights reserved. — NETAPP CONFIDENTIAL —
Other Data Management Challenges at the Edge
© 2018 NetApp, Inc. All rights reserved. — NETAPP CONFIDENTIAL —
Data Ingest
- Real-time processing/ Model Inference- Data preparation for transfer
Edge
• First point of data entry• Perfect place for metadata enrichment
• Quality assessment• Completeness, Accuracy, Validity, Consistency
• Storage efficiency• In the light of data reduction• Quantifiable data approximation to reduce size• WAN transfer
Storage efficiency and Metadata Enrichment
• WAN transfer from the edge to the core/cloud
• Storage efficiency over the wire§ ‘Useful’ data§ ‘New’ data§ ‘Interesting’ data
• All data - keep everything and let AI worry about too much data
• Efficient anytime algorithm for identifying recurring patterns, anomalies, etc. • http://www.cs.ucr.edu/~eamonn/MatrixProfile.html
© 2018 NetApp, Inc. All rights reserved.
• First point of data entry into enterprise environment
• Perfect point for Metadata enrichment• Data Quality• Geo-location• Behavior identification
• Helps in various data workflows down the pipeline
Edge
Performance never been more critical!
© 2018 NetApp, Inc. All rights reserved. — NETAPP CONFIDENTIAL —
Data Ingest
- Real time actionable insights required
- Feed into real-time monitoring
Data Processing
- Training of DL workloads is resource-heavy
- Sheer volume of data
Data Archival
- Fast retrieval
PERFORMANCE
Data Lake
Public Cloud
Private Cloud
GPUs read in parallel from storage; resource heavy AI/ML routines
Agenda Slide
1) First section/topic
2) Second section/topic
§ A brief description example
3) Third section/topic
4) Fourth section/topic
List major section of your presentation
30 © 2018 NetApp, Inc. All rights reserved. — NETAPP CONFIDENTIAL —
Agenda Slide
1) First section/topic
2) Second section/topic
§ A brief description example
3) Third section/topic
4) Fourth section/topic
List major section of your presentation
31 © 2018 NetApp, Inc. All rights reserved. — NETAPP CONFIDENTIAL —
5) Fifth section/topic
6) Sixth section/topic
§ A brief description example
Lorem ipsum dolor sit amet, consecteturadipiscing elit. Fusce ultrices felis eget enimtincidunt faucibus. Praesent vitae orci vestibulum, ultricies arcu in, sagittis tortor. Suspendisseconvallis ligula a gravida tincidunt.
32 © 2018 NetApp, Inc. All rights reserved. — NETAPP CONFIDENTIAL —
Closing Thoughts
33 © 2018 NetApp, Inc. All rights reserved. — NETAPP CONFIDENTIAL —
Closing Thoughts
Lorem ipsum dolor sit amet, consecteturadipiscing elit. Fusce ultrices felis eget enimtincidunt faucibus. Praesent vitae orci vestibulum, ultricies arcu in, sagittis tortor. Suspendisseconvallis ligula a gravida tincidunt. Fusce posuereipsum at finibus euismod. Vestibulum interdumtincidunt ligula vel tincidunt.