data architecture for data-driven enterprises · 2019-12-21 · metadata enrichment: quality, type,...

33
Data Architecture for Data-driven Enterprises © 2018 NetApp, Inc. All rights reserved. Deepti Aggarwal, Rukma Talwadker ATG, NetApp May 25, 2018

Upload: others

Post on 17-Apr-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Architecture for Data-driven Enterprises · 2019-12-21 · Metadata Enrichment: Quality, type, content, behavior…. •First point of data entry into the pipeline •What tags

Data Architecture forData-driven Enterprises

© 2018 NetApp, Inc. All rights reserved.

Deepti Aggarwal, Rukma TalwadkerATG, NetAppMay 25, 2018

Page 2: Data Architecture for Data-driven Enterprises · 2019-12-21 · Metadata Enrichment: Quality, type, content, behavior…. •First point of data entry into the pipeline •What tags

Big Data and AI/ML for insights

© 2017 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---2

Data lake

Local vs. Cloudcompute

ingest

Data Filtration/ cleaning

Data Processing(real / near real / batch)

Data AnalyticsMachine LearningDeep Learning

Data Visualization

Page 3: Data Architecture for Data-driven Enterprises · 2019-12-21 · Metadata Enrichment: Quality, type, content, behavior…. •First point of data entry into the pipeline •What tags

Change in the Data LifecycleWarrants the need for a re-look at Data Architecture

© 2018 NetApp, Inc. All rights reserved.

~160 ZB by 2025

Kind of data

Generation of Data

Processing of data

Extracting value from

data

Page 4: Data Architecture for Data-driven Enterprises · 2019-12-21 · Metadata Enrichment: Quality, type, content, behavior…. •First point of data entry into the pipeline •What tags

Evolving Data Pipeline

© 2018 NetApp, Inc. All rights reserved

Data Ingest

- Data collection and aggregation

- Real time model inference

Data Processing

- Dedicated hardware

- Model Building

Data Archival

- Hosted as-a-service solutions

- Long term data retention

Edge Core Cloud

Edge Cloud

Data Lake

Page 5: Data Architecture for Data-driven Enterprises · 2019-12-21 · Metadata Enrichment: Quality, type, content, behavior…. •First point of data entry into the pipeline •What tags

Evolving Data Pipeline

Data Ingest

- Data collection and aggregation

- Real time model inference

Data Processing

- Dedicated hardware

- Model Building

Data Archival

- Hosted as-a-service solutions

- Long term data retention

Edge Core Cloud

Edge Cloud

Data Lake

PERFORMANCE

SECURITY

COMPLIANCE

EFFICIENT STORAGE

DATA QUALITY

Page 6: Data Architecture for Data-driven Enterprises · 2019-12-21 · Metadata Enrichment: Quality, type, content, behavior…. •First point of data entry into the pipeline •What tags

Edge Real Time InsightsData Ingestion & Transfer

6 © 2018 NetApp, Inc. All rights reserved. — NETAPP CONFIDENTIAL —

Page 7: Data Architecture for Data-driven Enterprises · 2019-12-21 · Metadata Enrichment: Quality, type, content, behavior…. •First point of data entry into the pipeline •What tags

Storage Efficiency over the Wire

• WAN transfer from the edge to the core/cloud

• Two views:• Transfer limited data

• ‘Useful’ data• Deviations from ‘normal’

• Transfer all data • Keep everything • Quantifiable data loss/approximation

© 2018 NetApp, Inc. All rights reserved.

Edge

Page 8: Data Architecture for Data-driven Enterprises · 2019-12-21 · Metadata Enrichment: Quality, type, content, behavior…. •First point of data entry into the pipeline •What tags

Matrix Profiles: An anytime algorithm for time series analysis

• Anytime Algorithm• Consistent result at any iteration

• Scales with hardware

• Helps identify recurring sub-sequences, anomalies, change points, etc.

• Time series represented as a sequence of repeating patterns à data reduction

© 2018 NetApp, Inc. All rights reserved.

Edge

http://www.cs.ucr.edu/~eamonn/MatrixProfile.html

Page 9: Data Architecture for Data-driven Enterprises · 2019-12-21 · Metadata Enrichment: Quality, type, content, behavior…. •First point of data entry into the pipeline •What tags

Metadata Enrichment: Quality, type, content, behavior….

• First point of data entry into the pipeline

• What tags to add?• Data Quality

• Completeness, accuracy, timeliness

• Kind of data• File extension based, content based

• Behavior• Anomalous, normal

• Lineage, auditing

• Aids data workflows down the pipeline

• Kafka added support for enriching streams with custom metadata

© 2018 NetApp, Inc. All rights reserved.

Edge

Page 10: Data Architecture for Data-driven Enterprises · 2019-12-21 · Metadata Enrichment: Quality, type, content, behavior…. •First point of data entry into the pipeline •What tags

Security is a major concern at the edge

© 2018 NetApp, Inc. All rights reserved.

• Data privacy

• Access controls and perimeter security (firewalls)

• Too many end-points

• What if device gets hacked/stolen?• Means of authentication at the edge node

• Establish secure session based public key sharing• SSL Certificates

Edge

Page 11: Data Architecture for Data-driven Enterprises · 2019-12-21 · Metadata Enrichment: Quality, type, content, behavior…. •First point of data entry into the pipeline •What tags

CorePrivate Cloud, DatacenterSecure Domain

11 © 2018 NetApp, Inc. All rights reserved. — NETAPP CONFIDENTIAL —

Page 12: Data Architecture for Data-driven Enterprises · 2019-12-21 · Metadata Enrichment: Quality, type, content, behavior…. •First point of data entry into the pipeline •What tags

Parallel compute demands high performance from storage

12 © 2018 NetApp, Inc. All rights reserved. — NETAPP CONFIDENTIAL —

Core

• More advanced Neural Network variants

• RNN, CNN etc

• Easily available platforms

• Tensorflow, Keras, Theano…

• More and more data!!

• Operationalizing AI/ML workflows and prevalence of GPUs

• High, predictable performance from the underlying storage

• Model training stretching to days

• A small factor slowdown in storage impacts cost heavily.

Page 13: Data Architecture for Data-driven Enterprises · 2019-12-21 · Metadata Enrichment: Quality, type, content, behavior…. •First point of data entry into the pipeline •What tags

• Changing workloads

• New analytic applications

• Dynamic indexing based on query patterns

• Which columns to index?

• When to update the indices?

• Cost of updating indexes is compute heavy

• Which indices are obsolete?

‘Smart’ Data Indexing to meet performance requirements

§ Traditional indices do not take advantage of

the data distribution or patterns in the data.

• ’Learned’ Indexes

• https://arxiv.org/abs/1712.01208

• Key idea: Structure of keys or sort order learnt, used to

predict location of record

13 © 2018 NetApp, Inc. All rights reserved. — NETAPP CONFIDENTIAL —

Core

Futuristic scenario: AI/ML models replacing Core Data Management components

Page 14: Data Architecture for Data-driven Enterprises · 2019-12-21 · Metadata Enrichment: Quality, type, content, behavior…. •First point of data entry into the pipeline •What tags

§ Effective aggregation to present holistic view of data to applications§ Raw data from millions of sensors§ Continuous streams – transformations by analytic

apps§ Structured, unstructured, text, multimedia

§ Data Anonymization§ Data goes out of private domain

§ Data preparation for transfer § Bundle which data together? § Encrypt and maintain Key Metadata Store§ Compression

Aggregation and Preparation for transfer to Cloud

14 © 2018 NetApp, Inc. All rights reserved. — NETAPP CONFIDENTIAL —

Core

Page 15: Data Architecture for Data-driven Enterprises · 2019-12-21 · Metadata Enrichment: Quality, type, content, behavior…. •First point of data entry into the pipeline •What tags

Cloud Data archivalUntrusted domain

15 © 2018 NetApp, Inc. All rights reserved. — NETAPP CONFIDENTIAL —

Page 16: Data Architecture for Data-driven Enterprises · 2019-12-21 · Metadata Enrichment: Quality, type, content, behavior…. •First point of data entry into the pipeline •What tags

• Compliance regulations dictate lifecycle of data

• Right to be forgotten GDPR• Mandates need for efficient indexing

• Long term data retention• Cost • Unpredictable growth in data

• Provenance/traceability• Metadata about every small modification• Data auditing

Compliance governs data archival

16

Cloud

Page 17: Data Architecture for Data-driven Enterprises · 2019-12-21 · Metadata Enrichment: Quality, type, content, behavior…. •First point of data entry into the pipeline •What tags

• Reducing archival cost due to exponential growth in data• Tiering• Erasure Coding over replication

• Archival Tiering• AWS archive tiers: S3, Glacier Expedite, Glacier Bulk• Tradeoff between retrieval times and $/GB

• Another way to minimize cost further: Choose different Erasure coding Schemes for different tiers• Tradeoff between storage overhead, read times, bandwidth and compute performance. • Coldest tier: minimal storage overhead, active archive: lower read times.

Archival Tiering

17

Cloud

Page 18: Data Architecture for Data-driven Enterprises · 2019-12-21 · Metadata Enrichment: Quality, type, content, behavior…. •First point of data entry into the pipeline •What tags

Cloud goes hand-in-hand with Data Security

§ Store encrypted data in cloud with key metadata on-prem in protected environment

§ VPNs – one mechanism to provide restricted access

§ Enterprises’ looking to run analytics on archives?§ In presence of encryption?

§ Ongoing work: encryption schemes which compromise on some security to preserve a particular aspect of data§ Searchable encryption§ Order-preserving encryption

18 © 2018 NetApp, Inc. All rights reserved. — NETAPP CONFIDENTIAL —

Page 19: Data Architecture for Data-driven Enterprises · 2019-12-21 · Metadata Enrichment: Quality, type, content, behavior…. •First point of data entry into the pipeline •What tags

Other Dimensions of interest

§ Storage infrastructure suitable for each layer§ Object stores v/s tradition NAS

§ Implementation of Compliance regulations

§ Global namespace across the pipeline

19 © 2018 NetApp, Inc. All rights reserved. — NETAPP CONFIDENTIAL —

Page 20: Data Architecture for Data-driven Enterprises · 2019-12-21 · Metadata Enrichment: Quality, type, content, behavior…. •First point of data entry into the pipeline •What tags

Putting it all together…NetApp Data Fabric

20 © 2018 NetApp, Inc. All rights reserved. — NETAPP CONFIDENTIAL —

Page 21: Data Architecture for Data-driven Enterprises · 2019-12-21 · Metadata Enrichment: Quality, type, content, behavior…. •First point of data entry into the pipeline •What tags

§ Changing data lifecycle

§ Data management issues remain the same – need a re-think in context of

changing data workflows

§ Performance, Compliance, Data Quality, Security and Efficient Storage are the

key data management challenges that emerge

21 © 2018 NetApp, Inc. All rights reserved. — NETAPP CONFIDENTIAL —

Key Takeaways

Page 22: Data Architecture for Data-driven Enterprises · 2019-12-21 · Metadata Enrichment: Quality, type, content, behavior…. •First point of data entry into the pipeline •What tags

22

Thank You

© 2018 NetApp, Inc. All rights reserved. — NETAPP CONFIDENTIAL —

Page 23: Data Architecture for Data-driven Enterprises · 2019-12-21 · Metadata Enrichment: Quality, type, content, behavior…. •First point of data entry into the pipeline •What tags

Backup

23 © 2018 NetApp, Inc. All rights reserved. — NETAPP CONFIDENTIAL —

Page 24: Data Architecture for Data-driven Enterprises · 2019-12-21 · Metadata Enrichment: Quality, type, content, behavior…. •First point of data entry into the pipeline •What tags

Concluding Remarks

§ Everything about data is changing§ Types of data§ Data flow§ Data lifecycle

§ IOT – one of the most emerging upcoming trends

§ Key data management challenges § Performance§ Compliance

24 © 2018 NetApp, Inc. All rights reserved. — NETAPP CONFIDENTIAL —

Page 25: Data Architecture for Data-driven Enterprises · 2019-12-21 · Metadata Enrichment: Quality, type, content, behavior…. •First point of data entry into the pipeline •What tags

Light: Title Slide 2Subtitle text placeholder

© 2018 NetApp, Inc. All rights reserved. — NETAPP CONFIDENTIAL —

Page 26: Data Architecture for Data-driven Enterprises · 2019-12-21 · Metadata Enrichment: Quality, type, content, behavior…. •First point of data entry into the pipeline •What tags

Light: Title Slide 3Subtitle text placeholder

© 2018 NetApp, Inc. All rights reserved. — NETAPP CONFIDENTIAL —

Page 27: Data Architecture for Data-driven Enterprises · 2019-12-21 · Metadata Enrichment: Quality, type, content, behavior…. •First point of data entry into the pipeline •What tags

Other Data Management Challenges at the Edge

© 2018 NetApp, Inc. All rights reserved. — NETAPP CONFIDENTIAL —

Data Ingest

- Real-time processing/ Model Inference- Data preparation for transfer

Edge

• First point of data entry• Perfect place for metadata enrichment

• Quality assessment• Completeness, Accuracy, Validity, Consistency

• Storage efficiency• In the light of data reduction• Quantifiable data approximation to reduce size• WAN transfer

Page 28: Data Architecture for Data-driven Enterprises · 2019-12-21 · Metadata Enrichment: Quality, type, content, behavior…. •First point of data entry into the pipeline •What tags

Storage efficiency and Metadata Enrichment

• WAN transfer from the edge to the core/cloud

• Storage efficiency over the wire§ ‘Useful’ data§ ‘New’ data§ ‘Interesting’ data

• All data - keep everything and let AI worry about too much data

• Efficient anytime algorithm for identifying recurring patterns, anomalies, etc. • http://www.cs.ucr.edu/~eamonn/MatrixProfile.html

© 2018 NetApp, Inc. All rights reserved.

• First point of data entry into enterprise environment

• Perfect point for Metadata enrichment• Data Quality• Geo-location• Behavior identification

• Helps in various data workflows down the pipeline

Edge

Page 29: Data Architecture for Data-driven Enterprises · 2019-12-21 · Metadata Enrichment: Quality, type, content, behavior…. •First point of data entry into the pipeline •What tags

Performance never been more critical!

© 2018 NetApp, Inc. All rights reserved. — NETAPP CONFIDENTIAL —

Data Ingest

- Real time actionable insights required

- Feed into real-time monitoring

Data Processing

- Training of DL workloads is resource-heavy

- Sheer volume of data

Data Archival

- Fast retrieval

PERFORMANCE

Data Lake

Public Cloud

Private Cloud

GPUs read in parallel from storage; resource heavy AI/ML routines

Page 30: Data Architecture for Data-driven Enterprises · 2019-12-21 · Metadata Enrichment: Quality, type, content, behavior…. •First point of data entry into the pipeline •What tags

Agenda Slide

1) First section/topic

2) Second section/topic

§ A brief description example

3) Third section/topic

4) Fourth section/topic

List major section of your presentation

30 © 2018 NetApp, Inc. All rights reserved. — NETAPP CONFIDENTIAL —

Page 31: Data Architecture for Data-driven Enterprises · 2019-12-21 · Metadata Enrichment: Quality, type, content, behavior…. •First point of data entry into the pipeline •What tags

Agenda Slide

1) First section/topic

2) Second section/topic

§ A brief description example

3) Third section/topic

4) Fourth section/topic

List major section of your presentation

31 © 2018 NetApp, Inc. All rights reserved. — NETAPP CONFIDENTIAL —

5) Fifth section/topic

6) Sixth section/topic

§ A brief description example

Page 32: Data Architecture for Data-driven Enterprises · 2019-12-21 · Metadata Enrichment: Quality, type, content, behavior…. •First point of data entry into the pipeline •What tags

Lorem ipsum dolor sit amet, consecteturadipiscing elit. Fusce ultrices felis eget enimtincidunt faucibus. Praesent vitae orci vestibulum, ultricies arcu in, sagittis tortor. Suspendisseconvallis ligula a gravida tincidunt.

32 © 2018 NetApp, Inc. All rights reserved. — NETAPP CONFIDENTIAL —

Closing Thoughts

Page 33: Data Architecture for Data-driven Enterprises · 2019-12-21 · Metadata Enrichment: Quality, type, content, behavior…. •First point of data entry into the pipeline •What tags

33 © 2018 NetApp, Inc. All rights reserved. — NETAPP CONFIDENTIAL —

Closing Thoughts

Lorem ipsum dolor sit amet, consecteturadipiscing elit. Fusce ultrices felis eget enimtincidunt faucibus. Praesent vitae orci vestibulum, ultricies arcu in, sagittis tortor. Suspendisseconvallis ligula a gravida tincidunt. Fusce posuereipsum at finibus euismod. Vestibulum interdumtincidunt ligula vel tincidunt.