understanding metadata: why it's essential to your big data solution and how to manage it well

63
Understanding Metadata: Why it’s essential to your big data solution and how to manage it well Tuesday, June 21, 2016 Ben Sharma | Vikram Sreekanti

Upload: zaloni

Post on 21-Jan-2017

269 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

Understanding Metadata: Why it’s essential to your big data solution and how to manage it well

Tuesday, June 21, 2016

Ben Sharma | Vikram Sreekanti

Page 2: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

Speakers

Ben Sharma, Co-Founder & CEO – Zaloni ---

Ben Sharma is a passionate technologist and thought leader in big data, analytics and enterprise infrastructure solutions. Having previously worked in technology leadership at NetApp, Fujitsu and others, Ben's expertise ranges from business development to production deployment in a wide array of technologies including Hadoop, HBase, databases, virtualization and storage. Ben is co-author of Architecting Data Lakes and Java in Telecommunications.

Vikram Sreekanti, Software Engineer – AMPLab, UC Berkeley

Vikram Sreekanti is a software engineer working on research in the AMPLab at UC Berkeley. A graduate of Berkeley's computer science department, he will begin his Ph.D. in Fall 2016, working with Joe Hellerstein.

Page 3: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

In today’s data environment with structured and unstructured data, the importance of metadata is increased

•  Metadata allows you to keep track of what data is in the data lake, its source, its format and its lineage

•  Metadata allows for better change management through Impact Analysis

•  The result is data visibility, reliability and reduced time to insight for your analytics

Metadata matters in a big data world

Zaloni Proprietary 3

Page 4: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

Data architecture modernization Tr

aditi

onal

N

ew

Data Lake

Sources ETL EDW

Derived (Transformed)

Discovery Sandbox EDW

Streaming

Unstructured Data

Various Sources

Zaloni Proprietary

Reporting, BI Extracts

Data ScienceData Discovery

Reporting, BI Extracts

4

Page 5: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

Data lake reference architecture Consumption

ZoneSource System

File Data

DB Data

ETL Extracts

Streaming

TransientLoading Zone

Raw Data Refined Data

Trusted Data

DiscoverySandbox

Original unaltered data attributes

Tokenized Data

APIs

Reference Data Master Data

Data WranglingData DiscoveryExploratory Analytics

Metadata Data Quality Data Catalog Security

Data Lake

Integrate to common formatData ValidationData CleansingAggregations

OLTP or ODS

Enterprise Data Warehouse

Logs(or other unstructured

data)

Cloud Services

Business AnalystsResearchersData Scientists

Zaloni Proprietary 5

Page 6: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

•  Reduced time to insight for analytics

•  Modern Data architecture will require a holistic approach to metadata

Metadata improves data visibility and reliability

Type of Metadata Description Example

Technical Captures the form and structure of each data set

Type of data (text, JSON, Avro), structure of the data (fields and their types)

Operational Captures lineage, quality, profile and provenance of the data

Source and target locations of data, size, number of records, lineage

Business Captures what it all means to the user

Business names, descriptions, tags, quality and masking rules

Zaloni Proprietary 6

Page 7: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

Considerations: •  Integration with Enterprise Metadata

Management Solutions

•  Automated process for new metadata to be registered in the Data Lake

•  Data follows the registered metadata

Automated metadata registration

API

check-in copy to

repository

retrieve metadata

Enterprise Metadata

Repositories

END

START

metadata file

Hadoop Cluster Edge-node to Cluster (SFTP)

add tags origin info,

timestamp, etc.

Metadata

operational metadata

file

Zaloni Proprietary 7

Page 8: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

Data lineage example in Bedrock for impact analysis

Zaloni Proprietary 8

Page 9: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

Metadata enhancing data quality and reliability

Zaloni Proprietary 9

Page 10: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

Business users can quickly answer questions such as:

Data profiling speeds up data discovery and time to insight

•  How many records does an entity have? What is its total size? •  What does the activity look like for a specific entity (streaming,

updated monthly, untouched from a year ago)? •  Is this entity a subset of another entity? •  Does this entity likely contain duplicates? •  Does this data apply to my target customers/market? •  What is the min/max of a particular column? •  Is this data reliable/does it have enough valid values?

Zaloni Proprietary 10

Page 11: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

Data profiling example in Mica

Capture profiling metrics for every entity

•  Automatically collect profiling metrics at the: §  Entity level (e.g., size of data set) §  Field level (e.g., values, frequency of the field)

•  Visually display metrics with metadata •  Allow data quality check rules to be created

based on profiling information 

Zaloni Proprietary 11

Page 12: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

Data catalog example in Mica

Zaloni Proprietary 12

Page 13: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

•  Logical data lake that can include all tiers of storage: §  Files, HDFS, Object store in on-premise and cloud environments

•  Data lifecycle management across tiered storage environments

§  Hot -> Warm -> Cold on an entity level based on policies/SLAs

§  Across on-premise and cloud environments

§  Take advantage of various storage technologies

§  Provide data management features to automate scheduling and orchestration of data movement between heterogeneous storage environments

•  Elastic and on-demand compute for various analytical workloads

Data lifecycle management powered by Metadata

Zaloni Proprietary 13

Page 14: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

Example: Metadata management in Financial Services

Register/ updatemetadata

RDBMS/ Mid Tier

MainframeCOBOL

Flat files

SAS files

Source Systems

Metadatarepositories

MetadataManagement

solution

Extract/ Readmetadata

Data Ingestion Data Quality and Validation

Layout Standardization

Operational Metadata

Generation

Layout Standardization

Data Acquisition Automation

•  Automated Data Acquisition Framework providing timeliness of data

•  Capture Metadata in all phases: Ingestion, Transformation

•  Integration with Enterprise Metadata Management

•  Integrated Data Quality Analysis

Zaloni Proprietary 14

Page 15: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

DON’T GO IN THE LAKE WITHOUT US

Page 16: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

Grounding Big Data Vikram Sreekanti UC Berkeley

Page 17: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

REMEMBERING THE PAST

Data Warehouse Single Source of Truth

Enterprise Information Architecture Golden Master

Truth Truth

Page 18: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

Big data took us to a new world

Page 19: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

There were changes in volume, velocity and variety, which were challenging.

Big data took us to a new world

Page 20: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

There were changes in volume, velocity and variety, which were challenging.

The real challenge now is the meaning and value of data, which depend critically on context.

Big data took us to a new world

Page 21: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

WHAT IS DIFFERENT?

Shift in technology Data representations

Shift in behavior Data-driven organizations

Page 22: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

Shift in behavior Data-driven organizations

Page 23: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

Data in products Started with the Internet. Now, the Internet of Things

Page 24: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

By 2017: marketing spends more on tech than IT does.

Data in marketing

GARTNER GROUP

By 2020: 90% of tech budget controlled outside of IT.

Page 25: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

MANY USE CASES

MANY CONSTITUENCIES

MANY INCENTIVES

MANY CONTEXTS

Page 26: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

WHAT IS DIFFERENT?

Shift in technology Data representations

Shift in behavior Data-driven organizations

Page 27: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

Shift in technology Data representations

Page 28: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

Raw data in the data lake Simplifies capture

Encourages exploration

What does it mean?

It depends on the context.

Page 29: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

A LITTLE SCENARIO HDFS

Page 30: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

BITS All the web logs from last year

Page 31: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

VIEWS, MODELS, CODE A script to extract orders. To be used for Market Basket analysis.

Page 32: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

VIEWS, MODELS, CODE A Hive table of orders. To be used for Market Basket analysis.

Page 33: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

BITS All the web logs from last year

Page 34: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

VIEWS, MODELS, CODE Code to extract abandoned user sessions

Page 35: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

VIEWS, MODELS, CODE A retargeting model

Page 36: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

A hive table of orders

A retargeting model

VIEWS, MODELS, CODE

Page 37: Understanding Metadata: Why it's essential to your big data solution and how to manage it well
Page 38: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

MANY SCRIPTS

MANY MODELS

MANY APPLICATIONS

MANY CONTEXTS

Page 39: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

A broader context for big data ground

Page 40: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

THE MEANING AND VALUE OF DATA DEPENDS ON CONTEXT

Application Context Views, models, code

Behavioral Context Data lineage & usage

Historical Context

In and over time

Page 41: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

APPLICATION CONTEXT Metadata

Models for interpreting the data for use § Data structures § Semantic structures § Statistical structures

Theme: An unopinionated model of context

Page 42: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

HISTORICAL CONTEXT Versions

Web logs Code to extract user/movie rentals

Recommender for movie licensing

Trends over time How does a movie with these features

fare over time?

Point in time

A promising new movie is similar to older

hot movies at time of release!

Page 43: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

BEHAVIORAL CONTEXT

Why Dora?!

Lineage & Usage

Page 44: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

2 4 8 7 9

BEHAVIORAL CONTEXT Lineage & Usage

Data Science Recommenders

“You should compare with book sales from last year.”

Curation Tips “Logistics staff checks weather data the 1st

Monday of every month.”

Proactive Impact Analysis

“The Twitter analysis script changed. You should check

the boss’ dashboard!”

Page 45: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

7

7

9

9

THE BIG CONTEXT

A NEW WORLD NEEDS NEW SERVICES

Page 46: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

WHAT ARE WE BUILDING?

Grounding philosophy § Start useful, stay useful. § Stay general. § Design for scale.

Page 47: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

ABOVEGROUND API TO APPLICATIONS

UNDERGROUND API TO SERVICES

CONTEXT MODEL

COMMON GROUND

Parsing & Featurization

Catalog & Discovery

Wrangling

Analytics & Vis

Reference Data

Data Quality

Reproducibility

Model Serving

Scavenging and Ingestion

Search & Query

Scheduling & Workflow

Versioned Storage ID & Auth

Page 48: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

Scavenging and Ingestion

Search & Query

Scheduling & Workflow

Versioned Storage ID & Auth

COMMON GROUND CONTEXT MODEL

Pachyderm Chronos

Parsing & Featurization

Catalog & Discovery

Wrangling

Analytics & Vis

Reference Data

Data Quality

Reproducibility

Model Serving

ABOVEGROUND API TO APPLICATIONS

UNDERGROUND API TO SERVICES

CONTEXT MODEL

COMMON GROUND

Page 49: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

COMMON GROUND

Versions

Models

Usage

An unopinionated context model

Page 50: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

COMMON GROUND Models

Versions

Usage

Versions

Usage

Models

Model Graphs

The metamodel

Page 51: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

member k1

member k1: string

member k2

Object 2

member k1 member k2:

number

member k11: string member k12

element 1 element 2 element 3

element 1 element 2 element 3

Root

RELATIONAL SCHEMA

JSON DOCUMENT

Schema 1

Table 1

Column 1 Column c

Table t

Column 1 Column d

foreign key

Models

Versions

Usage

Versions

Usage

Models

Page 52: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

COMMON GROUND Models

Versions

Usage

Models

Versions

Usage

Versions

Usage

Models

Model Graphs

Version Graphs

The versioning model

Page 53: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

COMMON GROUND Models

Versions

Usage

Models

Versions

Usage

Versions

Usage

Models

Model Graphs

Version Graphs

The versioning model

Page 54: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

a3eb4b765520b0d0ab90594dcf2373c1ce5dbb0b0

0e9233e8e99cccd6861d304968efa4c945a0b918

3e64220f08374629ad43ca652d4ce7cef0bdbbca

3e0bada008655fe32d7d136eac0a3f333d23ed80 fd75a4ba16f96d11f3f954854acc2d739054233

Directed Acyclic Graphs (partial orders)

In this order

In no particular order

VERSION GRAPHS Models

Versions

Usage

Models

Versions

Usage

Versions

Usage

Models

Page 55: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

COMMON GROUND Models

Versions

Usage

Models

Versions

Usage

Models

Versions

Usage

Versions

Usage

Models

Model Graphs

Version Graphs

Usage Graphs: Lineage

The usage model

Page 56: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

USAGE GRAPHS

Everything can participate in usage

Models

Versions

Usage

Models

Versions

Usage

Models

Versions

Usage

Versions

Usage

Models

Page 57: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

COMMON GROUND Versions

Models

Usage

Model Graphs

Version Graphs

Usage Graphs: Lineage

The model

Page 58: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

INITIAL FOCUS AREAS

Page 59: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

ABOVEGROUND API TO APPLICATIONS

UNDERGROUND API TO SERVICES

CONTEXT MODEL

COMMON GROUND

Parsing & Featurization

Catalog & Discovery

Wrangling

Analytics & Vis

Reference Data

Data Quality

Reproducibility

Model Serving

Scavenging and Ingestion

Search & Query

Scheduling & Workflow

Versioned Storage ID & Auth

INITIAL FOCUS AREAS

Page 60: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

ABOVEGROUND API TO APPLICATIONS

UNDERGROUND API TO SERVICES

CONTEXT MODEL

COMMON GROUND

Catalog & Discovery

Wrangling

Analytics & Vis

Reference Data

Data Quality

Scavenging and Ingestion

Search & Query

Scheduling & Workflow

Versioned Storage ID & Auth

INITIAL FOCUS AREAS

Parsing & Featurization

Model Serving

Reproducibility

Page 61: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

ABOVEGROUND API TO APPLICATIONS

UNDERGROUND API TO SERVICES

CONTEXT MODEL

COMMON GROUND

Parsing & Featurization

Catalog & Discovery

Wrangling

Analytics & Vis

Reference Data

Data Quality

Reproducibility

Model Serving

Scavenging and Ingestion

Search & Query

Scheduling & Workflow ID & Auth

INITIAL FOCUS AREAS

Versioned Storage

Page 62: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

ABOVEGROUND API TO APPLICATIONS

UNDERGROUND API TO SERVICES

CONTEXT MODEL

COMMON GROUND

Parsing & Featurization

Catalog & Discovery

Wrangling

Analytics & Vis

Reference Data

Data Quality

Reproducibility

Model Serving

Scavenging and Ingestion

Search & Query

Scheduling & Workflow

Versioned Storage ID & Auth

ABOVEGROUND API TO APPLICATIONS

UNDERGROUND API TO SERVICES

Page 63: Understanding Metadata: Why it's essential to your big data solution and how to manage it well

Learn more at: http://www.ground-context.org

@vsreekanti