josa techtalk: metadata managementin big data

Metadata Managementin Big Data

Data Management Challenges@ezzibdeh

Tariq Ezzibdeh

Aim• Outline some perspective on metadata management principles that

apply in the big data space and beyond

• Provide some data governance foundations in the data space that essentially would outlast the actual technologies to serve the needs of the future

• Discuss technologies and solutions currently in the market

Big Data - Overview• Big Data 5 V’s:• Volume• Velocity • Variety • Veracity

• Platform of today – set of relatively split up components • Data is stored on HDFS – File system• Catalogue of Data and its schema is maintained in another service – TBC! • Query front ends – Query engines based on different requirements

=> Value

Platform Architecture – Modern Architecture

Data

Sou

rces

Ac

quisi

tion

Data

Sys

tem

s

Staging ZoneETL and data

standardization

Pristine ArchiveCompressed

Gzip etc.

Data WarehouseImmutable data

Analytics ZoneAllocated data changes

Schema CatalogueWell-define reference to data structures and attributes

Data LedgerTrack data and its access with lineage and operations

Big

Data

Pla

tform

Data Marts UI/API

Apps

Source: Hortonworks

Why do we need to manage metadata for Big data platforms?

Large volumes of data landing in Hadoop/Big Data

Growing users working with the data

The need for effective control & consumption

of Data

The implementation needs to:• Offer good data visibility across your cluster• Capture data lineage across source systems and in the platform • Audit and record operations that are performed in the platform• Enforce policies that are defined by the platform stewards • Help reduce data redundancy on the platform

Source: Cloudera

Metadata in Action

Metadata – What is it?Data about Data!• Business Metadata

Supplies the business context around data, such as the business term’s name, definition, owners or stewards, and associated reference data

• Technical Metadata provides technical information about the data, such as the name of the

source table, the source table column name, and the data type (e.g., string, integer)

• Operational Metadata furnishes information about the use of the data, such as date last updated,

number of times accessed, or date last accessed

Source: Informatica

Why do I need all this metadata?• Data lake will contain all types of data – log streams - kafka,

DBS – sqoop… don’t make your lake turn into a swamp!• Consistency of definitions - To reconcile the difference in terminology such as

"clients" and "customers," "revenue" and "sales”

• Clarity of data lineage – About origins of a data set and can be granular enough to define information at the attribute level, including operations on it

• To understand data usage on your cluster • Optimize queries and views

• Compliance and Regulatory • Compliance -Capture, store and move data – Sarbanes-Oxley, HIPAA,

Basel II • Security - Authorization, Authentication – Handling sensitive data • Auditing - Recoding every attempt to access• Archive & Retention - Data life cycle policies

Source: Teradata/Techtarget

Metadata System ArchitectureTopologically, metadata repository architecture defines one of the following three styles:• Centralized Metadata repository

Efficient access and adaptability, scalability and high performance Single point of failure and continuous synchronization

• Distributed Metadata repository Access to metadata repo in real-time, up-to date metadata Overhead in maintaining the configuration of the source system changes and

HA • Federated or Hybrid Metadata repository

Central definition storage with references to the proper locations of the accurate definitions

Source: Techtarget

Use-cases for the need for Metadata

Use Cases – Analytics 1. Finding the Data: Data Scientists spend a lot time finding the

correct columns for variable selection• Around 80% of the data scientist’s time on column investigation with SMEs

2. Profile of Data: Reduce the number of time spent on data profiling by the ad-hoc queries • ~78% of the queries run on the cluster are profiling queries

3. Track the transformation: Data Scientists would like to understand how the data sets are derived • Not fully tracked except at a high level

Source: Aetna

1. Finding the data: Challenges• Hive requires relatively manual traversal of the schema to find the

table and columns• HDFS also requires traversal of the directory listing to find a file • Any documentation (external to the system) become outdated and

are not always reliable • No simple way to add business metadata

Source: Aetna

HDFS/Hive Architecture

hadoop.apache.orgBen Lever -Slideshare

Source:

https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

1. Finding the data: Solutions• Run-time capture of metadata of hive and HDFS, and store in

repository• Provide an API to query the metadata and search across it • Provide an API or other ways to enrich the data with its business

context

Business Metadata

Technical/Physical Metadata

Hive

HDFS

Ingestion/Sqoop

Apache Atlas

Source: Aetna

2. Profile of Data: Challenges and Solutions

• Access to hive metastore will introduce latency in production• Lack of comprehensive information

provided by the hive metastore

78%

18%

4%

Average Daily QueryProfiling Exploratory Production

• Provide a system with business, technical data that are cross referenced• Have a framework for the data scientist

to accommodate additional profiling

Source: Aetna

3. Track the transformation: Challenges and Solutions

• Documenting transformation is manual and difficult to scale • Mechanism for auditing data pipeline still

lacking• Data quality and provenance is too manual

• Leverage metadata already captured to construct transformations • Provide an API to query transformations• Provide a visualization for the transformations

Source: Aetna

What do we need? 1. A Searchable platform for all the data types for business and technical metadata 2. Data profile store with basic metrics of the data

• Min• Max • Column distribution

3. Visual lineage for the data flow from the source system to different components within the platform • ETL operations – HL view• Analytics queries

4. Automated Metadata driven data ingestion and thus management • The Data Lake concept relies on capturing a robust set of attributes for every piece of content

within the lake• Maintaining this metadata requires a highly-automated metadata extraction, capture, and tracking

facility.

Solutions for Hadoop

Apache Atlas – deep dive• Apache Atlas Capabilities: Overview

• Data Classification• Import or define taxonomy business-oriented annotations for

data• Define, annotate, and automate capture of relationships

between data sets • Export metadata to third-party systems

• Centralized Auditing • Capture security access information • Capture the operational information for execution, steps, and

activities • Search & Lineage (Browse)

• Text-based search features locates relevant data and audit event across Data Lake quickly and accurately

• Browse visualization of data set lineage allowing users to drill-down into operational, security, and provenance related information

• Security & Policy Engine • Rationalize compliance policy at runtime based on data

classification schemes

Source: Hortonworks

Open-source Incubator project

Demo

Apache Atlas in action!

Possible solutions for other platforms

Netflix – Managing Data Platforms

Source: Netflix

Possible Solutions for other Platforms

Metacat • Apply Metadata management

on Service layer• Federated metadata catalog for

the whole data platform• Proxy service to different

metadata sources• Data metrics, data usage,

ownership, categorization and retention policy …

• Common interface for tools to interact with metadata

Tracking Data Difference• Apply Metadata management

on Service layer• Track the changes to

documents/entities• Custom code tracking through

logs collected as Mongo, or use a module called MongoID

Netflix OSS

Where Else?{ "Description": "A containerized foobar", "Usage": "docker run --rm example/foobar [args]", "License": "GPL", "Version": "0.0.1-beta", "aBoolean": true, "aNumber" : 0.01234, "aNestedArray": ["a", "b", "c"] }

<meta name=”description” content=”155 characters of message matching text with a call to action goes here”>

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.JOSA.Meta</groupId> <artifactId>project</artifactId> <version>1.0</version> </project>

Notes - Summary • Consider the different types of metadata you need to manage• Build a robust descriptive dictionary for the data • Manage metadata as a team effort. It has a lot of benefit so make it

Agile but effective.

Finally…remember thatOne’s Metadata – d/dx – is someone else’s Data!

Resources • HDP 2.3 Preview Sandbox VM: (Hortonworks) – http://hortonworks.com/hdp/whats-new/ • Apache Atlas:– http://atlas.incubator.apache.org/ – http://incubator.apache.org/projects/atlas.html – https://git-wip-us.apache.org/repos/asf/incubator-atlas.gi• Metadata Management (General)– https://www.informatica.com/content/dam/informatica-com/global/amer/us/collateral/white-paper/metadata-management-data-governance_white-paper_2163.pdf

http://hortonworks.com/hdp/whats-new/

http://atlas.incubator.apache.org/

http://incubator.apache.org/projects/atlas.html

https://git-wip-us.apache.org/repos/asf/incubator-atlas.gi

https://www.informatica.com/content/dam/informatica-com/global/amer/us/collateral/white-paper/metadata-management-data-governance_white-paper_2163.pdf




[email protected]

Tariq Ezzibdeh

Questions..?Contact info:

mailto:[email protected]

https://www.linkedin.com/in/ezzibdeh

josa techtalk: metadata managementin big data

Technology