josa techtalk: metadata managementin big data
TRANSCRIPT
Metadata Managementin Big Data
Data Management Challenges@ezzibdeh
Tariq Ezzibdeh
Aim• Outline some perspective on metadata management principles that
apply in the big data space and beyond
• Provide some data governance foundations in the data space that essentially would outlast the actual technologies to serve the needs of the future
• Discuss technologies and solutions currently in the market
Big Data - Overview• Big Data 5 V’s:• Volume• Velocity • Variety • Veracity
• Platform of today – set of relatively split up components • Data is stored on HDFS – File system• Catalogue of Data and its schema is maintained in another service – TBC! • Query front ends – Query engines based on different requirements
=> Value
Platform Architecture – Modern Architecture
Data
Sou
rces
Ac
quisi
tion
Data
Sys
tem
s
Staging ZoneETL and data
standardization
Pristine ArchiveCompressed
Gzip etc.
Data WarehouseImmutable data
Analytics ZoneAllocated data changes
Schema CatalogueWell-define reference to data structures and attributes
Data LedgerTrack data and its access with lineage and operations
Big
Data
Pla
tform
Data Marts UI/API
Apps
Source: Hortonworks
Why do we need to manage metadata for Big data platforms?
Large volumes of data landing in Hadoop/Big Data
Growing users working with the data
The need for effective control & consumption
of Data
The implementation needs to:• Offer good data visibility across your cluster• Capture data lineage across source systems and in the platform • Audit and record operations that are performed in the platform• Enforce policies that are defined by the platform stewards • Help reduce data redundancy on the platform
Source: Cloudera
Metadata in Action
Metadata – What is it?Data about Data!• Business Metadata
Supplies the business context around data, such as the business term’s name, definition, owners or stewards, and associated reference data
• Technical Metadata provides technical information about the data, such as the name of the
source table, the source table column name, and the data type (e.g., string, integer)
• Operational Metadata furnishes information about the use of the data, such as date last updated,
number of times accessed, or date last accessed
Source: Informatica
Why do I need all this metadata?• Data lake will contain all types of data – log streams - kafka,
DBS – sqoop… don’t make your lake turn into a swamp!• Consistency of definitions - To reconcile the difference in terminology such as
"clients" and "customers," "revenue" and "sales”
• Clarity of data lineage – About origins of a data set and can be granular enough to define information at the attribute level, including operations on it
• To understand data usage on your cluster • Optimize queries and views
• Compliance and Regulatory • Compliance -Capture, store and move data – Sarbanes-Oxley, HIPAA,
Basel II • Security - Authorization, Authentication – Handling sensitive data • Auditing - Recoding every attempt to access• Archive & Retention - Data life cycle policies
Source: Teradata/Techtarget
Metadata System ArchitectureTopologically, metadata repository architecture defines one of the following three styles:• Centralized Metadata repository
Efficient access and adaptability, scalability and high performance Single point of failure and continuous synchronization
• Distributed Metadata repository Access to metadata repo in real-time, up-to date metadata Overhead in maintaining the configuration of the source system changes and
HA • Federated or Hybrid Metadata repository
Central definition storage with references to the proper locations of the accurate definitions
Source: Techtarget
Use-cases for the need for Metadata
Use Cases – Analytics 1. Finding the Data: Data Scientists spend a lot time finding the
correct columns for variable selection• Around 80% of the data scientist’s time on column investigation with SMEs
2. Profile of Data: Reduce the number of time spent on data profiling by the ad-hoc queries • ~78% of the queries run on the cluster are profiling queries
3. Track the transformation: Data Scientists would like to understand how the data sets are derived • Not fully tracked except at a high level
Source: Aetna
1. Finding the data: Challenges• Hive requires relatively manual traversal of the schema to find the
table and columns• HDFS also requires traversal of the directory listing to find a file • Any documentation (external to the system) become outdated and
are not always reliable • No simple way to add business metadata
Source: Aetna
HDFS/Hive Architecture
hadoop.apache.orgBen Lever -Slideshare
Source:
1. Finding the data: Solutions• Run-time capture of metadata of hive and HDFS, and store in
repository• Provide an API to query the metadata and search across it • Provide an API or other ways to enrich the data with its business
context
Business Metadata
Technical/Physical Metadata
Hive
HDFS
Ingestion/Sqoop
Apache Atlas
Source: Aetna
2. Profile of Data: Challenges and Solutions
• Access to hive metastore will introduce latency in production• Lack of comprehensive information
provided by the hive metastore
78%
18%
4%
Average Daily QueryProfiling Exploratory Production
• Provide a system with business, technical data that are cross referenced• Have a framework for the data scientist
to accommodate additional profiling
Source: Aetna
3. Track the transformation: Challenges and Solutions
• Documenting transformation is manual and difficult to scale • Mechanism for auditing data pipeline still
lacking• Data quality and provenance is too manual
• Leverage metadata already captured to construct transformations • Provide an API to query transformations• Provide a visualization for the transformations
Source: Aetna
What do we need? 1. A Searchable platform for all the data types for business and technical metadata 2. Data profile store with basic metrics of the data
• Min• Max • Column distribution
3. Visual lineage for the data flow from the source system to different components within the platform • ETL operations – HL view• Analytics queries
4. Automated Metadata driven data ingestion and thus management • The Data Lake concept relies on capturing a robust set of attributes for every piece of content
within the lake• Maintaining this metadata requires a highly-automated metadata extraction, capture, and tracking
facility.
Solutions for Hadoop
Apache Atlas – deep dive• Apache Atlas Capabilities: Overview
• Data Classification• Import or define taxonomy business-oriented annotations for
data• Define, annotate, and automate capture of relationships
between data sets • Export metadata to third-party systems
• Centralized Auditing • Capture security access information • Capture the operational information for execution, steps, and
activities • Search & Lineage (Browse)
• Text-based search features locates relevant data and audit event across Data Lake quickly and accurately
• Browse visualization of data set lineage allowing users to drill-down into operational, security, and provenance related information
• Security & Policy Engine • Rationalize compliance policy at runtime based on data
classification schemes
Source: Hortonworks
Open-source Incubator project
Demo
Apache Atlas in action!
Possible solutions for other platforms
Netflix – Managing Data Platforms
Source: Netflix
Possible Solutions for other Platforms
Metacat • Apply Metadata management
on Service layer• Federated metadata catalog for
the whole data platform• Proxy service to different
metadata sources• Data metrics, data usage,
ownership, categorization and retention policy …
• Common interface for tools to interact with metadata
Tracking Data Difference• Apply Metadata management
on Service layer• Track the changes to
documents/entities• Custom code tracking through
logs collected as Mongo, or use a module called MongoID
Netflix OSS
Where Else?{ "Description": "A containerized foobar", "Usage": "docker run --rm example/foobar [args]", "License": "GPL", "Version": "0.0.1-beta", "aBoolean": true, "aNumber" : 0.01234, "aNestedArray": ["a", "b", "c"] }
<meta name=”description” content=”155 characters of message matching text with a call to action goes here”>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.JOSA.Meta</groupId> <artifactId>project</artifactId> <version>1.0</version> </project>
Notes - Summary • Consider the different types of metadata you need to manage• Build a robust descriptive dictionary for the data • Manage metadata as a team effort. It has a lot of benefit so make it
Agile but effective.
Finally…remember thatOne’s Metadata – d/dx – is someone else’s Data!
Resources • HDP 2.3 Preview Sandbox VM: (Hortonworks) – http://hortonworks.com/hdp/whats-new/ • Apache Atlas:– http://atlas.incubator.apache.org/ – http://incubator.apache.org/projects/atlas.html – https://git-wip-us.apache.org/repos/asf/incubator-atlas.gi• Metadata Management (General)– https://www.informatica.com/content/dam/informatica-com/global/amer/us/collateral/white-paper/metadata-management-data-governance_white-paper_2163.pdf
Tariq Ezzibdeh
Questions..?Contact info: