understanding metadata: why it's essential to your big data solution and how to manage it well
TRANSCRIPT
Understanding Metadata: Why it’s essential to your big data solution and how to manage it well
Tuesday, June 21, 2016
Ben Sharma | Vikram Sreekanti
Speakers
Ben Sharma, Co-Founder & CEO – Zaloni ---
Ben Sharma is a passionate technologist and thought leader in big data, analytics and enterprise infrastructure solutions. Having previously worked in technology leadership at NetApp, Fujitsu and others, Ben's expertise ranges from business development to production deployment in a wide array of technologies including Hadoop, HBase, databases, virtualization and storage. Ben is co-author of Architecting Data Lakes and Java in Telecommunications.
Vikram Sreekanti, Software Engineer – AMPLab, UC Berkeley
Vikram Sreekanti is a software engineer working on research in the AMPLab at UC Berkeley. A graduate of Berkeley's computer science department, he will begin his Ph.D. in Fall 2016, working with Joe Hellerstein.
In today’s data environment with structured and unstructured data, the importance of metadata is increased
• Metadata allows you to keep track of what data is in the data lake, its source, its format and its lineage
• Metadata allows for better change management through Impact Analysis
• The result is data visibility, reliability and reduced time to insight for your analytics
Metadata matters in a big data world
Zaloni Proprietary 3
Data architecture modernization Tr
aditi
onal
N
ew
Data Lake
Sources ETL EDW
Derived (Transformed)
Discovery Sandbox EDW
Streaming
Unstructured Data
Various Sources
Zaloni Proprietary
Reporting, BI Extracts
Data ScienceData Discovery
Reporting, BI Extracts
4
Data lake reference architecture Consumption
ZoneSource System
File Data
DB Data
ETL Extracts
Streaming
TransientLoading Zone
Raw Data Refined Data
Trusted Data
DiscoverySandbox
Original unaltered data attributes
Tokenized Data
APIs
Reference Data Master Data
Data WranglingData DiscoveryExploratory Analytics
Metadata Data Quality Data Catalog Security
Data Lake
Integrate to common formatData ValidationData CleansingAggregations
OLTP or ODS
Enterprise Data Warehouse
Logs(or other unstructured
data)
Cloud Services
Business AnalystsResearchersData Scientists
Zaloni Proprietary 5
• Reduced time to insight for analytics
• Modern Data architecture will require a holistic approach to metadata
Metadata improves data visibility and reliability
Type of Metadata Description Example
Technical Captures the form and structure of each data set
Type of data (text, JSON, Avro), structure of the data (fields and their types)
Operational Captures lineage, quality, profile and provenance of the data
Source and target locations of data, size, number of records, lineage
Business Captures what it all means to the user
Business names, descriptions, tags, quality and masking rules
Zaloni Proprietary 6
Considerations: • Integration with Enterprise Metadata
Management Solutions
• Automated process for new metadata to be registered in the Data Lake
• Data follows the registered metadata
Automated metadata registration
API
check-in copy to
repository
retrieve metadata
Enterprise Metadata
Repositories
END
START
metadata file
Hadoop Cluster Edge-node to Cluster (SFTP)
add tags origin info,
timestamp, etc.
Metadata
operational metadata
file
Zaloni Proprietary 7
Data lineage example in Bedrock for impact analysis
Zaloni Proprietary 8
Metadata enhancing data quality and reliability
Zaloni Proprietary 9
Business users can quickly answer questions such as:
Data profiling speeds up data discovery and time to insight
• How many records does an entity have? What is its total size? • What does the activity look like for a specific entity (streaming,
updated monthly, untouched from a year ago)? • Is this entity a subset of another entity? • Does this entity likely contain duplicates? • Does this data apply to my target customers/market? • What is the min/max of a particular column? • Is this data reliable/does it have enough valid values?
Zaloni Proprietary 10
Data profiling example in Mica
Capture profiling metrics for every entity
• Automatically collect profiling metrics at the: § Entity level (e.g., size of data set) § Field level (e.g., values, frequency of the field)
• Visually display metrics with metadata • Allow data quality check rules to be created
based on profiling information
Zaloni Proprietary 11
Data catalog example in Mica
Zaloni Proprietary 12
• Logical data lake that can include all tiers of storage: § Files, HDFS, Object store in on-premise and cloud environments
• Data lifecycle management across tiered storage environments
§ Hot -> Warm -> Cold on an entity level based on policies/SLAs
§ Across on-premise and cloud environments
§ Take advantage of various storage technologies
§ Provide data management features to automate scheduling and orchestration of data movement between heterogeneous storage environments
• Elastic and on-demand compute for various analytical workloads
Data lifecycle management powered by Metadata
Zaloni Proprietary 13
Example: Metadata management in Financial Services
Register/ updatemetadata
RDBMS/ Mid Tier
MainframeCOBOL
Flat files
SAS files
Source Systems
Metadatarepositories
MetadataManagement
solution
Extract/ Readmetadata
Data Ingestion Data Quality and Validation
Layout Standardization
Operational Metadata
Generation
Layout Standardization
Data Acquisition Automation
• Automated Data Acquisition Framework providing timeliness of data
• Capture Metadata in all phases: Ingestion, Transformation
• Integration with Enterprise Metadata Management
• Integrated Data Quality Analysis
Zaloni Proprietary 14
DON’T GO IN THE LAKE WITHOUT US
Grounding Big Data Vikram Sreekanti UC Berkeley
REMEMBERING THE PAST
Data Warehouse Single Source of Truth
Enterprise Information Architecture Golden Master
…
Truth Truth
Big data took us to a new world
There were changes in volume, velocity and variety, which were challenging.
Big data took us to a new world
There were changes in volume, velocity and variety, which were challenging.
The real challenge now is the meaning and value of data, which depend critically on context.
Big data took us to a new world
WHAT IS DIFFERENT?
Shift in technology Data representations
Shift in behavior Data-driven organizations
Shift in behavior Data-driven organizations
Data in products Started with the Internet. Now, the Internet of Things
By 2017: marketing spends more on tech than IT does.
Data in marketing
GARTNER GROUP
By 2020: 90% of tech budget controlled outside of IT.
MANY USE CASES
MANY CONSTITUENCIES
MANY INCENTIVES
MANY CONTEXTS
WHAT IS DIFFERENT?
Shift in technology Data representations
Shift in behavior Data-driven organizations
Shift in technology Data representations
Raw data in the data lake Simplifies capture
Encourages exploration
What does it mean?
It depends on the context.
A LITTLE SCENARIO HDFS
BITS All the web logs from last year
VIEWS, MODELS, CODE A script to extract orders. To be used for Market Basket analysis.
VIEWS, MODELS, CODE A Hive table of orders. To be used for Market Basket analysis.
BITS All the web logs from last year
VIEWS, MODELS, CODE Code to extract abandoned user sessions
VIEWS, MODELS, CODE A retargeting model
A hive table of orders
A retargeting model
VIEWS, MODELS, CODE
MANY SCRIPTS
MANY MODELS
MANY APPLICATIONS
MANY CONTEXTS
A broader context for big data ground
THE MEANING AND VALUE OF DATA DEPENDS ON CONTEXT
Application Context Views, models, code
Behavioral Context Data lineage & usage
Historical Context
In and over time
APPLICATION CONTEXT Metadata
Models for interpreting the data for use § Data structures § Semantic structures § Statistical structures
Theme: An unopinionated model of context
HISTORICAL CONTEXT Versions
Web logs Code to extract user/movie rentals
Recommender for movie licensing
Trends over time How does a movie with these features
fare over time?
Point in time
A promising new movie is similar to older
hot movies at time of release!
BEHAVIORAL CONTEXT
Why Dora?!
Lineage & Usage
2 4 8 7 9
BEHAVIORAL CONTEXT Lineage & Usage
Data Science Recommenders
“You should compare with book sales from last year.”
Curation Tips “Logistics staff checks weather data the 1st
Monday of every month.”
Proactive Impact Analysis
“The Twitter analysis script changed. You should check
the boss’ dashboard!”
7
7
9
9
THE BIG CONTEXT
A NEW WORLD NEEDS NEW SERVICES
WHAT ARE WE BUILDING?
Grounding philosophy § Start useful, stay useful. § Stay general. § Design for scale.
ABOVEGROUND API TO APPLICATIONS
UNDERGROUND API TO SERVICES
CONTEXT MODEL
COMMON GROUND
Parsing & Featurization
Catalog & Discovery
Wrangling
Analytics & Vis
Reference Data
Data Quality
Reproducibility
Model Serving
Scavenging and Ingestion
Search & Query
Scheduling & Workflow
Versioned Storage ID & Auth
Scavenging and Ingestion
Search & Query
Scheduling & Workflow
Versioned Storage ID & Auth
COMMON GROUND CONTEXT MODEL
Pachyderm Chronos
Parsing & Featurization
Catalog & Discovery
Wrangling
Analytics & Vis
Reference Data
Data Quality
Reproducibility
Model Serving
ABOVEGROUND API TO APPLICATIONS
UNDERGROUND API TO SERVICES
CONTEXT MODEL
COMMON GROUND
COMMON GROUND
Versions
Models
Usage
An unopinionated context model
COMMON GROUND Models
Versions
Usage
Versions
Usage
Models
Model Graphs
The metamodel
member k1
member k1: string
member k2
Object 2
member k1 member k2:
number
member k11: string member k12
element 1 element 2 element 3
element 1 element 2 element 3
Root
RELATIONAL SCHEMA
JSON DOCUMENT
Schema 1
Table 1
Column 1 Column c
Table t
Column 1 Column d
foreign key
Models
Versions
Usage
Versions
Usage
Models
COMMON GROUND Models
Versions
Usage
Models
Versions
Usage
Versions
Usage
Models
Model Graphs
Version Graphs
The versioning model
COMMON GROUND Models
Versions
Usage
Models
Versions
Usage
Versions
Usage
Models
Model Graphs
Version Graphs
The versioning model
a3eb4b765520b0d0ab90594dcf2373c1ce5dbb0b0
0e9233e8e99cccd6861d304968efa4c945a0b918
3e64220f08374629ad43ca652d4ce7cef0bdbbca
3e0bada008655fe32d7d136eac0a3f333d23ed80 fd75a4ba16f96d11f3f954854acc2d739054233
Directed Acyclic Graphs (partial orders)
In this order
In no particular order
VERSION GRAPHS Models
Versions
Usage
Models
Versions
Usage
Versions
Usage
Models
COMMON GROUND Models
Versions
Usage
Models
Versions
Usage
Models
Versions
Usage
Versions
Usage
Models
Model Graphs
Version Graphs
Usage Graphs: Lineage
The usage model
USAGE GRAPHS
Everything can participate in usage
Models
Versions
Usage
Models
Versions
Usage
Models
Versions
Usage
Versions
Usage
Models
COMMON GROUND Versions
Models
Usage
Model Graphs
Version Graphs
Usage Graphs: Lineage
The model
INITIAL FOCUS AREAS
ABOVEGROUND API TO APPLICATIONS
UNDERGROUND API TO SERVICES
CONTEXT MODEL
COMMON GROUND
Parsing & Featurization
Catalog & Discovery
Wrangling
Analytics & Vis
Reference Data
Data Quality
Reproducibility
Model Serving
Scavenging and Ingestion
Search & Query
Scheduling & Workflow
Versioned Storage ID & Auth
INITIAL FOCUS AREAS
ABOVEGROUND API TO APPLICATIONS
UNDERGROUND API TO SERVICES
CONTEXT MODEL
COMMON GROUND
Catalog & Discovery
Wrangling
Analytics & Vis
Reference Data
Data Quality
Scavenging and Ingestion
Search & Query
Scheduling & Workflow
Versioned Storage ID & Auth
INITIAL FOCUS AREAS
Parsing & Featurization
Model Serving
Reproducibility
ABOVEGROUND API TO APPLICATIONS
UNDERGROUND API TO SERVICES
CONTEXT MODEL
COMMON GROUND
Parsing & Featurization
Catalog & Discovery
Wrangling
Analytics & Vis
Reference Data
Data Quality
Reproducibility
Model Serving
Scavenging and Ingestion
Search & Query
Scheduling & Workflow ID & Auth
INITIAL FOCUS AREAS
Versioned Storage
ABOVEGROUND API TO APPLICATIONS
UNDERGROUND API TO SERVICES
CONTEXT MODEL
COMMON GROUND
Parsing & Featurization
Catalog & Discovery
Wrangling
Analytics & Vis
Reference Data
Data Quality
Reproducibility
Model Serving
Scavenging and Ingestion
Search & Query
Scheduling & Workflow
Versioned Storage ID & Auth
ABOVEGROUND API TO APPLICATIONS
UNDERGROUND API TO SERVICES
Learn more at: http://www.ground-context.org
@vsreekanti