enterprise data classification and provenance
TRANSCRIPT
1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Enterprise Data Classification and ProvenanceApache Atlas
Shwetha Shivalingamurthy Suma Shivaprasad
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Disclaimer
This document may contain product features and technology directions that are under development, may be under development in the future or may ultimately not be developed.
Project capabilities are based on information that is publicly available within the Apache Software Foundation project websites ("Apache"). Progress of the project capabilities can be tracked from inception to release through Apache, however, technical feasibility, market demand, user feedback and the overarching Apache Software Foundation community development process can all effect timing and final delivery.
This document’s description of these features and technology directions does not represent a contractual commitment, promise or obligation from Hortonworks to deliver these features in any generally available product.
Product features and technology directions are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind.
Since this document contains an outline of general product development plans, customers should not rely upon it when making purchasing decisions.
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
• Demo• Big Data Governance• Overview of Atlas• Atlas architecture• Features and Roadmap
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo usecase – Ad network
• Matches advertiser demand with ad space supply from publishers• Billing based on ad impressions/ad engagement• Enables targeting, tracking and reporting of ad impressions• Typical reports/queries:• Mismatch of demand and supply• Country/os wise reports• Top advertisers/publishers
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data landscape
Traditional warehouse
Ad serversUser
AdImpression,
Click,Billing logs
MetadataSummaries
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data governance requirements
• Cross platform lineage – impact analysis, forensic, discovery• Asset search• Common Business Terms • Compliance
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo
• Technical and business metadata • Cross Component Lineage• Creating views• Create tags• Entity deletes• Search using tags, attributes• Entity audit• Business catalog – find assets• Flexible model, external lineage ingest
HDP 2.5
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Governance
Data Discovery
and Tagging
Metadata Management
Data Lineage/Prov
enance
Access Management
Data Security & PrivacyData Quality
Compliance and Audit
Data Wrangling
Data Lifecycle Management
Data integration
Data Governance Aspects
Data governance refers to processes, methods and tools used in an enterprise for effective control of availability, usability, integrity, and security of data
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Enterprise Data Governance: Apache Atlas Data Managementalong the entire data lifecycle with integrated provenance and lineage capability
• Cross component lineage
Modeling with Metadataenables comprehensive business metadata vocabulary with enhanced tagging and attribute capabilities
• Common Business Language
• Hierarchically organized – No dupes !
Interoperable Solutionsacross the Hadoop ecosystem, through a common metadata store
• Combine and Exchange Metadata
STRUCTURED
UNSTRUCTURED
TRADITIONALRDBMS
METADATA
MPP APPLIANCES
Kafka Storm
Sqoop
Hive
ATLASMETADATA
Falcon
RANGER
STREAMING
Custom
Partners
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Background: DGI Community becomes Apache Atlas
May2015
Apache AtlasIncubation
DGI groupKickoff
Dec 2014
Aug2016
HDP 2.5/Apache 0.7 Release
Global FinancialCompany
* DGI: Data Governance Initiative
Key Benefits:
• Co-Dev = Built for real customer use cases
• Faster & Safer = Customers know business + HWX knows Hadoop
• Code contributors - Hortonworks, IBM, Aetna , Merck, Target
Jul2015
HDP 2.3/Apache 0.5Foundation Release
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Architecture
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Architecture
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Atlas Type System
• Defines model – schema of metadata• Flexible and powerful to define any model/custom types• Supports inheritance• Types
• Primitive types – bool, integer types, string, date, enum• Collections - array, map• Struct – set of attributes• Class – Identifiable struct, hierarchy • Trait – set of attributes, hierarchy
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive Model
DataSetmetaType: ClassTypename: String required hive_db
metaType: ClassTypename: string required
createTime: date requiredparameters: map<string,string> optional
hive_table
metaType: ClassTypedb: hive_ db required
createTime: date requiredcolumns: array<hive_column>
required
hive_columnmetaType: ClassTypename: string requiredtype: string required
extends references
references
0..n
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Entities
Instances of typesName: rawlogs
Guid: 1createTime: 2015-01-01 10:00
Type: hive_db
name: impressionsGuid: 2
Type: hive_table
name: adv_idtype: string
Guid: 3Type: hive_column
name: user_idtype: string
Guid: 4Type: hive_column
db column
column
EXPIRES_ONTime: March, 2016
PII
trait
trait
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Graph Engine
• Graph Database• Titan with storage backed by HBase
• Types and Entities are translated to the Graph Model• Classes, Structs and Traits map to a vertex• Relationships are mapped as edges• Rich relationships between metadata objects
• Indexing and Search• Indexing based on type annotations• External indexing – Titan backed by Solr
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Titan property graph modelGraph Search with Gremlinsaturn = g.V.has('name','saturn').next()
hercules = saturn.as(‘x’).in(‘father’).loop(‘x’) { it.loops > 3}.next()
hercules.outE(‘battled’).has(‘time’, T.gt, 1).inV.name cerberus hydra
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Search
Find Relevant Assets based on their attributes ,
associations with business terms
DSL with sql like syntax based on type system
from $type is $trait where $clause select|has $attributes, repeat
Examples Select columns from a hive_table where its name
is “impressions” and db name is “raw”hive_column where table.name=”impressions", table.db.name = ‘raw’
Select all columns from hive tables which are tagged as “PII”
hive_column is ‘PII’
Full text search ‘(rawlogs) AND hive’
‘(rawlogs OR supply*) AND hive’
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Features and Roadmap
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Atlas Component Integration & Lineage
• Cross- component dataset lineage. Centralized location for all metadata inside HDP
• Single Interface point for Metadata Exchange with platforms outside of HDP
Apache Atlas
Hiv
e
Ran
ger
Falc
on
Sqoo
p
Stor
m
Kaf
ka
Spar
k
NiF
i
HB
ase
Partner
Custom
HDP 2.3
HDP 2.5 Beyond HDP 2.5
HDP 2.5 External
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Business Catalog for Ease of Use
Organize data assets along business terms– Authoritative: Hierarchical Taxonomy Creation– Agile modeling: Model Conceptual, Logical, Physical assets– Definition and assignment of tags like PII (Personally
Identifiable Information)
Comprehensive features for compliance – Multiple user profiles including Data Steward and Business
Analysts– Object auditing to track “Who did it”– Metadata Versioning to track ”what did they do”
Faster Insight: ( Roadmap )– Data Quality tab for profiling and sampling– User Comments
Key Benefits:
Organize data assets along business terms
Compliance Features:
Faster Insight
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Ranger: Introduction
Centralized authorization and auditing across Hadoop components• HDFS, Hive, HBase, Knox, Strom, YARN, Kafka, Solr, ..• Audit logs to: Solr, HDFS, RDBMS, Log4j, ..
Resource based security• Policies for specific set of resources• Requires revision of policies as resources get added/moved
Classification based security• Policies for classifications and not for specific resources• A single policy protects resources in multiple components• As classification for resources change, appropriate policies would
automatically be applied• Enables separation of duties: resource-classification and security policies
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Scalable Access Control – Reusable Tag Policy
User group• AD• Linux
Resources:
• Files• Tables• Topologies
Atlas Tag
• PII
ANY asset PII
• Files• Tables• Topologies
Single Admin Group Assigns
Many Stewards Tag +Single point of
enforcement and audit
All future tagging is covered by
existing policy
Not Scalable
Scalable
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Open: Governance Ready Certification ProgramChoice: Customers choose features that they want to deploy—a la carte versus vendor lock
Curated & Fast: Selected group of vendor partners to provide rich, complimentary and complete features ready to deploy
Agile: Low switching costs, Faster deployment and innovation
Centralized : Common SLA & common open metadata store
Flexibility: Interoperability of products through Atlas metadata
Safe: HDP at core to provide stability and interoperability
Completed:• Waterline• Dataguise• Attivio• Trifacta
Pending:• Collibra• Alation• Meta
Integration (Miti)
• Paxata• Syncsort• Talend
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Roadmap…
• MultiTenancy• Titan 1.x Migration• Hive Column Level Lineage
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Summary
• Designed for Hadoop at platform, not application level• High Confidence data in Hadoop for regulated verticals• Compliance and business objectives aligned to data organization• Faster discovery for analysts – reduce time to value• Agile and adaptable – ensures information is current by native
connectors• Dynamic protection with Ranger in simple audited policies
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Learn More:
• Apache Incubator link http://atlas.incubator.apache.org/
• Hortonworks links: http://hortonworks.com/solutions/security-and-governance/
• https://community.hortonworks.com/spaces/64/governance-lifecycle-track.html?topics=Atlas&type=question
• Atlas Technical User Guide - http://atlas.incubator.apache.org/AtlasTechnicalUserGuide.pdf
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Questions
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Backup
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Dynamic Access PolicyApache Ranger + Atlas Integration
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How does Atlas work with Ranger at scale?
Atlas provides: Metadata• Business Classification (taxonomy): Company > HR > Driver
• Hierarchy with Inheritance of attribute to child objects: Sensitive “PII” tag of department HR will be inherited by group HR> Driver
• Atlas will notify Ranger via Kafka Topic for changes
Apache Atlas
Hiv
e
Ran
ger
Falc
on
Kaf
ka
Stor
m
Atlas provides the metadata tag to create policies
Ranger provides: Access & Entitlements
• Ranger will cache tags and asset mapping for performance
• Ranger will have a policy based on tags instead of roles.
• Example: PII = <group> This can work for a may assets.
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Automatic update of policies – active protection
Metastore
• Tags• Assets• Entities
Notification Framework
Kafka Topics
AtlasAtlas Client
• Subscribes to Topic• Gets Metadata
Updates
PDPResource Cache
Ranger
Notification Metadata updates
Messagedurability
Optimized for Speed
Event driven updates
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Ranger: Authorization and Auditing
HBase
Ranger Administration Portal
HDFS
Hive Server2
Ranger Audit StoreRanger Policy Store
Ranger Plugin
Hadoop Components
Enterprise Users
Log4j
Knox
Storm
YARN
Kafka
Solr
HDFS
Solr
Ranger Plugin
Ranger Plugin
Ranger Plugin
Ranger Plugin
Ranger Plugin
Ranger Plugin
Ranger Plugin
RDBMS
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Big Data GovernanceCurrent Landscape
• Opaque Data and in variety of data stores – HDFS, S3, Data warehouses• Schema is hardly sufficient – Hive Metastore, Avro, Data Warehouse• Platform tools like Ranger and Falcon solve parts of the problem
Need for Data governanceOrganizations need data governance to understand its information to answer questions such as:
• What do we know about our information?• Where did this data come from and how’s it being used?• Does this data adhere to company policies and rules?• Need for effective control and consumption of data
Atlas helps customers discover information about data objects, their meaning, location, characteristics, and usage.
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Business Taxonomy
Business Taxonomy (Catalog)The practice and science of classification of things or concepts, including the principles that underlie such classification. The business organization model is hierarchical making authoritative with no duplication.
Tags: Traits vs. Labels vs. Business TaxonomyAtlas has Tags that are authorative and prevent duplication. Tag can span different parts of the business taxonomy. A tag PII can be used in HR as well Finance or Sales.
Benefits:
A view of data assets organized by business language
Compliance, Acceptable use – Dynamic Metadata based access control
Common taxonomy through Hadoop components
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Principle Roles & Activities in an Enterprise
• Data Steward – Curator, responsible for data classification – associate business taxonomy and tagging, access policies
• Data Scientist – Analyst, primary consumer of Business Taxonomy
• Administrator/Operations – Role management, Data lifecycle management (Archival, retention)
• Data Engineer – Data ingress and egress, semantic data quality
• 50% - 80%+ Time spend looking for data
• Profit Center • Primary User of Atlas
• Enables Scientist
Goal: < 25% spent on finding data=Empowering scientist to spend their time uncovering insights -- faster
37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Governance Usecases : Impact analysis
HortonAdNetwork – A large size Ad network which has an international footprint with multiple publishers and advertisers across several countries
Complex ETL jobs and data pipelines processing real-time ad network data from several different sources and various data processing platforms
No easy way to determine the root cause when something is off charts Data analysts need effective data provenance tools for Impact/Root cause anaylsis
Cross component lineage is a must Data Lineage (Provenance)
Data lineage is defined as a data life cycle that includes the data's origins and where it moves over time. It describes what happens to data as it goes through diverse processes. It helps provide visibility into the analytics pipeline and simplifies tracing errors back to their sources
38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Governance Usecases - Compliance HortoniaBank – mid size bank expanding from US to international markets 2 Customer Tables owned by BH: 50K customer records each with 38 fields (PII, PHI, PCI
& non-sensitive data)– us_customers: USA person data only– ww_customers: multi-language, multi-country, localized person data
1 data set of prospects leased from a data broker– tax_2010: Data lease expired already!
39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
User Group Access Privileges
joe_analyst us_employee US Data Only, non-sensitive data only, rest forbidden depending on sensitivity
kate_hr us_hr US Data Only, All sensitive data (PCI, PII, PHI)
Tag Based Policies US HR team members can see all original data (PCI, PII,….) Analysts are prohibited from viewing PII data in any of the tables Anyone except operations/Admin are prohibited to access tax_2010 after the specified
date - Expires_on policy turns off access on the configured expiry date
40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sqoop
TeradataConnector
ApacheKafka
Expanded Native Connector: Dataset Lineage
Custom Activity Reporter
MetadataRepository
RDBMS
Any process using Sqoop is
covered
No other tool tracks IOT of
the box