data innovators summit ai-powered data cataloging · edc extension & scanner for tableau •...

58
` Data Innovators Summit AI-Powered Data Cataloging

Upload: others

Post on 07-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

`

Data Innovators Summit

AI-Powered Data Cataloging

Hyoun ParkCEO & Founder

Analyst Keynote

\\ AMALGAM INSIGHTS Twitter: @hyounpark\\

The Business Imperative For Data-Driven Context

4

Hyoun ParkFounder and Chief Analyst

\\ AMALGAM INSIGHTS Twitter: @hyounpark\\

As data professionals and quantitative executives,

we’ve been told to build the foundation for the “Data-

Driven Enterprise”

\\ AMALGAM INSIGHTS Twitter: @hyounpark\\

Big Themes of the 2010s

\\ AMALGAM INSIGHTS Twitter: @hyounpark\\

Big Data Is The New…

\\ AMALGAM INSIGHTS Twitter: @hyounpark\\

After the Hype, We Got

Faster Data

Bigger Data

More Usage

\\ AMALGAM INSIGHTS Twitter: @hyounpark\\

And the Results:Faster Data

Reduced Data Half-Life

Lower Predictive Value

Bigger Data

Too Much to Analyze

Too Hard to Access

More Usage

Lots of SilosLack of

Consistency

\\ AMALGAM INSIGHTS Twitter: @hyounpark\\

THE QUEST FOR CONTEXT AS A

BUSINESS DRIVER

\\ AMALGAM INSIGHTS Twitter: @hyounpark\\

“Metadata is a love note to the future.

Jason Scott, Textfiles.com

\\ AMALGAM INSIGHTS Twitter: @hyounpark\\

We started with the basics:

Building a data and metadata dictionary of basic terms.

\\ AMALGAM INSIGHTS Twitter: @hyounpark\\

Challenges of the Data Dictionary

Often built for data practioners, not data users

Limited to Structured Data

Focused on specific

departments by design

Hard to maintai

n

\\ AMALGAM INSIGHTS Twitter: @hyounpark\\

Why is the Data Dictionary so hard to

maintain?

\\ AMALGAM INSIGHTS Twitter: @hyounpark\\

The Biggest Challenge is just that Data grows too fast!

Structured Data Growth

Mobile Data Growth

All Data Growth

\\ AMALGAM INSIGHTS Twitter: @hyounpark\\

Structured data is estimated to double every 18-24 months, in

step with Moore’s Law.

\\ AMALGAM INSIGHTS Twitter: @hyounpark\\

Based on their 2019 Report, the CTIA states data use is up over 73x

since 2010.

This is an average increase of 170% per

year every year!

\\ AMALGAM INSIGHTS Twitter: @hyounpark\\

Including unstructured data, we estimate that 90% of the world’s data has been created in the

last two years.

This growth rate is equivalent to over 200% increase in data every year!

\\ AMALGAM INSIGHTS Twitter: @hyounpark\\

The Battle for Context means that “Big Data” needs to be categorized, curated, or tossed

away.

The future of AI requires Data Management at massive scale.

\\ AMALGAM INSIGHTS Twitter: @hyounpark\\

THE NEXT GENERATION OF METADATA MANAGEMENT:

DATA CATALOGUING

\\ AMALGAM INSIGHTS Twitter: @hyounpark\\

We say “data cataloguing as a verb, not a noun“

\\ AMALGAM INSIGHTS Twitter: @hyounpark\\

“Data cataloguing” needs to be an ongoing process, not a one-time build like a data warehouse or an ETL job.

\\ AMALGAM INSIGHTS Twitter: @hyounpark\\

Data Catalouging

Discovery Categorization Curation

\\ AMALGAM INSIGHTS Twitter: @hyounpark\\

The Challenge of Discovery

TOO MANY data and metadata catalogs, repositories, glossaries, taxonomies, and

ontologies scattered across your organization

\\ AMALGAM INSIGHTS Twitter: @hyounpark\\

Data Sources (Marts, Lakes, Clouds, etc)

Data Catalogs

Consolidated Data

Holistic Business Data

Departmental Catalog 1

Source 1

Source 2

Departmental Catalog 2

Source 3

Source 4

Data Pipelines

Data Virtualization

\\ AMALGAM INSIGHTS Twitter: @hyounpark\\

The Challenge of Categorization

The initial categorization of data requires human expertise and context. AI can supplement

human effort, but cannot fully replace human experience.

\\ AMALGAM INSIGHTS Twitter: @hyounpark\\

So Many Stakeholders!

Catalog Stakeholders

Data Analysts

Data Engineers

Data Scientists

Developers

Line-of Business Managers

Executives

\\ AMALGAM INSIGHTS Twitter: @hyounpark\\

The Challenge of Curation

From now on, there will always be too much data from too many sources for humans to fully curate. AI will be necessary to provide an initial

pass at curation.

\\ AMALGAM INSIGHTS Twitter: @hyounpark\\

The Answer?

A combination of top-down management, machine-learning aided discovery and curation,

& dedicated efforts to support metadata consistency.

\\ AMALGAM INSIGHTS Twitter: @hyounpark\\

The Virtuous Cycle of Context

Discovery

Categori-zationCuration

Data Sources and

Dictionaries

Business Experts and

Subject Matter Experts

AI to provide initial results

based on rules and confidence

levels

\\ AMALGAM INSIGHTS Twitter: @hyounpark\\

This is why:

AI Needs Data Management

and

Data Management Needs AI

\\ AMALGAM INSIGHTS Twitter: @hyounpark\\

So, the real goal?

The Context-Driven Business. Data is a means to an end, not the end goal.

\\ AMALGAM INSIGHTS Twitter: @hyounpark\\

Key Recommendations based on Amalgam Insights’

inquiries

\\ AMALGAM INSIGHTS Twitter: @hyounpark\\

1.Discover all of your metadata stores across all your

departments.Embrace the diversity of names!

Deal with the politics!

\\ AMALGAM INSIGHTS Twitter: @hyounpark\\

2. Identify Data Experts within each key business unit and

department.Cheat Code: Check with finance to see which units and departments have the biggest P/Ls.

Start with them.

\\ AMALGAM INSIGHTS Twitter: @hyounpark\\

3. After you have a working and standardized taxonomy,

automate and conduct long-tail categorization with AI.

You’ll still need people to verify categorization, but AI can find interesting and relevant

groupings.

\\ AMALGAM INSIGHTS Twitter: @hyounpark\\

4. Use these steps as a Virtuous Cycle for Context.

People come in for scheduled reviews, but can’t spend all their time on discovery and

categorization. The goal is to maximize the value of human insights, not to turn humans into

machines.

\\ AMALGAM INSIGHTS Twitter: @hyounpark\\

The “Data Driven Enterprise” is the Past

The “Context Driven Enterprise” is the

Future

\\ AMALGAM INSIGHTS Twitter: @hyounpark\\

`

Data Innovators Summit

AI-Powered Data Cataloging

4040 © Informatica. Proprietary and Confidential.

Data Drives All Digital Transformation Priorities

Governance Self-Service/ Advanced Analytics

CloudModernization

CustomerExperience

Machine Learning/AI

Explosion in Data Volume

New Data Types (mobile, social, IoT)

New Users

Data in the Cloud

500 millionbusiness data users and growing

Over 94% of data center trafficwill come from the Cloud

20 billion connected devices

1 billion workerswill be assisted by machine learning or AI

20.6 zettabytes per yearin global data center traffic

Growing Complexity of Data Landscape

4242 © Informatica. Proprietary and Confidential.

Intelligent Data Cataloging is the First Step

Lineage, Change Notification

Business Context Association

Governance

Lineage, Related Data, Recommendations

Collaboration – Ratings, Reviews, Certification

Self-Service/ Advanced Analytics

Lineage, Impact Analysis

Detailed Data Usage Information

CloudModernization

Lineage

Discovery and Onboarding

CustomerExperience

© Informatica. Proprietary and Confidential.4343

Find the data you need with simple, powerful

semantic search

Enterprise Data CatalogData Map for the Enterprise

Understand your enterprise data with a holistic view

Trust your data by understanding its lineage and quality

© Informatica. Proprietary and Confidential.4444

Enterprise Data CatalogAI, Human Knowledge and Collaboration

AI-powered automatic discovery, enrichment

and curation

Business context via intelligent business term

association

Collaboration & social curation to tap into shared

data knowledge

© Informatica. Proprietary and Confidential.4545

Open APIs for ExtensibilityExtend EDC Capabilities into Your Environment

EDC Tableau Extension - understand data in context within the native Tableau UI

EDC + Cloud Data Integration – accelerate development; discover and select assets, auto

populate connection values

PowerCenter | DQ MDM | BDM | DIH

BG | ILM | Axon | Informatica Cloud

Informatica

Oracle | DB2 | DB2 for z/OSSQL Server | Sybase | TeradataNetezza | JDBC | SQL Scripts |

SAP HANA | Stored Procedures

Databases

SAP R/3 | SalesforceOracle | Workday

Applications

HIVE (Cloudera, Hortonworks, MapR, IBM BigInsights, EMR, HDI)

HDFS | MapRFS |

Cloudera Navigator | Atlas

Big Data

AWS S3 | AWS Redshift | Azure SQL DB | Azure SQL DW | Azure

ADLS | Azure Blob | Google BigQuery | ADLS Gen 2

Cloud Platforms

CSV | Delimited | XML | JSON | Avro | Parquet | MS Excel | Adobe PDF | Flat File | MS

PowerPoint | MS Word

File Formats

Tableau | IBM Cognos |

SAP BusinessObjects

MicroStrategy | OBIEE

Business Intelligence

Microsoft SSIS | Erwin Models | PowerDesigner | Oracle Data Integrator | IBM DataStage | Custom Scanner Framework

Other

EnterpriseData

Catalog

• Semantic Search• Domain Discovery• Similarity Clustering• Business Term Association

• Relationships• Business Context• Glossary Integration• Custom Annotations

Analytics DataGovernance

Master DataManagement

CloudModernization

Metadata Intelligence

Data Integration Data Quality

• Discovery• Profiling• Lineage• Impact Analysis

• Reviews/Ratings• Questions/Answers• Data Certifications• Change Notifications

The Catalog of Catalogs

On-premDatabases

File Systems

BI Tools

On-prem/ SaaS Apps

ETLAWS Glue Azure Data

CatalogADLS Google Data Catalog

Knowledge Graph + Powered by AI/ML

Breadth of Active Metadata

Open APIs, Full Access

Enterprise Data Catalog

The image part with relationship ID rId2 was not found in the file.

Enterprise | Unified Metadata | Intelligence

Schema Inference

Recommendations

Data Tagging

Entity Discovery

Relationship Discovery Natural Language Translation

Data Similarity

Anomaly Detection

©New York Life Insurance Co., 2019

About New York Life Insurance

50

Since 1845, people have worked with New York Life to protect their families and futures. We believe in the importance of human guidance and in trusted relationships built on being there when our customers need us most.

Fast Facts• Headquarters in Manhattan • 11,000 employees across 15 corporate locations• 12,000 agents across over 200 field sales offices around the United

States• Providing customers with a range of products and services, including

life insurance, annuities, long term care insurance, mutual funds, exchange traded funds, institutional investments, and investment services

©New York Life Insurance Co., 2019

Supporting a Logical Data Model

51

Axon

EnterpriseDataCatalog

InformaticaDataQuality

Business AnalystsSubject Matter ExpertsData Modeler

Data Sources

Attributes Systems Glossary Data Quality PeopleData Sets

Data Steward

1

Data elements are published in Axon and

linked to source systems and attributes, preferred

source identified

Data steward prioritizes data elements for respective domain

2

Metadata & lineage for each source database is

scanned in EDC

EDC columns linked to business

terminology and definitions from

Axon

3Data quality mapplets are created in IDQ and linked to local data quality rules

in Axon

Data discovery profiles with primary key and foreign key analysis

executed on key sources

4

Data stewards and subject matter experts define data elements, standardize reference values, formats, etc.

Axon attributes are linked to columns

from EDC 5

6

Axon

©New York Life Insurance Co., 2019

Data Governance Capabilities

52

Metadata Management Data Quality Management

Privacy, Risk, & Compliance

Master Data Management

Data Governance Capabilities

• Glossary – Business terminology

• Catalog – Technical metadata & lineage

• Auto-discovery

• Profiling• Rules & metrics• Identifying, prioritizing, &

remediating data defects

• Decision making• Policies• Data Access• Appropriate use

• Modern data model• Processes to create & update• Controls to ensure quality

Tableau and Informatica

Solution Theater Presenter:

EDC Native Extensionfor Tableau

Hundreds ofJoint Customers

Cloud Big DataTraditional /Real Time

Tableau Catalog

Informatica Enterprise Data Catalog

56 © Informatica. Proprietary and Confidential.56

EDC Extension & Scanner for Tableau

• Discover and understand data in context within the native Tableau user interface

• Detailed information and lineage for Tableau analytics with EDC’s native scanners

57 © Informatica. Proprietary and Confidential.57

Learn More

Don’t miss the customer perspectives and demos at the AI-Powered Data Cataloging Virtual Summit:

• Data Cataloging for Data Governance: Maersk

• Data Curation & Collaboration for Self-Service Analytics: Nissan North America

• Data Lineage – The Foundational Use Case: Rabobank

• EDC Adoption Best Practices : Biogen

`

Thank You