ibm industry models and data lake
TRANSCRIPT
© 2016 IBM Corporation
IBM Industry Models and the IBM Data LakeJanuary 2017
Pat O’Sullivan – IBM AnalyticsEmail : [email protected] : @PatOSullivanIBM
© 2017 IBM Corporation
© 2015 IBM Corporation2 © 2017 IBM Corporation
Disclaimer
IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM’s sole discretion.
Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision.
The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract.
The development, release, and timing of any future features or functionality described for our products remains at our sole discretion.
Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.
2
© 2015 IBM Corporation3 © 2017 IBM Corporation
SOA
The broadening scope of analytics
Master Data Management Hub
Applications Data Warehouse
Pattern Discovery for Analytics
Operational Data Store
Adding in a business desire for real-time analytics, self service data and increasing regulations relating to individual privacy, it becomes necessary to have a well- defined, managed and governed approach to information architecture. We call this IBM’s data Lake.
SANDBOXES
AnalyzeValues
SearchFor Data
Reporting
DataLake
Hadoop
© 2015 IBM Corporation4 © 2017 IBM Corporation
Big Data Lakes or Swamps?
As we collect data• Can we preserve clarity?• Do we know what we are collecting?• Can we find the data we need?
Are we creating a data swamp?
How do we build trust in big data?• Do we know what data is being used
for?
© 2015 IBM Corporation5 © 2017 IBM Corporation
The Data Lake
Data Lake = Efficient Management, Governance, Protection and Access.
Data Lake
Information Management and Governance Fabric
Data Lake Services
Data Lake Repositories
© 2015 IBM Corporation6 © 2017 IBM Corporation
Users supported by the Data Lake
Data Lake (System of Insight)
Information Management and Governance Fabric
Data Lake Services
AnalyticsTeams
Governance, Risk andCompliance Team
InformationCurator
Line of BusinessTeams
Data LakeOperations
Data Lake Repositories
Enterprise IT
Other Data Lakes
Systems of Engagement
Systems of Automation
Systems of Record
New Sources
© 2015 IBM Corporation7 © 2017 IBM Corporation
The Data Lake subsystems
Data Lake (System of Insight)
Information Management and Governance Fabric
Catalogue
Self-ServiceAccess
EnterpriseIT Data
Exchange
Self-ServiceAccess
AnalyticsTeams
Governance, Risk andCompliance Team
InformationCurator
Line of BusinessTeams
Data LakeOperations
Enterprise IT
Other Data Lakes
Systems of Engagement
Data Lake Repositories
Systems of Automation
Systems of Record
New Sources
© 2015 IBM Corporation8 © 2017 IBM Corporation
Data lake repositories
Specialist Processing
Structured and Optimized
System-level Data(Landing Area)
Accumulation of Context for Master and Reference Data
Self-managed DataMetadata
Refined data formatted for particular consumers
© 2015 IBM Corporation9 © 2017 IBM Corporation
IBM Industry Data ModelsIBM Industry Data Models provide pre-defined data structures which help accelerate data warehouse, data lake and business intelligence projects.
Industry specific issues being addressed
Integrated set of Models from business requirements to low level design
Predefined and pretested deployment to RDBMS and HDFS environments
IBM Industry Data Models
KPIs
Business Vocabulary
Atomic DW Models Dimensional Models
Banking Insurance Fin Markets Retail Healthcare Telecom E&U
Customer Insight Profitability Risk Regulatory Compliance
Project Acceleration
Technical
Business
Analysis ModelsData Classifications
Business Models
Analysis Models
Design Models
Supportive Terms
DataWarehouse
OperationalData Store
Big DataDataMarts
Information Integration & Governance
© 2015 IBM Corporation10 © 2017 IBM Corporation
IBM Industry Models and main data lake deployment paths
Business Vocabulary is deployed to Data Lake Catalog via tools such as InfoSphere Information Governance Catalog (IGC)
Atomic (Inmon) and Dimensional (Kimball) Data Models deployed to data lake via tools such as InfoSphere Data Architect (IDA) and ERwin
Supporting collateralModels-specific white papers and best practice docs outlining the main deployment patterns and implementation considerations
© 2015 IBM Corporation11 © 2017 IBM Corporation
Overall set of Models
Business Terms/ FSDMSupportive
ContentAnalytical
Requirements
Atomic Warehouse
Model
Dimensional Warehouse
Models
Business Vocabulary (IGC)
Analysis level Models (IDA)
Design level Models (IDA)
DataModels
Business Data Model
© 2015 IBM Corporation12 © 2017 IBM Corporation
Data Lake
View-based
Interaction
Big Data Landscape – main components touched by the IBM Data Models
Line of BusinessApplications
Simple, Ad Hoc
Discoveryand
Analysis
Reporting
InformationService Calls
SearchRequests
ReportRequests
UnderstandInformation
Sources
UnderstandInformation
Sources
DeployDecisionModels
UnderstandCompliance
ReportCompliance
InformationService Calls
DataAccess
CatalogInterfaces
AdvertiseInformation
Source
DeployReal-timeDecisionModels
Enterprise IT Interaction
Data ReservoirOperations
CurationInteraction
Management
DataAccess
DataDeposit
DataDeposit
Raw DataInteraction
Information Integration & Governance
Repositories
Decision ModelManagement
Governance, Risk andCompliance Team
InformationCurator
Enterprise IT
Events to Evaluate
InformationService Calls
Data Out
Data In
Other SystemsOf Insight
NotificationsSystem of
RecordApplications
Enterprise
Service B
us
New Sources
Third Party Feeds
Third Party APIs
Systems of Engagement
Internal Sources
Other SystemsOf Insight
DeployReal-timeDecision Models
Published Data
HarvestedData INFORMATION
WAREHOUSE
DEEP DATA
HistoricalData
DescriptiveData
CATALOG
OPERATIONALHISTORY
REPORTINGDATAMARTS
SANDBOXES
Full info on the IBM Data Lake Reference Architecture see IBM Redbook : Designing and Operating a Data Reservoir http://www.redbooks.ibm.com/Redbooks.nsf/RedpieceAbstracts/sg248274.html?Open
© 2015 IBM Corporation13 © 2017 IBM Corporation
Options regarding common models/glossaries to encourage standardization and reuse
DataAccess
Enterprise IT
System of Record
Applications
EnterpriseService Bus
New Sources
Third Party Feeds
Third Party APIs
Systems of Engagement
Internal Sources
Enterprise IT Interaction
InformationService Calls
Data OutPublishingFeeds
ServiceInterfaces
Data In
InformationIntegration &Governance
DataIngestion
DeployDecisionModels
InformationService Calls
DataAccess
DeployReal-time
DecisionModels
DataDeposit
DeployReal-timeDecision Models
View-basedInteraction
Published
OBJECTCACHE
Repositories
SharedOperationalData
ASSETHUB
EXECUTION ENGINES
WORKFLOWMONITOR
InformationService Calls
SearchRequests
CurationInteraction
Management
DataDeposit
ReportRequests
HarvestedData
HistoricalData
DEEP DATA
OPERATIONALHISTORY
INFORMATION WAREHOUSE
REPORTINGDATAMARTS
Line of BusinessApplications
Consumers of Insight
Simple, ad hocDiscovery
and Analysis
Reporting
Analytical InsightApplications
DescriptiveData
CATALOG
SANDBOXES
Data Analysts/Data Scientists
Analytics Tools
Data Management Operations
Shared set of term and physical asset definitions in the Catalog that underpin all queries by all users
Data Scientists can make use of predefined catalogs and likely to create new catalog entries during their daily activities
Business Users use specific subsets of the same shared Catalog as users to ensure consistency of language and meaning
Any published structures required by the Business are based on the same standard definitions and structures as those used elsewhere
Standardized set of Business Term and Data Model definitions used to enforce both the meaning and where appropriate structure of stored data
Data Management Operations use the same shared set of models and catalog entries to build the necessary production ETL assets
© 2015 IBM Corporation14 © 2017 IBM Corporation
Catalog Deployment - Models in the Descriptive Data Zone
Business Terms/FSDMSupportive
ContentAnalytical
Requirements
Atomic Warehouse
Model
Dimensional Warehouse
Models
Business Vocabulary (IGC)
Analysis level Models (IDA)
Design level Models (IDA),
PurposeProvide a standard business language and information model that can be used when discussing business concepts and related technical components.Steps1. Business Vocabulary Models are deployed to the
Catalog (IGC) where they used and maintained by business analysts and data stewards
2. The Logical data Models (eg. Business and Atomic & Dimensional Warehouse Models) are be imported into the catalog. However they are mastered in a modelling tool like InfoSphere Data Architect
Considerations Evolving patterns/best practices for the overall
management of enterprise and LOB glossaries
Repositories
HarvestedData
HistoricalData
Enterprise IT Interaction
SharedOperationalDataInformation
Service Calls
Data OutPublishingFeeds
ServiceInterfaces
Data In
DataIngestion
Enterprise IT
System of Record
Applications
EnterpriseService Bus
New Sources
Third Party Feeds
Third Party APIs
Systems of Engagement
Internal Sources
ASSETHUB
DEEP DATA
OPERATIONALHISTORY
INFORMATION WAREHOUSE
REPORTINGDATAMARTS
InformationIntegration &Governance
2
1
SANDBOXES
Business Users
Data Scientists
Business Data Model
DescriptiveData
CATALOG
Descriptive Data Zone
© 2015 IBM Corporation15 © 2017 IBM Corporation
Repositories
HarvestedData
HistoricalData
Enterprise IT Interaction
SharedOperationalDataInformation
Service Calls
Data OutPublishingFeeds
ServiceInterfaces
Data In
DataIngestion
Enterprise IT
System of Record
Applications
EnterpriseService Bus
New Sources
Third Party Feeds
Third Party APIs
Systems of Engagement
Internal Sources
ASSETHUB
OPERATIONALHISTORY
InformationIntegration &Governance
DescriptiveData
CATALOG
Business Terms
Supportive Content
Analytical Requirements
Warehouse and Marts – Models in Integrated Warehouse Zone
Atomic Warehouse
Model
Dimensional Warehouse
Models
Business Vocabulary (IGC)
PurposeProvide data modellers with consistent data structures for deployment across the different aspects of an integrated Information Warehouse and Marts zone.Steps1. The Atomic Warehouse Model is used as the basis
for the Inmon-style central relational Information Warehouse
2. The Dimensional Warehouse Model is used as the basis for the Kimball-style Dimensional Information Warehouse.
3. The Dimensional Warehouse Model provides the business-issue-specific structures to enable the deployment of Reporting Data Marts.
I
Integrated Warehouse & Marts ZoneDEEP DATA
INFORMATION WAREHOUSE
3
1
2
REPORTINGDATAMARTS
Business Users
Analysis level Models (IDA)
Design level Models (IDA),
© 2015 IBM Corporation16 © 2017 IBM Corporation
Repositories
HarvestedData
HistoricalData
Enterprise IT Interaction
SharedOperationalDataInformation
Service Calls
Data OutPublishingFeeds
ServiceInterfaces
Data In
DataIngestion
Enterprise IT
System of Record
Applications
EnterpriseService Bus
New Sources
Third Party Feeds
Third Party APIs
Systems of Engagement
Internal Sources
ASSETHUB
INFORMATION WAREHOUSE
InformationIntegration &Governance
Dimensional Warehouse
Models
Business Terms
Supportive Content
Analytical Requirements
Big Data Deployment – Models in the Landing Area Zone
Atomic Warehouse
Model
Business Vocabulary (IGC)
PurposeProvide the basis for a consistent and appropriate use of schemas in the different repositories in the Landing Area Zone.Steps1. Atomic Warehouse Model used as the basis for
the deployment for both schema-at-write and schema-at-read Hadoop Deep Data structures
2. Atomic Warehouse Model may provide the basis for deployment for schema-at-read for Operational History raw data structures
Considerations Further investigation needed into the potential
role for DWM deployments to Hadoop-based technology
Landing AreaZone
21
DEEP DATA
OPERATIONALHISTORY
REPORTINGDATAMARTS
SANDBOXES
Business Users
Data Scientists
Analysis level Models (IDA)
Design level Models (IDA),
DescriptiveData
CATALOG
© 2015 IBM Corporation17 © 2017 IBM CorporationInformationIntegration &Governance
DescriptiveData
CATALOG
Repositories
SharedOperationalData
ASSETHUB
HarvestedData
HistoricalData
Enterprise IT Interaction
InformationService Calls
Data OutPublishingFeeds
ServiceInterfaces
Data In
DataIngestion
Enterprise IT
System of Record
Applications
EnterpriseService Bus
New Sources
Third Party Feeds
Third Party APIs
Systems of Engagement
Internal Sources
DEEP DATA
OPERATIONALHISTORY
INFORMATION WAREHOUSE
REPORTINGDATAMARTS
SANDBOXES
Business Users
Data Scientists
Summary Picture
Physical ModelHadoop
PhysicalModel RDBMS
Physical Model Dimensional
Logical ModelAtomic
Logical ModelDimensional
Business Vocabulary
Mappings to inform common Business Meaning using the Business Vocabulary in IGC
Generation of Technical Structure using the ER Data Models in ER tool (e.g. IDA)
LegendUse of Business Vocabulary to understand Business Meaning by Users• The Business Vocabulary Terms in IGC can be used to enforce common
business meaning through out the Data lake landscape• The output of the various Logical Models can be used to define the
technical structure of assets in the lake that need to be created. Where a predefined schema is required (e.g. Schema at Write)
41 2 35
67
8910
© 2015 IBM Corporation18 © 2017 IBM Corporation
Three different lifecycles relating to the evolution of the models with the Data Lake
Analysis
Refine
Deploy
Review
Requirement
Maintenance of the Business Language
AR
BT
SG
Analysis
Design
Generate
Review
Requirement
Development of the ER/UML Models
AWM DWM
The use of the Industry Models Business Vocabularies to enable a common Business meaning of language by all Data Lake users
The use of the Industry Models Business Vocabularies and derived physical assets in the creation and ongoing management of the Data Lake
The use of the ER and UML models to enforce a common structure of artifacts where required in the Data Lake
BDM
BT - Business TermsAR - Analytical RequirementsSG - Supportive GlossariesBDM - Business Data ModelAWM - Atomic Warehouse ModelDWM - Dimensional Warehouse Model
Legend AWM(Physical)
DWM(Physical)
Management of the runtime production environment
BT
Data Lake Repositories
Data Lake Catalog
DataData Lake Users
© 2015 IBM Corporation19 © 2017 IBM Corporation
The Repositories used by the Data Lake Lifecycles
IGC Dev Repository
Modelling Environment
Collaboration/Versioning Repository (e.g. RTC)
Business Language Environment
Runtime Data Lake Environment
IGC ProductionRepository
Data Repositories RDBMS
IGC Browser
IDA
IGC for Eclipse
Data Repositories HDFS
Data Lake Repositories
Data Lake Catalog
IGC Anywhere/REST
IGC Browser
IMAM IDA Import
IMAM
Physical Data Model IG
C W
orkfl
ow
© 2015 IBM Corporation20 © 2017 IBM Corporation
Lifecycle 1 - Maintaining the Business Language of the Data Lake Objective : The creation and ongoing maintenance of the
common Business Language to be used by all users to describe the various components of the Data Lake oi underpin the Data Lake
Roles Involved : Business user reps, Business SMEs, Business Language Stakeholders
Analysis
Refine
Deploy
Review
Requirement
Maintenance of the Business Language
AR
BT
SG
Considerations: • Determining the needs of the different users of
the Data Lake (different uses, need for different dialects, amount of technical metadata in the Language)
• Determining the approach to building the business language, the overall flow for creation, promotion and maintenance of terms
• Defining the specific glossary suitable for pure business users , versus Business Analysts, Data Scientists, Data Modellers and IT staff
• Determining the role of using IBM Industry Models to build out the Business Language
© 2015 IBM Corporation21 © 2017 IBM Corporation
Lifecycle 2 - Developing the technical Models Objective : The use of the ER and UML models to enforce a common
structure of artifacts where required in the Data Lake Roles Involved : Modellers, Business SMEs,
Considerations: • Ensuring the appropriate communications
between the Data Modellers and the Business Users
• Determining when to use and not to use Data models for the data lake repositories
• Determining the ongoing use of a Canonical Platform Independent Logical Model as a basis for the deployment of the different types of Platform specific, physical Models required across the Data Lake Repositories
• Determining the specific data modelling approaches and scenarios for deploying to the different Data lake repositories.
Analysis
Design
Generate
Review
Requirement
Development of the ER/UML Models
AWM DWM
BDM
© 2015 IBM Corporation22 © 2017 IBM Corporation
Lifecycle 3 - Deploying the Models into the runtime Data Lake environment Objective : The use of the Industry Models Business Vocabularies
and derived physical assets in the creation and ongoing management of the Data Lake
Roles Involved : Business user reps, Modellers, Data Lake Ops staff
Considerations: • Determining how to deploy the Business
Language for optimal use by the different Data Lake users (management access to the different terms, handling of ongoing updates)
• Determine the strategy for the ongoing association of the Business Terms with Data Assets (which users tag new data elements with the Business Language and when)
• What is the approach for the Data Lake ops staff to deploy the physical Data Models – how is feedback to the Data Modellers handled.
• How to incorporate the Data Model artifacts into the ongoing Data Lake governance aspects
AWM(Physical)
DWM(Physical)
Management of the runtime production environment
BT
Data Lake Repositories
Data Lake Catalog
DataData Lake Users
© 2015 IBM Corporation23 © 2017 IBM Corporation
ClaimFile
PatientInformation
File
Sample Source Data
/data/udmh/patient/<date>/<version>/.. Data files..
Data Transformation
Process (Hive,Spark, Pig,
ETL, ..)
Data Transformation
Process (Hive,Spark, Pig,
ETL, ..)
Hive Metastore
Patient party ext Table
HIVE
Vendor SQL for Hadoop interface
/data/udmh/claim/<date>/<version>/.. Data files..
Claim ext Table
Logical Data Model
PhysicalData Model
Patient ClaimPatient / Claim
Patient Claim
Downstream Data Transformation processes
123
Industry Models Hadoop deployment example – low level
HDFS
Three possible deployment paths
© 2015 IBM Corporation24 © 2017 IBM Corporation
Mapping of incoming new structures in the Data Lake
IGC Dev Repository
Runtime Data Lake Environment
IGC ProductionRepository
Data Repositories RDBMS
IDA
IGC for Eclipse
Data Repositories HDFS
Data Lake Repositories
Data Lake Catalog
IGC Anywhere/REST
IGC Browser
IMAM IDA Import
IMAM
Physical Data Model IG
C W
orkfl
ow
New HDFS Structure
1
2a
2b
2c
Question about what are the best practices for the “Bottom-up” mapping of a new structure in the data lake which has not been originally derived from a Data Model. 1. Direct mapping from the Physical Asset to the appropriate Term in the Catalog2. Indirect mapping via a specifically created data model (actual mapping done either via BGE or in BG Browser)
a. Reverse engineer a new model from the HDFS Structureb. Import the Data model into the Catalogc. Import the mappings into the Catalog from IDA (is mapping done in IDA via BGE)
© 2015 IBM Corporation25 © 2017 IBM Corporation
Model artifacts in the Data Lake Runtime environment – main usage patterns
There are three main categories ways in which the data model artifacts are used in or impact the Data Lake runtime environment
• Industry Model artifacts are deployed into the Data Lake runtime environment
• Most likely as an output from the two lifecycles “Maintaining the Business Language” and “Deploying the Technical Models”
• Industry Model artifacts deployed in the Data lake are used by and effected by Data Lake users
• For example , Data lake users provide feedback on changes/corrections/additions to the model artifacts
• Industry Model artifacts deployed in the Data lake are impacted by new or changed data coming into the Data Lake Repositories
• The most obvious example is the need for new mappings to a new or changed Repository brought into the Data Lake.
© 2015 IBM Corporation26 © 2017 IBM Corporation
REFERENCE MATERIALNew Information Architectures and Capabilities
© 2015 IBM Corporation27 © 2017 IBM Corporation
Designing and Operating a Data Reservoir
Description of the behaviour and processes that make up a data reservoir (IBM’s Data Lake)
Blog• 5 things to know about a data
reservoir https://www.ibm.com/developerworks/community/blogs/5things/entry/5_things_to_know_about_data_reservoir?lang=en
Redbook• http://www.redbooks.ibm.com/R
edbooks.nsf/RedpieceAbstracts/sg248274.html?Open
© 2015 IBM Corporation28 © 2017 IBM Corporation
IBM Industry Models and Data lake publications so far :
http://www-01.ibm.com/common/ssi/cgi-bin/ssialias?htmlfid=IMW14877USEN
http://www-01.ibm.com/common/ssi/cgi-bin/ssialias?htmlfid=IMW14872USEN
http://www-01.ibm.com/common/ssi/cgi-bin/ssialias?htmlfid=IMW14877USEN
http://www-01.ibm.com/common/ssi/cgi-bin/ssialias?htmlfid=IMW14872USEN
https://www-01.ibm.com/common/ssi/cgi-bin/ssialias?htmlfid=IMW14911IEEN&