data-ed webinar: design & manage data structures
TRANSCRIPT
Data structures enable you to store and organizedata so that it can be used efficiently. But how doyou know to apply the correct one? There is adifference between structuring master data,reference data and analytics data. This webinar will discuss the various data structures available and when to use each one. We will show how data structures should support your organizationalstrategy and how each method can contribute tobusiness value. Learning Objectives: • Application of correct data structures to fit business needs • How different structures create different business value Date: July 8, 2014Time: 2:00 PM ETPresented by: Dave Marsh & Peter Aiken
Copyright 2013 by Data Blueprint
Welcome: Design/Manage Data Structures
1
Copyright 2013 by Data Blueprint
Get Social With Us!
Like Us on Facebook www.facebook.com/
datablueprint Post questions and comments Find industry news, insightful
content and event updates.
Join the Group Data Management &
Business Intelligence Ask questions, gain insights and collaborate with fellow
data management professionals
Live Twitter Feed Join the conversation! Follow us: @datablueprint @paiken Ask questions and submit your comments: #dataed
3
Presented by Dave Marsh & Peter Aiken, Ph.D.
Design & Manage Data Structures
Marco Level
Copyright 2013 by Data Blueprint
Your PresentersDave Marsh • Lead Data
Consultant, Data Blueprint
• 30+ Years experience designing and building solutions for the private and public sectors.
• Architecture/Design experience in: - Transactional processing - Shop floor automation - Data Warehousing - Identity Management - Mobile
Peter Aiken • 30+ years DM
experience • 9 books/many articles • Experienced with 500+ data
management practices • Multi-year immersions: US
DoD, Nokia, Deutsche Bank, Wells Fargo, & Commonwealth of VA
4
Copyright 2013 by Data Blueprint
• Context: Data Management/DAMA/DM BoK/CDMP?
• What is a data structure?
• Structured data storage, a bit of history and context
• Why are data structures important?
• Data Personas/Usage (interest over time)
• Data Topology and alignment to the data audience
• Internal data structures to fit the needs
• Q & A?
Outline: Design/Manage Data Structures
6
• Context: Data Management/DAMA/DM BoK/CDMP?
• What is a data structure?
• Structured data storage, a bit of history and context
• Why are data structures important?
• Data Personas/Usage (interest over time)
• Data Topology and alignment to the data audience
• Internal data structures to fit the needs
• Q & A?
Maslow's Hierarchy of Needs
Copyright 2013 by Data Blueprint 7
You can accomplish Advanced Data Practices without becoming proficient in the Basic Data Management Practices however this will: • Take longer • Cost more • Deliver less • Present
greaterrisk
Data Management Practices Hierarchy
Basic Data Management Practices
Advanced Data
Practices • MDM • Mining • Big Data • Analytics • Warehousing • SOA
Data Program Management
Data Stewardship Data Development
Data Support Operations
Organizational Data Integration
Copyright 2013 by Data Blueprint 8
Data Program Coordination
Feedback
DataDevelopment
Copyright 2013 by Data Blueprint
StandardData
Data Management is an Integrated System of Five Practice AreasOrganizational Strategies
Goals
BusinessData
Business Value
Application Models & Designs
Implementation
Direction
Guidance
9
OrganizationalData Integration
DataStewardship
Data SupportOperations
Data Asset Use
Integrated Models
Leverage data in organizational activities
Data management processes and infrastructure
Combining multiple assets to produce extra value
Organizational-entity subject area data
integration
Provide reliable data access
Achieve sharing of data within a business area
Copyright 2013 by Data Blueprint
Five Integrated DM Practice Areas
10
Manage data coherently.
Share data across boundaries.
Assign responsibilities for data.Engineer data delivery systems.
Maintain data availability.
Data Program Coordination
DataDevelopment
Organizational Data Integration
DataStewardship
Data SupportOperations
Copyright 2013 by Data Blueprint
DAMA DM BoK & CDMP Data Management Functions
• Published by DAMA International – The professional association for Data
Managers (40 chapters worldwide) – DMBoK organized around – Primary data management functions focused
around data delivery to the organization (more at dama.org)
– Organized around several environmental elements
• CDMP – Certified Data Management Professional – DAMA International and ICCP – Membership in a distinct group made up of
your fellow professionals – Recognition for your specialized knowledge in
a choice of 17 specialty areas – Series of 3 exams – For more information, please visit:
• http://www.dama.org/i4a/pages/index.cfm?pageid=3399 • http://iccp.org/certification/designations/cdmp
11
Copyright 2013 by Data Blueprint
• Context: Data Management/DAMA/DM BoK/CDMP?
• What is a data structure?
• Structured data storage, a bit of history and context
• Why are data structures important?
• Data Personas/Usage (interest over time)
• Data Topology and alignment to the data audience
• Internal data structures to fit the needs
• Q & A?
Outline
13
Copyright 2013 by Data Blueprint
What is a data structure?• "An organization of information, usually in memory, for better
algorithm efficiency, such as queue, stack, linked list, heap, dictionary, and tree, or conceptual unity, such as the name and address of a person. It may include redundant information, such as length of the list or number of nodes in a subtree."
• Some data structure characteristics – Grammar for data objects
• Grammar is the principles or rules of an art, science, or technique "a grammar of the theater"
– Constraints for data objects
– Sequential order – Uniqueness – Arrangement
• Hierarchical, relational, network, other
– Balance – Optimality
http://www.nist.gov/dads/HTML/datastructur.html
14
Copyright 2013 by Data Blueprint
How are data structures expressed as architectures?• Details are
organized into larger components
• Larger components are organized into models
• Models are organized into architectures
A B
C D
A B
C D
A
D
C
B
15
Copyright 2013 by Data Blueprint
How are data structures expressed as architectures?• Attributes are organized into
entities/objects – Attributes are characteristics of "things" – Entitles/objects are "things" whose
information is managed in support of strategy – Examples
• Entities/objects are organized into models – Combinations of attributes and entities are
structured to represent information requirements – Poorly structured data, constrains organizational information delivery
capabilities – Examples
• Models are organized into architectures – When building new systems, architectures are used to plan development – More often, data managers do not know what existing architectures are and -
therefore - cannot make use of them in support of strategy implementation – Why no examples?
16
Copyright 2013 by Data Blueprint
Data Data
Data
Information
Fact Meaning
Request
A Model Specifying Relationships Among Important Terms
[Built on definition by Dan Appleton 1983]
Intelligence
Strategic Use
1. Each FACT combines with one or more MEANINGS. 2. Each specific FACT and MEANING combination is referred to as a DATUM. 3. An INFORMATION is one or more DATA that are returned in response to a specific REQUEST 4. INFORMATION REUSE is enabled when one FACT is combined with more than one
MEANING. 5. INTELLIGENCE is INFORMATION associated with its USES.
Wisdom & knowledge are often used synonymously
Data
Data
Data Data
17
Copyright 2013 by Data Blueprint
• Context: Data Management/DAMA/DM BoK/CDMP?
• What is a data structure?
• Structured data storage, a bit of history and context
• Why are data structures important?
• Data Personas/Usage (interest over time)
• Data Topology and alignment to the data audience
• Internal data structures to fit the needs
• Q & A?
Outline
19
Copyright 2013 by Data Blueprint
History (such as it is)• Automate existing manual
processing • Data management was:
– Running millions of punched cards through banks of sorting, collating & tabulating machines
– Results printed on paper or punched onto more cards
– Data management meant physically storing and hauling around punched cards
• Tasks (check signing, calculating, and machine control) were implemented to provide automated support for departmental-based processing
• Creating information silos • Data Processing Manager
20
Copyright 2013 by Data Blueprint
Chief Information Officer
21
Copyright 2013 by Data Blueprint
CFO Necessary Prerequisites/Qualifications• CPA
• CMA
• Masters of Accountancy
• Other recognized degrees/certifications
• These are necessary but insufficient prerequisites/qualifications
22
Copyright 2013 by Data Blueprint
CIO Qualifications• No specific qualifications • Typically technological fields:
– Computer science
– Software engineering
– Information systems
• Business – Master of Business Administration
– Master of Science in Management
• Business acumen and strategic perspectives have taken precedence over technical skills. – CIOs appointed from the business side of the organization
• Especially if they have project management skills.
23
Copyright 2013 by Data Blueprint
What do we teach knowledge workers about data?
What percentage of the deal with it daily?
24
Copyright 2013 by Data Blueprint
• Context: Data Management/DAMA/DM BoK/CDMP?
• What is a data structure?
• Structured data storage, a bit of history and context
• Why are data structures important?
• Data Personas/Usage (interest over time)
• Data Topology and alignment to the data audience
• Internal data structures to fit the needs
• Q & A?
Outline
26
Copyright 2013 by Data Blueprint
Data LeverageLess R
OT
Technologies
Process
People
• Permits organizations to better manage their sole non-depleteable, non-degrading, durable, strategic asset - data – within the organization, and – with organizational data exchange partners
• Leverage – Obtained by implementation of data-centric technologies, processes, and
human skill sets – Increased by elimination of data ROT (redundant, obsolete, or trivial)
• The bigger the organization, the greater potential leverage exists
• Treating data more asset-like simultaneously 1. lowers organizational IT costs and 2. increases organizational knowledge worker productivity
27
Copyright 2013 by Data Blueprint
Data Structure Questions
Program F
Program E
Program DProgram G
Program H
Program I
Applicationdomain 2Application
domain 3
• Who makes decisions about the range and scope of common data usage?
28
Copyright 2013 by Data Blueprint
Running Query
29
Copyright 2013 by Data Blueprint
Optimized Query
30
Copyright 2013 by Data Blueprint
Repeat 100s, thousands, millions of times ...
31
Death by 1000 Cuts
Copyright 2013 by Data Blueprint 32
Copyright 2013 by Data Blueprint
5 Basic Data Structures
Indexed Sequential File: Built-in index permits location of records of persons with last names starting with "T"
Index
Program: Where is the record for person "Townsend?"
Index: Start looking here where the "Ts" are stored
Relational Database: Records are related to each other using relationships describable using relational algebra
Flat File: Records are typically sorted according to some criteria and must be searched from the beginning for each access
Program: Must start at the beginning and read each record when looking for
person "Townsend?"
Network Database: Records are related to each other using arranged master records associated with multiple detail records using linked lists and pointers Associative
Concept-oriented Multi-dimensional
XML database 3NF
Star schema Data Vault
Hierarchical Database: Records are related to each other hierarchically using 'parent child' relationships
33
Copyright 2013 by Data Blueprint
Data structures organized into an Architecture• How do data structures support
organizational strategy? • Consider the opposite question?
– Were your systems explicitly designed to be integrated or otherwise work together?
– If not, then what is the likelihood that they will work well together?
– In all likelihood your organization is spending between 20-40% of its IT budget compensating for poor data structure integration
– They cannot be helpful as long as their structure is unknown
• Two answers/two separate strategies – Achieving efficiency and
effectiveness goals – Providing organizational dexterity for rapid
implementation
34
Copyright 2013 by Data Blueprint
Single Data Store
No Single Data Store• The thought of a single monolithic data
store which can service all of an organization’s information needs has long since been abandoned. In the modern data management topology, multiple data stores are created to service specific processing needs and user groups within the organization.
• Implications: • The needs characteristics of the multitude
of the audiences served by the data structures
• Data lifecycle • The design styles (old and new) utilized to
organize the data to service the audiences • A breakdown of the various stores • The resultant store characteristics
35
Copyright 2013 by Data Blueprint
Conclusions• 1 is not enough • Most
organizations have far to many different data structures and they become barriers to progress and integration
• Not much expertise to figure out these challenges
36
Copyright 2013 by Data Blueprint
• Context: Data Management/DAMA/DM BoK/CDMP?
• What is a data structure?
• Structured data storage, a bit of history and context
• Why are data structures important?
• Data Personas/Usage (interest over time)
• Data Topology and alignment to the data audience
• Internal data structures to fit the needs
• Q & A?
Outline
38
Copyright 2013 by Data Blueprint
Data Personas (The Requirements)
Operational Performer
Interested in alerts, notifications and
reporting based on current values (real-
time) data. They use the information to make
decisions and changes in the transactional
systems. These changes are targeted to
improve the organizations ability to
deliver in the short term.
Operational Analyst (Manager)
Interested in aggregated real-time data for their
domain of responsibility. The data is displayed
using visualization techniques of
scorecards, charts and reports, preferably within a single dashboard. The
searching is for favorable/unfavorable
trends to indicate adjustments are needed in the staff & resource
allocations.
Data Analyst Responsible to support detailed and typically
complex analysis requests from business
users/consumers of data. The analyst role
span both the operational and
historical time windows and thus they need to be
versed in both the operational and analytic
environments.
Data Miner/Scientist
Responsible for using statistical and machine learning techniques to identify patterns from
the data. These patterns are correlated into
insights and actions for better business
outcomes. The miner may use operational and historical data for
research.
Executive Consumer Receives the data through summary
dashboards with drill down/through
capabilities. Request detailed analysis and
reporting on High Value Question from the Data
Analyst and Data Miners. These
consumers are looking at the data to make short and long term
decisions to improve the organizational efficiency
and customer experience.
Operational Analytic
39
• Operational interest is high when data is introduced to the operational stores. This interest wanes over time.
• Analytic interest is low when data is first introduced. The interest increases as the data is collected and combined with other enterprise data.
Copyright 2013 by Data Blueprint
Persona Data Interest
Operational Interest
Analytic Interest
Interest
Time
40
Copyright 2013 by Data Blueprint
• Context: Data Management/DAMA/DM BoK/CDMP?
• What is a data structure?
• Structured data storage, a bit of history and context
• Why are data structures important?
• Data Personas/Usage (interest over time)
• Data Topology and alignment to the data audience
• Internal data structures to fit the needs
• Q & A?
Outline
42
Copyright 2013 by Data Blueprint
Data Topology Today
43
Copyright 2013 by Data Blueprint
Data Store Purpose a review of the Data Topology• Master Data
– Master Data is the term used to describe the data domains that drive business activities. Master data is the data that must first be in place before business transactions can occur. Master data is often shared across the organizational business units and it is typically at the center of business strategies. The transaction defines the business/process event (order, dispatch, sales) while the Master Data describes the ‘who’ (customers, drivers, account reps), the ‘what’ (load), the ‘when’ (date, time) and the ‘where’ (origin and destination location).
• Online Transaction Processing (OLTP) – “Transactional data” is the term used to describe the data involved in
the execution of the business activities. Transactional data associates master data (i.e. customers and products) to a business activity that often represents a unit or work, such as the creation of an order.
• The Master Data and OLTP stores are where data is initially created and persisted within the organization’s data and thus carry a special classification of System of Record (SOR). They are created to capture the transactional data as it arrives and makes the data available for the processes and services. The data arrives into these databases through manual entry or automated feeds. These data stores are logically (and sometimes physically) separated by the transactional subject area they are created to serve.
OLTP 1
OLTP 2
OLTP n...
MasterData
44
Copyright 2013 by Data Blueprint
Data Store Purpose a review of the Data Topology• Operational Data Store (ODS)
– An Operational Data Store (ODS) is created to integrate data from two or more SORs for the purposes of data integration. The ODS is normally created to satisfy reporting needs across functional SOR boundaries. The ODS should hold very little historical information and should focus on maintaining the most up-to-date data needed by the organization for daily operations. Depending on the application requirements, the ODS may institute a near real-time data feed from the source applications. The ODS is expected to be technically accurate and is considered to be an Authoritative Source. The data it contains can be used for non-critical needs instead of having to access the SOR. The more frequently the data is pushed into the ODS environment, the less reliance there will be on direct access to SORs for data reporting needs.
• Enterprise Data Warehouse (EDW) – An Enterprise Data Warehouse (EDW) is responsible for collection
and integration of data from either SORs or from the Operational Data Store. An EDW has an enterprise scope as it will pull from many (if not all) SORs. The focus of the data warehouse is to be historical in nature and in many instances is loaded with a latency (every 24 hours). The data warehouse is created to support historical analytics. The expectation of the data warehouse is to be exhaustive in the data it collects with a focus being on collecting and storing of the data.
Enterprise Data Warehouse
(EDW)
Operational Data Store
(ODS)
45
Copyright 2013 by Data Blueprint
Data Store Purpose a review of the Data Topology
• Data Marts – A Data Mart is a subset of a data warehouse, it
is created to address specific questions and/or subject area of questions. A Data Mart is built and tuned to deliver the data to the end users, it exists to get the data out from the data warehouse.
Data Mart
46
Copyright 2013 by Data Blueprint
Data Store Purpose a review of the Data Topology• Event Data Store
– Is the data store which logs, stores and reports the discrete business and technical events which occur within the process. This data store is a critical, and often overlooked data domain for managing, controlling and creating transparency into the business processes. The events are used to report out the overall health of the processes in both business and technical terms. This consolidated solution is key to obtaining a 360 view of the processes.
• Metadata Store – Metadata is a broad term which includes descriptive
elements in both business and technical terms. It covers: business terms, data elements descriptions, element display formats, element valid values, element quality targets, etc. Metadata is critical to an organization as it describes the organization’s business and processing infrastructure in detail. Metadata is entertainingly defined as “data about the data”. That is, Metadata characterizes other data and makes it easier to retrieve, interpret and use information.
Technical Metadata
Metadata StoreBusiness
Metadata
Event Data Store
Bus OPS Events
Tech OPS Events
47
Operational i n co n tr a st
w i th
AnalyticSubject-Oriented
Databases which are focused on a single or small set of business
functions
Integrated Collecting and semantically aligning data from disparate sources to achieve a homogeneous viewVolatile
Data which may change frequently
Non-Volatile Data for which entered into the database will not change
Atomic Low grain data, each transaction,
each order with all of the attributes
Aggregate A summary of multiple orders or transactions performed to transform the atomic detail into more comprehensible information
Current Valued: The data and the system represents what is
current in this moment; not yesterday, not last week --- now
Time Variant Data: is marked and stored with a date/time element where questions of what was it yesterday and last week can be answered
Copyright 2013 by Data Blueprint
Data Store Characteristics
48
Copyright 2013 by Data Blueprint
• Context: Data Management/DAMA/DM BoK/CDMP?
• What is a data structure?
• Structured data storage, a bit of history and context
• Why are data structures important?
• Data Personas/Usage (interest over time)
• Data Topology and alignment to the data audience
• Internal data structures to fit the needs
• Q & A?
Outline
50
• 3rd Normal Form (3NF) – Inmon
• Dimensional – Kimball
• Data Vault – Lindstad
Copyright 2013 by Data Blueprint
Data Structure Design Styles
51
• 3rd Normal Form Modeling • A mathematical data design
technique founded in the early 70s by E.F. Codd.
• Organizes data in simple rows and columns - Entities
• Creates connections between the entities called relationships to show how the data is inter-related
• It is purest form 3NF removes all data redundancies – a piece of data is stored only once
• 3NF is based on mathematics, give the same facts to different modelers; the model should be the same.
• Creates a visual (Entity Relation Diagram - ERD) which may be understood by less technical personnel
• 3NF is the modeling style most popularly used for operationally focused data stores.
Copyright 2013 by Data Blueprint
Design Styles – 3NF
52
Copyright 2013 by Data Blueprint
Design Styles – Dimensional Modeling• A data design approach create and
refined by Ralph Kimball in the 80s • Organizes data in Facts and
Dimensions – Fact tables record the events (what)
within the business domain
– Dimension tables describing who, when, how and where
• Created to exploit the capabilities of the relational database to retrieve and report against large volumes of data.
• There are 2 variations to Dimensional Modeling: – Star Schema – Snowflake
53
Copyright 2013 by Data Blueprint
Design Styles – Data Vault
• Newest of the relational database modeling techniques. • Conceived in the 1990s by Dan Linstedt • Focuses on linking the data from multiple disparate
locations without forcing the data to be semantically aligned
NOTE: There is a Data Ed presentation schedule for 14 October 2014 to cover the details of Data Vault designs!
54
DATA STORE AUDIENCE SERVED BUILD CHARACTERISTICS DESIGN STYLE
OPERATIONAL
Master Data
OLTP
ODS
Event
ANALYTIC
Data Warehouse
Data Mart
Copyright 2013 by Data Blueprint
Summary/Take AwaysDATA STORE AUDIENCE SERVED BUILD CHARACTERISTICS DESIGN STYLE
OPERATIONAL
Master Data Operations Manager Operational Analyst
Subject Oriented Volatile Atomic
Current Valued
3NF
OLTP Operational Performer Operations Manager
Subject Oriented Volatile Atomic
Current Valued
3NF
ODSOperational Manager Operational Analyst Executive Consumer
Integrated Volatile Atomic
Current Valued
3NF
Event All Personas
Integrated Volatile Atomic
Current Valued
3NF
ANALYTIC
Data Warehouse Data Miner/Scientist
Integrated Non-volatile
Atomic Time Variant
3NF trending to Data Vault
Data MartOperational Analyst
Data Analyst Executive Consumer
Subject Oriented Non-volatile
Atomic -or- Aggregated Time Variant
Dimensional
55
Copyright 2013 by Data Blueprint
• Context: Data Management/DAMA/DM BoK/CDMP?
• What is a data structure?
• Structured data storage, a bit of history and context
• Why are data structures important?
• Data Personas/Usage (interest over time)
• Data Topology and alignment to the data audience
• Internal data structures to fit the needs
• Q & A?
Outline: Design/Manage Data Structures
56
Copyright 2013 by Data Blueprint
Questions to Ask• Are you ready for a data
warehouse?• Foundational Practices• Is the business environment
constantly evolving?• Will you get it right the first time?• Do you have an agreed upon
enterprise-wide vocabulary• Is your data warehouse intended to
be the enterprise audit-able systemof record?
• Extract, Transform and Load• Data Transformations• How fast do you need results?• Performance of inserts vs reads• Project deliverables
57
Copyright 2013 by Data Blueprint
Upcoming EventsAugust Webinar: Data Management Maturity August 12, 2014 @ 2:00 PM ET/11:00 AM PT !Sign up here: • www.datablueprint.com/webinar-schedule • www.Dataversity.net !!!!!!!!!!!Brought to you by:
58
Copyright 2013 by Data Blueprint
Questions?
+ =
59
Copyright 2013 by Data Blueprint
Why Architectural Models?• Would you build a house without an architecture sketch? • Would you like to have an estimate how much your new house is going to cost? • If you hired a set of constructors from all over the world to build your house, would you
like them to have a common language? • Would you like to verify the proposals of the construction team before the work gets
started? • If it was a great house, would you like to build something rather similar again, in
another place? • Would you drill into a wall of your house without a map of the plumbing and electric
lines? • Model is the sketch of the system to be built in a project. • Your model gives you a very good idea of how demanding the implementation work is
going to be! • Model is the common language for the project team. • Models can be reviewed before thousands of hours of implementation work will be
done. • It is possible to implement the system to various platforms using the same model. • Models document the system built in a project. This makes life easier for the support
and maintenance!
Would you build a house without an architecture sketch?
Model is the sketch of the system to be built in a project.
Would you like to have an estimate how much your new house is going to cost?
Your model gives you a very good idea of how demanding the implementation work is going to be!
If you hired a set of constructors from all over the world to build your house, would you like them to have a common language?
Model is the common language for the project team.
Would you like to verify the proposals of the construction team before the work gets started?
Models can be reviewed before thousands of hours of implementation work will be done.
If it was a great house, would you like to build something rather similar again, in another place?
It is possible to implement the system to various platforms using the same model.
Would you drill into a wall of your house without a map of the plumbing and electric lines?
Models document the system built in a project. This makes life easier for the support and maintenance!
60
Copyright 2013 by Data Blueprint
Inmon Implementation
61
Copyright 2013 by Data Blueprint
Kimball Implementation
62
Copyright 2013 by Data Blueprint
Data Vault Implementation
63
Copyright 2013 by Data Blueprint
Hybrid Approach• (http://www.kimballgroup.com/2004/03/03/differences-of-opinion/) • Learn Data Vault – “dv-in-kimball-bus-architecture”
64
10124 W. Broad Street, Suite C Glen Allen, Virginia 23060 804.521.4056