The Rules of Time:Data Quality Issues for
Time Varying Databases
DAMA National Capital Region – Mar 2002
Dr. Jerry RosenbaumConcentrX, LLC410-764-1843 voice443-253-6054 mobile410-764-2445 [email protected]
Sept 2001 Concentrx, LLC 2
Outline
• Perspective• Example• Aspects of Time• Example (with LDM)• Queries• Design Guidelines
Sept 2001 Concentrx, LLC 3
Perspective
• Designing building and using a time dependent database looks simple– Just add in dates or date ranges to some tables– Rows are only logically deleted (to maintain
history)– Make sure the SQL includes date logic
• BUT . . .
Sept 2001 Concentrx, LLC 4
Perspective continued
• There are often many issues and a lot of complexity lurking under the covers– You must understand the requirements– You must understand the uses of the data– You must be prepared help the “ad hoc”
customer obtain valid results• Ferret out what the customer forgot to tell you• Understand what they are really saying
• Transaction Path Analysis is very useful for physical design
Sept 2001 Concentrx, LLC 5
Perspective continued• Primary Keys often have a time factor• Queries must take into account the (multiple)
times and / or time ranges• Relationships between entities tend to
become more complex• The notion of referential integrity may need
to change • Training customers is difficult• Training developers is no easier
Sept 2001 Concentrx, LLC 6
Simple Questions
• How will we represent date– ymd, mdy, dmy, yd, day count since a start date
• Which Calendar– Julian, Gregorian, Hebrew, Chinese, Muslim, Hindu
• When does a day begin– Just after midnight (local time)– At sunset (local time)
• How about am/pm vs. 24 hour clock• How does daylight savings time fit in• What are the transformation rules between them
Sept 2001 Concentrx, LLC 7
Example - Persons Residence
Track every residence a person has lived in and when they resided at each place
• Basic table design includes – Name– Address– Start date– End date
Sept 2001 Concentrx, LLC 8
Some Issues
• However, we are not yet done– We must understand the business purpose for
tracking the data– We must understand how the data may be used– We must uncover and handle possible “quirks”
in the data– Are other attributes needed– How should we handle the primary key
Sept 2001 Concentrx, LLC 9
Example of Business Issue
• How do we plan to use the address– General mailings– Bills– Time sensitive material (e.g. auction catalog)– Visit the person– Call the person– Aggregate reporting– Etc, Etc, Etc
Sept 2001 Concentrx, LLC 10
Questions
• Is day sufficiently granular
• What if the person lives in Bombay, India and the user lives in NYC – What do we do about the 12 hour time zone
difference, especially if it bridges days.– For this type of application we can probably
ignore the time zone (unless we wish to call the person)
Sept 2001 Concentrx, LLC 11
More Questions
• Can there be a time with no residence• Can a person have more than one residence at one
time– Is one residence primary and other secondary
– Can we have a temporary overlap of times as the person moves residences
– How about winter and summer residences with each primary in its season
– Should a temporary residence be included
– Can one buy two residences on one day
Sept 2001 Concentrx, LLC 12
Primary Key Questions
• Does it make sense to use – Name + date + address sequence number– Name + address sequence number– Surrogate key
• If a surrogate key is used, what is the underlying business key
• What affect does this have on foreign keys
Sept 2001 Concentrx, LLC 13
Possible Design
• So far we are led to the below possible design– Surrogate Key– Name– Address– Address Type– Start Date– End Date
Sept 2001 Concentrx, LLC 14
Yet More Questions
• Do we have to track– When we knew about a new address– When we knew that an address is to end
• Note that these two dates can be– Before the person moved to an address– During the time a person is at an address– After a person leaves an address
• This data would add two more dates to the table design
Sept 2001 Concentrx, LLC 15
One Last Thought
• Alternative physical design could be 2 tables
• Table 1– Person Id– Name
• Table 2– Person Id– Address Seq Number– Rest of the attributes
Sept 2001 Concentrx, LLC 16
Key Points
• The basics of tracking time varying data appear easy
• The details cannot be ignored because they will cause changes in both the design and use of the database
• One must understand the business
• One must understand the customers
• Rules are subject to change
Sept 2001 Concentrx, LLC 17
Aspects of Time
Degree of time dependency and vary from table to table and attribute to attribute
• Some data has no time dependency (or we don’t care about the time dependency)
• Some data is time annotated
• Other data is valid only for a specified time or time period (I.e. time period dependent)
Sept 2001 Concentrx, LLC 18
Time Data Types
• Time Points
• Time Periods
• Time Period Categories
• Time Period Categories
• Bounded Time Periods
Sept 2001 Concentrx, LLC 19
Events and Time
Time by itself is rarely of interest
• Events and Things are important and we may need to track time in relation to them
• An event or thing may have one or more time factors associated with it that are relevant to the business
• Time factors may be interdependent
Sept 2001 Concentrx, LLC 20
Time Points
• Refers to a single “moment” in time
• Examples– The time that an event happened– The time we found out that the event happened– The time the data about the event was entered
into the system
• Any single event may have multiple point in time dimensions
Sept 2001 Concentrx, LLC 21
Picking a Point in Time
• Suppose a widget is imported by Ship• What is the import date
– Date widget is loaded onto the ship– Date ship arrives in U.S. port– Date container is taken off ship– Date customs inspector gets manifest– Date custome inspector verifies manifest– Etc. Etc, Etc
• If widgets are subject to a quota this is very important
Sept 2001 Concentrx, LLC 22
Time Periods
• Has a duration - beginning time point and end time point
• Examples– U.S. government fiscal year 1999 (Oct 1, 1998
to Sept 30, 1999)– Effective and Expiration dates of an insurance
policy
• An event may have multiple time periods associated with it
Sept 2001 Concentrx, LLC 23
Time Point Categories
• Generalization of a Time Point• Examples
– Last day of Month (Jan 31, Feb 28, etc)– New Moon – Mondays
• Categories must be well defined and data may be entered or calculated for each entry
• Example of use - service customer first Monday of each month
Sept 2001 Concentrx, LLC 24
Time Period Categories
• Generalization of Time Periods
• Examples– Fiscal Year– Accounting Months– Sales weeks for a retailer (often is Mon - Sun
and numbered sequentially from first full week in January)
• Example of use - comparing retail sales from last year and this year
Sept 2001 Concentrx, LLC 25
Bounded Time Periods
• Similar to time periods, but the span of time is not predefined
• Examples– The period when a person works for a company
(or department)– Car ownership - day you acquire a car until the
day you dispose of it
Sept 2001 Concentrx, LLC 26
Tense
• Time factors can be– Past– Present– Future
• There are often business rules about recording past and future information as well as rules for changing that data
Sept 2001 Concentrx, LLC 27
The Global Aspect
Many companies operate in multiple time zones (including global operations)
• To correlate time factors between different time zones generally sets up– Reference time zone – Rules for recording local time zones (or
location)
Sept 2001 Concentrx, LLC 28
Example – Tracking Employees
• We need track some HR data and maintain history– Employees
• Hours worked each day
• Salary
• Paychecks
– Departments• Departmental Manager
• Employees working in the department
Sept 2001 Concentrx, LLC 29
Business Question
• Determine the number of hours a person worked during the week of January 1, 1998 (Thursday)
• If a work day includes midnight, we attribute all hours to the day in which the work period began– Note: Midnight is the beginning of the next day
Sept 2001 Concentrx, LLC 30
Additional Questions
• When does a work week start: Friday, Saturday, Sunday or Monday
• The week of Jan 1 goes across a calendar year boundary, do we split the week into two
• Are there two types of weeks: tax weeks and work weeks. We use tax week for the IRS and work week to calculate payroll
• Payroll withholding rules change every year
Sept 2001 Concentrx, LLC 31
Logical Design
Building the logical data model
• Include time independent and time annotated items
• Temporarily ignore time dependencies and treat the model as if you were looking at the business at a specific point in time.
• Add time dependencies as a second step
Sept 2001 Concentrx, LLC 32
LDM Without time dependencies
Sept 2001 Concentrx, LLC 33
Add Some Time Dependencies
• Employees– Have hire and termination dates– Change salary – Change departments
• Departments – Are created and eliminated– Have changes in management
Sept 2001 Concentrx, LLC 34
LDM With Time Dependencies
Sept 2001 Concentrx, LLC 35
Notes
• Primary keys for tables (except PayCheck) include a start time
• All time periods include both a start time and an end time
• If we do not know the end time, should we– Use a standard default value (preferred)– Use a null
• This is the normalized logical model
Sept 2001 Concentrx, LLC 36
More Notes
• The LDM maintains RI– Physical model will generally not have RI
• The business rules for integrity of the data (similar to “RI”) are critical – The “basic business key” portions must match– The time period of the “referenced table” must
include the time period of the “referencing table”
Sept 2001 Concentrx, LLC 37
Still More notes
• PayCheck is still the same except there is an important business rule
• The attribute salary has become a separate table with a 1:M relationship
• The 1:1 manages a dept relationship became a M:M relationship
• The 1:M member of a dept relationship became a M:M relationship
Sept 2001 Concentrx, LLC 38
Looking At Queries• Consider the following tablesEmployeeEmp Id Name Start Dt End Dt
001 Smith 1995-01-01 9999-12-31002 Jones 1996-04-01 9999-12-31
Member Of Emp Id Dept From Dt To Dt001 Acct 1995-01-01 1997-02-01001 Finance 1997-02-01 9999-12-31002 Finance 1996-04-01 9999-12-31
Salary Emp Id Salary From Dt To Dt001 40000 1995-01-01 1997-01-31001 50000 1997-02-01 9999-12-31002 90000 1996-04-01 1997-03-14002 100000 1997-03-15 9999-12-31
Sept 2001 Concentrx, LLC 39
Average at a Point in Time• Average salary for finance at the end of 1999 Select Average (T3 Salary) From Member Of T2 Salary T3 Where T2.Dept = Fin And T2.EmpId = T3.EmpId And 1999-12-31 Between T2.Dt From and T2.Date To And 1999-12-31 Between
T3.Date From and T3.Date To
• We have a similar query for Average 1998 salary
Sept 2001 Concentrx, LLC 40
Are We Comparing the Right Averages
• People have changed departments• Average salary at the end of 1998 and
1999reflects those people who just happened to be in Finance at those points in time
• The average salary in Finance dropped because we transferred in a low salary employee
• We must create views that take into account the organizational changes
Sept 2001 Concentrx, LLC 41
Yesterday’s Salary with Today’s Glasses
• What is the average salary for Finance at the end of 1998 based on those in finance at the end of 1999
Select Avg (T2.Salary)
From Member Of T2
Salary T3
Where T2.Dept = Fin
And T2.EmpId = T3.EmpId
And 1999-12-31 Between T2.Dt From And T2.Dt To
And 1998-12-31 Between T3.Dt From And T3.Dt To
Sept 2001 Concentrx, LLC 42
Query Notes
• Queries that involve one time point (or period) are usually straight forward
• Queries involving 2 time point (or period) can cause significant confusion
• Watch out for a set of 2 (or more) queries which involve more than one time point (or period). They look deceptively simple
Sept 2001 Concentrx, LLC 43
Design Guidelines
• First build the logical data model – Include time independent and time annotated
items (e.g. date of birth)– Temporarily ignore time dependencies and treat
the model as if you were looking at the business at a specific point in time.
– Make notes about all time dependent attributes and entities and relationships
• Gather potential queries, query sets and reports
Sept 2001 Concentrx, LLC 44
Design - 2• The LDM should be in Third Normal Form• The primary keys in this model will be the basic
business keys for future integrity rules• Do not combine entities in 1:1 relationships
unless they truly represent the same thing or concept
• In general column vectors are preferred to row vectors
• Delay design changes for physical considerations until later
Sept 2001 Concentrx, LLC 45
Adding in the Time Factors
• Individual Attributes• Groups of Attributes• 1:1 Relationships• 1:M Relationships• M:M Relationships• N-ary relationships• Integrity Rules• Multiple Time Factor Case
Sept 2001 Concentrx, LLC 46
Integrity Rules
• Referential Integrity often does not hold in the physical database design– There is no exact matching of primary and
foreign keys
• Business integrity rule usually replaces RI– An exact match of the business key (like RI)– Rule for how the time factors must relate to each
other
Sept 2001 Concentrx, LLC 47
Hard Problem
• The design of the database is the easy problem
• Training customers to properly understand and use a time varying database is hard and you should not underestimate the task.
Sept 2001 Concentrx, LLC 48
Thank you for your patience
Questions
Dr. Jerry Rosenbaum
ConcentrX, LLC
410-764-1843 voice
443-253-6054 mobile
410-764-2445 fax