temporal dimensional modeling · temporal dimensions compared to slowly changing dimensions are...

Temporal Dimensional Modeling, Ronnback and Regardt, preprint 2019 rev.4

Temporal Dimensional Modeling

lars ronnback∗

olle regardt

the Department of Computer Science, Stockholm [email protected]

Abstract

Dimensional modeling has for the last decades been one of the prevalent techniques for modeling datawarehouses. As initially defined, a few constructs for recording changes were provided, but due tocomplexity and performance concerns many implementations still provide only the latest information.With increasing requirements to record changes, additional constructs have been suggested to managechange, mainly in the form of slowly changing dimension types. This paper will show that the currentlyexisting types may lead to undesired anomalies, either at read or at write time, making them unsuitablefor application in performance critical or near real-time data warehouses. Instead, based on currentresearch in temporal database modeling, we introduce temporal dimensions that make facts anddimensions temporally independent, and therefore suffer from none of said anomalies. In our research,we also discovered the twine, a new concept that may significantly improve performance when loadingdimensions. Code samples, along with query results showing the positive impact of implementingtemporal dimensions compared to slowly changing dimensions are also presented.

dimensional modeling · microbatch · data warehouse · slowly changing dimension

high performance · twine · temporal dimension · real-time analytics

1. introduction

For the last 25 years, two techniques havebeen dominating when it comes to data ware-house implementations. (Inmon 1992) withhis normalized model and (Kimball 1996) withhis dimensional modeling. Initially, few con-structs or guidelines for managing data thatchanges over time were available, but as suchrequirements became more common, tempo-ral features were retrofitted onto these tech-nologies (Bliujute et al. 1998; Body et al. 2002).This paper focuses on slowly changing di-mension types (Kimball 2008; Ross 2013),which were introduced to manage change indimensional modeling, and how well thesefit with current requirements for high perfor-mance. Along with the traditional types ofdimensions, a simple and new type of dimen-sion is introduced, called a temporal dimen-sion. It is based on current research in tem-porality (Hultgren 2012; Ronnback et al. 2010;Golec, Mahnic, and Kovac 2017), and suffersfrom none of the refresh anomalies associ-

ated with slowly changing dimensions (R. J.Santos and Bernardino 2008; Slivinskas et al.1998). For most types of slowly changingdimensions the fact table and its dimensiontables become temporally dependent, leadingto update anomalies and performance degra-dation (Araque 2003). This is not the casewhen using temporal dimensions, in whichthe tables are temporally independent, suchthat existing relationships are preserved re-gardless of all changes. The authors, we,recommend that this new type of dimensionshould become the de-facto standard for man-aging change in dimensional modeling andparticularly when considering real-time an-alytics, a growing requirement within datawarehousing (Russom, Stodder, and Halper2014; Azvine, Cui, and Nauck 2005). Whilemuch research has been done into strategiesfor loading data at near real-time (Jorg andDessloch 2009; Vassiliadis and Simitsis 2009),little has been done with respect to findingthe most suitable modeling techniques.

The temporal dimension and many of the∗Thanks to Up to Change (www.uptochange.com) sponsoring research on anchor and transitional modeling.

1

mailto:[email protected]

www.uptochange.com


existing types support all three query cate-gories that may be relevant when retrievinginformation that changes over time (Golfarelliand Rizzi 2009; Ahmed, Zimanyi, and Wrem-bel 2015):

tiy Today is Yesterday, in which all facts refer-ence the current dimensional value.

toy Today or Yesterday, in which all facts refer-ence the dimensional value that was in effectwhen the fact was recorded.

yit Yesterday is Today, in which all facts refer-ence the dimensional value that was in effectat certain point in time.

The paper has a running example and isstructured as follows. This section served asan introduction to the topics discussed, Sec-tion 2 presents slowly changing dimensions,their models, and issues stemming from howthey are implemented, and Section 3 intro-duces our new concept of a temporal dimen-sion. Following that is Section 4, that presentsresults from reading from and writing to thedifferent types of dimensions, Section 5 con-trasting our work with related research, andSection 6 that ends the paper with conclusionsand suggestions for further research.

2. slowly changing dimensions

A running example will be used to illustratethe differences found in the models of theslowly changing dimensions of type 1–7, the“junk” type, and the temporal dimension. Theexample model, containing a simple fact anddimension table, can be seen in Figure 1. Eachcolumn is contained in a box, which is linedand shaded when that column is part of theprimary key in the table, and white withdotted lines if not. Slanted text is used forcolumns that have nothing to do with thekeeping of a history of changes and uprighttext for the columns necessary to keep sucha history. Relationships between tables areshown along with their cardinalities.

To keep the model uncluttered and tothe point, it is the simplest possible thatstill conveys the construction of the differ-ent types. The dimension table has a single

value, dimProperty , along with a primary key,dim ID. The fact table references this key froma column with the same name, along with onemeasure, factMeasure, and the date, factDate,for which this measure was recorded1. Theprimary key in the fact table is a combinedkey, consisting of all dimension table refer-ences (in our case one) and the date that themeasures were recorded.

The numbering of the types are takenfrom (Kimball 1996) for types 1 to 3, whereas4 to 7 and ”junk” come from (Ross 2013)and (Wikipedia 2019), favoring those def-initions that actually capture a history ofchanges. There is, in fact, also a slowly chang-ing dimension of type 0, but in that typechanges are simply discarded, resulting in in-formation that is not up to date. As this is lessthan desirable, type 0 will not be discussedfurther. It is well known that change is preva-lent and integral to almost all information, sowith no further ado, these are the types ofsolutions retrofitted onto dimensional model-ing, trying to solve the problem of managingchange.

2.1. type i

dim_ID

factDate

factMeasure

Fact

dim_ID

dimProperty

Dimension_SCD1

1..n

1

Figure 1: Dimensional model, with a slowly changingdimension of type 1, also coinciding withtype 0.

In type 1, depicted in Figure 1, destructiveupdates of dimProperty are used to bring in-formation up to date. The history is, of course,lost, and this type can only support tiy. Thistype is commonly used, despite its inability tokeep previous versions, probably because the

1Normally dates are referenced from a fact table to a time dimension, but since that would only introduce additionaland for all illustration and testing purposes equal complexity, it was simplified to a column directly in the fact table.

2


other types of slowly changing dimensions re-sult in added complexity. However, as the re-quirement is to keep a history of the changesmade, this type is discussed less, but willact as a point of reference when performingthe experiments that measure performance ofread and write operations.

2.2. type ii

In type 2, depicted in Figure 2, instead of up-dating dimProperty a new row is added withthe new value. This row will have a differ-ent dim ID, so in order to be able to correlatethis change with the row containing the orig-inal value, some portion of the dimensionmust remain static over changes. This is rep-resented by the dimStatic column. If all valuescan change, this may instead be introducedthrough a generated identifier. Furthermore,to know the order of changes and when theytook effect, a dim ValidFrom date is added.

dim_ID

factDate

factMeasure

Fact

dim_ID

dim_ValidFrom

Dimension_SCD2

1..n

1

dimProperty

dimStatic

Figure 2: Dimensional model, with a slowly changingdimension of type 2.

Facts with factDate after dim ValidFromwill reference the new value, and consider-ing a model that regularly is receiving data,this will build a toy connection between factsand dimension values. This can lead to con-fusion, particularly if aggregating measuresover dimProperty , which now yields at leasttwo different aggregates, given the circum-stances. Bringing facts into a tiy or yit view ispossible through the combination of dimStaticand dim ValidFrom, but the resulting query iscomplex and needs additional indexing inorder to be performant.

If the use case of tiy is much more com-mon, it is possible to update fact referenceswhen values change. This will instead take aheavy hit on performance at write time, sincepart of a primary key in the fact table is up-dated.

2.3. type iii

dim_ID

factDate

factMeasure

Fact

dim_ID

dimProperty_V1

Dimension_SCD3

1..n

1

dimProperty_V2


In type 3, depicted in Figure 3, columnsare added to manage change. The columndimProperty is split into dimProperty V1 anddimProperty V2, if only two versions are of in-terest. This limits the number of changesthat can be kept track of, since regularlyadding columns to the table will make it un-wieldy. Any existing queries will also haveto be continuously maintained in order tocope with the addition of new columns. Fur-thermore, if only a few of the rows in thedimension change, most rows will never haveseen any changes and their columns then con-tain empty or dummy values for all but thefirst version, which is a waste of space.

2.4. type iv

In type 4, depicted in Figure 4, a type 1 de-structive update is used on the actual valuein the dimension table producing a tiy con-nection. However, this is complemented by asecond “history” table, to which the row withthe value about to be updated first is copied.The primary key in the history table consistsof the combination of the columns dim ID anddim ValidFrom. Historical rows thereby retain

3


their original dim ID, making it easier to findthem, but since the key now is composed itcannot be referenced from the fact table. Thedashed reference line intends to show that anon-enforced relationship nevertheless exists.

The drawback is that a performance hitis taken at write time, since not only musta value be updated, but rows must also becopied between tables. As no referential in-tegrity constraint exist between the fact andthe history table, consistency must insteadbe guaranteed by other means. Queries forwhich yit or toy is of interest become lesscomplex than in the case of type 2, but ta-ble elimination is no longer possible (Paulley2000), due to the lacking referential integrityconstraint. This may prove a problem in situ-ations where many dimensions are slowlychanging according to this type, since un-needed joins may be performed during ex-ecution of the query.

dim_ID

factDate

factMeasure

Fact

dim_ID

dim_ValidFrom

Dimension_Current

1..n

1

dim_ID

dim_ValidFrom

Dimension_SCD4

dimProperty

dimProperty


2.5. type v

Taking ideas from type 2, type 5 depicted inFigure 5, complements the dimension withthe column dim Current ID, making it easy tofind the current information. Where a sin-gle join from the fact to the dimension tableyields a toy connection, the tiy connectionis only a second join away. Performance is

hampered by the fact that searches need tobe done over non-primary keys, which neces-sitates the need for secondary indexes. Fur-thermore, yit still needs custom queries, andwhenever a value changes updates are neededon existing rows, to repoint these towards thenew latest row.

dim_ID

factDate

factMeasure

Fact

dim_ID

dim_Current_ID

Dimension_SCD5

dimProperty

1..n

1

dim_ValidFrom

1..n

1

dimStatic


2.6. type vi

Trying to remedy some of the disadvantagesof type 2, type 6 depicted in Figure 6 sharesalmost the same structure, but with a differ-ent primary key. The advantage of this typeis that it is no longer needed for any part ofthe dimension to remain static, since the samedim ID will remain across changes, thanks todim ValidFrom now being part of the primarykey.

dim_ID

factDate

factMeasure

Fact

dim_ID

dim_ValidFrom

Dimension_SCD6

dimProperty


4


The sacrifice being made is that referen-tial integrity no longer can be maintained, asonly half of that key is present in the fact ta-ble. While this gives a lot of flexibility, withno preference towards either of tiy, toy, oryit, problems arise if this model is the sourcefor a cube, or any tool that requires referentialintegrity. A possible circumvention, althoughnot mentioned in literature but probably ex-isting in practice, would be to create regularviews corresponding to tiy and toy, whichthe fact table can reference using a logical keyin the tool utilizing the model.

2.7. type vii

Combining ideas from type 2 and type 5,type 7 depicted in Figure 7 also adds a col-umn to keep track of the current information.Rather than to reference this from the dimen-sion, as in type 5, the fact table is extendedwith a column. Compared to type 5 the cur-rent information is now a single join awayinstead of needing a double join.

dim_ID

factDate

factMeasure

Fact_SCD7

dim_ID

dim_ValidFrom

Dimension_SCD2

1..n

1

dimProperty

dim_Current_ID1..n

1

dimStatic

Figure 7: Dimensional model, with a slowly changingdimension of type 7, which extends the facttable with a foreign key column referencingthe current dimension information.

Having the same issues as type 5 with ad-ditional indexing and requiring a static part,the reduced number of joins should, however,improve performance at read time. The trade-off is that more rows need updating at writetime, since now fact rows have to be updatedinstead of dimension rows, and they tend tobe much more numerous.

2.8. junk and mini dimension

dim_ID

factDate

factMeasure

Fact

dim_IDDimension

1..n

1

dim_Junk_ID

dim_ValidFrom

Dimension_Mini

dim_Junk_ID

dimProperty

Dimension_Junk

dim_ID1..n

1

1..n

1

Figure 8: Dimensional model, with a rapidly changingdimension, through mini and junk dimen-sions.

A final approach depicted in Figure 8, not hav-ing been given a type, is still a constructionthat aims to manage change. The introduc-tion of a “junk” and “mini” dimension is rec-ommended when changes occur more rapidlythan what is suitable for a slowly changing di-mension. It is not clear exactly what rapiditynecessitates this, but most likely when somepart of the dimension changes considerablymore often than other parts.

The parts that change rapidly, dimPropertyin our case, are moved to a junk dimension,having its own primary key. Parts that arestatic or change slowly may remain in theoriginal dimension, which is of a suitableslowly changing dimension type. All thatremain in the original dimension in the exam-ple is dim ID. A mini dimension is then intro-duced that references both the original dimen-sion and the junk dimension, in which theprimary key is extended with dim ValidFrom.

This is actually the only model in whichboth referential integrity is maintained as wellas being agnostic with respect to tiy, toy, and

5


yit. It suffers from now needing three joinsin order to assemble the information, alongwith complex query logic. Furthermore, itrelies upon the modeller making perfect as-sumptions about which columns will changerapidly, slowly or not at all. Of course, allcolumns could be moved to the junk dimen-sion, but if they change at very different rates,a lot of information will be unnecessary du-plicated.

3. temporal dimension

Surprisingly, there is a model that is simplerin nature than many of the ones already pre-sented that maintains referential integrity, isagnostic with respect to how the informa-tion should be temporally retrieved, and thathas very respectable read and write perfor-mance. The key is the realization that tem-poral information benefits from having itsimmutable and mutable parts separated fromeach other and that minimal assumptionsshould be made concerning the future of theinformation being modeled.

dim_ID

factDate

(dim_ValidFrom)

Fact

dim_IDDimension_Immutable

1..n

1

dim_ID

dim_ValidFrom

Dimension_Mutable

dimProperty

1..n

1

factMeasure

Figure 9: Dimensional model, with the in this paperintroduced temporal dimension, along withan optional “valid from” column in the facttable.

The hereafter dubbed temporal dimensioncan be seen in Figure 9. It has two parts, a mu-table table similar to a type 6 slowly changingdimension and an immutable table, similar to

what would remain of the original dimensionin the junk dimension scenario on the assump-tion that all columns may change. Shouldmore properties be available in the dimen-sion and those change at very different rates,each can reside in its own mutable table, stillreferencing the same immutable table.

While it may look strange to introduce animmutable table with a single column, in be-tween the fact and dimension table, it ensuresthat referential integrity can be maintainedby the database. Perhaps even stranger is thefact that the query optimizer can use theseconstraints to completely eliminate this tableduring the execution of retrieval queries, di-rectly joining the fact and the mutable table.The immutable table will not be physicallytouched, and therefore behaves only like alogical modeling construct, not adversely af-fecting performance.

The optional dim ValidFrom column in thefact table is used to retain which version is ineffect with respect to the factDate. This willspeed up toy queries to some extent, at thecost of enlarging the fact table, which thenslightly slows down tiy and yit queries. De-pending on the requirements this column maybe omitted.

4. comparative experiment

Experiments have been made in which readand write performance have been tested fortype 1–7 and junk compared with the pre-sented temporal dimension. Read perfor-mance is further classified into the threequery types necessary to retrieve informationin tiy, toy, and yit. Write performance ismeasured separately for adding and updat-ing rows in the dimension and fact tables.

4.1. configuration

The tests were carried out on a serverequipped with an Intel i7 4770 4-core CPU,32 GB of RAM, an Intel 750 PCIe NVMe SSDas data disk, and four RAID-striped Samsung840 Pro SATA SSD assembled as a temp disk.The server was running a 64-bit MicrosoftWindows R© operating system and the chosendatabase engine was Microsoft SQL Server R©

2017 Developer Edition.

6


The initial dimension data consists of220, 1 048 576 , unique values for dim ID, forwhich one value changes 220 times, two val-ues change 219 times, four values change 218

times, . . . , 218 values change two times, and219 values change one time, thus represent-ing both slowly and rapidly changing values.This resulted in 21 495 808 dimension rows intotal. The times these changes were made,captured by dim ValidFrom, will range fromthe end of 2017 back to January 2008, withthe most rapidly changing every five minutes.The initial fact data consists of 10 rows addedevery minute over the same time range, butcaptured by factDate, with each row referenc-ing a randomly selected dim ID. This randomdistribution is squarely skewed towards thevalues in the dimension that change more of-ten. This resulted in 52 427 836 fact rows intotal.

These values were chosen in order toachieve query execution times in the rangeof seconds to minutes, which was consideredlong enough to not be disturbed by otherfactors and short enough to make the exper-iment feasible. They also represent the con-cept of microbatches, achieving near real-timeupdated data. For testing the write perfor-mance a larger batch had to be constructed toachieve consistent measures during the exper-iment. The method of loading is through reg-ular SQL statements2. For writing to the di-mension, all dimProperty values related to the1 048 576 unique dim IDs receive an update inthe beginning of 2018, and equally many newvalues are added. One fact row per dim ID isalso added, totalling 2 097 152 new rows. Inorder to ensure that the tests are as equal aspossible with respect to the data, the tempo-ral dimension model was populated first, andthe others populated from it.

Figure 10 begins with a code sample show-ing the creation of the tables for the tem-poral dimension. Both the fact and muta-ble dimension table is partitioned by respec-tively the year of factDate and the year ofdim ValidFrom, to liken them with actual im-plementations. Tables for type 1–7 and junkare created in a similar manner, following the

models presented earlier. Partitioning can beand is used for all types where dim ValidFromis part of the primary key. All code for repro-ducing the tests is available online (Ronnback2018) or by request to the corresponding au-thor.

create table Dimension Immutable (dim ID int not null,primary key (dim ID asc)

);

create table Dimension Mutable (dim ID int not null,dim ValidFrom smalldatetime not null,dimProperty char(42) not null,foreign key (dim ID) references

Dimension Immutable (dim ID),primary key (dim ID asc, dim ValidFrom desc)

) on Yearly(dim ValidFrom);

create table Fact (dim ID int not null,factDate smalldatetime not null,factMeasure smallmoney not null,foreign key (dim ID) references

Dimension Immutable (dim ID),primary key (dim ID asc, factDate asc)

) on Yearly(factDate);

create function pitDimension (@timepoint smalldatetime

)returns table as returnselect

dm in effect.dim ID,dm in effect.dim ValidFrom,dm in effect.dimProperty

from (select

∗,row number() over (partition by dim ID order

by dim ValidFrom desc) asReversedVersion

fromDimension Mutable

wheredim ValidFrom <= @timepoint

) dm in effectwhere

dm in effect.ReversedVersion = 1;

Figure 10: Code sample showing how to create the inthis paper introduced temporal dimension,followed by the point in time parametrizedview over the temporal dimension, throughwhich the rows in effect at the given pointin time can be fetched.

2This is sometimes referred to as ELT, as in Extract, Load, then Transform. While Microsoft SQL Server comes withits own streaming data ETL tool, Integration Services, it only supports type 2 out of the box.

7


4.2. preparation

To further liken the experimental setup with areal world scenario, parametrized views werecreated that encapsulate common operations.Such views are inlined by the database engine,“search and replaced” in the query, before itis passed to the optimizer, and should there-fore have no performance impact comparedto having typed the entire statement to beginwith.

The parametrized view in the bottom halfof Figure 10 is created using a windowedfunction that helps finding the informationin effect. This is just one of many thinkableways to find the information in effect at agiven point in time and several methods wereevaluated with respect to performance. Foreach type of slowly changing dimension, thebest performing method of those describedby (Hutmacher 2016) was picked for the ex-periment. For models in which the dim Staticcolumn is present, this instead meant using across apply combined with an ordered top 1operation.

While the point in time parametrizedviews provide an efficient way to find infor-mation in effect at a specific point in time,as for tiy and yit, they proved to be a poorchoice for toy queries. Trying to cross ap-ply such a view for every factDate yielded aquery execution estimate in which billions ofoperations would take place. An attempt torun such a query was halted after four hours,during which numerous drawings saw thedawn of light on a whiteboard. Among thesewas one in which two timelines were drawnclose together, from which the idea of a twinecame.

The parametrized view in Figure 11 showshow two timelines are intertwined in orderto create a lookup table between the fact anddimension keys for the information in effect.After twining the timelines, a conditional cu-mulative max-operation is used with a win-dow to find the latest value from one of thetimelines for each value in the other timelineover each dim ID. Using a twine, the “nev-erending” query now took seconds to run. Asit is a simple trick, it is bound to have beendiscovered by practitioners, but to the bestknowledge of the authors it is neither scientif-

ically documented nor practically discussedin conjunction with slowly changing dimen-sions. As such, it should be of significant in-terest because of its tremendous performanceboost. The authors have already had usesfor twines in several situations outside of thedimensional modeling domain.

create function twineFact (@fromTimepoint smalldatetime,@toTimepoint smalldatetime

)returns table as returnselect

in effect.dim ID,in effect.factDate,in effect.dim ValidFrom

from (select

twine.dim ID,twine.Timepoint as factDate,twine.Timeline,max(case when Timeline = ’D’ then Timepoint

end) over (partition by dim ID order byTimepoint) as dim ValidFrom

from (select

dim ID,factDate as Timepoint,’F’ as Timeline

fromdbo.Fact

wherefactDate between @fromTimepoint and

@toTimepointunion allselect

dim ID,dim ValidFrom as Timepoint,’D’ as Timeline

fromdbo.Dimension Mutable

wheredim ValidFrom <= @toTimepoint

) twine) in effectwhere

in effect.Timeline = ’F’;

Figure 11: Code sample showing how to create thetwine parametrized view, intertwining thetimelines of the fact and dimension tablesbetween the given time points by using aconditional cumulative max-operation, win-dowed to find the value in effect at eachpoint of the timeline and for each dim ID.It yields a significant performance gain fortoy type queries, compared to other knownmethods.

8


4.3. approach

With these views in place, the actual test wasset up. For each type of query, reading tiy,yit, toy and writing to the dimension and facttables, a loop per dimension type was set up.This loop was run four times when measur-ing read times, with the first iteration beinga “dry run”, after which all column statisticsis updated. This is because the optimizer willautomatically create statistics on the first runwhere it deems it suitable, slightly slowingdown the query and thereby disturbing thetest results. Furthermore, as this is done dur-ing execution, the statistics is only based on asample in order to not affect the performancetoo adversely. The explicit update of the statis-tics is instead based on full table scans. Whenmeasuring write times, three runs without a“dry run” was used.

The loop for queries that read data is struc-tured as follows. The first part clears allcaches possible from within SQL Server, thesecond sets a starting time with nanosecondprecision, the actual query is silently run withits results redirected to a temporary table, af-ter which an ending time is set. Similarly, theloop for queries that write data is structuredas follows. The first part clears all caches, atransaction is opened, a dimension startingtime is set, the query writing data to the di-mension table is run, an ending time is set,a fact starting time is set, the query writingdata to the fact table is run, and ending timeis set, after which the queries are rolled back.The rollback is necessary in order for eachrun to be equal and its time is not taken intoaccount.

For both reading and writing, the differ-ence in milliseconds between the starting andthe ending time is stored for each iteration ina separate table. From these the median valueis selected. A margin of error was also cal-culated and monitored. Should it have beenlarge, indicating that something disturbed thetesting, the tests would have been run again,but this never happened. Most of the testswere nevertheless run many times, betweenwhich refinements were made. Values fordimProperty and factMeasure were chosen ina way that verifications could be made toensure that the results produced from the dif-

ferent queries were identical.

selectin effect.dimProperty,f.numberOfFacts,f.avgMeasure

into#result tiy

from (select

dim Id,count(∗) as numberOfFacts,avg(factMeasure) as avgMeasure

fromFact

wherefactDate between ’20140101’ and ’20141231’

group bydim ID

) fjoin

pitDimension(’20180101’) in effecton

in effect.dim ID = f.dim ID;

selectdm.dimProperty,count(∗) as numberOfFacts,avg(f.factMeasure) as avgMeasure

into#result toy

fromtwineFact(’20140101’, ’20141231’) in effect

joinFact f

onf.dim ID = in effect.dim ID

andf.factDate = in effect.factDate

joinDimension Mutable dm

ondm.dim ID = in effect.dim ID

anddm.dim ValidFrom = in effect.dim ValidFrom

group bydm.dimProperty;

Figure 12: Code sample showing the performance testquery for tiy and yit, joining the pointin time view for the temporal dimension(but with “20140101” passed to the viewfor yit), followed by the test query for toy,using a twine to find the version in effect.

An example query, for tiy, is shown in Fig-ure 12 using the point in time view from thetemporal dimension. As seen, the query calcu-lates the number of fact rows and the averageof factMeasure per dimProperty . For the par-

9


ticular query, this resulted in 1 014 754 rows,which is slightly less than the unique num-ber of dim IDs, as the randomization did notexhaust all possible references between factsand dimension values. For some types, inwhich fact rows are already joined to the cur-rent version, tiy can be found through a directjoin of the fact and dimension table. The samecalculation is also used for yit and toy andfor all types of dimensions, which for yit isas simple as passing a different date to thepoint in time view, but for toy may involve atwine, depending on the type.

4.4. results

To summarize the results, it clearly comesacross that the different types of dimensionsare geared towards specific use cases. Thesepreferences fall into four categories; fast re-trieval of the current version, the previousversion, the historically correct version, or be-ing temporally agnostic. In the first three,fact rows are temporally bound to a particu-lar version in the dimension, whereas for theagnostic preference, fact rows and dimensionrows are temporally independent. Temporalindependence has faster write performanceat the cost of slower read performance.

4.5. general performance

As can be seen in Table 1 there is no clearwinner if taking both reading and writinginto account. Favoring the current version aretypes 1, 3, 4, 5, and 7, favoring the previousis only type 3, and favoring the historicallycorrect are types 2, 5, and 7. Favoring no par-ticular version are temporal, optional, type 6,and junk. If only current values are of in-terest, type 1 is the best choice, and if onlyvery few versions suffice, type 3 may be anoption. If ad-hoc querying over versions isnecessary, requirements have to be weighedagainst performance considerations. Whilstthose types favoring a particular version per-form particularly well when retrieving thatversion, they suffer at write time because ofadditional logic to include the required tem-poral bond.

If, for example, in a read intensive environ-ment where historically correct versions are

necessary in most use-cases and only rarelythe version in effect at an arbitrary point intime, then type 2 is the best choice, given thatwrite performance can be sacrificed. Loadingthe type 2 dimension table is 50% slower thanfor temporal and six times slower than theupdate done by type 1. Loading the fact tableis seven times slower than for temporal andtype 1, due to a twine being necessary, butcannot be done as efficiently since finding theversion in effect relies on dimStatic.

If both historically correct and current ver-sions are common in use-cases, but still rarelythe version in effect at an arbitrary point intime, then type 5 or type 7 could be a choice,but only after carefully considering the addi-tional sacrifice in write performance. Writingto the dimension table of type 5 is seventeentimes slower than temporal and writing tothe fact table of type 7 is twenty-four timesslower than temporal, due to the updates thatboth finds the version in effect and then setsit for each row. Maintaining a type 5 or type 7dimension will over time become an insur-mountable task.

If both current and the version in effectat an arbitrary point in time are common inuse-cases, then our temporal and option, aswell as type 6 and junk are good candidates.While junk has very similar read performanceto temporal, the time is takes to load its di-mension tables is doubled in comparison. Fortype 6, showing the best write performanceof all types that retain a complete history ofchanges, the difference seen is explained byit having sacrificed referential integrity. How-ever, as referential integrity must be guaran-teed at some point, type 6 makes this “some-body else’s problem”, where it neverthelessis bound to hog some resources.

4.6. suitability for near real-time

To determine the suitability of each type ofdimension in a microbatched near real-timedata warehouse, a hypothetical data ware-house system is envisioned. This hypotheticalsystem can either perform reading or writing,with one blocking the other. Simplistic, butnot entirely unlike a real system. Assum-ing such a system is approaching real-time,there are four possible scenarios; write often —

10


Temporal Optional Type 1 Type 2 Type 3 Type 4 Type 5 Type 6 Type 7 Junk

tiy 10.4 10.2 0.8 9.1 0.7 0.7 3.6 9.9 2.8 6.6yit 6.4 6.7 — 9.1 0.7 60.2 9.1 6.3 9.1 4.1

toy 53.5 23.7 — 3.2 — 53.7 3.0 53.2 2.9 53.8Dim 14.2 14.0 3.3 23.9 4.2 7.0 271.2 7.2 31.8 30.1Fact 10.8 10.9 10.8 74.2 10.9 10.9 74.9 7.2 254.2 10.8

Table 1: Median execution times (in seconds) for queries reading (tiy, yit, toy) and writing (Dim, Fact) for theintroduced temporal (and optional) dimension and each type of slowly changing dimension. The “skyline”presents the total execution time for all queries and type of query.

read often, write often — read seldom, writeseldom — read often, and write seldom —read seldom. Looking at the last of these, itis easy to be misled into thinking that such ascenario is never coinciding with the require-ments of real-time decision support. This isnot the case as the determining characteris-tic of real-time is latency (Bateni et al. 2011).Even in a write seldom — read seldom sce-nario, the rare read may take place arbitrarilyclose to the rare write and it may be of utmostimportance that the data returned is as freshas possible.

In order to reduce latency in the hypo-thetical system, the write needs to block asfew reads as possible and likewise for theother way around. Type 5 and 7 are veryimbalanced in this respect, also having thetwo poorest write performances, and cannotbe recommended for near real-time require-ments. The imbalance is also quite large fortype 2, ranking in the bottom three for writeperformance together with types 5 and 7.Type 4 has poor read performance and types 1and 3 may be disqualified for not maintain-ing a complete history of changes, leavingtemporal, optional, type 6, and junk as thecandidates.

We deliberately chose to ignore a poten-tial issue when setting up the experiment.When rows are temporally bound, this de-pends on the assumption that data has beenarriving synchronously over time, such that

all existing fact rows always reference the cor-rect version. In practice, this does not alwayshappen, and retrospective modifications haveto be made, adversely affecting the total main-tenance time for types 1, 2, 3, 4, 5, and 7. Asthere is no way to estimate the impact, de-pending of how large the error is, this mayrange from a quick fix to lengthy outages, un-acceptable in a near real-time environment.This again leaves temporal, optional, type 6,and junk as the candidates.

4.7. simplicity

Looking at the different models, junk is themost complex. Type 4 adds complexity inthe form of a separate history table, type 2, 5,and 7 rely on some other column being static.Type 3 could lead to unwieldy tables due tothe large number of columns. Type 6 is thesecond best with respect to model simplicity,following type 1, which naturally has the sim-plest model. The temporal dimension is verysimilar to type 6, adding an immutable tablebetween the fact and dimension table, stillkeeping it on the simpler side. It also makeno assumptions, allowing for every part ofthe dimension to change. Some lineage mustexist if never before seen data is about to beloaded though, making it possible to deter-mine if it represents a change or an addition.

Querying becomes slightly more complexfor the temporal dimension, than in the tem-porally dependent types, but is helped by

11


“hiding” the logic behind the point in timeand twine parametrized views. Every typeneeded a point in time view for yit, excepttype 3, but under the presumption that keep-ing a single old version is enough. Thetypes 4, 6, junk, and temporal all need thetwine for toy. The “valid from” column inoptional adds complexity to the benefit of nothaving to twine. This again, though, assumesthat once that time point is set it is correctand will never have to be changed. Giventhat reading times for toy were only halved,this may not be advantageous enough to war-rant its introduction.

5. related research

A lot of research has been done concerningperformance of non-temporal dimensionalmodels, many of which use Star SchemaBenchmark, a modification of TPC-H (Rablet al. 2013; O’Neil et al. 2009). More recentresearch tend to use the newer TPC-DS bench-mark model (Nambiar and Poess 2006), butonly one of its dimensions has temporal char-acteristics in the form of type 5. Given thestate of established benchmarks, having for-gotten about temporal requirements, the deci-sion was made to produce our own test data.

A brief comparison of a subset of the typesof slowly changing dimensions has been doneby (V. Santos and Belo 2011), where they us-ing only logical reasoning come to the conclu-sion that type 4 is superior to all other types.Given our findings, that seems like a poorconclusion. No previous work contrasting alltypes of dimensions could be found.

(Johnston 2014) discusses the need for newthinking when it comes to slowly changingdimensions and suggests a bitemporal dimen-sion, albeit not readily implementable in cur-rent database technologies. His dimension istype 2-like as it relies on a “durable identifier”apart from the unique surrogate primary key,and therefore cannot be said to be completelytemporally independent.

Big Data platforms, like Hadoop, rely onimmutability of existing data. (Akinapelli,Shetye, and Sangeeta 2017) points out thatslowly changing dimensions are particularlyhard to manage in this respect. Thanks to

the temporal decoupling between facts anddimensions when using our temporal dimen-sion, databases can be insert only, which mayopen up for easier implementations on suchplatforms.

6. conclusions and further

research

Most, if not all, analytical tools have supportfor slowly changing dimensions, but to vary-ing degrees with respect to different types.To name a few are: Tableau, MicroStrategy,Alteryx, SAP, Cognos, SAS, Analysis Services,and Power BI. Many database vendors alsohave guides for how to implement them, likeMicrosoft, Oracle, Vertica, Hive, SAP HANA,PostgreSQL, and IBM DB2 to name a few.ETL tools have special adapters for loadingdata into slowly changing dimensions, like In-formatica, Microsoft Integration Services, Or-acle Data Integrator, Pentaho, IBM DataStage,Apache Kafka, and others. This, if nothingelse, should indicate that slowly changing di-mensions are a hot topic, but as it stands,with little scientific grounding. We aimedto put them on better footing by challeng-ing traditional types with a type of our own.There may still be situations where the tra-ditional types may be of preference, but theintroduced temporal dimension fares so wellthat we suggest it be the default choice whena history of changes is required. This sug-gestions is further accentuated when a datawarehouse approaches near real-time updates.To summarize its advantages:

• It is temporally independent from the facttable, such that it may be refreshed asyn-chronously and favors no particular versionin time.

• It has a simple model that extends uponalready familiar types and it does not relyupon any part of the dimension being static.

• It is free from assumptions about the futureand it may even hold future versions thatwill come into effect at some later time.

• It is modular and parts that change withdifferent rates can be broken apart whilekeeping the same immutable connection tothe fact.

12


• It maintains referential integrity in thedatabase and when changes occur only in-serts are used to capture these, never up-dates.

• Its performance is close to or better than allother types with similar capabilities, whenboth reading and writing is weighed in.

Along the way, we also discovered thetwine; a new strategy for finding the ver-sion in effect for each fact by the times thesefacts were recorded. The twine is essentiallya union of timelines that is sorted with anadditional column containing a conditionallyrepeated value. It performs much better thanany approach trying to produce a temporaljoin through some inequality condition. Twin-ing may help many existing implementationsgain better performance, as it can be usedin any scenario where a temporal join is re-quired.

There are many venues for further re-search. It would be interesting to utilize tem-poral tables, added in the ANSI:SQL 2011standard (Kulkarni and Michels 2012), to seeif the results would be the same. We usedSQL to drive the loading of the dimensions,so an investigation into how streaming dataETL tools compare is warranted. Running thesame data set and queries in other databaseengines could give insights into possible in-ternal optimizations that may have affectedour experiment. Extending the idea of ourtemporal dimension to become bitemporalshould be straightforward as it is already tem-porally decoupled from the fact table. Find-ing the threshold for when changes are rapidenough compared to slower changing partsthat breaking the mutable dimension apartwould yield even better performance is also ofinterest. Skewing the data differently in ourexperiment may also produce different results.Combining the idea of temporal dimensionswith slowly changing measures (Goller andBerger 2013) is also highly relevant.

References

Ahmed, Waqas, Esteban Zimanyi, and RobertWrembel (2015). “Temporal DataWare-houses: Logical Models and Querying.” In:EDA, pp. 33–48.

Akinapelli, Sandeep, Ravi Shetye, and TSangeeta (2017). “Herding the elephants:Workload-level optimization strategies forHadoop.” In: EDBT, pp. 699–710.

Araque, Francisco (2003). “Real-time DataWarehousing with Temporal Require-ments.” In: CAiSE Workshops, pp. 293–304.

Azvine, Behnam, Zheng Cui, and D Di Nauck(2005). “Towards real-time business intel-ligence”. In: BT Technology Journal 23.3,pp. 214–225.

Bateni, MohammadHossein et al. (2011).“Scheduling to minimize staleness andstretch in real-time data warehouses”. In:Theory of Computing Systems 49.4, pp. 757–780.

Bliujute, Rasa et al. (1998). Systematic changemanagement in dimensional data warehousing.Tech. rep. Citeseer.

Body, Mathurin et al. (2002). “A multidi-mensional and multiversion structure forOLAP applications”. In: Proceedings of the5th ACM international workshop on DataWarehousing and OLAP. ACM, pp. 1–6.

Golec, Darko, Viljan Mahnic, and TatjanaKovac (2017). “Relational model of tem-poral data based on 6th normal form”. In:Tehnicki vjesnik 24.5, pp. 1479–1489.

Golfarelli, Matteo and Stefano Rizzi (2009).“A survey on temporal data warehousing”.In: International Journal of Data Warehousingand Mining (IJDWM) 5.1, pp. 1–17.

Goller, Mathias and Stefan Berger (2013).“Slowly changing measures”. In: Proceed-ings of the sixteenth international workshop onData warehousing and OLAP. ACM, pp. 47–54.

Hultgren, Hans (2012). Modeling the agile datawarehouse with data vault. New Hamilton.

Hutmacher, Daniel (2016). Last row per group.url: https://sqlsunday.com/2016/04/11 / last - row - per - group/ (visited on07/11/2018).

Inmon, William H (1992). “Building the datawarehouse”. In:

Johnston, Tom (2014). Bitemporal data: theoryand practice. Newnes.

Jorg, Thomas and Stefan Dessloch (2009).“Near real-time data warehousing usingstate-of-the-art ETL tools”. In: International

13

https://sqlsunday.com/2016/04/11/last-row-per-group/

https://sqlsunday.com/2016/04/11/last-row-per-group/


Workshop on Business Intelligence for theReal-Time Enterprise. Springer, pp. 100–117.

Kimball, Ralph (1996). The Data WarehouseToolkit: Practical Techniques for Building Di-mensional Data Warehouses. New York, NY,USA: John Wiley & Sons, Inc. isbn: 0-471-15337-0.

— (2008). “Slowly changing dimensions”. In:Information Management 18.9, p. 29.

Kulkarni, Krishna and Jan-Eike Michels(2012). “Temporal features in SQL: 2011”.In: ACM Sigmod Record 41.3, pp. 34–43.

Nambiar, Raghunath Othayoth and MeikelPoess (2006). “The making of TPC-DS”. In:Proceedings of the 32nd international confer-ence on Very large data bases. VLDB Endow-ment, pp. 1049–1058.

O’Neil, Patrick et al. (2009). “The star schemabenchmark and augmented fact tableindexing”. In: Technology Conference onPerformance Evaluation and Benchmarking.Springer, pp. 237–252.

Paulley, Glenn Norman (2000). “Exploitingfunctional dependence in query optimiza-tion”. In:

Rabl, Tilmann et al. (2013). “Variations of thestar schema benchmark to test the effectsof data skew on query performance”. In:Proceedings of the 4th ACM/SPEC Interna-tional Conference on Performance Engineering.ACM, pp. 361–372.

Ronnback, Lars (2018). Temporal DimensionalModeling Performance Suite. url: https://github.com/Roenbaeck/tempodim/ (vis-ited on 02/01/2019).

Ronnback, Lars et al. (2010). “Anchor model-ing—Agile information modeling in evolv-ing data environments”. In: Data & Knowl-edge Engineering 69.12, pp. 1229–1253.

Ross, Margy (2013). “Design Tip# 152 SlowlyChanging Dimension Types 0, 4, 5, 6 and7”. In: Kimball Group.

Russom, Philip, David Stodder, and FernHalper (2014). “Real-time data, BI, and an-alytics”. In: Accelerating Business to Lever-age Customer Relations, Competitiveness, andInsights. TDWI best practices report, fourthquarter, pp. 5–25.

Santos, Ricardo Jorge and Jorge Bernardino(2008). “Real-time data warehouse load-ing methodology”. In: Proceedings of the2008 international symposium on Databaseengineering & applications. ACM, pp. 49–58.

Santos, Vasco and Orlando Belo (2011). “Noneed to type slowly changing dimen-sions”. In: IADIS International ConferenceInformation Systems, pp. 11–13.

Slivinskas, G et al. (1998). “Systematic ChangeManagement in Dimensional Data Ware-housing”. In:

Vassiliadis, Panos and Alkis Simitsis (2009).“Near real time ETL”. In: New Trendsin Data Warehousing and Data Analysis.Springer, pp. 1–31.

Wikipedia (2019). Slowly Changing Dimension.url: https://en.wikipedia.org/wiki/Slowly_changing_dimension (visited on02/01/2019).

14

https://github.com/Roenbaeck/tempodim/

https://github.com/Roenbaeck/tempodim/

https://en.wikipedia.org/wiki/Slowly_changing_dimension

https://en.wikipedia.org/wiki/Slowly_changing_dimension

temporal dimensional modeling · temporal dimensions compared to slowly changing dimensions are...

Documents