open data warehouse building a data warehouse with pentaho eng... · 2019. 2. 26. · itnoum best...

24
Open Data Warehouse Building a Data Warehouse with Pentaho Best Practice it-novum.com

Upload: others

Post on 30-Dec-2020

8 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Open Data Warehouse Building a Data Warehouse with Pentaho ENG... · 2019. 2. 26. · itnoum Best Practice Open Data Warehouse — Building a Data Warehouse with Pentaho Database

Open Data WarehouseBuilding a Data Warehouse with Pentaho

Best Practice

it-novum.com

Page 2: Open Data Warehouse Building a Data Warehouse with Pentaho ENG... · 2019. 2. 26. · itnoum Best Practice Open Data Warehouse — Building a Data Warehouse with Pentaho Database

1. Objectivesandbenefits 3

2. Management Summary 4

3. Buildingadatawarehousesystem 5

4. Datasources 6

5. Datacapture 7

5.1 DesigningtheETLprocess 8

5.1.1 BestpracticeapproachtoworkingwiththeLookupandJoinoperations 8

5.2 Practicalimplementation 11

5.3 Datastorage&management 17

5.3.1 Performanceoptimization 18

5.4 Dataanalysis 20

5.5 Datapresentation 21

6.Conclusionandoutlook 22

7. Appendix 23

Contents

2

Page 3: Open Data Warehouse Building a Data Warehouse with Pentaho ENG... · 2019. 2. 26. · itnoum Best Practice Open Data Warehouse — Building a Data Warehouse with Pentaho Database

it-novum Best Practice | Open Data Warehouse — Building a Data Warehouse with Pentaho

1. Objectives and benefitsThisbestpracticeguideshowshowapowerful,high-performancedatawarehousesystemcan

bebuiltwithopensourceproducts.Thisbestpracticeguideisofinteresttoyouifyou

и arethinkingaboutintroducingadatawarehousesystemintoacompanyandarelookingfor

anappropriatesolution

и aremoreinterestedinsoftwareadaptationthaninvestingyourmoneyinsoftwarelicensing

и wantanopenandcustomizablesolutionthatcanquicklybeputinplaceandwillgrowwith

yourneeds

и alreadyhaveopensourcesolutionsinmind,butareunsurehowtoapproachsuchaproject

и havealimitedprojectbudget

Youwillbenefitfromreadingthisbestpracticeguide,becauseit

и clearlydemonstratesanddescribesanexamplearchitectureforanopensource-baseddata

warehouse

и givesyoutipsfordesigningandimplementingyoursystem

и providesaschemathatyoucanmodifytosuityourownindividualneeds

и containsrecommendationsforprovensoftwareproducts

3

Page 4: Open Data Warehouse Building a Data Warehouse with Pentaho ENG... · 2019. 2. 26. · itnoum Best Practice Open Data Warehouse — Building a Data Warehouse with Pentaho Database

it-novum Best Practice | Open Data Warehouse — Building a Data Warehouse with Pentaho

2. Management SummaryCompaniesandtheenvironmentsinwhichtheyexistareproducingeverfasterandevergreater

amountsofdata.Theresultingdatacontainsanenormouseconomicpotential–butonlywhen

ithasbeenproperlyprocessedandanalyzed.Adatawarehousesolutionprovidestheidealbasis

forthisprocessingandanalysis.Settingupanddevelopingadatawarehouse,however,often

entailscostlylicensingandhardwareissues.Thisdetersmainlysmallandmedium-sizedcom-

paniesfromundertakingsuchaproject,eventhoughcost-effectiveopensourcesolutionshave

longbeenavailableinthebusinessintelligencesector.Arethesesolutionssuitable,however,

fordevelopingahigh-performancedatawarehousesystemoristherereallynoalternativeto

commercialsoftwareproductswhenitcomestoperformanceandfunctionality?

Thisbestpracticeguideaddresseswhetherandhowapowerfuldatawarehousecanbebuilt

usingopensourcesystems.

Forthispurpose,wewillbedesigningandbuildingaprototypicalexamplearchitectureusing

thePentahoandInfobrightsolutions.Wehaveimplementedthisarchitecture–withitsessential

features–inanumberofprojects,manyofwhichhavebeeninsuccessfuluseforyears.

Foreaseofunderstanding,thisbestpracticeguidehasbeenorganizedinthefollowingmanner:

и Generalconstructionofadatawarehouse

и Connectingdatasources

и DesigningtheETLprocess

и Datastorage&management

и Dataanalysis

и Datapresentation

и Conclusions

Intheprocess,allpertinentpointsassociatedwiththedesignandimplementationofadata

warehousewillbecovered.

4

Page 5: Open Data Warehouse Building a Data Warehouse with Pentaho ENG... · 2019. 2. 26. · itnoum Best Practice Open Data Warehouse — Building a Data Warehouse with Pentaho Database

it-novum Best Practice | Open Data Warehouse — Building a Data Warehouse with Pentaho

3. Building a data warehouse systemAdatawarehousesystemconsistsoffivelevels:Datasources,dataacquisition,datamanage-

ment,dataanalysisanddatapresentation.Theindividuallayersarerealizedbymeansofvari-

ousconceptsandtoolsandtheirrespectiveimplementationsarebasedontherequirementsof

thegivencompany.

Thedatawarehousesystemusedforourprototypeisstructuredasshowninthefigurebelow.

Allcomponentsareopensource.TheMicrosoftdatabaseAdventureWorkswillserveasthedata

source.TheETLprocessatthedataacquisitionlevelwillberealizedwithPentahoDataIntegra-

tion.ThistoolispartoftheopensourcesolutionPentahoBusinessAnalyticsSuite,which,inour

example,willbeusedfordataanalysisandpresentation.

Theactualdatawarehouseusedwithinthedatamanagementsystemwillbetheanalytical

databasemanagementsystem,Infobright.PentahoMondrianwillbeusedastheOLAPserver.

TherequiredXMLschemawillbegeneratedwithPentahoSchemaWorkbench.Oncethese

componentshavebeenimplemented,thedataatthedatapresentationlevelcanbepreparedin

theformofanalyses,dashboards,andreportsusingavarietyofPentahotools.Inthefollowing

chapters,theindividuallayersandtheirassociatedtoolswillbedescribedindetail.

Structureofthesamplearchitecture

5

Page 6: Open Data Warehouse Building a Data Warehouse with Pentaho ENG... · 2019. 2. 26. · itnoum Best Practice Open Data Warehouse — Building a Data Warehouse with Pentaho Database

it-novum Best Practice | Open Data Warehouse — Building a Data Warehouse with Pentaho

4. Data sourcesInourprototypesystem,wewillbeusingAdventureWorks2008R2(AWR2)asthebasisforour

sampledata.ThisOLTPdatabasefromMicrosoftcanbedownloadedasafreebackupfilehere:

http://msftdbprodsamples.codeplex.com/Thedatarepresentsafictitious,multinationalcompa-

nyintheproductionsectorthatspecializesinthemanufactureandsaleofbicyclesandrelated

accessories.Thedatamodelconsistsofmorethan70tablesdividedintofivecategories:Human

Resources,Person,Production,Purchasing,andSales.

Thedatamodelissimilartothedatabasestructuresofrealcompaniesintermsofscenariosand

complexityofdatabasestructureandis,therefore,wellsuitedfordemonstrationandtesting

purposes.Thebackupfile(.bak)providedcanbeeasilyintegratedintoMicrosoftSQLServer.To

dothis,openSQLServerManagementStudioand,clickingonthedatabasesymbol,openthe

restoredatabaseinterface(seeillustration).ThenewdatabasewillthenappearintheObject

Explorer.

MicrosoftSQLServerisnot,ofcourse,opensourcesoftware.Wehave,however,employeditfor

thisexamplesothatwecanusethetestdatasuppliedbyMicrosoft.Alternatively,thebackup

canbeconvertedintoCSVfiles.Thesefilescanthenbeloadedintoanyopensourcedatabase

viatheBulkLoadoption.

Importthebackupfile

6

Page 7: Open Data Warehouse Building a Data Warehouse with Pentaho ENG... · 2019. 2. 26. · itnoum Best Practice Open Data Warehouse — Building a Data Warehouse with Pentaho Database

it-novum Best Practice | Open Data Warehouse — Building a Data Warehouse with Pentaho

5. Data captureData acquisition is concerned with the extraction, transformation and loading (ETL) of operational data into the

data warehouse. The goal is to achieve a uniform, high-quality base of data in the data warehouse that can be ana-

lyzed as needed. Planning the data model is also one of the tasks associated with data capture.

The data model then forms the basis for the later transformation of the data. Certain fundamental decisions have to

be made concerning the business process to be analyzed, the level of detail for the analysis as well as any possible

dimensions and values.

Wewillbeevaluatingsalesdataaspartofthepresentprototype.Nochangeswillbemadeinthe

levelofdetailforthedata,soallhierarchylevelswillberetained.Overall,fivedimensionsareto

becreatedwithrespecttotheOLTPdatabase.Thesearedate,customer,product,salesterri-

tory,andcurrency.Asforthemetrics,itshouldbepossibletoanalyzequantity,shippingcosts,

discountsgranted,taxes,unitcostsandrevenue.Thedatawillbepresentedintheformofastar

schema,becausethisisthebestmodelingvariantforachievingourgoals(seefigure).Itisalso

possible,however,touseasnowflakeorflatschema.

StarschemabasedontheAWR2database

7

Page 8: Open Data Warehouse Building a Data Warehouse with Pentaho ENG... · 2019. 2. 26. · itnoum Best Practice Open Data Warehouse — Building a Data Warehouse with Pentaho Database

it-novum Best Practice | Open Data Warehouse — Building a Data Warehouse with Pentaho

DatabaseLookup

5.1 Designing the ETL processFortheimplementationoftheETLprocess,wewillusetheopensourcetoolPentahoData

Integration(PDI).Thistoolprovidespredefinedstepsforextracting,transforming,andloading

data.Thesecanbecreatedviaagraphicaldrag-and-dropinterfacecalledSpoon.UsingSpoon,

professionalETLroutinesintheformoftransformationsandparentjobscanbebuiltwithvery

littleeffort.Thus,theamountofmanualprogrammingrequiredisverysmall,makingthistool

idealforthenoviceuser.

5.1.1 Best practice approach to working with the Lookup and Join operations

ThedimensionsforthestarschemarequireinformationfromdifferenttablesintheOLTPda-

tabase.Tobeabletocompareandcombinethedata,PentahoDataIntegrationoffersLookup

andJoinoperations.Eachstepcanbechangedoraltered,soitisimportanttheyarehandled

correctly.Applyingtheminthewrongcontextcanleadtoperformanceproblems.Since,asa

rule,manytablesmustoftenbecombinedwithintheETLprocess,choosingtherightstepswill

significantlyincreaseprocessingspeed.Wewill,therefore,explainthemostimportantsteps

below.

Database LookupDatabaseLookupenablesthecomparisonofdatafromdifferenttablesinadatabase,whichal-

lowsfortheintegrationofnewattributesintotheexistingdatabasetable.DatabaseLookupwas

designedforsmalltomedium-sizeddatasets.Particularattentionneedstobepaidtohowthe

cacheisconfiguredhere,becauseusinganactivecachecansignificantlyshortentheprocessing

time.Afurtherincreaseinperformancecanalsobeachievedbyselectingthecheckbox„Load

AllDataFromTable“.Particularlywithlargerdatasets,cachingcan,however,overloadtheJava

stackspacecausingthetransformationtocollapse.

8

Page 9: Open Data Warehouse Building a Data Warehouse with Pentaho ENG... · 2019. 2. 26. · itnoum Best Practice Open Data Warehouse — Building a Data Warehouse with Pentaho Database

it-novum Best Practice | Open Data Warehouse — Building a Data Warehouse with Pentaho

Stream LookupStreamLookupissimilartoDatabaseLookupwithactivecaching,butthedatabeingcompared

canbeextracteddirectlyfromthestream.Duetocontinuouscaching,StreamLookupisvery

resourceintensiveandshouldonlybeusedwithsmallerdatasetsthatcannotbeloadeddirectly

fromadatabase.Byselectingthecheckbox„PreserveMemory(CostsCPU)“,itispossibleto

reducethememoryfootprint,butattheexpenseoftheCPU.Whenthisfunctionisactivated,

theloadeddataisencodedduringsorting(hashing).Adisadvantageofthis,however,islonger

running times.

Merge JoinMergeJoincarriesoutatraditionaljoinoperationwithinthedataintegrationcomponent.A

distinctionhastobemadeherebetweenInner,FullOuter,LeftOuterandRightOuterJoins.

MergeJoinisparticularlysuitableforlargedatasetswithhighcardinality,thatis,manydifferent

attributevalueswithinacolumn.Fromthebeginning,thedatamustbesortedaccordingtothe

attributesbeingcompared.Wheneverpossible,thisshouldbeperformeddirectlyonthedataba-

sevia„OrderBy“.Alternatively,thiscanalsobedoneviaSortRows,however,thiswillimpacton

performance.

StreamLookup

StreamLookup

9

Page 10: Open Data Warehouse Building a Data Warehouse with Pentaho ENG... · 2019. 2. 26. · itnoum Best Practice Open Data Warehouse — Building a Data Warehouse with Pentaho Database

it-novum Best Practice | Open Data Warehouse — Building a Data Warehouse with Pentaho

Database JoinUsingDatabaseJoinitisalsopossibletolinkdatabasetablesandcreateandexecutenative

queries.Theparametersusedintheprevioustransformationcanbeintegratedintothenative

query.Aquestionmarkisusedasaplaceholderandisreplacedbythecorrespondingparameter

whenthequeryisrun.Whenintegratingmultipleparameters,itiscrucialthatthesebeplacedin

theircorrectorderinthe„Theparameterstouse“area.

Dimension Lookup / UpdateTheDimensionLookup/UpdatestepcombinesthefunctionalityofInsert/Updatewiththat

ofDatabaseLookup.Itispossibletoswitchbetweenbothfunctionsbyselectingordeselecting

thecheckbox„UpdateTheDimension“.Inaddition,thisstepallowsfortheefficientintegration

oftheSlowlyCachingdimension.TheLookup/Updatedimensionwasdevelopedforspecial

applicationscenariosandis,therefore,veryprocessorintensive.Wementionithereonlyinthe

interestsofcompleteness.

DatabaseJoin

DimensionLookup/Update

10

Page 11: Open Data Warehouse Building a Data Warehouse with Pentaho ENG... · 2019. 2. 26. · itnoum Best Practice Open Data Warehouse — Building a Data Warehouse with Pentaho Database

it-novum Best Practice | Open Data Warehouse — Building a Data Warehouse with Pentaho

Performance TestTohighlightthedifferencesinperformance,wehaveexaminedthevariousstepsinconnection

withtwodatasetsthatcorrespondtotheexpectedamountsofdatainourdemo.

Asdemonstratedinthetable,DatabaseJoinhighlightsmajorperformanceproblemswhen

increasingdatavolumesareinvolved.Fortheremainingsteps,however,thereishardlyany

increaseinprocessingtimedespitethesignificantlylargeramountsofdata.

Operation 40.000 Data Sets 152.500 Data Sets

DatabaseLookup 0,9sec. 1,5sec.

StreamLookup 1,6sec. 2,3sec.

MergeJoin 1,3sec. 2,6sec.

DatabaseJoin 8,2sec. 3min.

5.2 Practical implementationTheETLprocessesconformtothespecificationslaiddownforamulti-dimensionaldatamodel.

Atransformationiscreatedperdimensionorfacttable.Thisprovidesabetteroverviewand

errorscanbemoreeasilylocated.Inaddition,theriskofoverloadingtheserverisalsoreduced.

Thevarioustransformationsarecombinedandcontrolledviaacentraljob:

TransformationsofthePerformanceTest

PerformanceTestResults

11

Page 12: Open Data Warehouse Building a Data Warehouse with Pentaho ENG... · 2019. 2. 26. · itnoum Best Practice Open Data Warehouse — Building a Data Warehouse with Pentaho Database

it-novum Best Practice | Open Data Warehouse — Building a Data Warehouse with Pentaho

Beforetheactualtransformationprocessesforpreparingandpopulatingthedatawarehouse

canbegin,thedatabaseconnectionstothedatasourceandthedatawarehousemustfirstbe

defined.NewdatabaseconnectionscanbecreatedinPentahoDataIntegrationunderthetab

View>DatabaseConnections.Theselectionofthecorrectdatabaseanddriveraswellasuser-

specificsettings,suchasdatabasenameoruser,isdecisive.

ItisimportanttoensurethattheJDBCdriversforbothdatabasesareinthelibdirectoryofthe

DataIntegrationTool.Bydefault,MySQLcanbeaccessedviaport1433andtheInfobrightdata-

baseviaport5029.

ParentJob

Databaseconnectiondetails

12

Page 13: Open Data Warehouse Building a Data Warehouse with Pentaho ENG... · 2019. 2. 26. · itnoum Best Practice Open Data Warehouse — Building a Data Warehouse with Pentaho Database

it-novum Best Practice | Open Data Warehouse — Building a Data Warehouse with Pentaho

DimCurrencyThedimensionDimCurrencyisgeneratedbasedonthetableSales.Currency.Onlythenameand

theabbreviationfortherespectivecurrencyareapplied.Whenthishasbeendone,anidentifica-

tionnumber(CurrencyID)isgeneratedfortheindividualtuples,whichactsastheprimarykey.

Then,usingthestepTableDimCurrency,thetableiscreatedandpopulated.

DimCustomerThesecondtransformation,DimCustomer.ktralsodoesnotrequireajoinoperation.Thedi-

mensionisbasedonthetableSales.vIndividualCustomer.Thisonlyrequiresslightchangeswith

regardtothedimensiontableDimCustomer,whichneedstobecreated.Thefirstandlastname

willbecombinedinacommondatafieldforlateranalysis.Theprimarykeynameismodifiedto

suitthedesignationwithinthemulti-dimensionaldatamodel.

DimCurrency.ktr

DimCustomer.ktr

13

Page 14: Open Data Warehouse Building a Data Warehouse with Pentaho ENG... · 2019. 2. 26. · itnoum Best Practice Open Data Warehouse — Building a Data Warehouse with Pentaho Database

it-novum Best Practice | Open Data Warehouse — Building a Data Warehouse with Pentaho

DimPersonThetransformationDimPerson.ktrisbasedonthetablePerson.Person.Thisisexpandedto

includePerson.PersonPhoneandPerson.EmailAdressusingDatabaseLookup.Inbothcases,

BusinessEntityIDisusedforcomparisonasitservesastheprimarykeyforallthreetables.

DimProductThedimensionDimProductisgeneratedfromthreetablesoftheOLTPdatabase.Thetransfor-

mationstartswiththetableProduction.Product.Asafirststep,wewillincludetheattributes

NameandProductCategoryIDfromthetableProduction.ProductSubcategory.Thesecondof

theseattributeswillserveasthekeyforthesecondLookupoperationperformedonthetable

Production.ProductCategory.Thenextstep,SelectProduct,isusedtocleanupandrename

datafieldsbeforepopulatingthetablewithdata.Whilethisisbeingdone,anSQLscriptisrun

inparalleltocreateanadditionalproductdatasetnamedFreight.Thiswillenableustofilteron

freightcostsforindividualshipmentsduringourlateranalyses.

DimPerson.ktr

DimProduct.ktr

14

Page 15: Open Data Warehouse Building a Data Warehouse with Pentaho ENG... · 2019. 2. 26. · itnoum Best Practice Open Data Warehouse — Building a Data Warehouse with Pentaho Database

it-novum Best Practice | Open Data Warehouse — Building a Data Warehouse with Pentaho

DimSalesTerritoryThe transformation DimSalesTerritory.ktr essentially corresponds to the previous proces-ses. It is based on the table StateProvince. During the Lookup operation, the attributes Name and Group from the tables CountryRegion and SalesTerritory are added to this table. After this has been successfully completed, the dimension is linked to the table Address using a Merge Join. Prior to this, the address lines within the step Contact Fields are combined, which will prove useful for later analysis. Merge Join is realized as a Right Outer Join, ensuring the address information is only updated for the dimension‘s data sets. In this transformation, Merge Join is, therefore, the most efficient solution.

DimDateThe dimension DimDate cannot be extracted from the OLTP database, so it must be created manually. The basis of this transformation is a CSV document created using Excel. The file contains a unique identification number, the specific date, the corresponding day of the week as well as day, month and year – all separated. The dimension also covers the period from 2003 to 2013. The corresponding CSV file can be integrated into Pentaho Data Integration using the step CSV Input File. The data can then be processed and / or loaded into the target database (as depicted in the figure).

DimSalesTerritory.ktr

DimDate.ktr

15

Page 16: Open Data Warehouse Building a Data Warehouse with Pentaho ENG... · 2019. 2. 26. · itnoum Best Practice Open Data Warehouse — Building a Data Warehouse with Pentaho Database

it-novum Best Practice | Open Data Warehouse — Building a Data Warehouse with Pentaho

FactSalesThisusesthetableSales.SalesOrderHeader.Italreadycontainsmanyimportantkeysandvalues

suchasFreightandTaxAmt.Theconnectionsare,however,rarelydesignedtoworkquickly.This

iswhyOrderDateandShipDatearereplacedbyDimDateinthefirstsection,whichwasgenera-

tedwithinthedimension.Afterthis,theattributeToCurrencyCodefromthetableSales.Curren-

cyRateisintegratedusingaLookup.Indoingso,CurrencyRateIDisusedasthekey.Sincefor

manyordersnocurrencyconversionwillbeperformed(transactionswithinthedollarzone),the

columnvalueisoftennull.ThenullvaluesarereplacedinthenextstepbytheabbreviationUSD

fordollars.Then,theattributeisreplacedbythebetterperformingCurrencyIDfromthedimen-

sionDimCurrency.

Now,thetableSalesOrderDetailcanbeintegratedusingaLeftOuterJoin.Thistableprovides

detailedinformationontheindividualorders.Thisjoincombinesdatasetsofdifferentaggre-

gationlevels,sothesewillhavetobesubsequentlymodified.Thedatastreamisthensplitfor

furtherprocessing:Theaimhereistocreateanadditionaldatasetperorderthatincludesthe

aggregatedvaluesFreightandTaxAmt.Thisisimplementedintheupperpath.Thelowerpart,

however,processesthedatasetsfortheindividualorderitems.Asanalternative,itisalso

possibletodistributethevalueoftheitemsFreightandTaxAmtequallyamongtheindividual

positionsofeachrespectiveorder.

Inthelowersection,onlythetwoparentvaluesFreightandTaxAmtaresetto0.Inparallelto

this,thedatasetsintheupperpartarefilteredbydistinctstatement.Then,variousattributes

suchasProductIDorUnitPricearesetto0.Now,anewvalueiscreatedfromthesumofthe

taxesandfreightcosts.ThisiscorrespondinglydesignatedasLineTotal.Afterfurthertermshave

beenmodified,thetwodatastreamsarejoinedagainbeforeafinalselectionoftheattributesis

carriedout.Thefinishedfacttablenowcontainssevenkeysandsixvalues.

16

Page 17: Open Data Warehouse Building a Data Warehouse with Pentaho ENG... · 2019. 2. 26. · itnoum Best Practice Open Data Warehouse — Building a Data Warehouse with Pentaho Database

it-novum Best Practice | Open Data Warehouse — Building a Data Warehouse with Pentaho

5.3 Data storage & managementDependingonthetypeofdataandtheobjectivespursued,anumberofdifferentdatabasema-

nagementsystems(DBMS)canbeintegratedatthedatamanagementlevel.Here,adistinction

mustbemadebetweenrelationalandnon-relationalsystems.Non-relationalsystems,which

arealsoreferredtoasNoSQLdatabases,includeMongoDB,HBaseandCouchDB.Therelational

DBMSincludetraditional,row-orientedproductssuchasMySQLandPostgresaswellasanalyti-

cal,column-orientedsystemssuchasInfobright,InfiniDBandMonetDB.

WithintheexamplearchitecturewehaveusedInfobrightasthedatabasemanagementsystem

forourdatawarehouse.InfobrightisananalyticaldatabasebasedonMySQL.Itis,therefore,

compatiblewithavarietyofwell-knownMySQLtools,butalsoofferssignificantlybetterper-

formancewhendealingwithlargeamountsofdata.AtthecoreofInfobrightistheBrighthouse

Engine,whichhasbeenspeciallydevelopedforanalyzinglargedatasets.Infobrightisavailable

inbothafreeCommunityEditionandapaidEnterpriseEdition.TheEnterpriseEditionoffers

significantadvantagesintermsoffunctionalityandperformance,whichiswhywewillbeusing

itforourprototypedatawarehouse.

TheinstallationandintegrationofInfobrightiseasyduetoitshighdegreeofautomation.Large

partsoftheconfigurationandmanual„tweaking“ofthesystem,includingindexing,areomit-

ted.Thecorrespondingsettingsarecarriedoutbythesystemautomatically,butcanalsobe

FactSales.ktr

17

Page 18: Open Data Warehouse Building a Data Warehouse with Pentaho ENG... · 2019. 2. 26. · itnoum Best Practice Open Data Warehouse — Building a Data Warehouse with Pentaho Database

it-novum Best Practice | Open Data Warehouse — Building a Data Warehouse with Pentaho

modifiedretrospectively.Theconfigurationfilesmy-ib.iniandbrighthouse.iniarelocatedinthe

installationdirectory.

5.3.1 Performance optimizationInfobright‘salreadyhighperformanceandcompressionratecanbeenhancedevenfurther.This

isimportant,forexample,forthedatatypesChar/Varchar,becausetheyweakensystemperfor-

manceandsignificantlyincreasememoryrequirements.Infobrightoffersseveralsolutionshere,

includingLookupColumnsandDomainExpert.Bothoftheseareusedinourprototypesystem.

Lookup ColumnsTheprinciplebehindLookupColumnsissimilartothatofacompressedglossaryordictiona-

ry.WhenusingLookupColumns,allvaluesaretransformedintocorrespondingindicesofthe

typeIntegerwithinacolumn.Inparallel,adirectoryiscreatedthatstorestheindexesandthe

correspondingChar/Varcharvalues.Comparabletermsarerepresentedbythesameindex.This

improvesperformanceandreducesthememoryrequirementsofthedatabase.

LookupColumnsaredefinedwhenthetableiscreatedwithintheCreatestatement.Subse-

quentintegrationusingtheAlterstatementisnotpossible.Therespectivecolumniscompleted

viatheadditional„comment‚lookup‘“.CardinalityplaysacrucialrolewhencreatingLookup

Columns.Forratiosupto10to1,theyaredefinitelyworthusing.Thismeansthatforacolumn

with100,000values,amaximumof10,000differentinstancescanexist.Thus,forhighercardina-

lities,anypossiblegainsinperformancemustbecarefullyweighedupagainstalongerprocess

loadingtime.

18

Page 19: Open Data Warehouse Building a Data Warehouse with Pentaho ENG... · 2019. 2. 26. · itnoum Best Practice Open Data Warehouse — Building a Data Warehouse with Pentaho Database

it-novum Best Practice | Open Data Warehouse — Building a Data Warehouse with Pentaho

DomainExpertInfobrightoffersanothertypeofoptimization:DomainExpert.Thistechnologyenablestheuser

tocreateanattribute‘sstructureintheformofarule.Byusingthistechniquethewaydatais

processed,stored,andretrievedissustainablyimproved.Accordingtothedeveloper,queries

willrunupto50%fasterandsignificantlybetterdatacompressioncanbeachieved.

DomainExpertcanonlybeusedwithcolumnswhosevalueshaveaparticularpatternoracer-

tainstructure.Forinstance,theseinclude,IPaddresses(Number.Number.Number.Number)or

emailaddresses([email protected]):

Onlyoneofthetwomethodsdescribedcanbeappliedpercolumn.Ingeneral,LookupColumns

shouldbepreferred.DomainExpertismerelyanalternativethatcanbeusedifacolumnhasa

fixedpattern,butduetoatoohighcardinalityLookupColumnscannotbeused.

Table Column Pattern

DimCustomer EmailAddress %s@%s.%s

DimDate Date %d.%d.%d

Examplepattern

DomainExpertrulesinourexamplearchitecture

19

Page 20: Open Data Warehouse Building a Data Warehouse with Pentaho ENG... · 2019. 2. 26. · itnoum Best Practice Open Data Warehouse — Building a Data Warehouse with Pentaho Database

it-novum Best Practice | Open Data Warehouse — Building a Data Warehouse with Pentaho

5.4 Data analysisData analysis within a data warehouse architecture is based on the principle of Online Analytical Processing (OLAP).

OLAP makes it possible to carry out dynamic analysis in multi-dimensional spaces, regardless of the physical data

storage used. OLAP has a number of variations: relational (ROLAP), multi-dimensional (MOLAP) and hybrid (HOLAP)

OLAP.

Withinourprototypearchitecture,wehaveimplementedthedataanalysislevelusingtheopen

sourceserverMondrian.MondrianisbasedontheROLAPorrelationalapproach.Usingadata

cube,ittransformsMDXqueriesintomultipleSQLqueriesand,conversely,processesrelational

resultsmulti-dimensionally.Inaddition,MondrianusestheinformationprovidedbyPentaho

SchemaWorkbenchtogeneratetheXMLschema.Thistoolletsyouestablishadirectconnection

tothedatabase,whichcanbeusedtostoreattributesintheschemaviaadrop-downlist.The

connectionisestablishedinthesamewayasPentahoDataIntegration.

Theschemaiscreatedaccordingtoafixedstructure:Thebasishereisthecube,whichiscreated

byclickingonthecubesymbol.Next,thefacttableFactSalesneedstobedefined,afterwhich

theindividualdimensionscanbesetup.Inadditiontothehierarchiesandtheirassociated

levels,theintegrationoftherespectivedimensiontablemustalsobeperformed.

ItmaybeadvisabletoseparatemultiplehierarchiesofadimensiontableintheXMLschema

intotheirowndimensions.Thisallowsthehierarchiestobeintegratedontothedifferentaxes

inparallel.WeusedthisprincipleinourexamplearchitectureforthedimensionsShipDateand

OrderDate.Specialattentionshouldbepaidtothespecificorderofthehierarchyleveland/or

attributes.

Afterthishasbeendone,thevariouskeyfigurescanbeintegratedintotheXMLschema.Choo-

singtherightaggregatorhereiscrucial.ThevaluesTurnover,TaxAmt,OrderQty,Freightare

addedtogetherusingsum.AddingtogetherthevaluesUnitPriceandUnitPriceDiscountwould

notbemeaningful.Instead,wederivetheiraverageusingavg.

20

Page 21: Open Data Warehouse Building a Data Warehouse with Pentaho ENG... · 2019. 2. 26. · itnoum Best Practice Open Data Warehouse — Building a Data Warehouse with Pentaho Database

it-novum Best Practice | Open Data Warehouse — Building a Data Warehouse with Pentaho

Whensettinguptheschema,specialattentionmustbepaidtothespecificcharacteristicsof

therespectiveanalysistools.Forad-hocanalyses,SaikuorPivot4Jcanbeusedinsteadofthe

PentahoAnalyzer.PentahoAnalyzerreliesonthedefinedhierarchiesforlabelingdimensions

ratherthantheactualdimensionsthemselves.

5.5 Data presentationThe data presentation level is the interface between the system and the end user. Various graphi-

cal tools can be used to select, analyze, and visualize the data processed by the data warehouse.

Included among these are ad-hoc analyses, dashboards, and reports.

Forouropensourcedatawarehouse,wehavechosentousePentahoAnalyzerandAnalyzer

Reportforthedatapresentationandvisualizationlayer.AnalyzerReportprovidesagraphical

draganddropinterfacewithwhichindividualfieldscaneasilybeaddedtoananalysis.Using

variouscombinationsofattributesandvaluesaswellasfilters,differentviewsofthedatacanbe

created.Butbeforethiscanbedone,theschemaaswellasthecorrespondingdatabaseconnec-

tionmustfirstbesetupwithintheUserConsole.ThisisdoneviathemenuitemManageData

Sources>NewDataSource.Oncethishasbeendone,itispossibletoproducecomprehensive

analysesofthedatastoredinthedatawarehouse.

Excerptsfromtheschema

21

Page 22: Open Data Warehouse Building a Data Warehouse with Pentaho ENG... · 2019. 2. 26. · itnoum Best Practice Open Data Warehouse — Building a Data Warehouse with Pentaho Database

it-novum Best Practice | Open Data Warehouse — Building a Data Warehouse with Pentaho

6. Conclusion and outlookThearchitectureillustratedinthisdocumentprovesthatitispossibletoconstructahigh-per-

formance,user-friendlydatawarehousesystemwithopensourcesoftware.Neitherduringthe

constructionnortheoperationofthedatawarehousewereweabletofindanyfaults,deficienci-

esorlossofperformanceforouropensourcesolutionsincomparisonwithproductsofferedby

majormanufacturers.

Wethereforerecommendthatyoutakeaseriouslookatopensourcebusinessanalyticssoft-

wareproducts.TherearenowanumberofexcellentopensourcetoolsfordataanalysisandBig

Datascenariosthatmorethancompetewiththeclosedsource„bigboys“.Smallandmedium-

sizedbusinessescandefinitelybenefithere,becauseforthem,themajorityofsolutionsprovi-

dedbylargesoftwarevendorsaretoopowerfulandtooexpensivetouse.Largecorporateusers,

bycontrast,willbenefitfromtheopenarchitectures,interfaces,andextremeflexibilitythatan

opensourcedatawarehousesystemcanoffer.

22

Page 23: Open Data Warehouse Building a Data Warehouse with Pentaho ENG... · 2019. 2. 26. · itnoum Best Practice Open Data Warehouse — Building a Data Warehouse with Pentaho Database

it-novum Best Practice | Open Data Warehouse — Building a Data Warehouse with Pentaho

7. AppendixTable Column

• DimDate • w_day

• month

• DimCustomer • Titel

• City

• StateProvinceName

• PostalCode

• CountryRegionName

• DimPerson • PersonType

• Titel

• FirstName

• LastName

• DimProduct • ProductCategoryName

• ProductSubcategoryName

• Color

• ProductLine

• CLASS

• Style

• DimSalesTerritory • City

• StateProvinceCode

• StateProvinceName

• CountryRegionCode

• LocationName

• GROUP

LookupColumnswithintheexamplearchitecture

23

Page 24: Open Data Warehouse Building a Data Warehouse with Pentaho ENG... · 2019. 2. 26. · itnoum Best Practice Open Data Warehouse — Building a Data Warehouse with Pentaho Database

it-novum Profile

Your contact person for Business Intelligence and Big Data: Stefan Müller [email protected]+49(0)661103942

Leading in Business Open Source solutions and consultingit-novumistheleadingITconsultancyforBusinessOpenSourceintheGerman-speakingmarket.Foundedin2001

it-novumtodayisasubsidiaryofthepublicly-heldKAPBeteiligungs-AG.

Weoperatewith85employeesfromourmainofficeinFuldaandbranchofficesinDüsseldorf,Dortmund,Vienna

andZurichtoservelargeSMEenterprisesaswellasbigcompaniesintheGerman-speakingmarkets.

it-novumisacertifiedSAPBusinessPartnerandlongtimeaccreditedpartnerofawiderangeofOpenSourcepro-

ducts.WemainlyfocusontheintegrationofOpenSourcewithClosedSourceandthedevelopmentofcombined

OpenSourcesolutionsandplatforms.

DuetotheISO9001certificationit-novumbelongstooneofthefewOpenSourcespecialistswho

canprovethebusinesssuitabilityoftheirsolutions,provenbyinternationalqualitystandards.

More than 15 years of Open Source project experience ▶ OurportfoliocontainsawiderangeofOpenSourcesolutionswithintheapplicationsandinfrastructureareaas

wellasownproductdevelopmentswhicharewell-establishedinthemarket.

▶ As an IT consulting companywith a profound technical know-howwithin the BusinessOpen Source areawe

differentiateourselves from thebig solutionproviders’ standardofferings.Becauseour solutionsarenotonly

scalableandflexiblebutalsointegrateseamlesslyinyourexistingITinfrastructure.

▶ Wecanassemlemultidisciplinaryprojectteams,consistingofengineers,consultantsandbusinessdataproces-

singspecialists.Thuswecombinebusinessknow-howwithtechnologicalexcellencetobuildsustainablebusiness

processes.

▶ Ourtargetistoprovideyouwithahigh-qualitylevelofconsultingduringallprojectphases–fromtheanalysisand

conceptionuptotheimplementationandsupport.

▶ Asadecision-makingbasispriortotheproject’sstartweofferyouaProof-of-Concept.Throughareal-casesimula-

tionandadevelopedprototypeyoucandecideonanewsoftwarewithouttakinganyrisks.Moreover,youbenefit

from:

и Securityandpredictability

и Clearprojectmethodology

и Sensiblecalculation

it-novum GmbH GermanyHeadquartersFulda:EdelzellerStraße44·36043FuldaPhone:+49(0)661103333BranchesinDüsseldorf&Dortmund

it-novum GmbH SwitzerlandHotelstrasse1·8058ZurichPhone:+41(0)445676207

it-novum branch AustriaAusstellungsstraße50/ZugangC·1020ViennaPhone:+4312057741041