enterprise data warehouse optimization: 7 keys to success
TRANSCRIPT
1 ©HortonworksInc.2011–2016.AllRightsReserved1 ©HortonworksInc.2011–2017.AllRightsReserved
ScottGnau CTO,Hortonworks@Scott_GnauDavidLoshin,President,[email protected]
LegacyArchitecturesImpedePerformance
EDW
CapitalCosts
OperationsCosts
Scalability
AnalyticFlexibility
TimetoValue
DataQuality
DataVariety
©2017Knowledge Integrity,[email protected] (301) 754-6350 2
• Datawarehouseperformance isnolongersolelydefinedintermsofcomputationspeed
• Optimalperformancereflectstheabilitytomaximizevalueacrossarangeofdimensions
• Thestaticdesignoflegacyplatformshasnotkeptpacewithgrowingdesireforbusinessintelligenceandanalytics
Step1:LeverageHorizontalScalability• DWappliancesrequire
significantcapitalinvestment– Systemmustbesizedtomeet
anticipatedneeds– Allowsforunusedcapacityat
beginning– Requiresincreased“step-up”
investmentsonregularintervals• Hadoopfinessesthischallenge
– Reliesoncommoditycomponents
– Startwithwhatyouneed,growwithincreaseddemand
– Introducenewerhardwareseamlessly
– Exploitinnovationstospeedperformance(e.g.,Stinger.next,LowLatencyAnalyticalProcessing)
©2017Knowledge Integrity,[email protected] (301) 754-6350 3
Rackswitch
NameNode
DataNode&TaskTracker
DataNode&TaskTracker
DataNode&TaskTracker
DataNode&TaskTracker
Rackswitch
NameNode
DataNode&TaskTracker
DataNode&TaskTracker
DataNode&TaskTracker
DataNode&TaskTracker
Rackswitch
NameNode
DataNode&TaskTracker
DataNode&TaskTracker
DataNode&TaskTracker
DataNode&TaskTracker
Rackswitch
NameNode
DataNode&TaskTracker
DataNode&TaskTracker
DataNode&TaskTracker
DataNode&TaskTracker
Step2:AugmentEDWStoragewithHive
• ThevalueofexistingEDWinvestmentscanbeextendedusingaHybridArchitecture
• Hivecontinuestoevolvewithinnovativeperformanceimprovements:– In-memorycachingand
persistentqueryexecutors– Column-orienteddistributed
dataorganization– Improvedsecurityusing
ApacheRanger– SQLACIDMerge
©2017Knowledge Integrity,[email protected] (301) 754-6350 4
HadoopCluster
EDW
Step3:IncreaseDataFlexibility
• Conventionaldatawarehousearchitecturesareorganizedusingadimensionalmodel– Factsrepresentevents– Dimensionscharacterizethefacts
• ThedimensionalmodelissuitedtotypicalDWoperations– Aggregationandrolled-upreporting– “Sliceanddice”
• However,thismodelforcesalldataintopredeterminedschema(“schema-on-write”)– Introducesbias,createsconstraintsandlimitsdataflexibility
• Alternative:schema-on-read– Datasetsarecapturedintheirsourceformats– Freesdataconsumerstoapplytheirownorganization– Allowslogicalstructuretobelayeredontopofdatainsourceformat– Enablesuseofcreativealgorithmsforanalytics,textmining,andmachinelearning
©2017Knowledge Integrity,[email protected] (301) 754-6350 5
Step4:UseUnstructuredData
• Datawarehousesareengineeredaroundstructureddata• Manysourcesofincreasingvolumeofunstructureddata
– AppsrunningonInternet-connecteddevicesgeneratetextstreams– Machine-generatedunstructuredcontent– Semi-structuredsources
• Applicationsthatconsumebothstructuredandunstructureddataprovidefullervisibilityintoanalyticalresults
• ToolslikeLucene,Solr,Mahout,andothertextanalyticslibrarieshelptoparseandtagunstructuredtext
©2017Knowledge Integrity,[email protected] (301) 754-6350 6
Ingest
Parse
Tag
Organ
ize
Lucene
Solr
Mahout
Step5:DataDiscovery
©2017Knowledge Integrity,[email protected] (301) 754-6350 7
DataIngestion&
Transformation
• Dataimportedintothedatawarehouseishomogenizedandorganizedwithinpredefineddatamodels
• Thisconstrainsdownstreamconsumers
Step5:DataDiscovery
©2017Knowledge Integrity,[email protected] (301) 754-6350 8
DataDiscovery&Preparation
DataDiscovery&Preparation
DataDiscovery&Preparation
DataDiscovery&Preparation
DataDiscovery&Preparation
• Datadiscoveryallowseachusertoconfigurethedatafortheirspecializedpurposes
Step6:OffloadETLtoHadoop
• 60-70%oftheeffortofdatawarehousingisattributedtoextraction,transformation,andloading(ETL)
• HadoopisanaturalplatformforETLprocessing:– ETLisinherentlydataparallel,enablingfasterexecution– Developmenttimecanbedrasticallyreducedwithfasterdev/test/debugcycle– ResourcescanbedynamicallyapportionedandreleasedwhenETLprocessingiscompleted,
loweringcosts
• ApacheHivesupportsSQLACIDMergewhichhandlesinserts,updates,anddeletesinasinglepass
• Allowsforin-databasetransformationswithoutneedformassiverefreshes
©2017Knowledge Integrity,[email protected] (301) 754-6350 9
Step7:OperationalDataGovernance
• Delegatingmoreresponsibilitytotheconsumercommunityposesariskofinconsistentinterpretationanduse
• Instituteoperationaldatagovernancetosupportversioning,lineage,andprovenance– Metadatamanagement– Datalineage– Archivingpolicies– Versioningpolicies– Datasecurityandprotection
• ApacheAtlasisanopensourcecomponentoftheHadoopecosystemthatcapturesdatadefinitions,hierarchicaltaxonomies,dataelementsandtheirrelationships,andlineage
©2017Knowledge Integrity,[email protected] (301) 754-6350 10
Modernization:EvolvingtheHybridEDW
• ConventionalRDBMS-baseddatawarehouseshaveservedorganizationswell,butarebeingeclipsedbynewertechnologies
• Scalablesystemsbuiltoncommoditycomponentsarerapidlybeingadoptedforbusinessintelligenceandanalyticsapplications
• OptimizetheEDWusinganevolutionaryapproachtoembracingHadoop:– Expandthestoragefootprint– Increasecomputationalpower– Broadenthescopeofapplicationsupport– Lowercosts
©2017Knowledge Integrity,[email protected] (301) 754-6350 11
Questions&Suggestions
• www.knowledge-integrity.com• www.dataqualitybook.com• www.decisionworx.com• Ifyouhavequestions,comments,
orsuggestions,pleasecontactmeDavidLoshin301-754-6350loshin@knowledge-integrity.com
©2017Knowledge Integrity,[email protected] (301) 754-6350 12
13 ©HortonworksInc.2011–2016.AllRightsReserved
TheNextGenEDWistheBigDataWarehouseà InForrester’s2016globalsurvey,59%ofrespondentsstatedthatleveragingbigdata
andanalyticswasacriticalorhighpriority.
14 ©HortonworksInc.2011–2016.AllRightsReserved
CompaniesAreLookingtoBigDataforEDWOptimization
à 82%of2550+respondentsarelookingtoBigDataforEDWOptimizationratherthanastraightreplacement.– 2016BigDataMaturitySurvey
15 ©HortonworksInc.2011–2016.AllRightsReserved
HortonworksConnectedDataPlatformsandSolutions
HortonworksConnection
HortonworksSolutions
EnterpriseDataWarehouseOptimization
CyberSecurityandThreatManagement
InternetofThingsandStreamingAnalytics
HortonworksConnectionSubscriptionSupportSmartSense
PremierSupportEducationalServicesProfessionalServices
CommunityConnection
CloudHortonworks DataCloudAWS HDInsight
DataCenterHortonworks DataSuite
HDFHDP
16 ©HortonworksInc.2011–2016.AllRightsReserved
DriversofaModernBIInfrastructure
DeeperandBroaderDataSets
CompleteData‘Provenance’
LeadingAnalyticsandTools
Integratenon-EDWdataandEDWdata
TotalCostofOwnership
17 ©HortonworksInc.2011–2016.AllRightsReserved
OpenSourceTransformationalImpacttoEDW
UnmatchedEconomicssupportlowcostdata-centerandcloudarchitecturesforEnterpriseApacheHadoop
EliminatesRiskandEnsuresIntegrationpreventsvendorlock-inandspeedsecosystemadoptionofODPi-compliantcore
COSTEFFICIENCY
DATAVARIETY
EDW
PROPRIETARYHADOOP
HORTONWORKSOPENSOURCE
RDBMS
18 ©HortonworksInc.2011–2016.AllRightsReserved
But,whyaren’tmorecompaniesrunningtothissolution?
Risky
Hadooprequiresabunchofnewskillsets
It’lltakealongtime
There’stoomuchmanualcodingrequired
It’shardtointegratetomyBItoolstack
19 ©HortonworksInc.2011–2016.AllRightsReserved
LegacyEDWSolution
20 ©HortonworksInc.2011–2016.AllRightsReserved
UsingHadooptoOptimizetheDataWarehouse
à AugmentEDWwithHive
à OffloadETLtoHadoop
à DataGovernance
21 ©HortonworksInc.2011–2016.AllRightsReserved
AugmentcurrentEDWwithHive
HiveLLAPGA:Interactivequeryinseconds,10Xfastjoinperformance
EaseofUseandAdoption:SQLStandardACIDMerge
EnterpriseReadiness:SupportsallTPC-DSQueries
StreamlinedOperations:HiveViews
22 ©HortonworksInc.2011–2016.AllRightsReserved
0
5
10
15
20
25
30
35
40
45
50
0
50
100
150
200
250
Speedup(xFactor)
QueryTime(s)(Low
erisBetter)
Hive2withLLAPaverages26xfasterthanHive1
Hive1/TezTime(s) Hive2/LLAPTime(s) Speedup (xFactor)
Hive2withLLAP:26xPerformanceBoostat1TBScale
23 ©HortonworksInc.2011–2016.AllRightsReserved
HiveLLAPinHDP2.6:StablePerformancewithHighConcurrency
4xQueries,2.8x
RuntimeDifference
5xQueries,4.6x
RuntimeDifference
Mark ConcurrentQueries
AverageRuntime
5 7.76s
25 36.24s
100 102.89s
24 ©HortonworksInc.2011–2016.AllRightsReserved
OffloadETLtoHadoop
à TheProblem:– EDWscanconsumebetween50%and90%of
resourcesjustonETL/ELTtasks.– Thesejobsinterferewithmorebusiness-
criticaltaskslikeBIandadvancedanalytics.
à TheSolution:– HiveandHDPdeliverETLthatscalesto
petabytes.– Economicalscale-outprocessingon
commodityservers.
à TheResult:– BetterSLAsformission-criticalanalytics.– LimitEDWexpansionorretireoldsystems.
ETL/ELT
DATAMART
DATALANDING&
DEEPARCHIVE
CUBEMART
ENDUSER
APPLICATIONS
APPLICATIONS
APPLICATIONS
ENDUSERSANDAPPS
25 ©HortonworksInc.2011–2016.AllRightsReserved
DataGovernanceforEDWOptimization
Classification
Prohibition
Time
Location
Policies
PDPResourceCache
Ranger
ManageAccessPoliciesandAuditLogs
TrackMetadataandLineage
AtlasClientSubscriberstoTopic
GetsMetadataUpdates
Atlas
MetastoreTags
Assets
Entitles
Streams
Pipelines
Feeds
HiveTables
HDFSFiles
HBaseTables
EntitiesinDataLake
IndustryFirst:DynamicTag-basedSecurityPolicies
26 ©HortonworksInc.2011–2016.AllRightsReserved
UseCase1:Multi-ChannelBehavioralAnalysis
à Industry:MassMedia– Largestbroadcastingandcablecompany
intheworldbyrevenue– Multiplechannels:Cable(set-top-box),
wirelessdevices,streamingprogramming,
– 22million+subscribers(internet&video)
à Results:– Scalability:480Brows,500nodes– 60xqueryperformanceimprovement– Insights:Newinfoimprovenegations– Loyalty:Outreachtocustomersviewing
competitivestreams;▼churn▲revenue
Before After
LeadingMediaCompany
HortonworksHDP
AtScaleIntelligenceServer
HortonworksHDP
Netezza DataMart
ChannelFeeds
Tableau+MSExcel+R
ChannelFeeds
Tableau+MSExcel
27 ©HortonworksInc.2011–2016.AllRightsReserved
UseCase2:CampaignPaid-SearchEffectiveness
à Industry:Retail/eCommerce– TopUSdepartmentstore(byrev)– Onlinesales$4B+&growing(11%+total)– 800+departmentstoresnationwide
à Results– Scale:Millionspaidkeywordsanalyzed– Speed:Eliminateextractstep– Insight:Operationalizedclosed-loop
analysisà insightà decisionà action– Impact:Makeandsave$millionsw/
instantbiddecisionsover6-weekseasonà thatdrives60%annualrevenue
Before After
HortonworksHDP
AtScaleIntelligenceServer
HortonworksHDP
Vertica DataMarts
Ad&PaidKeywords
Cognos +Tableau+Excel
Ad&PaidKeywords
Tableau+Excel
LeadingRetailer
28 ©HortonworksInc.2011–2016.AllRightsReserved
UseCase3:ClientandPatientAnalysis
à Industry:ManagedHealthCare– MemberofFortune100– Health,life+otherinsuranceproducts– ~52millionmembers;
medical/dental/pharm
à Results– Scalable:BIdirectlyon264+nodesdata– Time: Eliminatedatamovement step– 62xqueryperformanceimprovement– Speed:<2.2secondaveragequerytime– Insight:TableauonHadoopfor1000+– Security:Accesscontrolbyuser;HIPAA
Before After
LeadingManagedHealthcareProvider
HortonworksHDP
AtScaleIntelligenceServer
HortonworksHDP
Netezza DataMart
Client/PatientDetails
Tableau+MSExcel
Client/PatientDetails
Tableau+MSExcel
29 ©HortonworksInc.2011–2016.AllRightsReserved
NextStep:
à EveryonewillreceiveafreecopyofForresterWhitePapertitled”TheNext-GenerationEDWIsTheBigDataWarehouse”
à EDWOptimizationwithHDP– http://hortonworks.com/solutions/edw-optimization/– EDWOptimization7minvideo
30 ©HortonworksInc.2011–2016.AllRightsReserved
HortonworksConnectedDataPlatformsandSolutions
HortonworksConnection
HortonworksSolutions
EnterpriseDataWarehouseOptimization
CyberSecurityandThreatManagement
InternetofThingsandStreamingAnalytics
HortonworksConnectionSubscriptionSupportSmartSense
PremierSupportEducationalServicesProfessionalServices
CommunityConnection
CloudHortonworks DataCloudAWS HDInsight
DataCenterHortonworks DataSuite
HDFHDP
31 ©HortonworksInc.2011–2016.AllRightsReserved
ThankYou