bi isnt big data and big data isnt bi (updated)

Download Bi isnt big data and big data isnt BI (updated)

If you can't read please download the document

Upload: mark-madsen

Post on 23-Jan-2018

989 views

Category:

Data & Analytics


19 download

TRANSCRIPT

  1. 1. SQL.. . SQL! SQL? SQL Hadoop BIIsntBigData,BigDataIsntBI September,2015 MarkMadsen www.ThirdNature.net @markmadsen
  2. 2. Third Nature Inc. Summary Commonusesandcommoditytechnology leadto Novelpractices leadto Differentdataanddifferenttechnologyneeds leadto Newarchitectures Leadto Commonusesandcommoditytechnology
  3. 3. Third Nature Inc. Our ideas about information and how its used are outdated.
  4. 4. Third Nature Inc. HowWeThinkofUsers Ourdesignpointisthe passiveconsumerof information. Proof:methodology ITroleisrequirements, design,build,deploy, administer Userroleisrunreports SelfserveBIisnotlike pickingtherightdoughnut fromabox. Slide 4
  5. 5. Third Nature Inc. HowWeThinkofUsers Ourdesignpointisthe passiveconsumerof information. Proof:methodology ITroleisrequirements, design,build,deploy, administer Userroleisrunreports SelfserveBIisnotlike pickingtherightdoughnut fromabox. HowWeWantUsersto ThinkofUs
  6. 6. Third Nature Inc. HowWeThinkofUsers WhatUsersReallyThink
  7. 7. Third Nature Inc. WethinkofBIaspublishing,anoldmetaphor. Publishinghasvalue,but maynotbeactionable.
  8. 8. Third Nature Inc. Planningdatastrategymeansunderstandingthe contextofdatausesowecanbuildinfrastructure Monitor Analyze Exceptions Analyze Causes Decide Act No problem No idea Do nothing We need to focus on what people do with information as the primary task, not on the data or the technology.
  9. 9. Third Nature Inc. Generalmodelfororganizationaluseofdata Monitor Analyze Exceptions Analyze Causes Decide Act No problem No idea Do nothing Act within the process Usually real-time to daily
  10. 10. Third Nature Inc. OriginofBIanddatawarehouseconcepts Thegeneralconceptofa separatearchitectureforBI hasbeenaroundlonger,but thispaperbyDevlinand Murphyisthefirstformal datawarehousearchitecture anddefinitionpublished. 10 An architecture for a business and information system, B. A. Devlin, P. T. Murphy, IBM Systems Journal, Vol.27, No. 1, (1988) Slide10CopyrightThirdNature,Inc.
  11. 11. Third Nature Inc. Origins:in1988therewasonlybighair. Norealcommercialemail,publicinternetbarelystarted Storagestateoftheart:100MB,cost$10,000/GB OracleApplicationsv1GLreleased;SAPgoespublic, entersUSmarket Unixismostlyrunbylonghairedfreaks Mobilewasthis Thisisthecontext:scarcityofdata,ofsystemresources,ofautomated systemsoutsidecorefinancials,ofmoneytopayforstorage.
  12. 12. Third Nature Inc. Generalmodelfororganizationaluseofdata Collect new data Monitor Analyze Exceptions Analyze Causes Decide Act No problem No idea Do nothing Act on the process Usually days/longer timeframe CopyrightThirdNature,Inc.
  13. 13. Third Nature Inc. Youneedtobeabletosupportbothpaths Collect new data Monitor Analyze Exceptions Analyze Causes Decide Act Act on the process Act within the process Conventional BI, addition of EDM Causal analysis, data science CopyrightThirdNature,Inc.
  14. 14. Third Nature Inc. TheusagemodelsforconventionalBI Collect new data Monitor Analyze Exceptions Analyze Causes Decide Act No problem No idea Do nothing Act on the process Usually days/longer timeframe Act within the process Usually real-time to daily This is what weve been doing with BI so far: static reporting, dashboards, ad-hoc query, OLAP CopyrightThirdNature,Inc.
  15. 15. Third Nature Inc. Theusagemodelsforanalyticsandbigdata Collect new data Monitor Analyze Exceptions Analyze Causes Decide Act No problem No idea Do nothing Act on the process Usually days/longer timeframe Act within the process Usually real-time to daily Analytics and big data is focused on new use cases: deeper analysis, causes, prediction, optimizing decisions This isnt ad-hoc, reporting, or OLAP. CopyrightThirdNature,Inc.
  16. 16. Third Nature Inc. Whenyoufirstgivepeopleaccesstoinformation thatwasunavailable OH GOD I can see into forever
  17. 17. Third Nature Inc. Afterawhileitbecomesthenewnormal
  18. 18. Third Nature Inc. Aspracticesevolvebasedonnewcapabilities Anewlevelof complexity developsover topofthe older,now better understood processes, leadingtonew dataand analysisneeds.
  19. 19. Third Nature Inc. I never said the E in EDW meant everything What do you mean, Just doughnuts?
  20. 20. Third Nature Inc. Thedatawarehousevs businessagility Allthedata Common,typed,tabulardata Thebottleneckisyou
  21. 21. Third Nature Inc. Itsgoingtogetalot worse NotE E Conclusion:anymethodologybuiltonthepremisethatyou mustknowandmodelallthedatafirstisuntenable
  22. 22. Third Nature Inc. Oldmarketsays:Theresnothingwrongwithwhat youhave,justkeepbuyingnewproductsfromus
  23. 23. Third Nature Inc. Theemergingbigdatamarkethasananswer
  24. 24. Third Nature Inc. Thedatalake
  25. 25. Third Nature Inc. Thedatalakeafteralittlewhile
  26. 26. Third Nature Inc. TANSTAAFL Whenreplacingtheold withthenew(orignoring thenewovertheold)you alwaysmaketradeoffs, andusuallyyouwontsee themforalongtime. Technologiesarenot perfectreplacementsfor oneanother.Oftennot better,onlydifferent.
  27. 27. Third Nature Inc. Bigdataisunprecedented. Anyoneinvolvedwithbigdataineventhe mostbarelyperceptibleway
  28. 28. Third Nature Inc. Wevebeenherebefore Source:BillSchmarzo,EMC
  29. 29. Third Nature Inc. Bigiswellsupportedbydatabasesnow Source:Noumenal,Inc.
  30. 30. Third Nature Inc. Ordersofmagnitude:20yearsagoTB,todayPB Shiftsindataavailabilitybyordersofmagnitude necessitatenewmeansofmanagingandusingit.
  31. 31. Third Nature Inc. Analyticsembiggens thedatavolumeproblem ManyoftheprocessingproblemsareO(n2)orworse,so moderatedatacanbeaproblemforDBbasedplatforms
  32. 32. Third Nature Inc. Muchofthebigdatavaluecomesfromanalytics BIisaretrievalproblem,notacomputationalproblem. Fivebasicthingsyoucandowithanalytics Prediction whatismostlikelytohappen? Estimation whatsthefuturevalueofavariable? Description whatrelationshipsexistinthedata? Simulation whatcouldhappen? Prescription whatshouldyoudo? Slide 36 CopyrightThirdNature,Inc. CopyrightThirdNature,Inc.
  33. 33. Third Nature Inc. MostpeopledonotneedspecialtechnologyNumberofpeople The distribution of data size is about normal, yet these guys set the tone of the market today. Bigness of data CopyrightThirdNature,Inc.
  34. 34. Third Nature Inc. Analytics:ThisisreallyrawdataunderstorageNumberofjobs Microsoft study of 174,000 analytic jobs in their cluster: median size ??? Bigness of data CopyrightThirdNature,Inc.
  35. 35. Third Nature Inc. WorkingdataforanalyticsmostoftennotbigNumberofjobs 14 GB Smallness of data CopyrightThirdNature,Inc.
  36. 36. Third Nature Inc. An(overly)SimpleDivisionoftheProblemSpaceComputation LittleLots Data volume Little Lots Big analytics, little data Specialized computing, modeling problems: supercomputing, GPUs Big analytics, big data Complex math over large data volumes requires shared nothing architectures Little analytics, little data The entry point; SAS, SMP databases, even OLAP cubes can work Little analytics, big data The BI/DW space, for the most part, with work done in databases
  37. 37. Third Nature Inc. Third Nature Inc. Whatmakesdatabig? Verylargeamounts Hierarchicalstructures Nestedstructures Linkedstructures Encodedvalues Nonstandard(fora database)types Deepstructure Humanauthoredtext bigisbetteroffbeingdefinedascomplexorhardtomanage CopyrightThirdNature,Inc.
  38. 38. Third Nature Inc. Categorizingthemeasurementdatawecollect Theconvenientdataisthe transactionaldata. GoesintheDWandisused,even ifitisnttherightmeasurement. Theinconvenientdatais observationaldata. Itsnotneat,clean,ordesigned intomostsystemsofoperation. Thedifficultandmisleadingdata isdeclarativedata. Whatpeoplesayandwhatthey dorequiregroundtruth. Weneedanarchitecturethat supportsallthreecategories. CopyrightThirdNature,Inc.
  39. 39. Third Nature Inc. Transactionsvsbigdata Theclassicexampleofstructureddata Transactiondataincludes: quantificationdetails(date,value,count) referencedataforexplanation(product, customer,account) Lotsofmeaningfulinformation Referencedataisusuallysharedacrossthe organization,henceitsimportance.There aretwoparts: identifiertouniquelyidentifythesubject descriptiveattributeswithcommonor standardizedvaluedomains Transaction details Reference data
  40. 40. Third Nature Inc. Todayitsdifferentdata:observations,nottransactions Sensor data doesnt fit well with current methods of collection and storage, or with the technology to process and analyze it. CopyrightThirdNature,Inc.
  41. 41. Third Nature Inc. Bigdataasatypeofdata:Transactionsvs.Events Transactions: Eachoneisvaluable Mutable Theelementsofatransactioncanbeaggregatedeasily Asetoftransactionsdoesnotusuallyhaveimportantordering ordependency Events: Asingleeventoftenhasnovalue,e.g.whatisthevalueofone clickinaseries?Someeventsareextremelyvaluable,butthis isonlydetectablewithinthecontextofotherevents. Elementsofeventsareoftennoteasilyaggregated Asetofeventsusuallyhasanaturalorderanddependencies Immutable
  42. 42. Third Nature Inc. Examplebigdata:Webtrackingdata USER_ID 301212631165031 SESSION_ID 590387153892659 VISIT_DATE 1/10/20100:00 SESSION_START_DATE 1:41:44AM PAGE_VIEW_DATE 1/10/20109:59 DESTINATION_URL https://www.phisherking.com/gifts/store/LogonForm?mmc= linksrcemail_m100109_44IOJ1_shop&langId= 1&storeId=1055&URL=BECGiftListItemDisplay REFERRAL_NAME Google.com REFERRAL_URL http://www.google.com/search?sourceid=navclient&aq=0h& oq=Italian&ie=UTF8&rlz=1T4ACGW_enUS386US387&q=italia n+rose&fu=0&ifi=1&dtd=204&xpc=1KoLqh374s PAGE_ID PROD_24259_CARD REL_PRODUCTS PROD_24654_CARD,PROD_3648_FLOWERS SITE_LOCATION_NAME VALENTINE'SDAYMICROSITE SITE_LOCATION_ID SHOPBYHOLIDAYVALENTINESDAY IP_ADDRESS 67.189.110.179 BROWSER_OS_NAME MOZILLA/4.0(COMPATIBLE;MSIE7.0;AOL9.0;WINDOWS NT5.1;TRIDENT/4.0;GTB6;.NETCLR1.1.4322)
  43. 43. Third Nature Inc. Webtrackingdatahasanestedstructure USER_ID 301212631165031 SESSION_ID 590387153892659 VISIT_DATE 1/10/20100:00 SESSION_START_DATE 1:41:44AM PAGE_VIEW_DATE 1/10/20109:59 DESTINATION_URL https://www.phisherking.com/gifts/store/LogonForm?mmc= linksrcemail_m100109_44IOJ1_shop&langId= 1&storeId=1055&URL=BECGiftListItemDisplay REFERRAL_NAME Direct REFERRAL_URL PAGE_ID PROD_24259_CARD REL_PRODUCTS PROD_24654_CARD,PROD_3648_FLOWERS SITE_LOCATION_NAME VALENTINE'SDAYMICROSITE SITE_LOCATION_ID SHOPBYHOLIDAYVALENTINESDAY IP_ADDRESS 67.189.110.179 BROWSER_OS_NAME MOZILLA/4.0(COMPATIBLE;MSIE7.0;AOL9.0;WINDOWS NT5.1;TRIDENT/4.0;GTB6;.NETCLR1.1.4322) unstructured data embedded in the logged message: complex strings
  44. 44. Third Nature Inc. Themissingingredientfrommostbigdata
  45. 45. Third Nature Inc. Thecreation,flowanduseofdataisdifferentfor transactionsandmachinegeneratedevents Data entry Extract Cleanse Load UseStore Transactions MDM Generate Store Use UseCleanse Program Capture This runs at human speed This runs at machine speed, with higher latency feedback cycles
  46. 46. Wecollectlargevolumesoftext,ararepractice tenyearsago.Todaywecanturntextintodata. Categories, taxonomies Topics, genres, relationships, abstracts Sentiment, tone, opinion Words & counts, keywords, tags Entities people, places, things, events, IDs CopyrightThirdNature,Inc.
  47. 47. Third Nature Inc. YoucanstorethisdatainanRDBMS,but
  48. 48. Exampledata:TwitterMessageAPIPayload Lookslike: This is really just a record format much like a DB row. Datetime, userID, name, location, description, message, message metadata, etc. But its In json or xml.
  49. 49. Third Nature Inc. @markmadsenCheckout:From#MongoDBto#Cassandra: WhyTheAtlasPlatformIsMigratinghttp://owl.li/cvxFK Atweethaslotsoffields,butoneimportantone Thepayloadisfreetextbuthasotherelements: Fromthesethingsyoulikelywanttogenerateorlinkto referencedata. To username Hashtag HashtagURL
  50. 50. Third Nature Inc. Third Nature Inc. Internalpayloadelementsformanewgraph The@elementspointto otherrecordsandcreatea deeplylinkedstructure. Youhavetoassemblethe linkedstructuretosee whatsreallythere,which meansrepeatedscanning some/allofthedata. Thederivedpatternis interestingdata, sometimesmorethanthe individualmessages.
  51. 51. Third Nature Inc. Third Nature Inc. Therearemanypatternsinthedata Follower/followingnetworksareeasy theyareexplicit andindependentoftheevents. Communitydetectionrequireslookingatpatternsof@ communicationinadditiontofollowrelationships. Whatdoyoudowiththeseafterdiscovery? Follower network Conversational communities
  52. 52. Third Nature Inc. Moredata:patternsemergefromlotsofeventdata Patternsemergefrom theunderlyingstructure oftheentiredataset. Thepatternsaremore interestingthansums andcountsoftheevents. Webpaths:clicksina sessionasnetworknode traversal. Email:trafficanalysis producinganetwork The event stream is a source for analysis, generating another set of data that is the source for different analysis.
  53. 53. Third Nature Inc. Bigchangesfordatawarehousingworkloads Theresultsofanalytic processingcan,oftendo, feedbackintothe systemfromwhichthey originate. Muchofthedataisbeing read,writtenand processedinrealtime. Ourdesignpointwasnot changingtablesand ephemeralpatterns.
  54. 54. UnstructuredisNotReallyUnstructured Slide 58 Unstructureddataisnt reallyunstructured: languagehasstructure. Textcancontaintraditional structureddataelements. Theproblemisthatthe contentisunmodeled.
  55. 55. Third Nature Inc. Slide 59 THEBIGCHANGEISNT TECHNOLOGY,ITSARCHITECTURE
  56. 56. Third Nature Inc. Therearereallythreeworkloadstoconsider,nottwo 1. Operational:OLTPsystems 2. Analytic:OLAPsystems 3. Processing:Computationalsystems Unitoffocus: 1. Transaction 2. Query 3. Computation Differentproblemsrequiredifferentplatforms
  57. 57. Third Nature Inc. Workloads OLTP BI Analytics Access ReadWrite Readonly Readmostly Predictability Predictable Unpredictable Fixedpath Selectivity High Low Low Retrieval Low Low High Latency Milliseconds