data modelingzone geoffrey-clark-v2
DESCRIPTION
I presented these slides at Data Modeling Zone Europe on September 24 2013.TRANSCRIPT
![Page 1: Data modelingzone geoffrey-clark-v2](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5455ee5aaf795998788b4b23/html5/thumbnails/1.jpg)
Physical Database Design for MPP and Columnar Databases
Geoffrey ClarkPrincipal at Lucidata, Inc.
September 2013
copywrite, Lucidata, 2013
![Page 2: Data modelingzone geoffrey-clark-v2](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5455ee5aaf795998788b4b23/html5/thumbnails/2.jpg)
Conceptual, Logical, Physical
• Conceptual links to Business Strategy.– This is now becoming more quantitative
• Logical maps to the Business Semantics.– Con-way example
• Physical maps to your Data Stores– These will be more varied and heterogeneous in
the future, due to specialization.
copywrite, Lucidata, 2013
![Page 3: Data modelingzone geoffrey-clark-v2](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5455ee5aaf795998788b4b23/html5/thumbnails/3.jpg)
HBR Business Strategy
The New Dynamics of Competition, Michael D. Ryall, Harvard Business Review, June 2013
Michael Porter’s Five Forces has dominated strategic and competitive analysis since 1979. This analysis has largely been conceptual in nature.
Quantitative analysis on structured data in context is changing the nature of business culture, and improving business decisions.
This drives the demand for data modeling and management.
copywrite, Lucidata, 2013
![Page 4: Data modelingzone geoffrey-clark-v2](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5455ee5aaf795998788b4b23/html5/thumbnails/4.jpg)
Design and Evolution• Hierarchies
– 14th Century Europe and the Financial Revolution– Aggregations & Allocations
• Cards, Tapes – physical analog media• Computer Science
– Moore’s Law• Processor Speed Improvements• Memory Improvements• Media Improvements – Punch Cards, Tape, Disk, Memory
• Design for Context & the Future– Character encoding - Internationalization– Calendars – Gregorian, Fiscal, Lunar, ... Y2K?
• Files and Fields– Separation of Data and Metadata– Modern versions -> XML, JSON
• Joins!– Data Sets – Super types, Sub types– Associations describe Networks!
copywrite, Lucidata, 2013
![Page 5: Data modelingzone geoffrey-clark-v2](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5455ee5aaf795998788b4b23/html5/thumbnails/5.jpg)
Technology’s Improvement Pace
copywrite, Lucidata, 2013
![Page 6: Data modelingzone geoffrey-clark-v2](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5455ee5aaf795998788b4b23/html5/thumbnails/6.jpg)
... and Demand Forecast
copywrite, Lucidata, 2013
![Page 7: Data modelingzone geoffrey-clark-v2](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5455ee5aaf795998788b4b23/html5/thumbnails/7.jpg)
Separation of Church and State
• Operational uses– Capture the data, hand-entered <- validation– A Data Flow, such as Order to Cash cycle– Con-way example of PRO(-gressive) numbers
• Analytical uses– Desire for reports, Reporting crashes the
Operational cycle, Cash flow problem.– Banished from OLTP, go make an ODS
copywrite, Lucidata, 2013
![Page 8: Data modelingzone geoffrey-clark-v2](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5455ee5aaf795998788b4b23/html5/thumbnails/8.jpg)
The Star SchemaThe purpose of business computers is to sort data. A graphical representation of sorted data is called a ‘Star Schema’.
– Michael Silves, Principal at Datamorphosis
• The right design at the right time, becomes default doctrine for DW– Early RDBMS (Relational Data Base Management Systems)
• Low memory, slow disks, slow CPU• Big Demand, with questions that spanned the datasets• Performance issues over large datasets
– Interview Business people to get questions• Pre-process the data, based on business questions
– Separation into Dimensions and Facts/Metrics• Link to Business Semantics• OLAP (On-Line Analytical Processing)• Educate Users on Aggregation and Allocation• Conformed Dimensions across Departments to give an Enterprise-wide view of the data.
• But as technology changes, problems emerge– Ad-hoc questions require redesign & rework– With business hierarchies when one concept is both a fact & dimension, e.g. Shipment– Fact tables become difficult to distribute for MPP ... e.g. Teradata prefers a normalized DW
• Example – transportation networks
copywrite, Lucidata, 2013
![Page 9: Data modelingzone geoffrey-clark-v2](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5455ee5aaf795998788b4b23/html5/thumbnails/9.jpg)
copywrite, Lucidata, 2013
Example – Multi-Modal Freight
• Shipments are agreements between a Carrier and a Shipper to move goods between two places.
• Shipments can be split into “ProFreight” (which is assigned a cost via activity-based costing).
• Shipments/ProFreight are composed of Freight handling units.
• Freight can be “re-tendered” to another carrier, in which case is is linked to the original and the new Shipment.
• Freight moves between places on one or many “VFCs” or Containers.
• Containers are moved between places on Trips.
![Page 10: Data modelingzone geoffrey-clark-v2](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5455ee5aaf795998788b4b23/html5/thumbnails/10.jpg)
Kimball on Transportation, 3NF
copywrite, Lucidata, 2013
![Page 11: Data modelingzone geoffrey-clark-v2](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5455ee5aaf795998788b4b23/html5/thumbnails/11.jpg)
Kimball on Transportation, Star
copywrite, Lucidata, 2013
![Page 12: Data modelingzone geoffrey-clark-v2](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5455ee5aaf795998788b4b23/html5/thumbnails/12.jpg)
Table Level DW diagram
copywrite, Lucidata, 2013
![Page 13: Data modelingzone geoffrey-clark-v2](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5455ee5aaf795998788b4b23/html5/thumbnails/13.jpg)
Dim Modeling Dogma
• “Our carefully normalized data model can not be translated into a star schema... “– Dimensional modeling is necessary in order to
generate correct queries – Any (normalized) data model can be transformed
in a dimensional model... – ... and there exists an algorithm to do it
copywrite, Lucidata, 2013
![Page 14: Data modelingzone geoffrey-clark-v2](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5455ee5aaf795998788b4b23/html5/thumbnails/14.jpg)
Dim Modeling Example
copywrite, Lucidata, 2013
![Page 15: Data modelingzone geoffrey-clark-v2](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5455ee5aaf795998788b4b23/html5/thumbnails/15.jpg)
Star option considered
copywrite, Lucidata, 2013
![Page 16: Data modelingzone geoffrey-clark-v2](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5455ee5aaf795998788b4b23/html5/thumbnails/16.jpg)
Bridge table(remember, we tried this)
We tried this with hesmith When selecting a main hierarchy is has too much of a downside, and you don’t have a weight factor …
copywrite, Lucidata, 2013
![Page 17: Data modelingzone geoffrey-clark-v2](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5455ee5aaf795998788b4b23/html5/thumbnails/17.jpg)
copywrite, Lucidata, 2013
Multi-fact option considered
![Page 18: Data modelingzone geoffrey-clark-v2](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5455ee5aaf795998788b4b23/html5/thumbnails/18.jpg)
Oracle’s Algorithmic approach
copywrite, Lucidata, 2013
![Page 19: Data modelingzone geoffrey-clark-v2](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5455ee5aaf795998788b4b23/html5/thumbnails/19.jpg)
Basic DW diagram
copywrite, Lucidata, 2013
![Page 20: Data modelingzone geoffrey-clark-v2](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5455ee5aaf795998788b4b23/html5/thumbnails/20.jpg)
Build Dimensional Model in BI
copywrite, Lucidata, 2013
![Page 21: Data modelingzone geoffrey-clark-v2](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5455ee5aaf795998788b4b23/html5/thumbnails/21.jpg)
Freight moves through Networks
copywrite, Lucidata, 2013
![Page 22: Data modelingzone geoffrey-clark-v2](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5455ee5aaf795998788b4b23/html5/thumbnails/22.jpg)
Information Factory & MPP
• Normalized Base– Integrate data once
• Source -> Normalized -> Denormalized -> OK• Source -> Denormalized? -> Un-normalized -> ?
– Detect problems and fix them once!• Does not preclude Data Marts• Massive Parallel Processing– Data distribution
• Optimizations – Broadcast, Co-location, Re-distribution• Scalability, the quest for 1:1• Normalized data - reduced IO, better match for
copywrite, Lucidata, 2013
![Page 23: Data modelingzone geoffrey-clark-v2](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5455ee5aaf795998788b4b23/html5/thumbnails/23.jpg)
Bob Conway’s Rapid Methodology
copywrite, Lucidata, 2013
![Page 24: Data modelingzone geoffrey-clark-v2](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5455ee5aaf795998788b4b23/html5/thumbnails/24.jpg)
Core Model with many Roles
TransactionTables
Reference Tables
copywrite, Lucidata, 2013
![Page 25: Data modelingzone geoffrey-clark-v2](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5455ee5aaf795998788b4b23/html5/thumbnails/25.jpg)
Power of Conformed Dimensions
copywrite, Lucidata, 2013
![Page 26: Data modelingzone geoffrey-clark-v2](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5455ee5aaf795998788b4b23/html5/thumbnails/26.jpg)
Example Data Model & Hierarchy
copywrite, Lucidata, 2013
![Page 27: Data modelingzone geoffrey-clark-v2](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5455ee5aaf795998788b4b23/html5/thumbnails/27.jpg)
Data Flow and Usage
copywrite, Lucidata, 2013
![Page 28: Data modelingzone geoffrey-clark-v2](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5455ee5aaf795998788b4b23/html5/thumbnails/28.jpg)
Cubes and In-memory BI
• Multi-Dimensional OLAP (MOLAP)– Drag-and-Drop OLAP environment, analysts become
capable of self-service.– Dealt with Ragged Hierarchies, common in Financial
data such as General Ledger (GL)– Limited by memory size– Pressure for more dimensionality floods cube size,
build times from relational sources exceed load windows ...
• Relational OLAP (ROLAP)copywrite, Lucidata, 2013
![Page 29: Data modelingzone geoffrey-clark-v2](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5455ee5aaf795998788b4b23/html5/thumbnails/29.jpg)
But a network this size choked it
copywrite, Lucidata, 2013
![Page 30: Data modelingzone geoffrey-clark-v2](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5455ee5aaf795998788b4b23/html5/thumbnails/30.jpg)
Columnar vs Row-wise
• Physically store data by Column vs Row– Rather like Fifth Normal Form.– If Semantically Organized, then Rapid Response to
user’s ad-hoc aggregation requests.– Prefers batch loading, always loads once per
column, even if loading one row.• Continues to Appear and Operate as a normal
Row-wise cousin.
copywrite, Lucidata, 2013
![Page 31: Data modelingzone geoffrey-clark-v2](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5455ee5aaf795998788b4b23/html5/thumbnails/31.jpg)
Columnar IO example
Compression becomes much more effective
Reading a Column is like reading a Row
copywrite, Lucidata, 2013
![Page 32: Data modelingzone geoffrey-clark-v2](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5455ee5aaf795998788b4b23/html5/thumbnails/32.jpg)
Design Pattern for Log DataData Stewards for
Master DataData Stewards for
Metadata
Architects integrate data and metadata
Architects organize data for
analysis with physical in mind
Architects identify levels for analysis, and distributionColumnar
MPP
copywrite, Lucidata, 2013
![Page 33: Data modelingzone geoffrey-clark-v2](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5455ee5aaf795998788b4b23/html5/thumbnails/33.jpg)
Importance of Reference Data
copywrite, Lucidata, 2013
![Page 34: Data modelingzone geoffrey-clark-v2](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5455ee5aaf795998788b4b23/html5/thumbnails/34.jpg)
Infobright’s Database Landscape 2011
copywrite, Lucidata, 2013
![Page 35: Data modelingzone geoffrey-clark-v2](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5455ee5aaf795998788b4b23/html5/thumbnails/35.jpg)
Analytic Database ComparisonActian
ParAccelIBM
NetezzaHP
VerticaGreenplum
Teradata
Sybase IQ
copywrite, Lucidata, 2013
![Page 36: Data modelingzone geoffrey-clark-v2](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5455ee5aaf795998788b4b23/html5/thumbnails/36.jpg)
Gartner’s Magic Quadrant
copywrite, Lucidata, 2013
![Page 37: Data modelingzone geoffrey-clark-v2](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5455ee5aaf795998788b4b23/html5/thumbnails/37.jpg)
Hadoop (Cloudera & Hortonworks)
“Although it’s true that Hadoop can be valuable as an analytic silo, most organizations will prefer to get the most business value out of Hadoop by integrating it with—or into—their BI, DW, DI, and analytics technology stacks.” – Philip Russom TDWI http://tdwi.org/webcasts/2013/04/integrating-hadoop-into-business-intelligence-and-data-warehousing.aspx
copywrite, Lucidata, 2013
![Page 38: Data modelingzone geoffrey-clark-v2](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5455ee5aaf795998788b4b23/html5/thumbnails/38.jpg)
Hadoop for Analytics?
Analytics performs best on Structured
Data, for good reasons.
Maintain MPP strengths in the solution through
Architecture.copywrite, Lucidata, 2013
![Page 39: Data modelingzone geoffrey-clark-v2](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5455ee5aaf795998788b4b23/html5/thumbnails/39.jpg)
Message from Hortonworks (Hadoop)
“Although it’s true that Hadoop can be valuable as an analytic silo, most organizations will prefer to get the most business value out of Hadoop by integrating it with—or into—their BI, DW, DI, and analytics technology stacks.” – Philip Russom TDWI http://tdwi.org/webcasts/2013/04/integrating-hadoop-into-business-intelligence-and-data-warehousing.aspxcopywrite, Lucidata, 2013
![Page 40: Data modelingzone geoffrey-clark-v2](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5455ee5aaf795998788b4b23/html5/thumbnails/40.jpg)
Hadoop as ETL
copywrite, Lucidata, 2013
![Page 41: Data modelingzone geoffrey-clark-v2](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5455ee5aaf795998788b4b23/html5/thumbnails/41.jpg)
Data Flow Reference Architecture
copywrite, Lucidata, 2013
![Page 42: Data modelingzone geoffrey-clark-v2](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5455ee5aaf795998788b4b23/html5/thumbnails/42.jpg)
Message from Neo4J NoSQL
copywrite, Lucidata, 2013
![Page 43: Data modelingzone geoffrey-clark-v2](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5455ee5aaf795998788b4b23/html5/thumbnails/43.jpg)
Message from MongoDB (NoSQL)
http://www.slideshare.net/fullscreen/mongodb/schema-design-by-example/1copywrite, Lucidata, 2013
![Page 44: Data modelingzone geoffrey-clark-v2](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5455ee5aaf795998788b4b23/html5/thumbnails/44.jpg)
Message from Couchbase (NoSQL)
http://www.couchbase.com/why-nosql/nosql-databasecopywrite, Lucidata, 2013