how real time data changes the data warehouse
DESCRIPTION
Surveys show a growing demand for more up-to-date data in our BI environments. To meet these needs requires changing from a strict reliance on nightly batch-style ETL to other methods. What is often ignored is how this affects the data warehouse. This shift introduces new technology and methods, which means the warehouse must support new types of workloads. • Methods and tools for processing up-to-date data • New requirements for your data warehouse database or platform • What to look for as you address these requirementsTRANSCRIPT
Attribution-NonCommercial-No Derivativehttp://creativecommons.org/licenses/by-nc-nd/3.0/us/
How Real Time Data Requirements Change the Data Warehouse EnvironmentMark Madsen – September 17, 2008www.ThirdNature.net
Slide 2Third Nature, January 2008 Mark Madsen
OutlineWhat’s real-time about?
Impacts on the data warehouse architecture
Delivering data to users
Extracting the data
Storing the data
Operations
Getting started
Slide 3Third Nature, January 2008 Mark Madsen
Speeding Up the Data Warehouse
Why?Faster reaction time
Reduced decision time
New process capabilities
Slide 4Third Nature, January 2008 Mark Madsen
Which Decisions Benefit?
Most real time needs will be driven by operational decision making, not strategic decisions.
Strategic Operational
Decision time flexible, long cycle constrained, short cycle
Decision scope broad, organizational narrow, departmental or process
Decision model Complex Simple
Data latency High, history is core to decisions
Low, recent data is core to decisions
Data scope Many sources, many types, aggregated
Few sources, structured, detailed
Slide 5Third Nature, January 2008 Mark Madsen
Strategy, Decisions and Data Latency
Increase share of low to mid market customers
Efficient sourcing
Consolidate suppliers
Decrease Out of StocksTactics
Reduce cost of products soldStrategy
Goal
Improve promotional performance
Catch out of stocks before they occur
Improve delivery compliance
Reports & spreadsheets
Dashboards, alerts & scorecards
Real time alerts & embedded analytics
BI Needs
Slide 6Third Nature, January 2008 Mark Madsen
What People Are Doing Today
3
27%
24
34%
44
69% 15%
29
19%
6%32%
29% 65% 30%
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
2006
2004
2002
Monthly Weekly Daily Multiple times per day On demand
Sources: TDWI, Gartner
At the same time, data volumes are rising for most data warehouses at 50% to 100% per year.
Slide 7Third Nature, January 2008 Mark Madsen
BI Efforts Involving Real Time Data Access
Terms you may hear from the BI market that imply real time:
Operational BIEmbedded analyticsDecision automationComplex event processingEvent-driven BIProcess-driven BI
They are all similar in requiring some level of low latency data access.
Slide 8Third Nature, January 2008 Mark Madsen
Impacts on the DW Architecture
Databases Documents Flat Files XML Queues ERP Applications
Source Environments
Databases Dashboards OLAP Productivity BAM/BPM Reporting Analytics Applications
Data Consumers
Delivery
Warehouse Database
ETL
Mart
ODS
EDR EII
Content Store
Adding current data to the system requires effort at all three layers
DW Platforms
Slide 9Third Nature, January 2008 Mark Madsen
In-line with process:• Real time data flows separately
from the warehouse data• May include a low-latency data
store in the real time environment• This model be needed for
extremely low latency data• More applicable for event-driven
Out of band:• Data to the consumer first flows
through the DW• Unified architecture for both low
and high latency data• More applicable for on-demand
One Architecture or Two?
Batch DWBI
Process
RT BI
DW
BI & RT BI
ProcessProcess
Slide 10Third Nature, January 2008 Mark Madsen
User Interface: Two BI Usage ModelsDemand driven• Users ask for current data• Most BI tools work this way• Harder to adapt these tools to
event-driven models
Event driven• System takes action based on
data, e.g. alerts, rule engines• May not have (or need) an end
user interface• Need understanding of decision
& action process for this model
Slide 11Third Nature, January 2008 Mark Madsen
BI Tools Need New Capabilities
Embedding BI within applications
• UI embedding• Full embedding
Event-based integration
Feeding BI data to applications: services, not SQL, may be desired
Custom UI code may be preferable to a BI tool
Slide 12Third Nature, January 2008 Mark Madsen
The Data Integration Layer• Integration is the most complex
element of adding real time data.• Inline vs. out of band, demand vs.
event-driven BI usage create different DI requirements.
• You may not have exactly the same metrics, attributes or data extract logic.
• Don’t count on replacing the ETL batch; more likely you are augmenting it.
• You probably need to add new DI technologies to your portfolio.
• Batch performance design isn’t like real time design.
Slide 13Third Nature, January 2008 Mark Madsen
Speeding Up Data Integration Methods
Hourly+
Single batch
Frequent batch
Continuous load
Streaming
Immediate
Mini-batch
Slide 14Third Nature, January 2008 Mark Madsen
The Platform Layer: Data and Database
• Schemas will need changes.• You don’t need to convert the
entire database to a real time schema.
• One schema or two?• Event-driven BI creates
different query patterns and workloads.
• Configuration and tuning may be different than what you are used to with traditional BI.
• Application developers want services or ORMs, not SQL.
Slide 15Third Nature, January 2008 Mark Madsen
Different Platform Workloads
Three workloads:
Data loading +Normal BI +Real time BI
= complications
Databases Documents Flat Files XML Queues ERP Applications
Source Environments
Databases Dashboards OLAP Productivity BAM/BPM Reporting Analytics Applications
Data Consumers
Delivery
Warehouse Database
ETL
Mart
ODS
EDR EII
Content Store
DW Platforms
Slide 16Third Nature, January 2008 Mark Madsen
Development, Maintenance & Operations
• Real time decisions on real time data mean data quality plays a larger role, and it’s harder to address.
• Warehouse availability becomes much more important to the business, and it isn’t just the database – it’s everything.
• Performance and meeting strict BI SLAs will rise in importance since you are now tied in to business operations.
Slide 17Third Nature, January 2008 Mark Madsen
A Prescription for Getting Started1. Star with a decision
process2. Define data needs for the
process3. Ensure that data is
available at the right latency
4. Determine appropriate data integration technologies.
5. Design and initiate upstream work
6. Build
Slide 18Third Nature, January 2008 Mark Madsen
Thanks
Slide 19Third Nature, January 2008 Mark Madsen
Thanks to the people who supplied the creative commons licensed images used in this presentation:• Divers - http://flickr.com/photos/raveller/ • Fast dog - http://flickr.com/photos/marinacvinhal/379111290/• Febo - http://flickr.com/photos/igor/419425754/• Subway - http://flickr.com/photos/neilsphotoalbum/504517855/• Cadillac ranch - http://flickr.com/photos/whatknot/179655095/
CC Image Attributions
Page 20
About the PresenterMark Madsen is president of Third Nature, a technology research and consulting firm focused on business intelligence, data integration and data management. Mark is an award-winning author, architect and CTO whose work has been featured in numerous industry publications. Over the past ten years Mark received awards for his work from the American Productivity & Quality Center, TDWI, and the Smithsonian Institute. He is an international speaker, a contributing editor at Intelligent Enterprise, and manages the open source channel at the Business Intelligence Network. For more information or to contact Mark, visit http://ThirdNature.net.