dw 2.0

Post on 24-May-2015

1.170 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

This presentation discusses about the next generation data warehousing. The sources of this presentation is from Bill Inmon\'s Book - DW2.0 and many other web and text resources.

TRANSCRIPT

DW 2.0 The Next Generation Data

warehousing

Lakshminarasu ChenduriData Warehousing Practice

AGENDA

• A quick look around Data warehousing • Evolution of DW2.0• Lifecycle of data• Masterdata and Metadata• Method of Accessing Data• Structured/Unstructured Data• Flow of data in DW2.0• Master Data Management (MDM)• Future Roadmap• Conclusion

BACKGROUND

• A Data Warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data.

• Data mart is a data structure that is optimized for access. It is designed to facilitate end-user analysis of data. It typically supports a single, analytic application used by a distinct set of workers.

• Facts are the metrics that business users would use for making business decisions.

• Dimensions are those attributes that qualify facts. They give structure to the facts.

EXISTING DATA WAREHOUSES

• Data warehouses consists of• Operational database layer • Data access layer • Metadata layer • Informational access layer

• This approach does not define limits for the data retention and aging.

• Unorganized Metadata

• Query optimization is a must for such unorganized databases.

EVOLUTION TO THE DW 2.0 ENVIRONMENT

• The demand for more and different uses of technology.

• Online processing of data, reminds us faster processing of data – dealing with real time data.

• The need for integrated, corporate data.

• The need to include unstructured, textual data in the mix.

• Capacity of data storage and volume of data are proportional.

• Economics of technology in investing new technologies and paying higher prices.

BUSINESS IMPACT

• Credit card fraud analysis

• Inventory management

• Customer profiles

• Frequent Flier programs in Airlines

• Health Analysis - Analysis of Epidemic and Endemic diseases by geographical location wise.

• Climatic and Weather report on historical analysis

• SOA Governance and active-exploratory warehousing

WHAT MADE DW 2.0 SO IMPORTANT?

• DW 2.0 is the definition of data warehouse architecture for the next generation of data warehousing.

• Urge for meeting the dynamic and faster growing of business in terms of capacity, ROI, geographically, organizationally, etc.

• This would also mean to add the neediness to upgrade the existing technology to than that of a higher one, which can handle pedabytes of data.

• To understand how DW 2.0 came about, consider the following shaping factors:

FACTORS THAT INFLUENCED DW 2.0

1. Business Value2. Volume of Integrating Data3. Cost of setting up a Data warehouse4. Metadata and Masterdata Management5. Medium of storage6. Neediness of Data warehouses7. Changing business requirements

LIFECYCLE OF DATA

• As data enters into the data warehouse, it starts a lifecycle.

• Hence its phases are termed as Sectors of data.

• In its lifecycle, a data can be put into any of the four sectors.

• Following picture depicts the Lifecycle of data in DW2.0 Environment

SECTORS OF DATA

Sectors of Data:

• Interactive Sector – Very current data from an application may be as old as a second or a minute.

• Integrated Sector – Nearly current data, may be as old as an hour to day or a week.

• Near line Sector – Less than current, may be as old as a month or 2

• Archival Sector – Very old data, may be as old as a year up to 30 or 40 years.

STATES OF DATA

States of data:

• Application Data – When data enters into a table from an application (Interactive Sectored data)

• Corporate Data – When data enters into a warehouse through ETL (Integrated Sector)

• Archived Data – Data in Archival sector data is called as archived data

WHY DO WE HAVE DIFFERENT SECTORS?

• Query Access pattern differs for each sector

• Distinct volumes of data in each sector w.r.t time frame

• Each sector is optimally served with different technologies

• Demarcation between frequently accessed data and rarely accessed data

MASTER DATA MANAGEMENT(MDM)

• Most valuable information that a business owns, representing core business components – pillars of business

• Captures data that becomes key things that stand common in all parts of the organization.

• Data management teams must be able to visualize and segregate the metadata and the master data of a data warehouse/organization

• Later part of this PPT is dedicated to MDM

CONSIDERATION OF METADATA

• Metadata is a rapidly growing concern of any organization, which relates the business and technical facets.

• Physical embodiment of metadata.

• Metadata should be placed with the actual data. This coupling of Metadata with actual data enhances the visibility of data for a longer time range.

• This will help while examining archival data; it will be clear what the data is.

METADATA INFRASTRUCTURE IN DW 2.0

• In first generation data warehouses, we had only metadata which exposes business and technical view.

• In DW 2.0, the metadata distinctions have been clearly demarcated with their purposes.

• We classify them as,

• Enterprise Metadata• Local Metadata• Business Metadata• Technical Metadata

METADATA EXPLAINED…

• Enterprise metadata is stored in a locale that is central to all of the tools and all of the processes that exist within the DW 2.0 environment.

• Local metadata is stored in a tool or technology that is central to the usage of the local metadata. Eg. ETL Source, target objects, DBMS directory metadata about tables, repository, attributes, indexes, etc.

• Business intelligence (BI) universe metadata is about data used in analytical processing. Data quality screen specifications including the code for data quality tests, severity score of the potential error, and action to be taken when error occurs.

• Technical Metadata is the one which has Source descriptions of all data sources, including record layouts and column definitions. Business names for all tables and columns mapped to appropriate presentation server objects, join paths, computed columns, and business groupings. May also include aggregate navigation and drill across functionality. This also spans to the depth of designing logic and functionality of data transformations.

METADATA IN DIFFERENT SECTORS

MD

MD

MD

Metadata Repository

MD

MD

Very Current Data (Interactive)

Near to Current Data (Integrated Sector)

Less than current Data (Near Line Sector)

Old and Very Old Data (Archival Sector)

METADATA IN ARCHIVAL SECTORS

• Metadata in Archival data is stored with the data itself.

• This is because it is assumed that metadata could be lost over time if it is collocated with its associated archival content data.

ACTIVE AND PASSIVE REPOSITORIES

• A passive repository is one in which the metadata does not interact in any direct manner with the development and/or the query activities of the end user.

• An active repository is one in which the metadata interacts in an ongoing manner with the development and query activities of the system.

ACCESS OF DATA

• Differences in patterns and frequency of data access.

• In case of Interactive data, unit of data changes are w.r.t seconds, even milliseconds.

• In case of Archival data, data changes are w.r.t quarters, years and decades.

• For faster querying performance, interactive data is handled by an optimal technology which fits which would be more expensive

• Archival data which is the least used, so they are served with a different set of technologies which would be less expensive.

Pattern and frequency of data access changes dramatically w.r.t time

STRUCTURED/UNSTRUCTURED DATA

• Two basic types of data:

• Structured Data• Unstructured Data

• Structured data is the one which comes in a repetitive format.

• Best examples of structured data would be data generated by bank transactions, invoice bills, various ticketing systems, retail transactions, etc.

• Data from these types of transactions are easily recorded in the databases with suitable entities, attributes and keys and indexes.

• In short, these are well served by the standard database technology.

UNSTRUCTURED DATA

• On the other hand, Unstructured Data is classified into two ways:

• Textual Unstructured Data (TUD) like Emails, Text messages, PowerPoint presentations, Text documents, telephonic conversations, etc.

• Non-textual Unstructured Data (NTUD) like photographic images, X-Rays, MRI Images, diagrammatic illustrations, etc.

• Current technology is not able to handle non-textual data.

• Unstructured textual data can be handled; they can be captured and analyzed, but not easily with the current database technology.

• Since the TUDs are not repetitive, a specialized effort is needed in handling such data. Though, this does not mean that TUDs are of no value, they are good value indeed.

BLATHER

• Of many of the challenges in DW 2.0, screening of Unstructured Data is a significant one.

• When we consider unstructured data, we are interested only in the data which adds value to the business or that which becomes a key metric in the business in itself.

• Consider a text message, saying “Hi honey, Bit busy here, will come late tonight”.

• Such personal emails, SMS become a part of unstructured data. This type of data which is not useful to the business by any means is said to be “Blather”.

• So screening and eliminating such type of data consumes lot of effort.

THE FLOW OF DATA IN DW2.0 ENVIRONMENT

• Data flows throughout the DW 2.0 environment.

• Data enters the Interactive Sector either directly or through ETL from an external application. Data flows to the Integrated Sector through the ETL process, coming from the Interactive Sector.

• Data flows from the Interactive Sector to the Near Line Sector or the Archival Sector as it ages.

• On a limited basis, data may flow from the Archival Sector back to the Integrated Sector, and occasionally data flows from the Near Line Sector to the Integrated Sector.

The flow of data in DW2.0 Environment Interactive to Integrated Sector

Data flow from Integrated to near line sector

DATA FLOW FROM INTEGRATED TO ARCHIVAL SECTOR(Data ageing)

MASTER DATA MANAGEMENT

• Master data can be described by the way that it interacts with other data.

• Master Data Management (MDM) is the technology, tools, and processes required to create and maintain consistent and accurate lists of master data.

• This type of data is used across organization repeatedly by several business processes.

• This provides a business context through underlying data models.

IBM’s DEFINITION OF MASTER DATA MANAGEMENT

• Here comes the IBM’s definition of Master Data Management:

• Decouples master information from individual applications

• Becomes a central, application independent resource

• Ensures consistent master information across transactional and analytical systems

• Simplifies ongoing integration tasks and new application development

• Addresses key issues such as data quality and consistency proactively rather than “after the fact” in the data warehouse

GROUPING OF DATA AND MASTER DATA

• Data in Corporate world are classified as • Unstructured• Transactional• Metadata• Hierarchical• Master

• Master data falls into four groupings of business:• People• Things• Places• Concepts

DIMENSIONS OF MASTER DATA MANAGEMENT

PHASES IN DEVELOPING A MDM PROJECT

• Identify sources of master data.

• Identify the producers and consumers of the master data.

• Collect and analyze metadata about for your master data.

• Appoint data stewards.

• Implement a data-governance program and data-governance council.

• Develop the master-data model.

• Choose a toolset.

• Design the infrastructure.

• Generate and test the master data.

• Modify the producing and consuming systems.

• Implement the maintenance processes.

FUTURE ROADMAP

• Cloud Computing – Building data warehouses in public clouds which enables data to be location independent, transparent.

• Private and public clouds make the way of handling data in a sensible way.

• But this has few concerns like • Data volumes• Data privacy and governance

• Open Source Data Integration Solutions – This will provide a greater impact on the Cost/ROI

• Many companies have come up with java based ETL engines with a comparable performance; still we need some concrete standards

CONCLUSION

• DW2.0 addresses various business and technical needs of Data warehousing.

• This approach also facilitates for further exploring the warehouse for future; a clear way of managing of data from various perspectives.

• The evolution in industry has poised a growth in the business and its perspectives; on the other hand we see a major growth in the technology as well.

• More involvement in these, led to the emergence of regulatory compliance, SOA, and mergers and acquisitions.

• This has made the creating and maintaining of accurate and complete master data and DW2.0 a business imperative.

REFERENCES

• Book References:

• DW2.0 by William Inmon, Derek Strauss, Genia Neushloss

• New Trends in Data Warehousing and Data Analysis by Stanislaw Kozielski and Robert Wrembel

• Enterprise Master Data Management: An SOA Approach to Managing Core Information, by Allen Dreibelbis, Eberhard Hechler, Ivan Milman, Martin Oberhofer, Paul van Run and Dan Wolfson.

• Web References:

• http://en.wikipedia.org/• http://msdn.microsoft.com/en-us/library/bb190163.aspx• http://www.sqlmag.com/• http://www.ibm.com/

top related