new data warehouse economics

28
The New Data Warehouse Economics Executive Summary The real message about the economics of data warehousing is not that everything is less costly as a result of Moore’s Law 1 , but rather, that the vastly expanded mission of data warehousing, and analytics in general, is only feasible today because of those economics. While “classical” data warehousing was focused on creating a process for gathering and integrating data from many mostly internal sources and provisioning data for reporting and analysis, today organizations want to know not only what happened, but what is likely to happen and how to react to it. This presents tremendous opportunities, but a very different economic model. For twenty-five years, data warehousing has proven to be a useful approach to report and understand an organization’s operations. Before data warehousing, reporting and provisioning data from disparate systems was an expensive, time-consuming and often futile attempt to satisfy an organization’s need for actionable 1 The number of transistors per square inch on integrated circuits had doubled every year since the integrated circuit was invented. Moore predicted that this trend would continue for the foreseeable future. Current thinking now is every eighteen months. © 2013 Hired Brains Inc. All Rights Reserved 1

Upload: neil-raden

Post on 31-Oct-2014

692 views

Category:

Technology


0 download

DESCRIPTION

The is my original text before GigaOM mangled it without my approval

TRANSCRIPT

Page 1: New data warehouse economics

The New Data Warehouse Economics

Executive Summary

The real message about the economics of data warehousing is not that

everything is less costly as a result of Moore’s Law1, but rather, that the vastly

expanded mission of data warehousing, and analytics in general, is only

feasible today because of those economics.

While “classical” data warehousing was focused on creating a process for

gathering and integrating data from many mostly internal sources and

provisioning data for reporting and analysis, today organizations want to know

not only what happened, but what is likely to happen and how to react to it.

This presents tremendous opportunities, but a very different economic model.

For twenty-five years, data warehousing has proven to be a useful approach

to report and understand an organization’s operations. Before data

warehousing, reporting and provisioning data from disparate systems was an

expensive, time-consuming and often futile attempt to satisfy an

organization’s need for actionable information. Data warehousing promised

clean, integrated data from a single repository.

The ability to connect a wide variety of reporting tools to a single model of the

data catalyzed an entire industry, Business Intelligence (BI). However, the

application of the original concepts of data warehouse architecture and

methodology are today weighed down by complicated methods and designs,

inadequate tools and high development, maintenance and infrastructure

costs. Until recently, computing resources were prohibitively costly, and data

warehousing was constrained by a sense of “managing from scarcity.”

Instead, approaches were designed to minimize the size of the databases

1 The number of transistors per square inch on integrated circuits had doubled every year since the integrated circuit was invented. Moore predicted that this trend would continue for the foreseeable future. Current thinking now is every eighteen months.

© 2013 Hired Brains Inc. All Rights Reserved

1

Page 2: New data warehouse economics

through aggregating data, creating complicated designs of sub-setted

databases and closely monitoring the usage of resources.

Proprietary hardware and database software are a “sticky” economic

proposition – market forces that drive down the cost of data warehousing are

blocked by the locked-in nature of these tools. Database professionals

continue to build structures to maintain adequate performance, such as data

marts, cubes, specialized schemas tuned for a single application and multi-

stage ETL2 that slow the update of information and hinder innovation.

Market forces and, in particular, the economics of data warehousing are

intervening to change the picture entirely. Today, the economics of data

warehousing are significantly different. Managing from scarcity has given way

to a rush to find the insight from all of the data, internal and external, and to

expand the scope of uses to everyone. Tools, disciplines, opportunities and

risk have been turned upside-down.

How do the new economics of data warehouse play out? There are already

enormous proven benefits to organizations such as:

- Offloading analytical work from data warehouses to analytical platforms that are purpose built for the work

- Capturing data more quickly, in greater detail, decreasing the time to decisions

- Delaying or even foregoing expensive upgrades to existing platforms

- Shortening the lag to analyzing new sources of data, or even eliminating it

- Broadening the audience of those who perform analytics and those who benefit from it

- Shortening the time needed to add new subject areas for analysis from months to days

2 ETL, or Extract, Transform and Load tools manage the process of moving data from original source systems to data warehouses and often include operations for profiling, cleansing and integrating the data to a single model.

© 2013 Hired Brains Inc. All Rights Reserved

2

Page 3: New data warehouse economics

Economic realities present a new era in Enterprise Business Intelligence,

disseminating analytical processing throughout and beyond the organization,

not just to a handful of analysts. With the accumulated effect of Moore’s Law,

the relentless doubling of computing and storage capacity3 every two years,

with a corresponding drop in prices, plus engineering innovation and Linux-

based and open-source software, a redrawing of the data warehouse model

is needed.

Expanding the scope and range of data warehousing would be prohibitively

expensive based on the economics of twenty-five years ago, or even ten

years ago. The numbers are in your favor.

The Numbers

Part of the driving force behind the new economics of data warehousing is

simply the advance of technology. For example:

- From 1993 to 2013, the cost of a terabyte of disk storage has dropped

from $2,000,000 to less than a $100 (this is the cost of the drives alone.

Storage management hardware and software are separate)

- Dynamic RAM (DRAM), the volatile memory of the computer, has seen a

similar drop in price ($250/4Mb of 8-bit memory versus $450/32Gb of 64-

bit memory). Today’s servers can be fitted with 64,000 times as much

memory for only twice the cost twenty years ago.

- Cost/Gb is only part of the story. There is a similar dramatic increase in

density (from 1Mb to 8Gb/chip), allowing for smaller installations, less

space and using less energy

- A proprietary Unix server with four Intel 486 chips, 384Mb RAM and 32Gb

of storage cost $650,000 in 1993, greater than $1 million in constant

3 Though technically not a result of Moore’s Law, the capacity and cost for storage follows a similar and even more dramatic curve

© 2013 Hired Brains Inc. All Rights Reserved

3

Page 4: New data warehouse economics

dollars. Today’s mid-range laptops of the same capacity cost between

$2,000-$4,000.

To put these numbers in perspective, since a chart simply cannot adequately

convey the magnitude of the effect of Moore’s Law over time, the table below

compares the specifications of a 1988 Ferrari Testarossa, one of the true

supercars of its era, its twenty-five year’s later version in the 2013, the F12,

and the effect of twenty-five years of Moore’s Law on the 1988 version for

comparison:

1988 Ferrari

Testarossa

2013 Ferrari

F12

1988 Ferrari

“enhanced”

By Moore’s Law

Weight 3300 lb. 3355 lb. 1.65 oz.

MPG 11 12 352,000

Cost New $200,000 $330,000 $6.25

Horsepowe

r

390 730 12,500,000

0-60 5.3 seconds 3.1 seconds .0002 seconds

Top Speed 180 mph 211 mph 4,800,000 mph

Expensive to Build and Maintain

The reality today is that moderately priced, high-performance hardware is

widely available. The cost of CPUs, memory and storage devices have

plummeted in the past five to ten years, but data warehouse design,

completely out of step with market realities, is still based on getting extreme

© 2013 Hired Brains Inc. All Rights Reserved

4

Page 5: New data warehouse economics

optimization from expensive and proprietary hardware. A great deal of the

effort involved in data modeling a data warehouse is performance-related. In

some popular methodologies, six or seven different schema types are

employed, with as many as twenty or thirty or more separate schemas

various use cases.

This has the effect of providing adequate query performance for mixed

workloads, but the cost is limited access to the data and extreme inflexibility

borne of the complexity of the design. Agility is difficult due to the number of

elements that must be designed and modified over time. Under today’s cost

structure, it makes more sense to ramp up the hardware and simplify the

models for the sake of access and flexibility.

Annual data warehouse maintenance costs are considerably higher than

operational systems. This is partly due to the iterative nature of analytics, with

changing focus based on business conditions, but that is only part of the

equation. The major cause is that the architecture and best practices are no

longer aligned with the technology. It takes a great deal of time to maintain

complex structures and it vastly complicates the understanding and usability

of the utility on the front end. For those BI tools that do provide a useful

semantic layer4, allowing business models to be expressed in business terms,

not database terms, user queries quickly overwhelm resources with complex

queries (complex from the database perspective.) Many BI tools still operate

without a semantic layer between the user and the physical database objects,

forcing non-technical users to deal with concepts like tables, columns, keys

and joins which makes navigation nearly impossible. All of these problems

can be solved today by taking advantage of the tools made possible by new

market economics.

In today’s BI environments, with “best practices” data warehouses, the

business models are burned into the circuitry of the relational database

4 A semantic layer translates physical names of data in source systems into common names for analysis. For example, System 1 and 2 may name “dog” CANINE_D4 and “Chien,” respectively, but the semantic layer abstracts these terms to a common naming convention of “dog” for understandability and consistency

© 2013 Hired Brains Inc. All Rights Reserved

5

Page 6: New data warehouse economics

schemas. The range of analysis possible is limited by the ability of the

database server to process the queries. Some BI software has the capability

to do much more, but it is constrained by the computing resources in the

database server. These tools can present a visual model to business users,

allow them to manipulate, modify, enhance and share their models. When a

request for analysis is invoked, the tools dynamically generate thousands of

lines of SQL per second, against the largest databases in the world, asking

the most difficult questions. The use of these tools, though they are precisely

what knowledge workers need, is very limited today because most data

warehouses are built on proprietary platforms that are an order of magnitude

more expensive and an order or two of magnitude slower than those available

from newer entrants into the market like Vertica.

How This Affects Operations and Strategy

The dramatic change in cost of computing may seem compelling, and is is,

but relative hardware costs are not the entire picture. When considering the

“economics” of data warehousing, it is just as important to take into account

the whole picture: the “push” from technology and the lessening of managing

from scarcity: For example:

- A shift in labor from highly skilled data management staff to more flexible

scenarios where more powerful platforms can take over more of the work

through both brute force (faster, more complex processing) and better

software designed for the new platforms

- The amount of data captured for analytics has also grown dramatically, to

some extent, but not entirely, normalizing the relative cost decrease. In

addition, as today’s analytical applications become more operational, the

need for speed and reduced latency (time to value) require faster devices

© 2013 Hired Brains Inc. All Rights Reserved

6

Page 7: New data warehouse economics

such as solid-state storage devices (SSD) which cost 5x to 10x the price

of conventional disk drives (but will continue to drop in relative price)

- There is far wider audience for reporting and analytics as a result of better

presentation tools and mobile platforms which drive the requirement for

more fresh data and analysis, particularly to mobile devices

- Data warehousing and BI were predominantly an inward-focused

operation, providing information and reporting to staff. This is no longer

the case. Now analytics are embedded in connections to customers,

suppliers, regulators and investors.

- The availability of cloud computing offers some very promising

opportunities, in many cases, particularly in agility and scale

A reversal of role of the data warehouse as the single repository of integrated

historical data with an emphasis on the repository function to supply reporting

needs rather than analytics has been reversed. BI is still useful and needed to

explain the Who, What and Where questions, but there is a driving force

across all organizations today to understand the “Why,” and to do something

about it. To understand the “Why” requires direct access to much more

detailed and diverse data not possible in first generation data warehouse

architectures, and it has to be handled by systems capable of orders of

magnitude faster processing.

New database technology designed specifically for analytics and data

warehouses such as data warehouse/analytics appliances, columnar-

orientation, optimizers designed for analytics, large memory models and

cloud orientation appeared a few years ago and are mature products today.

Most organizations are taking a fresh look at data warehouse architecture

more purpose-built for fast implementation, greater scale and concurrency,

rapid and almost continuous update, lower cost of maintenance and

dramatically more detailed data for analysis to deliver better outcomes to the

organization.

© 2013 Hired Brains Inc. All Rights Reserved

7

Page 8: New data warehouse economics

One thing is certain: the data warehousing calculus changed dramatically in

the past few years, with many options not available previously. As a result,

the cost/benefit analysis has altered. In particular, data warehouse technology

today can offer more function and value, often at lower costs, especially when

considering Total Cost of Ownership.

Looking at Total Cost of Ownership

Total Cost of Ownership is a financial metric that is part of a collection of tools

to measure the financial impact of investments. Unlike ROI, Return on

Investment or IRR, Internal Rate of Return, TCO does not take into account,

directly, the benefits of an investment. What it does is attempt to calculate,

either prospectively or retrospectively, the cost of implementing and running a

project over a fixed period of time, often three to five years. A simple TCO

calculation examines the replacement cost of something plus its operating

costs, and compares them with some other option, such as not replacing the

thing. Alternatively, it may compare the TCO of various options. People often

perform an informal TCO analysis when buying a car, considering say a five-

year TCO including purchase or lease, fuel, insurance and maintenance.

The value of a TCO model is to reduce the prominence of initial startup costs

in decision making in favor of taking a longer view. For example, of three

candidates for data warehouse platform, the one with the higher purchase

cost may have a lower longer term running cost because of maintenance

costs, energy costs or labor.

One application of TCO analysis in the new economics of data warehousing

might compare the costs of implementing Hadoop as a data integration

platform instead of existing tools such as enterprise ETL/ELT (Extraction,

Transform & Load). Because Hadoop is Open Source software it has no

purchase or license costs (though there are commercial distributions that

© 2013 Hired Brains Inc. All Rights Reserved

8

Page 9: New data warehouse economics

have conventional licensing but at modest cost) and appears to be a very

appealing alternative using TCO:

Software

License

Purchase

Software

Annual

Maintenance

Server

Purchase

w/utilities

Server

Annual

Maintenance

ETL $500,000 $100,000 $175,000 $35,000

Hadoop 0 0 $40,000 0

ETL Hadoop

At Inception $810,000* $40,000

Year 2 135,000 0

Year 3 135,000 0

Year 4 135,000 0

Year 5 135,000 0

Total 5-year TCO $1,215,000 $40,000

* Maintenance fees are typically paid a year in advance

This example does not demonstrate the desirability of Hadoop of ETL. It

demonstrates how easy it is to make mistakes with TCO by not considering

all of the factors. For example, ETL technology is in its third decade of

development and is designed purposely for the one application of the

integration of data for data warehouses. Hadoop is a general-purpose

collection of tools for managing large sets of messy data. What the model

above excludes is the cost of programming effort needed for Hadoop to

perform what ETL already can without programming. IN addition, ETL tools

have added benefits that are not often evident until a few years into the

© 2013 Hired Brains Inc. All Rights Reserved

9

Page 10: New data warehouse economics

project, such as connectors to every iniginable type of data without

programming, mature metadata models and services that ease enhancement

and agility, developer’s environments with version control, administrative

controls and even run-time robots that advise of defects or anomalies as they

occur.

Once the costs of labor, delays and downtime are factoring, the TCO model

provide a very different answer. The point of this example is not to promote

ETL over Hadoop or vice-versa, it is only to illustrate why it is so important in

this fairly turbulent time of change to build your models carefully.

Unlike ROI and it various derivatives, TCO does not discount future cash

flows for two reasons: first, because the time periods are relatively short and

second, because TCO costs are not actually investments. The goal of TCO is

to find (or measure) the cost of the implementation over time.

Keep in mind that a comparative analysis using TCO does not address the

question of whether you should make the investments; it only compares

various alternatives. The actual decision has to be based on an investment

methodology that considers benefits as well as costs.

Big Data and Data Warehouses

Until now, this discussion centered on data warehouses and BI. Big Data is

the newest complication in the broader field of analytics, but its ramifications

affect data warehousing to a great extent. Much is made of big data being a

“game changer” and the successor to data warehousing. It is not. If anything,

it represents an opportunity for data warehousing to fulfill, or at a least extend

its reach closer to its original purpose: to be the source of useful, actionable

data for all analytical processing. Data warehouse thinking will of necessity

have to abandon its vision of one physical structure to do this, however.

Instead, it will have to cooperate with many different data sources as a

virtualized structure, a sort of surround strategy of “quiet” historical data

warehouses, extreme, unconstrained analytical databases providing both

© 2013 Hired Brains Inc. All Rights Reserved

10

Page 11: New data warehouse economics

real-time update and near-time response and non-relational, big data clusters

such as Hadoop. Big data forces organizations to extend the scale of their

analytical operations, both in terms of volume and variety of inputs, and,

equally important, to stretch their vision of how IT can expand and enhance

the use of technology both within and beyond the organization’s boundaries.

Big data is useful for a wide range of applications in gathering and sifting

through data, and most often repeated uses are for gathering and analyzing

information after the fact to draw new insights not possible previously.

However, in 3-5 years, big data will move into production of all sorts of real-

time activities such as biometric recognition, self-driving cars (maybe 10

years for that) and cognitive computing including natural language question

and answering systems. This will only be possible because computing

resources continue to expand in power and drop in relative cost. Examples

include in-memory processing, ubiquitous cloud computing, text and

sentiment analysis and better devices for using the information. However, as

crucial as the hardware and software components are, the real advances will

happen with the wide adoption of semantic technologies that use ontology5,

graph analysis6 and open data7.

Economic realities are ushering in a new era in Enterprise Business

Intelligence, disseminating analytical processing throughout and beyond the

organization. BI is currently too expensive, too limited in scope and too rigid.

The economics of computing are not reflected in current data warehousing

best practices. The solution to problems of latency and depth of data is the

application of the vast computing and storage resources now available at

reasonable prices. Due to Moore’s Law and innovations in query processing

5 Ontologies are a way to represent information in a logical relationship format. One benefit is the ability to create them and link the together with other, even external, ontologies providing a way to use the meaning of things across domains6 Relational databases employ a logical concept of tables, two-dimensional arrangements of identically structured rows of records. Graphs are a completely different arrangement: things connected to other things by relationships Graphs are capable of traversing massive amounts of data quickly. Often in used in search, path analysis etc.7 Open data is a collaborative effort of organizations and people to provide access to clean, consistent shared information

© 2013 Hired Brains Inc. All Rights Reserved

11

Page 12: New data warehouse economics

techniques, workarounds and limitations that have become “best practices”

can be abandoned. Forward-looking companies like H-P/Vertica are

positioned to provide the kinds of capacity, easy scalability, affordability (both

purchase cost and TCO8) and simplicity that allow for broadly inclusive

environments with far greater flexibility. Tedious design work and endless

optimization that slow a project can be eliminated with new approaches at a

much lower cost.

In addition to servicing the reporting and analysis needs of business people,

data warehouses will soon be positioned to support a new class of

applications - unattended decision support. So-called hybrid or composite

systems, that mix operational and analytical processing in real-time are

already in limited production and will become mainstream within the next few

years. Most BI architectures today are not equipped to meet these needs.

Making data warehouses more current, flexible and useful will lead to much

greater propagation of BI, improving comprehension and retention of

business information, facilitating collaboration and increasing the depth and

breadth of business intelligence.

Bringing data warehouse practices up to speed with the state of technology

requires a reversal in polarity: the focus of data warehousing is still solidly on

data and databases and needs to shift to business users and practices that

need its functionality. The preponderance of effort in delivering BI solutions

today is still highly technical and dominated by technologists. For the effort to

be relevant, the focus and control must be turned over to the stakeholders.

This is impossible with current best practices.

Unlike systems designed to support business processes such as supply chain

management, fulfillment, accounting or even human resources, where the

business processes can be clearly defined (even though it may be arduous),

BI isn’t really a business process at all, at least not yet. Traditional systems

architecture and development methodologies presuppose that a distinct set of

8 Total Cost of Ownership

© 2013 Hired Brains Inc. All Rights Reserved

12

Page 13: New data warehouse economics

requirements can be discovered at the beginning of an implementation.

Asking knowledge workers to describe their needs and “requirements” for BI

is often a fruitless effort because the most important needs are usually unmet

and largely unknown. Thinking out of the box is extremely difficult in this

context. The entire process of “requirements gathering” in technology

initiatives, and BI in particular, is questionable because the potential

beneficiaries of the new and improved capabilities are operating under a

mental model formed by the existing technology.

A popular alternative to this approach is the iterative or “spiral” methodology

that simply repeats the requirements gathering in steps, implementing partial

solutions in between. Nevertheless, this process is still an result of managing

from scarcity – models have to be built in anticipation of use. But the new

economics of data warehousing are changing that. It is now possible to gather

data and start investigations before the requirements are known and to

continuously increment the breadth of a data warehouse because the tools

are so much more capable of processing information.

Prescription for Realignment

No amount of interviewing or “requirements gathering” can break through and

find out what sorts of questions people would have if they allowed themselves

to think freely. The alternative is to ramp up the capacity, allow exploration

and discovery, guiding it over a period of time. What’s needed is a new

paradigm that breaks open the data warehouse to business users, enabling

ad hoc and iterative analyses on full sets of detailed and dynamic data.

Unfortunately, old habits die hard. When it comes to BI, the industry is largely

constrained by a drag of technology. What passes as acceptable BI in

organizations today is rarely much more than re-platforming reports and

queries that are ten- to twenty-years old.

For data warehousing and BI to truly pay off in organizations, IT needs to shift

its focus from deciding the informational needs of the organization through

© 2013 Hired Brains Inc. All Rights Reserved

13

Page 14: New data warehouse economics

technical architecture and discipline, to one of responding to those needs as

quickly as they arise and change.

Conclusion: Walk in Green Fields

The new economics of data warehousing, more than anything else, free data

from the doctrine of scarcity. Many ordinary business questions are

dependent on examining attributes of the most atomic pieces of data. These

bits of information are lost as the data is summarized. Some examples of

queries that are simple in appearance but cannot be resolved with

summarized data are:

- Complementarity (a.k.a. market basket analysis): What options are most

likely to be purchased, sorted by profitability, in connection with others?

Extra credit: which aren’t?

- Metric Filters (a metric derived within a query is used as a filter): Who are

the top ten salespersons based on the number of times they have been in

the top ten monthly (inclusion based on % of quota)?

- Lift Analysis (what do we get from giving something away): How many

cardholders will we send a gift to for making a purchase greater than $250

on their VISA card in the next 90 days and which of them did so because

they were given an incentive? What if the amount is $300? How a

visualization of the lift from $50 to $500? Extra credit: greater than triple

their average charge in the trailing 90 days before they were sent the

promotion?

- Portfolio Analysis (to evaluate concentration risk, for example):

Aggregating data obscures information, especially in a portfolio where the

positive or negative impact of one or a small proportion of entities can

adversely affect the whole collection. Evaluating and understanding all of

the attributes of each security, for example, is essential, not an average.

- Valuations (such as asset/liability matching in life insurance): Each

coverage and its associated attributes (face amount, age at issue, etc.) is

© 2013 Hired Brains Inc. All Rights Reserved

14

Page 15: New data warehouse economics

modeled with interest assumption to generate cash flows to demonstrate

solvency for a book of business.

All of these examples are just extrapolations of classic analytics, but

enhanced by the ability to make more observations (data points) and fewer

assumptions, and to do it in near real-time. In addition to these examples, the

following are examples of hybrid processing, involving the coordination of

analytics and operational processing:

- Manufacturing Quality: Real-time feeds from the shop floor are monitored

and combined with historical data to spot anomalies, resulting in

drastically reduced turnaround times for quality problems.

- Structured/Unstructured Analysis: The Web is monitored for press

releases that indicate price changes by competitors, causing the release

to be combined with relevant market share reports from the data

warehouse and broadcast to appropriate people, all on an unattended

basis.

- Personalization: B2B and B2C eCommerce dialogues are dynamically

customized based on analysis of rich integrated data from the data

warehouse in real-time and real-time feeds

- Dynamic Pricing: Operational systems, like airline reservations, embed

real-time analysis to price each transaction (also known as Yield

Management).

This list is endless. Providing people (and processes) access to this level of

detail is considered too difficult and too expensive. Porting these applications

to newer, commodity-based platforms and purpose-built software that exploits

the new economics can empower a multitude of powerful new analytical

applications.

Conclusion

The new economics of data warehousing provide attractive alternatives in

both costs and benefits. While big data gets most of the attention, evolved

© 2013 Hired Brains Inc. All Rights Reserved

15

Page 16: New data warehouse economics

data warehousing will play an important role for the foreseeable future. In

order to be relevant, data warehouse design and operation needs to be

simplified, taking advantage of the greatly improved hardware, software and

methods. On the benefits side, careful location of processing to separate data

integration and stewardship activities from those of an operational and

analytical nature is clearly required. The industry has moved beyond a single

relational database to do everything. There are more choices today than ever

before. New BI tools are arriving every day that take reporting and analysis to

new levels of visualization, collaboration and distribution. Embedded

analytical processes in operational systems for decision-making rely on fast,

deep and clean data from data warehouses.

Data warehousing can trace its antecedents to traditional Information

Technology. Big data, and especially its most prominent tool, Hadoop, trace

their lineage to search and e-business. While these two streams are more or

less incompatible, they are converging. Data warehouse providers are quickly

adding Hadoop distributions, or even their own versions of Hadoop, into their

architecture, adding further cost advantages to collection of extremely large

data sets.

Finding and retaining the talent to manage in this newly converged

environment will not be easy. Current estimates are that in the U.S. alone,

there is an existing undersupply of “data scientists” with the skill and training

to be effective. To change the mix in the educational system will take more

than a decade, but the economy will adjust much more quickly. New and

better software has already arrived allowing knowledge workers without

Ph.D.’s and/or extensive background in quantitative methods to analyze data

and to build and execute models.

Driven by the new economics of data warehousing, the proper design and

placement of a data warehouse today would be almost unrecognizable to a

practitioner of a decade ago. The opportunities are there.

© 2013 Hired Brains Inc. All Rights Reserved

16

Page 17: New data warehouse economics

ABOUT THE AUTHOR

Neil Raden, based in Santa Fe, NM, is an industry analyst and active

consultant, widely published author and speaker and the founder of Hired

Brains, Inc., http://www.hiredbrains.com. Hired Brains provides consulting,

systems integration and implementation services in Data Warehousing,

Business Intelligence, “big data,” Decision Automation and Advanced

Analytics for clients worldwide. Hired Brains Research provides consulting,

market research, product marketing and advisory services to the software

industry.

Neil was a contributing author to one of the first (1995) books on designing

data warehouses and he is more recently the co-author of Smart (Enough)

Systems: How to Deliver Competitive Advantage by Automating Hidden

Decisions, Prentice-Hall, 2007. He welcomes your comments at

[email protected] or at his blog at Competing on Decisions.

Appendix

© 2013 Hired Brains Inc. All Rights Reserved

17

Page 18: New data warehouse economics

Relational Database Inverted: What Columnar Databases Do

For a complete description of the features and functions of Vertica, please

refer to their website at www.vertica.com. What follows is a more general

description of the concept. It is very slightly technical.

Classic relational databases store information in records and collections of

records in tables (actually, Codd, the originator of the relational model,

referred to them as tuples and relations, hence, relational databases). This

scheme is excellent for dealing with the insertion, deletion or retrieval of

individual records as the elements of each record, or the attributes, are

logically grouped. These records can get very wide, such as network

elements in a telecom company or register tickets from a Point of Sale

application. This physical arrangement closely mirrors the events or

transactions of a business process.

When accessing a record, however, the database accesses the entire record,

even if only one attribute is needed in the query. In fact, since records are

physically stored in “pages,” the database has to retrieve more than one

record, often dozens, to access one piece of data. Furthermore, wide

database tables are often very “sparse,” meaning there are many attributes

with null values that still have to be processed when used in a query. These

are major hurdles in using relational databases for analytics.

Columnar database technology inverts the database structure and stores

each attribute separately. This completely eliminates the wasteful retrieval

(and burden of carrying the needless data) as the query is satisfied. As an

added benefit, when dealing with just one attribute at a time, the odds are

vastly increased that instances are not unique which leads to often dramatic

compression of the data (depending on the “distinct”-ness of each collection

of attributes.) As a result, much more data can fit in memory, and moving data

into memory is much faster.

© 2013 Hired Brains Inc. All Rights Reserved

18

Page 19: New data warehouse economics

In summary, the Vertica Analytical Database innovations include:

Shared-nothing massively parallel processing

Columnar architecture residing in commodity hardware

Aggressive data compression leading to smaller, faster databases

Auto-administration: design, optimization, failover, recovery

Concurrent loading and querying

© 2013 Hired Brains Inc. All Rights Reserved

19