lab validation report - infostor | data storage news ... lab validation emc greenplum...lab...

17
Lab Validation Report EMC Greenplum Data Computing Appliance By Julie Lockner & Tom Kornegay June 2011 © 2011, Enterprise Strategy Group, Inc. All Rights Reserved.

Upload: buikien

Post on 24-Mar-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

Lab Validation Report EMC Greenplum Data Computing Appliance

By Julie Lockner & Tom Kornegay

June 2011 © 2011, Enterprise Strategy Group, Inc. All Rights Reserved.

Lab Validation: EMC Greenplum Data Computing Appliance 2

© 2011, Enterprise Strategy Group, Inc. All Rights Reserved.

Contents

Introduction .................................................................................................................................................. 3 EMC Greenplum Data Computing Appliance ........................................................................................................... 4

ESG Lab Validation ........................................................................................................................................ 5 The Physical Test Bed ............................................................................................................................................... 5 The Data Model ........................................................................................................................................................ 5 Load & Go ................................................................................................................................................................. 6 Linear Scalability ....................................................................................................................................................... 8 Analytics Ready ....................................................................................................................................................... 10

ESG Lab Validation Highlights ..................................................................................................................... 12

Issues to Consider ....................................................................................................................................... 12

The Bigger Truth ......................................................................................................................................... 13

Appendix ..................................................................................................................................................... 14 DCA GP1000 Test bed Hardware Configuration Detail (Full Rack) ......................................................................... 14 RAID Configuration Details ..................................................................................................................................... 14 Test Query SQL Code .............................................................................................................................................. 15

All trademark names are property of their respective companies. Information contained in this publication has been obtained by sources The Enterprise Strategy Group (ESG) considers to be reliable but is not warranted by ESG. This publication may contain opinions of ESG, which are subject to change from time to time. This publication is copyrighted by The Enterprise Strategy Group, Inc. Any reproduction or redistribution of this publication, in whole or in part, whether in hard-copy format, electronically, or otherwise to persons not authorized to receive it, without the express consent of the Enterprise Strategy Group, Inc., is in violation of U.S. Copyright law and will be subject to an action for civil damages and, if applicable, criminal prosecution. Should you have any questions, please contact ESG Client Relations at (508) 482.0188.

ESG Lab Reports

The goal of ESG Lab reports is to educate IT professionals about emerging technologies and products in the storage, data management and information security industries. ESG Lab reports are not meant to replace the evaluation process that should be conducted before making purchasing decisions, but rather to provide insight into these emerging technologies. Our objective is to go over some of the more valuable feature/functions of products, show how they can be used to solve real customer problems and identify any areas needing improvement. ESG Lab's expert third-party perspective is based on our own hands-on testing as well as on interviews with customers who use these products in production environments. This ESG Lab report was sponsored by EMC Greenplum.

Lab Validation: EMC Greenplum Data Computing Appliance 3

© 2011, Enterprise Strategy Group, Inc. All Rights Reserved.

Introduction Business data growth rates have been increasing exponentially in recent years and will continue to do so for the foreseeable future. “Big data,” as it is termed, has become a big challenge as organizations look to revamp their infrastructures and processes to scale to accommodate steep growth rates without the equivalent cost curve. With that in mind, ESG Lab validated the real-world performance and functional capabilities of the EMC Greenplum Data Computing Appliance (DCA). Testing was designed to assess the ease of use, performance, scalability, and analytics-readiness of the DCA platform running an extremely large data set using a real-world retail data model.

ESG research indicates that in 2011, the top two business initiatives that will have the greatest impact on IT spending are cost reduction and business process improvements (see Figure 1).1

Figure 1. Business Initiatives That Will Have the Greatest Impact on IT Spending

Close behind, in the top four is improving business intelligence and delivery of real-time analytics.

Source: Enterprise Strategy Group, 2011.

The ROI of stockpiling vast amounts of data is directly correlated to an organization’s ability to leverage it. In other words, data is only an asset if you can make it one. This is apparent in the evolution of new roles such as the “data scientist”—part hacker, part quantitative analyst—that help companies derive competitive advantages from rich data stores. The more traditional act of reporting is being supplemented with data analytics and data mining disciplines requiring strong mathematical and technological repertoires. The reason for the programming aspect of the role has to do in part with the diminishing use of the traditional data warehouse model as a solution to big data. Data is simply growing too fast and changing too rapidly for architects to organize in advance and aggregate it into a conformed model. While this fits well with the act of reporting where questions are known and standardized, it works against data analytics and mining where access to all data in raw form is pivotal and the questions are ad-hoc

1 Source: ESG Research Report, 2011 IT Spending Intentions Survey, January 2011.

11%

11%

11%

15%

16%

16%

19%

19%

27%

33%

42%

0% 10% 20% 30% 40% 50%

Research and development innovation/improvement

Increased use of social networking technology for marketing, customer outreach, market research, etc.

International expansion

Improved internal collaboration capabilities

Business growth via mergers, acquisitions, or organic expansion

"Green" initiatives related to energy efficiency and/or reducing company-wide environmental impact

Regulatory compliance

Improved busines intelligence and delivery of real-time business information

Security/risk management initiatives

Business process improvement initiatives

Cost reduction initiatives

Which of the following business initiatives do you believe will have the greatest impact on your organization's IT spending decisions over the next 12-18 months? (Percent of

respondents, N=611, three responses accepted)

Lab Validation: EMC Greenplum Data Computing Appliance 4

© 2011, Enterprise Strategy Group, Inc. All Rights Reserved.

or completely unknown. In the analytics realm, attempting to tune data to match activity can be an exercise in futility. Data usage patterns in the analytics realm have no borders. The processing power, capacity, and throughput required to support analytics can easily surpass the cost/benefits of traditional stack infrastructure.

EMC Greenplum Data Computing Appliance

EMC Greenplum addresses this business problem with the Data Computing Appliance (DCA): an all-in-one data analytics solution that provides the capacity and power to handle the most punishing big data analytics at the best price/performance ratio on the market. The high-level design of the DCA is a Massively Parallel Processing (MPP) architecture utilizing master servers (one live, one standby) for orchestration and segment servers to do the heavy lifting. All servers are internally connected via an interconnect bus with 24 10 gigabit Ethernet ports and 10 Fibre Channel ports. The GP1000 model is a full rack equipped with two master servers (one live, one standby) and 16 segment servers each capable of spawning multiple processing threads.

Figure 2. EMC Greenplum Data Compute Appliance overview

This report examines the DCA’s out-of-box performance and capabilities by validating its best in class ingest speed, its ability to scale linearly, and its ability to execute complex analytics on terabytes of data. All of this was done using a real world retail e-commerce data model housing over 10 TB, or 50 billion rows, of data. In particular, this report shows how a single DCA:

• Can scale quickly from a three-quarter rack to a full rack with zero downtime and without reloading data. • Is a true MPP architectural design with linear scalability supported by an increase in performance

equivalent to the number of resources added. • Can scan billions of rows in seconds without caching. • Is simple to administer as it requires no traditional performance tuning objects such as indexes. • Natively integrates with popular analytics tools including Alpine Miner.

Lab Validation: EMC Greenplum Data Computing Appliance 5

© 2011, Enterprise Strategy Group, Inc. All Rights Reserved.

ESG Lab Validation The real-world performance and functional capabilities of the EMC Greenplum DCA were assessed by ESG Lab via hands-on testing at the EMC Greenplum facility located in San Mateo, California. The methodology presented in this report was designed to assess the performance capabilities of a single Data Computing Appliance (DCA) running an extremely large data set using a real-world retail data model. The data model was designed to exhibit buying patterns and the cyclical characteristics of an e-commerce system using real data collected from Amazon, the US Census Bureau, the US Postal Service, and other available resources. The project benefited from EMC Greenplum marketing, engineering, architecture, and administration staff being on site for the duration of testing.

The Physical Test Bed

The starting test bed, shown in Figure 3, was an out-of-the-box DCA GP1000 with a three-quarter rack configuration yielding 108 TB of usable compressed capacity.2

Figure 3. EMC Greenplum DCA Lab Configuration

There are two master servers in every GP1000; the primary difference among the quarter, half, three-quarter, and full rack configurations is the number of segment servers.

The Data Model

The data model installed on the test machine was engineered by EMC to represent a basic retail e-commerce application. It contained a total of seven dimension tables covering basic reference information (customer, product category lookups, etc.) and three very large fact tables housing all orders, order line items, and shipping line items (see Figure 4). The tables were the only objects created in the database—there were no indexes, views, materialized views, or other common tuning objects. The dotted lines on the entity-relationship diagram represent logical references between table keys as opposed to physical (primary/foreign key) references. The database was pre-loaded with approximately 10 TB (50 billion rows) of data covering five years of transactions. The three fact tables are range partitioned monthly for the first four years and partitioned weekly for the most current year. The data was designed to reflect the typical ebb and flow of the retail business cycle with order volumes increasing around holidays and decreasing in summer months as well as other characteristics such as average order size. Other characteristics included:

2 The DCA is configured to compress data by default using a “quick compress” algorithm that does compress on write. Uncompressed usable capacity would yield 27 TB in a three-quarter rack, or one-quarter the compressed size.

Lab Validation: EMC Greenplum Data Computing Appliance 6

© 2011, Enterprise Strategy Group, Inc. All Rights Reserved.

• Sales volumes increase year over year to reflect a growing business • Monthly and seasonal differences in data • Real products, names, and demographic information • Some products selling well, some selling poorly • A majority of single item orders with some individuals making big purchases

Figure 4. Data Model Used During Validation Testing

Load & Go

The Load & Go test is a measurement of how quickly a database appliance can consume (or ingest) newly created data. The ESG lab team validated the DCA’s ingest speeds by loading an additional 825 GB of data (approximately 6 months of transactions) into the DCA three-quarter rack test bed and clocking the results. This is equivalent to 1.9 billion rows of data loading into single logical table with 24 (6 months X 4 weekly partitions) physical objects.

The DCA’s ingest performance was measured by clocking time the time to insert 825 GB of data from a source table (order_lineitems_load) loaded on an ETL server to the target table (order_lineitems) on the DCA. The ETL environment was connected using 10GbE links connected directly to the DCA’s interconnect bus. The insert statement used was:

INSERT INTO retail_demo.order_lineitems SELECT * FROM retail_demo.order_lineitems_load;

The data load completed in 351 seconds, which equates to approximately 8.26 TB per hour. This exceeds the expected load rate of 7.3 TB3

per hour by nearly a full terabyte.

3 Calculated using half and full rack published performance quotes in EMC Greenplum Data Computing Appliance: High Performance for Data Warehousing and Business Intelligence – An Architectural Overview, Oct 2010, pages 42-43.

Lab Validation: EMC Greenplum Data Computing Appliance 7

© 2011, Enterprise Strategy Group, Inc. All Rights Reserved.

Table 1. Load & Go performance Test Results

DCA Ingest Performance

Half Rack 3/4 Rack Full Rack

Load Rate (baseline) 4.77 TB/hr 7.285 TB/hr (est.) 9.8 TB/hr (est.) Load Rate (as tested) n/a 8.263 TB/hr n/a

Why This Matters The ability to move data into an appliance quickly is crucial for productivity. In analytical applications, time spent lifting and moving large data sets is time wasted as productivity essentially comes to a halt. Fast ingest speeds on big data sets allow analysts to spend more time analyzing and less time dealing with the logistics of data collection. For mission-critical business processes that require real-time reporting, data load rates are even more critical. Reducing latency in identifying potential fraudulent transactions, monitoring risk measures, and identifying customer order system issues directly translates to business process improvements and better bottom lines.

Lab Validation: EMC Greenplum Data Computing Appliance 8

© 2011, Enterprise Strategy Group, Inc. All Rights Reserved.

Linear Scalability

The scalability of a database appliance is essential to managing the exponential growth of data in organizations. Capacity and performance scale-out was tested by expanding the GP1000 from its three-quarter rack starting configuration to a full rack by adding four additional segment servers. As previously mentioned, all 16 segment servers were pre-wired in the test bed with only 12 in use by the master server. This was done so that test could focus more on the steps and time required to redistribute the existing data (+10 TB) equally across all 16 segment servers while the DCA remained online and in use.

ESG Lab Testing

ESG Lab validated the linear scalability of the GP1000 using a set of benchmark queries run before, during, and after the segment server addition; Figure 5 shows a query’s execution being monitored using the Greenplum Query Plan GUI. Additionally, ESG Lab validated redistribution of the data from 12 segment servers to 16 segment servers while the DCA was still online and open to user queries by running additional queries simultaneously. The time to complete was clocked using timestamps from a terminal’s standard output and measuring the difference from the time the redistribution command was executed to when it was completed. The only downtime required was a quick bounce of the database after the new segment servers came online so that the master server would recognize the new configuration. The total time to redistribute +10 TB of data was clocked at 1 hour and 22 minutes. This required no reloading of data by users as it was all handled dynamically by the DCA.

Figure 5. Monitoring the Execution of the “Killer Query”

The before and after performance results of the DCA were gauged using four baseline queries of varying complexity and volume of rows scanned. The GP1000 DCA does not cache query results in memory, so performance does not deviate the second or third time the query is executed. The full results and performance difference (delta) are detailed below in Table 2.

Lab Validation: EMC Greenplum Data Computing Appliance 9

© 2011, Enterprise Strategy Group, Inc. All Rights Reserved.

Table 2. Three-quarter Rack to Full Rack Performance Scale-out Results

Test Query4 Rows

Scanned 3/4 Rack Full Rack Delta

Detail on a single order 5,253,880 6160 ms 4600 ms -25.32% Count order line items (1 month) 369,789,719 13028 ms 9770 ms -25.01% Count order line items (6 months) 1,993,167,486 64394 ms 45548 ms -25.08%

Killer query Billions 178 s 109 s -35.88%

Why This Matters IT and finance departments often struggle with capacity planning for stack infrastructures, which tends to make the discipline more art than science. These infrastructures have too many moving parts and can require a massive amount of overhead and calculation just to end up at an educated guess. What’s more, attempting to figure out whether you need more CPU, more throughput, faster disks, or more capacity can be a difficult balancing act often resulting in over-allocating resources or, worse, not scaling enough to meet demand. The advantage of all-in-one solutions with MPP designs on shared-nothing architectures is that when you add 25% more capacity and horsepower, you get 25% more capacity and horsepower. It’s simple arithmetic with no configuration decisions to calculate—you just need to decide if you want 25%, 50%, or 100% more. Add the fact that this can be done while systems are online with no need to physically migrate data and you have an easy to manage solution with predictable performance.

4 See the appendix for full SQL code of each test query.

Lab Validation: EMC Greenplum Data Computing Appliance 10

© 2011, Enterprise Strategy Group, Inc. All Rights Reserved.

Analytics Ready

How well an appliance integrates and facilitates analytics is key to understanding its value. From a functional standpoint, if all the data resides in the DCA, how easily can an analyst plug in their weapon of choice and get down to business? From a performance standpoint, can the DCA handle the demanding compute, IO, and throughput required to support the full analytics lifecycle? The goal with this portion of ESG Lab testing was to utilize a popular analytics tool to examine the capabilities of the DCA from an end-user perspective.

ESG Lab Testing

ESG Lab utilized Alpine Miner to design and build an end-to-end analysis with the DCA as the data repository and compute engine. Alpine Miner was chosen due to its big data predictive analytics capabilities. The technology implements analytics directly at MPP database kernel so data does not need to be moved outside the Greenplum database. Alpine Miner generates analytic flows based on SOA architecture and provides a quick and easy to use interface to construct a multi-stage analysis including data preparation, data transformation, data modeling, and data scoring. ESG Lab used a “churn” model designed to mine several tables of data for high risk customers. Figure 6 shows the details of each stage of the analysis starting with the raw data. ESG Lab used the Alpine Miner execution log to clock the processing time of the model.

Figure 6. Analytical Workflow Used During Lab Validation Testing

Lab Validation: EMC Greenplum Data Computing Appliance 11

© 2011, Enterprise Strategy Group, Inc. All Rights Reserved.

The 14-stage model ran calculations on approximately 1.8 million customers and completed in 11 minutes and 29 seconds

Figure 7. Alpine Miner Model Log Results

Why This Matters Data has evolved from an asset to a strategic asset. Reporting and business intelligence activities continue to play pivotal roles in organizational management and process improvement. Organizations with mature data reporting are expanding their strategies into analytics in order to explore new opportunities and identify problems before they occur—essentially, to answer questions that do not yet exist. Analytics and data mining are fast becoming a core competency and a strategic initiative. The key for IT teams is that ad-hoc analytics are far more resource-intensive (magnitudes more so) than repetitive reporting activities given the same set of data. The EMC Greenplum DCA is built from the ground up to handle the most punishing analytics workloads and to support the full analysis cycle from exploration to scoring by integrating with industry standard tools like Alpine Miner. Alpine Miner, powered by EMC Greenplum built for next generation big data predictive analytics, supported a full analysis cycle from exploration to scoring all inside Greenplum which eliminated the data movement pains of the traditional predictive analytics tools.

Lab Validation: EMC Greenplum Data Computing Appliance 12

© 2011, Enterprise Strategy Group, Inc. All Rights Reserved.

ESG Lab Validation Highlights Ingest speed on an out of the box three-quarter rack configuration was confirmed in excess of 8 TB/hr. ESG Lab validated that the EMC Greenplum MPP architecture scales linearly and predictably in both

capacity and performance, simplifying capacity planning. EMC Greenplum DCA provides plug-and-play integration with popular analytics suites such as Alpine Miner,

SAS, and R. ESG Lab confirmed the ability to produce complex analytics without the need to move or model the data or

tune the database. Performance tests showed that EMC Greenplum’s native compress on write increases useable capacity

without a negative impact on performance. ESG Lab was able to scale the infrastructure with only a short bounce of the database and no further

downtime.

Issues to Consider Currently, any scaling of the EMC Greenplum DCA is handled by EMC support engineers. Depending on the

customer’s point of view, having experts come on site to handle the expansion on their behalf may be seen as a benefit that decreases risk.

The DCA does not currently have an integrated analytics package. Any analysis that surpasses the capabilities of SQL will need to be done with an external analytics suite. While in-database analytics packages can perform well and provide a level of convenience, the “analytics tool agnostic” approach of EMC Greenplum allows analysts to use the tools and techniques they are comfortable with.

As of this writing, internal DCA storage media is exclusively provided via directly attached 600 GB SAS drives which offer a good mix of speed and capacity. EMC has indicated that the strategic roadmap for the EMC Greenplum DCA includes expanded back-end storage options.

Companies moving from a “stack” infrastructure to an appliance may incur a material upfront investment cost in exchange for lower future costs as the environment scales.

Lab Validation: EMC Greenplum Data Computing Appliance 13

© 2011, Enterprise Strategy Group, Inc. All Rights Reserved.

The Bigger Truth Organizations are facing challenges in scaling infrastructures to handle multiplying data volumes and demanding analytics. Traditional BI/reporting architectures that rely on data modeling, tuning, and pre-aggregating data are ill-suited for big data analytics.

Approach BI/Reporting Analytics/Mining Business Drivers Questions are known Question may not be known Activity Characterization Repetitive & predictable Ad-hoc & highly variant Data Granularity Aggregates & cubes Raw – all of it Conducive to SQL Highly Minimally

While reporting is typically a “build it once, run it many times” operation, analytics is a “build it many different ways and run it a few times” activity. IT departments must have a melting pot of skill sets in order to run and optimize a rapidly scaling IT infrastructure stack—comprised of applications, databases, servers, storage, networking, etc. The operating costs and convoluted solutions scraped together are still usually too slow to keep up with rapidly evolving business demands as the thirst for real-time information continues to grow. Data scientists now need to incorporate multiple types of source data—both structured and unstructured form factors—to produce comprehensive views. Complex queries developed for current data types may not perform or be easily adaptable for a variety of data structures. In addition, the need to produce complex time series analysis (regressions, scoring, moving averages, etc.) goes against the grain of the popular relational database design.

EMC Greenplum is tackling those limitations in an easy to manage, high performance, massively scalable solution. The ESG lab team validated not only the DCA’s ability to scale seamlessly as big data gets bigger, but its incredible performance in loading and analyzing massive data sets. The combination of minimal management and tuning, fast ingest, and analytics performance means data teams can spend more time analyzing problems and delivering insight.

The EMC Greenplum DCA provides a low risk, vertically integrated database appliance that simplifies IT’s job in supporting big data requirements both in performance and scale without introducing the added complexity found in traditional database management solutions. Because data loads and analytics can be executed in the database leveraging a parallel processing architecture, enterprise architects have new opportunities for system and server consolidation. Advanced analytical algorithms and workload management functions are available to run in-database, allowing data architects and developers to continue to program in the languages and applications most suited for their needs and skill sets while again leveraging the MPP where possible.

The EMC Greenplum DCA drives productivity from every angle. From maintenance overhead reductions to rapid delivery of business insight, this is a data computing machine focused on results. IT organizations charged with improving big data analytics performance and efficiency would be well-advised to examine the EMC Greenplum DCA.

Lab Validation: EMC Greenplum Data Computing Appliance 14

© 2011, Enterprise Strategy Group, Inc. All Rights Reserved.

Appendix Table 3. High-level Hardware Configuration Details

GP1000 Starting (3/4 Rack) Ending (Full Rack)

Master Servers 2 2 Segment Servers 12 16

CPUs 144 192 Memory 576 GB 768 GB

Capacity Uncompressed/Compressed 27 TB / 108 TB 36 TB / 144 TB

DCA GP1000 Test bed Hardware Configuration Detail (Full Rack)

Master Server Configuration (x2)

• Processors – 2 Intel X5680 3.33 GHz (6 core) • Memory – 48 GB DDR3 1333 MHz • Dual-port converged network adapter - 2 x 10Gb/s • RAID controller – dual channel 6 Gb/s SAS • Hard disk – 6 600 GB 10k SAS • OS – RHEL 5.5

Segment Server Configuration (x16)

• Processors – 2 Intel X5670 2.93 GHz (6 core) • Memory – 48 GB DDR3 1333 MHz • Dual-port converged network adapter - 2 x 10Gb/s • RAID controller – dual channel 6 Gb/s SAS • Hard disk – 12 600 GB 15k SAS • OS – RHEL 5.5

Interconnect Bus

• 24 x 10 GbE ports • 8 x Fibre Channel (FC) ports

RAID Configuration Details

Table 4. DCA GP1000 RAID Configuration

Server Type RAID Group Physical Disks Virtual Disks Function

File System

Master Group 1

RAID 5 (4+1) 5

Virtual Disk 1 ROOT ext3 Virtual Disk 2 SWAP SWAP Virtual Disk 3 DATA XFS

Hot Spare 1 n/a DATA2 n/a

Segment

Group 1 RAID 5 (5+1)

6 Virtual Disk 1 SWAP SWAP Virtual Disk 2 DATA2 XFS

Group 2 RAID 5 (5+1)

6 Virtual Disk 1 SWAP SWAP Virtual Disk 2 DATA2 XFS

Lab Validation: EMC Greenplum Data Computing Appliance 15

© 2011, Enterprise Strategy Group, Inc. All Rights Reserved.

Test Query SQL Code

-- Detail on a single order (test query 1) SELECT * FROM retail_demo.order_lineitems WHERE order_id = '145338570' AND order_datetime BETWEEN date '2006-01-27' AND date '2006-01-28‘; -- Count order line items (test query 2) SELECT TO_CHAR(COUNT(*), '999,999,999') AS cnt FROM retail_demo.order_lineitems WHERE order_datetime BETWEEN date '2010-11-01' AND date '2010-11-30'; --Count order line items (test query 3) SELECT TO_CHAR(COUNT(*), '999,999,999,999') AS cnt FROM retail_demo.order_lineitems WHERE order_datetime BETWEEN date '2010-01-01' AND date '2010-06-30'; -- Killer Query (test query 4) -- Customers who have: -- Bought DVDs in the 2 previous (2008 & 2009) holiday seasons -- Bought a Blu-ray player in the last 6 months -- Have not bought a Blu-ray disc since their Blu-ray player purchase SELECT customers.customer_id , email.email_address , customers.Last_BluRay_2010 , customers.First_Player_2010 , SUM(RFMT.num_purchases) as total_purchases FROM (SELECT oli.customer_id , SUM(CASE WHEN cat.category_name = 'DVD' AND oli.order_datetime BETWEEN date '11-01-2008' AND date '12-24-2008' THEN item_quantity ELSE 0 END) AS DVDs_2008 , SUM(CASE WHEN cat.category_name = 'DVD' AND oli.order_datetime BETWEEN date '11-01-2009' AND date '12-24-2009' THEN item_quantity ELSE 0 END) AS DVDs_2009 , MAX(CASE WHEN cat.category_name = 'DVD' AND prod.product_name LIKE '%Blu-ray%' AND oli.order_datetime BETWEEN date '05-01-2010' AND date '10-31-2010' THEN order_datetime ELSE NULL END) AS Last_BluRay_2010 , MIN(CASE WHEN cat.category_name = 'CE' AND prod.product_name LIKE '%Blu-ray%' AND oli.order_datetime BETWEEN date '05-01-2010' AND date '10-31-2010' THEN order_datetime ELSE NULL END) AS First_Player_2010 FROM order_lineitems oli , products_dim prod , categories_dim cat WHERE oli.product_id = prod.product_id AND prod.category_id = cat.category_id AND cat.category_name IN ('DVD', 'CE') AND (oli.order_datetime BETWEEN date '11-01-2008' AND date '12-24-2008' OR oli.order_datetime BETWEEN date '11-01-2009' AND date '12-24-2009' OR oli.order_datetime BETWEEN date '05-01-2010' AND date '10-31-2010')

Lab Validation: EMC Greenplum Data Computing Appliance 16

© 2011, Enterprise Strategy Group, Inc. All Rights Reserved.

GROUP BY oli.customer_id ) AS customers , email_addresses_dim email , customer_RFMT_scores RFMT WHERE customers.customer_id = email.customer_id AND customers.customer_id = RFMT.customer_id AND DVDs_2008 > 0 AND DVDs_2009 > 0 AND Last_BluRay_2010 < First_Player_2010 GROUP BY customers.customer_id , email.email_address , customers.Last_BluRay_2010 , customers.First_Player_2010 limit 100

20 Asylum Street | Milford, MA 01757 | Tel:508.482.0188 Fax: 508.482.0218 | www.enterprisestrategygroup.com