analytics final project

8/9/2019 Analytics Final Project

1/49

IE 330

Final Project

12/10/14

Group 2:

Brian Leap

Matthew Murphy

Troy McCrum


2/49

Introduction

When people receive coupons in the mail or see advertisements in their favorite magazines,

they often interpret such events as random or maybe they think that companies are just trying to cast a

wide net to drum up business. In some cases, these things could be true; however, randomness usually

has nothing to do with it at all. What prospective customers may not know, or perhaps do not want to

know, is that companies know a lot more about them than they think. And random appearances of

coupons in the mailbox are in fact conscious, targeted efforts from marketing teams to persuade a

certain type of consumer to purchase a certain type of product.

So how do companies know so much about their customers? Any time a customer uses a credit

card, completes a survey, uses a coupon, subscribes to a magazine, etc., information about that

customer is being recorded and saved into a database. All sorts of customer information are valuable to

a company. A customers transactions, race, income, family size, pets, age, and occupation are all

examples of useful characteristics that companies can utilize in designing a marketing strategy. The

more information a company can gain about a customer, the better that company can market to that

customer. Additionally, as companies accumulate MORE information about MORE customers, insights

can be made about customers as a group rather than individually. In other words, companies can assign

individuals to a group and then simply market to that group, instead of marketing from customer to

customer. This process of grouping similar customers together is known as clustering. Clustering is an

essential technique because the time and resources spent in marketing to each customer individually

would far outweigh any sort of potential monetary gains.

As companies accrue more and more customers, the thought of pouring through customer data

by hand becomes impossible, and even somewhat laughable. The solution to this problem is a database.

A database of customer information to a marketing team is similar to the Periodic Table of Elements to achemist. Both a database and the periodic table are simply collections of information that are organized

in a certain way. To the average person, they are nothing more than that. However, for someone with

the necessary knowledge and tools, the information can be manipulated to create exciting new ideas.

With the help of analytics, marketing teams are able to query, or intelligently extract, data contained in

a database in order to make inferences about customers. Extracted data can then be converted into

tables, charts, graphs, or other types of graphical visualizations to be used in customer and marketing

analysis. Without a well-organized database, a technique like clustering would be impossible. Therefore,

using analytics to gain insight to customer and market behavior all starts with a good database.

Problem Description

The problem presented in this project is that a supermarket has an enormous amount of

transaction and demographic data pertaining to its customers, but the marketing team needs help in

finding ways to use this data to develop marketing and advertising strategies.

The project is divided into two parts:


3/49

The first part of the project deals with predictive analytics. This requires substantial use of

Microsoft Excel. Using Excel the team must calculate moving averages and exponential smoothing,

construct time series graphs, utilize regression lines, use k-means clustering to group similar customers

together, and use demographics data to form insights about the customer base. Using this information,

sales quantities of top-selling items must be predicted and methods of promoting and advertising

products must be developed.

The second part of the project deals primarily with databases. Part 2 requires the team to use

Microsoft Access to create database tables for analysis and perform queries to find interesting trends in

the data. Also, additional graphical visuals must be created outside of Access or Excel by using R. Again,

the team must make inferences and marketing recommendations based on the analysis.

Project Objectives

Part 1 (Excel)

o

Calculate prediction error, find: Mean Squared Average,

Mean Absolute Deviation,

Tracking Signal.

o Construct time series graphs

o Use OLS Regression to predict future sales

o Use k-means clustering to:

Group customers with similar buying habits,

Use analytics to find demographic commonalities

within those groups,

Make marketing recommendations based on analysis.

o Use demographic data to derive insights to the customer base

Part 2 (Access)

o Create database tables for analysis

o Identify relationships between tables

o Perform queries on various topics to gain insights

o Create advanced graphical visuals to display data


4/49

Part 1: Analytics

Moving Average and Exponential Smoothing

1)

Figure 1: Prediction error for weeks 10 through 20 for item types 3 and 17. The graph shows

the Mean Squared Error, Mean Absolute Deviation, and Tracking Signal.

Regression

2)

Figure 2:Time series for the sales of snacks (Item 17) over the 104 week period

Also included are the trend line, regression equation, and R2value

Item Type 3 Mean Squared Error Mean Absolute Deviation Tracking Sig

4 period moving average 1235.22 24.32 -1.03

2 period weighted moving average 1481.34 28.71 -0.56

Exponential Smoothing 1192.31 23.9 -0.83

Item Type 17

4 period moving average 868.05 25.18 5.58

2 period weighted moving average 801.37 23.29 4.59

Exponential Smoothing 783.55 25.35 6.63


5/49

Figure 3:Time series for the sales of eggs (Item 12) over the 104 week period


Figure 4:Time series for the sales of butter (Item 3) over the 104 week period



6/49

Figure 5:Time series for the sales of cookies (Item 8) over the 104 week period


Figure 6:Time series for the sales of cereal (Item 5) over the 104 week period



7/49

Figure 7:Time series for the sales of BBQ (Item 2) over the 104 week period


Figure 8:Time series for the sales of cat food (Item 4) over the 104 week period



8/49

Figure 9:Time series for the sales of ice cream (Item 13) over the 104 week period


Figure 10:Time series for the sales of crackers (Item 9) over the 104 week period



9/49


10/49

Item Type Item Description Regression Equation

17 Snacks y = 0.0423x + 85.11312 Eggs y = -0.1012x +76.178

3 Butter y = -0.4956x + 81.59

3)

Table 2:Regression Equations for Top 3 Selling Items Using Only First 60 Weeks of Data

Figure 12:Time Series for the Sales of Snacks (Item 17) over the 104 Week Period

Regression line displayed is based off of the results from the first 60 weeks


11/49

Figure 13:Time Series for the Sales of Eggs (Item 12) over the 104 Week Period


Figure 14:Time Series for the Sales of Butter (Item 3) over the 104 Week Period



12/49

Week Item 17 - Snacks Item 12 - Cereal Item 3 - Butter

61 87.6933 70.0048 51.3584

62 87.7356 69.9036 50.8628

63 87.7779 69.8024 50.3672

64 87.8202 69.7012 49.8716

65 87.8625 69.6000 49.3760

66 87.9048 69.4988 48.8804

67 87.9471 69.3976 48.3848

68 87.9894 69.2964 47.8892

69 88.0317 69.1952 47.3936

70 88.0740 69.0940 46.8980

71 88.1163 68.9928 46.4024

72 88.1586 68.8916 45.9068

73 88.2009 68.7904 45.4112

74 88.2432 68.6892 44.9156

75 88.2855 68.5880 44.4200

76 88.3278 68.4868 43.9244

77 88.3701 68.3856 43.4288

78 88.4124 68.2844 42.9332

79 88.4547 68.1832 42.4376

80 88.4970 68.0820 41.9420

Week Item 17 - Snacks Item 12 - Cereal Item 3 - Butter

61 59 104 38

62 44 36 24

63 49 36 41

64 71 39 37

65 75 47 35

66 64 46 41

67 52 34 41

68 54 45 79

69 47 38 40

70 72 58 24

71 53 43 39

72 46 71 74

73 55 95 49

74 48 27 53

75 77 50 12876 91 63 68

77 71 147 71

78 43 61 43

79 62 62 82

80 52 94 167

4)

Table 3:Projected Units of Snacks, Cereal, and Butter Sold for Weeks 61 through 80 using

Regression Equations obtained from first 60 Weeks of Data

Table 4: Actual Units of Snacks, Cereal, and Butter Sold for Weeks 61 through 80


13/49

Clustering

5)

Figure 15:Graphical Representation of K-Means Clustering using K=3 for Units of Crackers

Purchased vs. Units of Cookies Purchased by Each Customer over the 104 Week Span




14/49






15/49

Figure 19:Graphical Representation of K-Means Clustering using K=3 for Units of Eggs

Purchased vs. Units of Cereal Purchased by Each Customer over the 104 Week Span

Figure 20:Graphical Representation of K-Means Clustering using K=3 for Units of Hotdogs

Purchased vs. Units of Pizza Purchased by Each Customer over the 104 Week Span


16/49

General

6)

Figure 21: Graphical representation of race of store customers.

Figure 22: Chart representing total number of customers falling into each income bracket.

Race (%)

White

Black

Hispanic

Oriential

0

10

20

30

40

50

60

70

80

90

0 - 10k 10 -

11.9k

12 -

14.9k

15 -

19.9k

20 -

24.9k

25 -

34.9k

35 -

44.9k

45 -

54.9k

55 -

64.9k

65 -

74.9k

75+k

Customer Income


17/49

Figure 23: Chart representing total number of customers in occupation by field.

Part 1 Discussion and Overall Recommendations:

1)

Problem #1 involved using the solution to question 5 from homework 7. This question asks for the salesforecast of the two highest selling items which are item types 3 and 17. The following forecasts were

completed in homework 7: 4-period moving average, 2-period weighted moving average, and

exponential smoothing. Using these solutions, the mean squared error, mean absolute deviation, and

the tracking signal were computed. Using the equations shown in (Figure 1A) in the Appendix, each of

these errors were calculated.

According to the results, the Mean Squared Error for Exponential Smoothing provided the least amount

of error for both Item types. The values are rather large showing that prediction sales data may not be

entirely effective. I would recommend the store owner to stay away from these sales forecasting

methods, but to choose the exponential smoothing if need be.

2)

Problem #2 provides some interesting insight into the buying patterns of certain items. With the help of

a pivot table, it was easy to condense the data to determine the top ten most frequently purchased

items. After determining what these top ten items were, a time series graph was created for each

individual item. The last aspect of the time series that was implemented was the regression line (trend

0

20

40

60

80

100

120

140

Male

Female


18/49

line). This equation of this trend line allows the user to plug in any value for week number to get an

estimate of how many units of a particular item will be purchased during that given week. To rate the

accuracy of this trend line in predicting the actual number of units sold, a correlation coefficient (R2) was

also computed.

No items by themselves had any particularly interesting trends, however, there were a couple ofinteresting aspects that each of these ten plots had in common. The first noticeable common

occurrence was the sharp decrease in sales of each item during week 57. This could have been caused

by a variety of factors. For example, maybe week 57 was a holiday week so customers were busy

spending their time at home with family instead of at the store. Another possibility is that maybe the

store was closed for a few days that week, thus decreasing its total sales. No conclusive answer can be

obtained from the data given, but there is surely something that happened during week 57 to cause this

decrease in sales.

Another interesting feature that all ten items had in common was a negatively sloping regression line.

This means that as time goes on, the projected sales of every single one of these items is projected to

continually decrease. Basically, this means that the store has a problem. If sales of each of its ten most

popular products continue to decrease as time goes on like the regression model suggests, the store will

bring in less and less money as time goes on. In the long run, if this trend goes on for long enough, the

projected decrease in sales could even lead to the store going out of business.

Finally, one last notable feature that all ten items shared was a very small correlation coefficient. The

correlation coefficient essentially shows how accurate the fit of the regression line is to the observed

data. This value will fall somewhere between 0 and 1; 0 representing minimal accuracy and 1

representing perfect accuracy in the fit of the regression line. Out of all ten items, the largest

correlation coefficient was 0.2358. One of the values were even as low as 0.002. These extremely low

coefficient correlation values indicate that the regression line is not very accurate in predicting the

actual number of units of an item sold given the week number.

3)

Problem #3 further dives into the concept of regression and using a regression line to predict future

response. In this problem, only the top three most frequently purchased items were considered. A

regression line was formed using only the sales data from the first 60 weeks. After this regression line

from the first 60 weeks was calculated and created, it was plotted overtop of the time series plot of all

104 weeks of data for its corresponding item type. This created a visual that made it easy to see just

how accurate each individual regression line was in predicting future response.

After all of these steps were carried out, the most notable occurrence was the great change in the

equations for the regression lines from what they were when all 104 weeks of data was considered. The

initial equations were y = -0.0593x + 84.136 for item 17, y = -0.0168x + 72.545 for item 12, and y = -

0.1308x +72.062 for item 3. However, once the parameters were changed to only include the first 60

weeks, the new regression equations were y = 0.0423x + 85.113 for item 17, y = -0.1012x +76.178 for

item 12, and y = -0.4956x + 81.59 for item 3. One notable occurrence was the change in slope for item


19/49

17. When all 104 weeks of data was used, the slope of the regression line was negative. However,

when only the first 60 weeks of data was considered, the slope of the regression line became positive.

This means that sales are projected to gradually increase when only the first 60 weeks of data is used,

but when all 104 weeks is considered, sales of item 17 are projected to decrease as time goes on.

Another noticeable trend was the rapid decrease in slope of items 12 and 3 when only the first 60 weeks

of data are considered. The slope decreases almost six times as fast for item 12 and almost four times

as fast for item 3 when only the first 60 weeks is used in forming the regression line as opposed to all

104 weeks. However, it should be noted that in both of these cases, the intercept does start at a higher

value for the regression line from only the first 60 weeks, thus somewhat compensating for the large

increase in negativity of the slope. Overall, the great changes that take place in the equations for the

regression lines when more data is used indicates the lack of accuracy of the regression lines in

predicting the response (units sold).

4)

So now, the goal of problem #4 is to predict the number of units that will sell each week, starting from

week 61 and stopping at week 80. These predictions were made using the regression lines found from

considering only the first 60 weeks of data. In order to obtain these predictions, the week number had

to be plugged into the regression equation as the x value. The predicted values and the actual values

can be seen in the Results Section.

It is evident that the predictions obtained through regression are not very accurate for predicting future

sales. For item 17, the average sales from weeks 61 through 80 are significantly less than the predicted

values indicate. The predicted values are also noticeably larger for item 12. There are a few instances

where actual units sold increases to be greater than the projections, but for the most part, the average

sales are less than the predicted values. Like the other two items, the projections for item 3 are not

quite accurate either. The regression line indicates a quick decrease in units sold while in actuality, the

values seem to be gradually increasing.

Overall, after comparing the projections from the regression lines to the actual amount of each item

sold each week, it is clear that there is a lot of error in this prediction method. If the owner of the store

wants to better predict the amount of items that will be sold in the future, forecasting methods such as

Moving Average or Exponential Smoothing may create more accurate results.

5)

Problem #5 dealt with the clustering of two different items against each other. Clustering is a very

helpful strategy when it comes to breaking down and studying data. Customers are an example of

something that is often clustered. In this case, clustering allows the user to separate the customers into

groups so that different recommendations can be given to different groups based off of the common

preferences of the customers in that group. K-means clustering is a very widely used clustering

technique. This method breaks the data down into k separate groups, with each group having its own

centroid. The rule of thumb here is that all of the points in each cluster will be closest to its own

centroid, as opposed to being closer to the centroid of another group.


20/49

So, for problem 5, the k-means clustering algorithm was used to break the data down into k separate

groups. The first two items that were compared were items 8 and 9 (cookies and crackers). These items

were clustered using k values of 3, 4, 5, and 6. The graphs were created and then judged to determine

which value of k produced the best fit for the data. All of the graphs that were created can be seen in

the Results Section. K=6 appeared to create the best fit for the data on the graph because it creates the

most precise groups with points that seem to have the smallest deviation from their centroid. After this,

two more sets of items were clustered, however, k=3 was the only value of k that was used on those

cases. The first graph created contained a comparison between cereal and eggs while the second graph

created focused on hotdogs and pizza.

So, the reason that this clustering procedure was carried out is to help the store manager promote

products based on specific customer attributes. After the customers were assigned to a specific cluster,

the background information on each customer was examined to determine if there were any noticeable

similarities in the characteristics of customers within clusters. This background information was given in

the Demographics Data

So, the cluster analysis containing k=6 groups of crackers (Item 9) purchased vs. cookies (Item 8)

purchased was first examined to determine the buying patterns of the customers in each of the clusters.

The other three graphs for crackers vs. cookies using k=3, 4, and 5 were not considered here because it

was decided that k=6 created the best and most accurate results. In each of the graphs, clusters are

referred to as Series 1, Series 2, etc. Series 1 contains the customers who buy very little of both

products, Series 2 contains customers who buy a lot of crackers and a substantial amount of cookies,

Series 3 contains customers who buy some cookies but not a lot of crackers, Series 4 contains customers

who buy some crackers but not a lot of cookies, Series 5 contains customers who buy a lot of cookies

and a substantial amount of crackers, and Series 6 contains the customers who buy a substantial

amount of cookies and some crackers. To get a better understanding of the customers who belonged toeach of these groups, Family Size was useful in differentiating between clusters. The two groups that

were focused on were Series 2 and Series 5 because those were the groups that contained the

customers who bought a large amount crackers and a large amount of cookies, respectively. In Series 2,

the most common family size by far is 2 people, accounting for 63% of the people in that group.

However, in Series 5, there are three family sizes that make up the majority of this group. A family size

of 2 makes up 33%, a family size of 3 makes up 25%, and a family size of 4 makes up 25%. Note that all

of these Demographic Data statistics can be found in the Appendix in table form. So, now that the

company has some background on the customers belonging to each of the customers, it must now

decide how to allocate its advertising resources to try to increase the sales of cookies and crackers.

Based off of this data, the type of customer who buys a lot of crackers belongs to Series 2 and moreoften than not, has a family size of 2 people. The type of customers who buy a lot of cookies are those

who have families of size 2, 3, or 4. So, in order to maximize profit from sales of both crackers and

cookies, crackers should be marketed mainly towards people who have family sizes of 2 people and

cookies should be marketed towards people who have families ranging from 2-4 people. In order to

advertise to these groups of people, TV would probably be the best way to get the word out. Out of the

people who responded, 264 out of the 272 people owned at least one television. However, if a


21/49

television advertisement is not economically feasible, other methods, such as a newspaper or magazine

advertisement could work well. 50% of the people in Series 2 and 58.33% of the people in Series 5 read

the newspaper. If a magazine advertisement is the way that the store wants to go, their best course of

action would be to put an ad for crackers in Better H&G because 37.5% of people that buy a lot of

crackers also subscribe to Better H&G. To reach the customers who buy a lot of cookies, and

advertisement should be placed in Good House because one third of the people who buy a lot of cookies

also subscribe to Good House.

The next cluster analysis that was carried out compared the amount of eggs (Item 16) purchased

vs. the amount of cereal (Item 5) purchased by each customer. This cluster analysis only used k=3

groups. Series 1 contained the customers who did not buy a lot of either product, Series 2 contained the

customers who bought a lot of cereal, and Series 3 contained the customers who bought a lot of eggs.

When the demographics data of all of the customers was studied by cluster, there were some pretty

interesting results. Annual Income was one factor that created some interesting statistics when studied.

Out of all of the people in Series 2, over 60% of those people make more than $35,000 per year. In

Series 3 on the other hand, less than 19% of people make more than $35,000 per year. So, since there isclearly some correlation between annual income and product purchased, the store should take this into

account for advertising purposes. For the store to increase sales of both of these items, the best course

of action would be to try to advertise cereal to people who make more than $35,000 per year and

advertise eggs to the people who make less than $35,000 per year. In order to do this advertising, once

again the most efficient yet most costly way to do so would be to create a television commercial. If the

store would like to advertise in the newspaper instead, 57.89% of the people who buy a lot of cereal and

60.53% of the people who buy a lot of eggs subscribe to the newspaper. And finally, if a magazine ad is

the way that the store chooses to go, in order to reach the people who buy a lot of cereal, an ad should

be placed in Reader Digest because 26.32% of the people who buy a lot of cereal subscribe. To reach

the people who buy a lot of eggs, a magazine ad should be placed in either Readers Digest or BetterH&G because 26.32% of people who buy a lot of eggs subscribe to each of these magazines.

The final cluster analysis that was carried out compared the number of units of pizza (Item 16)

purchased vs. the number of units of hotdogs (Item 11) purchased by each customer. Again, k=3 was

the value used for number of groups. Series 1 contained people who purchased very few hotdogs and

pizzas, Series 3 contained the people who purchased a lot of hotdogs, and Series 2 pretty much just

contained everyone who did not fit either of those criteria. In this case, race was a factor that lead to

some pretty interesting results. Though white people were the race that was most common in the

study, the data shows some pretty evidence pertaining to the preferences of African Americans. In

Series 1 and 2, the percentage of African American belonging to each cluster is just above 10%.However, in Series 3, the percentage of African American jumps all of the way up above 23%. The

percentage of white people belonging to Series 3 also drops more than 10% from both Series 1 and 2.

As a result of this correlation between race and product purchased, this should be taken into account by

the store when trying to advertise pizza and hotdogs. This data shows that it is a waste of time to

market pizza to African Americans because only a little more than 10% of them ever purchase pizza.

White people would be a better marketing target when it comes to advertising pizza. If an item should


22/49

be marketed to African Americans, it should be hot dogs because a much higher percentage of them

purchase hotdogs. So, once again, the statistics show that the most efficient way to reach consumers is

through a television advertisement. However, in this case a magazine or newspaper ad may not be a

good way to go when trying to advertise. In Series 3, which would be the main group that the store

would want to advertise to because they buy such a large amount of hotdogs, only 38.46% of people

subscribe to the newspaper which, compared to the other groups, is very small. Also, out of this group,

no magazine is subscribed to by more than 16% of the people, making this advertisement possibility also

somewhat ineffective. So, in order to reach these people in Series 3, the best method after all may just

be to save up money to produce a TV commercial because that seems to be the only really effective way

to reach these people.

6)

When developing a marketing strategy it is essential to know and understand ones customer base.

Using the Demographics data, some statistics were able to be pulled from the data to in order to

develop such an understanding of the area of the stores location and the people who live there. The

percentages of ethnicities of the customers were found, as seen in Figure 21. This graph indicates that

the population of the location of the town is predominantly white, and has very low minority population

percentages. This leads to believe that the store is located in a more rural area, as opposed to in a larger

city. More potential evidence for the rural, blue-collar town is that 34% of customers own at least one

dog. This number would not be so high in a city with people living mostly in apartment buildings. The

fact that so many people own dogs brings a recommendation from the team that the store should

promote its dog food and other dog-related products to its customers. This area also seems like a rural

town with blue-collar people because of the annual income. The vast majority of customers fall into the

middle-class range, whereas only a few customers earn more than $75k per year. Surprisingly, the most

common profession among all the customers is retirement. Yes, of all the types of professions there are

more customers that are retired than in any other category. This leads to believe that the store has a

customer base with a majority of elderly citizens. By using clustering and other analytic techniques, the

store has the capability to use all aforementioned statistics to gain an advantage with its sales and

marketing strategies.


23/49

Part 2: Databases

1)

Figure 1:Access Database Table Schema


24/49

Table 1:Transaction:Transaction Information Table

Table 2:CouponLookupCoupon Description

Table

Table 3: Product: Item Description Table

Table 4:Vendor: Vendor Decsription Table


25/49

Table 5:Customer: Customer Demographic Table

Table 6:RaceLookup Race Description Table

Table 7:IncomeLookup Income Table

Table 8:FemaleOccLookupFemale Occupation Table

Table 9:MaleOccLookup Male Occupation Table


26/49

Table 10:MaleAgeLookupMale Age Table

2)

Query 1: Cross table of query, showing all females occupations, in addition to family size and

coupon origin.

SQL code:

TRANSFORM Count([AllOccFemale-CouponOrig].[Customer ID]) AS [CountOfCustomer ID]

SELECT [AllOccFemale-CouponOrig].[Family Size], [AllOccFemale-CouponOrig].Desc,

Count([AllOccFemale-CouponOrig].[Customer ID]) AS [Total Of Customer ID]

FROM [AllOccFemale-CouponOrig]

GROUP BY [AllOccFemale-CouponOrig].[Family Size], [AllOccFemale-CouponOrig].Desc

PIVOT [AllOccFemale-CouponOrig].CouponDesc;


27/49

Query 2: Origins of all coupon transactions.

SQL code:

TRANSFORM Count(CouponAmounts.[Customer ID]) AS [CountOfCustomer ID]

SELECT CouponAmounts.CouponDesc, Count(CouponAmounts.[Customer ID]) AS [Total Of

Customer ID]

FROM CouponAmounts

GROUP BY CouponAmounts.CouponDesc

PIVOT CouponAmounts.CouponID;


28/49

Query 3: Total coupons used that valued greater than $0.99 in regards to item types.

SQL code:

TRANSFORM Count(CouponItemInteraction.[Customer ID]) AS [CountOfCustomer ID]

SELECT CouponItemInteraction.Description, Count(CouponItemInteraction.[Customer ID]) AS [Total Of

Customer ID]

FROM CouponItemInteraction

WHERE (((CouponItemInteraction.[Coupon Value (Cents)])>99))

GROUP BY CouponItemInteraction.Description

PIVOT CouponItemInteraction.[Coupon Value (Cents)];


29/49

Query 4: Total items bought by ethnicity.

SQL code:

TRANSFORM Count(EthnicItemType.[Customer ID]) AS [CountOfCustomer ID]

SELECT EthnicItemType.Desc, Count(EthnicItemType.[Customer ID]) AS [Total Of Customer

ID]

FROM EthnicItemType

GROUP BY EthnicItemType.Desc

PIVOT EthnicItemType.Description;


30/49

Query 5: Units bought compared to family size.

SQL code:

Query 6: Coupon origins based on female occupations.

SQL code:

TRANSFORM Count([Family-Units Bought].[Customer ID]) AS [CountOfCustomer ID]

SELECT [Family-Units Bought].[Units Bought], Count([Family-Units Bought].[Customer ID])

AS [Total Of Customer ID]

FROM [Family-Units Bought]

GROUP BY [Family-Units Bought].[Units Bought]

PIVOT [Family-Units Bought].[Family Size];

TRANSFORM Count([FemaleOcc-Coupons].[Customer ID]) AS [CountOfCustomer ID]

SELECT [FemaleOcc-Coupons].Desc, Count([FemaleOcc-Coupons].[Customer ID]) AS [Total

Of Customer ID]

FROM [FemaleOcc-Coupons]

GROUP BY [FemaleOcc-Coupons].Desc

PIVOT [FemaleOcc-Coupons].[Coupon Origin];


31/49

Query 7: Ice Cream by Day of Week

SQL code:

Query 8: Coupon origin and family size of unemployed females.

SQL code:

TRANSFORM Count([IceCream-ByDay].[Item Type]) AS [CountOfItem Type]

SELECT [IceCream-ByDay].Description, Count([IceCream-ByDay].[Item Type]) AS [Total Of

Item Type]

FROM [IceCream-ByDay]

GROUP BY [IceCream-ByDay].Description

PIVOT [IceCream-ByDay].Day;

TRANSFORM Count([NonEmployFemale-CouponOrig].[Customer ID]) AS [CountOfCustomer ID]

SELECT [NonEmployFemale-CouponOrig].[Family Size], [NonEmployFemale-CouponOrig].Desc,

Count([NonEmployFemale-CouponOrig].[Customer ID]) AS [Total Of Customer ID]

FROM [NonEmployFemale-CouponOrig]

GROUP BY [NonEmployFemale-CouponOrig].[Family Size], [NonEmployFemale-CouponOrig].Desc

PIVOT [NonEmployFemale-CouponOrig].CouponDesc;


32/49

Query 9: Item types bought with subscription to cable tv in regards to male occupation.

SQL code:

TRANSFORM Count([TVAds-MenOcc].[Cable TV]) AS [CountOfCable TV]

SELECT [TVAds-MenOcc].Description, Count([TVAds-MenOcc].[Cable TV]) AS [Total Of Cable TV]

FROM [TVAds-MenOcc]

GROUP BY [TVAds-MenOcc].Description

PIVOT [TVAds-MenOcc].Desc;


33/49

Query 10: Item types bought by income amounts.

SQL code:

TRANSFORM Count([VolumeBought-Income].[Customer ID]) AS [CountOfCustomer ID]

SELECT [VolumeBought-Income].Incomeamt, Count([VolumeBought-Income].[Customer ID]) AS [Total Of

Customer ID]

FROM [VolumeBought-Income]

GROUP BY [VolumeBought-Income].Incomeamt

PIVOT [VolumeBought-Income].Description;


34/49

Query 11: Item types bought by week.

SQL code:

TRANSFORM Count(WeeklyProducts.[Customer ID]) AS [CountOfCustomer ID]

SELECT WeeklyProducts.Description, Count(WeeklyProducts.[Customer ID]) AS [Total Of

Customer ID]

FROM WeeklyProducts

GROUP BY WeeklyProducts.Description

PIVOT WeeklyProducts.Week;


35/49

Query 12: Family size and income with children.

SQL code:

Query 13: What customer bought snacks from what vendor on day 5.

SQL code:

SELECT Customer.[Customer ID], Customer.[Family Size], Customer.Income,

Customer.Children

FROM Customer

WHERE (((Customer.[Subscription to Newsweek])=Yes))

ORDER BY Customer.Children DESC , Customer.[Family Size] DESC;

SELECT Transaction.[Customer ID], Product.Description, Vendor.Vendor, Transaction.Day

FROM Vendor INNER JOIN (Product INNER JOIN [Transaction] ON Product.[Item Type] =

Transaction.[Item Type]) ON Vendor.ItemType = Transaction.[Item Type]

WHERE (((Product.Description)="snack") AND ((Transaction.Day)=5));


36/49

Query 14: Amount of coupons from origin 23, item type 10, with a coupon value greater than

$0.99.

SQL code:

SELECT Transaction.[Coupon Origin], Product.[Item Type], Transaction.[Coupon Value

(Cents)]

FROM Product INNER JOIN (Customer INNER JOIN [Transaction] ON Customer.[Customer ID]

= Transaction.[Customer ID]) ON Product.[Item Type] = Transaction.[Item Type]

WHERE (((Transaction.[Coupon Origin])=23) AND ((Product.[Item Type])=10) AND

((Transaction.[Coupon Value (Cents)])>99));


37/49

Query 15: Hot Dog purchases >1 in regards to family size, number of dogs, day and week.

SQL code:

Query 16: Amount of female cleaners that bought cleansers and amount of units bought.

SQL code:

SELECT Customer.[Customer ID], Customer.[Family Size], Customer.Dogs,

Product.Description, Transaction.Week, Transaction.[Units Bought]

FROM Product INNER JOIN (Customer INNER JOIN [Transaction] ON Customer.[Customer ID]

= Transaction.[Customer ID]) ON Product.[Item Type] = Transaction.[Item Type]

WHERE (((Customer.Dogs)>1) AND ((Product.Description)="dogs"))

ORDER BY Customer.[Customer ID], Transaction.Week;

SELECT Customer.[Customer ID], Customer.[Female Occupation], Transaction.[Item Type],

Transaction.[Units Bought]

FROM Customer INNER JOIN [Transaction] ON Customer.[Customer ID] =

Transaction.[Customer ID]

WHERE (((Customer.[Female Occupation])=8) AND ((Transaction.[Item Type])=6));


38/49

Query 17: Amount of duplicates of vendors and their item numbers.

SQL code:

SELECT First(Vendor.[Vendor]) AS [Vendor Field], First(Vendor.[ItemNumber]) AS[ItemNumber Field], Count(Vendor.[Vendor]) AS NumberOfDups

FROM Vendor

GROUP BY Vendor.[Vendor], Vendor.[ItemNumber]

HAVING Count Vendor.Vendor >1 And Count Vendor.ItemNumber >1 ;


39/49

Query 18: Females subscribed to Better Home & Garden, and their occupations.

SQL code:

SELECT Customer.[Customer ID], Customer.[Subscription to Better H&G],FemaleOccLookup.Desc

FROM FemaleOccLookup INNER JOIN Customer ON FemaleOccLookup.FemaleOccID =

Customer.[Female Occupation]

WHERE (((Customer.[Subscription to Better H&G])=Yes));


40/49

Query 19: Male and female education in regards to income.

SQL code:

SELECT Customer.[Customer ID], Customer.Income, Customer.[Male Education],

Customer.[Female Education]

FROM IncomeLookup INNER JOIN Customer ON IncomeLookup.IncomeID = Customer.Income

WHERE (((Customer.Income)=6));


41/49

Query 20: Coupon origin of snacks bought by African Americans.

SQL code:

SELECT Customer.[Customer ID], RaceLookup.Desc, Product.Description, CouponLookup.CouponDesc

FROM CouponLookup INNER JOIN (RaceLookup INNER JOIN (Product INNER JOIN (Customer INNER

JOIN [Transaction] ON Customer.[Customer ID] = Transaction.[Customer ID]) ON Product.[Item Type] =

Transaction.[Item Type]) ON RaceLookup.RaceID = Customer.Ethnicity) ON CouponLookup.CouponOrig

= Transaction.[Coupon Origin]

WHERE (((RaceLookup.Desc)="black") AND ((Product.Description)="snack"));


42/49

Query 21: How oftern customers purchased hot dogs along with family size and units bought.

SQL code:

3)

Graphics:

Figure1: Ice Cream units sold by day of week.

SELECT Customer.[Customer ID], First(Customer.[Family Size]) AS [Family Size], First(Customer.Dogs)

AS Dogs, Count(Transaction.Week) AS CountOfWeek, First(Transaction.[Units Bought]) AS [UnitsBought]

FROM Product INNER JOIN (Customer INNER JOIN [Transaction] ON Customer.[Customer ID] =

Transaction.[Customer ID]) ON Product.[Item Type] = Transaction.[Item Type]

GROUP BY Customer.[Customer ID]

HAVING (((First(Customer.Dogs))>1))


43/49

Figure 2: The total coupons used and the origin of those coupons.

Figure 3: Sum of total items bought by ethnicity.


44/49

Figure 4: Distribution of coupons used valued greater than $0.99.

Figure 5: Total coupons used by females occupations.


45/49

Part 2 Discussion and Overall Recommendations:

Creating databases in Access has proven to show many important insights to the transaction data.

Through the formulation of tables and queries, information can be found lying within the bulk of the

data. The database was set up by creating a few of the necessary tables. The demographic data was

inputted and created as the customer table. A transaction table was also created where Customer ID isshared between the Transaction and Customer tables (Relationship). To make things easier to visualize,

a Product table was created that lists the description of each item type instead of the corresponding

numerical ids. As queries were being made, understanding many of the numerical signifiers for the

demographics were difficult. Thus the following tables were created to show descriptions when calling

upon them in a query: CouponLookup, IncomeLookup, RaceLookup, FemaleOccLookup,

MaleAgeLookup, and MaleOccLookup. By using the actual descriptions from the demographic data, the

store owner can easily identify anything he or she may be searching for in the form of data analysis.

As queries were being written, certain trends became apparent. The grocery store runs on people

buying their products. Finding relationships of what type of people are buying what type of products canprove to be very beneficial. Linking transaction data to the demographics of the customers is crucial to

marketing and sales of any given company.

To analyze some of the findings from the queries, we will start with the most general query: the quantity

of items bought over the 102 week period. There were a total of 49,576 items bought. Of these 49,576

items bought, 213 purchases were made from Hispanic customers, 255 listed as other, 4,373 African

Americans, and 44,735 Caucasian customers. Within these purchases, the top 3 items bought were

snacks at 6,379, eggs at 5059, and cereal at 4859 items. This data already shows that this store is located

where the primary customers of Caucasian ethnicity.

Coupons can be an integral part to a grocery stores sales. When analyzing coupon data, you can find

what types of items coupons are used most often with, where these coupons are coming from, and

what type of people are using them. Looking at the queries CouponItemInteraction, CouponAmounts,

and FemaleOcc-Coupons can answer a few of these questions. The top 3 items bought with the use of

coupons that valued greater than $0.99 were cereal with 397 instances, detergents with 147, and coffee

with 139. With almost 10 times more than any other origin, the Sunday supplement vendor accounted

for 60 percent of all coupons used at this store with a total of 2014 of the 3376 coupons used.

Newspapers were the next largest source of coupons with 424 transactions. As females accounted for

the bulk of coupon use, the primary occupations of these coupon users were retired. The next two

female occupations with the highest coupon use were unemployed and clerical respectively. Drawing

conclusions from this data, there is probably a large population of retirees that shop at this store. To

target market, I would recommend targeting Caucasian females in the Sunday supplement for items that

you may want to sell more of.

This is just one example of something pulled from this transaction data. There are certainly many more

inferences that can be made by using different queries to find and make these conclusions. We found

that there were many lessons to be learned from this project. The use of databases can be a huge


46/49

advantage when sorting through large amounts of data in search of making inferences. The ability to call

certain criteria from a large selection can return a small, comprehendible set of data. In the transaction

data, we were able to draw strong evidence of the demographic prevalence in the store region. From

this demographic data, buying tendencies gave way to easy assumptions to make about what items sell

and what items do not sell. This project proved to be very beneficial in learning hands-on the

importance of data analytics and databases. This experience will prove to be extremely useful in our

future fields.

Group Roles

Matt MurphyPart 1 - #(2,3,4,5)

Brian LeapPart 2 - #(1,2,3)

Troy McCrumPart 1 - #(1,6) and Part 2 - #(2)

Appendix

Part 1:

Figure 1:Equations used for question 1, prediction errors.


47/49

C1 C2 C3

0 0.018315 0 01 0.087912 0.105263 0.078947

2 0.069597 0 0.052632

3 0.058608 0.078947 0.105263

4 0.120879 0.052632 0.184211

5 0.087912 0.105263 0.157895

6 0.164835 0.052632 0.236842

7 0.153846 0.210526 0.078947

8 0.128205 0.184211 0.078947

9 0.03663 0.052632 0.026316

10 0.029304 0.052632 0

11 0.043956 0.105263 0

C1 C2 C3

0 0 0 0

1 0.876364 0.885246 0.769231

2 0.105455 0.114754 0.230769

3 0.010909 0 0

4 0 0 0

5 0.007273 0 0

C1 C2 C3 C4 C5 C6

0 0.00 0.00 0.00 0.00 0.00 0.00

1 0.12 0.06 0.14 0.07 0.17 0.14

2 0.36 0.63 0.34 0.30 0.33 0.46

3 0.24 0.00 0.34 0.23 0.25 0.03

4 0.15 0.19 0.06 0.33 0.25 0.19

5 0.08 0.00 0.09 0.03 0.00 0.05

6 0.05 0.13 0.02 0.03 0.00 0.14

# of TVs # of Customers

0 8

1 63

2 126

3 75

No Response 77

Table 1:Percentage of people from each series (or cluster) grouped by how many family member they

have (see key in Data Dictionary)

Table 2:Percentage of people from each series (cluster) grouped by how much they make annually (see

key in Data Dictionary)

Table 3:Percentage of people from each series (cluster) grouped by Race (see key in Data Dictionary)

Table 4:Amount of TVs owned by Customers


48/49

C1 C2 C3

Newspaper subscriber 0.4652 0.5789 0.6053

Subscription to Better H&G 0.2491 0.1579 0.2632

Subscription to Good House 0.1062 0.1842 0.1316

Subscription to Ladies HJ 0.1136 0.1316 0.2105

Subscription to McCalls 0.1465 0.0526 0.1316

Subscription to Redbook 0.0733 0.0789 0.0526

Subscription to Reader's Digest 0.2601 0.2632 0.2632

Subscription to Cosmopolitan 0.0256 0.0000 0.0263

Subscription to TV Guide 0.1575 0.1053 0.1842

Subscription to People 0.0293 0.0000 0.0526

Subscription to Glamour 0.0183 0.0000 0.0263

Subscription to Time 0.0806 0.0526 0.0000

Subscription to Newsweek 0.0623 0.0263 0.0000

C1 C2 C3 C4 C5 C6

Newspaper 0.4895 0.5000 0.5000 0.5667 0.5833 0.4054

Subscription to Better H&G 0.2632 0.3750 0.1875 0.1667 0.1667 0.2432

Subscription to Good House 0.1158 0.1250 0.1250 0.1000 0.3333 0.0541

Subscription to Ladies HJ 0.1211 0.1875 0.1406 0.1000 0.0833 0.1351

Subscription to McCalls 0.1263 0.1250 0.1719 0.0333 0.2500 0.1622

Subscription to Redbook 0.0895 0.1250 0.0313 0.0333 0.1667 0.0270

Subscription to Reader's Dig 0.2263 0.1875 0.3281 0.3667 0.0833 0.3243

Subscription to Cosmopolita 0.0211 0.0625 0.0000 0.0000 0.0833 0.0541

Subscription to TV Guide 0.1684 0.0625 0.1563 0.1333 0.0833 0.1622

Subscription to People 0.0105 0.0000 0.0781 0 .0333 0.1667 0.0000

Subscription to Glamour 0.0158 0.0625 0 .0000 0.0333 0.0833 0.0000

Subscription to Time 0.0789 0.0625 0.0625 0.0000 0.0833 0.0811

Subscription to Newsweek 0.0579 0.0000 0.0469 0.0667 0.0833 0.0270

Table 5: Percentage of people who subscribe to Newspaper/Magazines from Clusters formed from

Crackers vs. Cookies

Table 6:Percentage of people who subscribe to Newspaper/Magazines from Clusters formed from Eggs

vs. Cereal


49/49

C1 C2 C3

Newspaper 0.4909 0.5246 0.3846

Subscription to Better H&G 0.2436 0.2459 0.1538

Subscription to Good House 0.0982 0.2295 0.0000

Subscription to Ladies HJ 0.1091 0.2295 0.0000

Subscription to McCalls 0.1200 0.1967 0.1538

Subscription to Redbook 0.0655 0.0984 0.0769

Subscription to Reader's Digest 0.2582 0.3115 0.0769

Subscription to Cosmopolitan 0.0218 0.0328 0.0000

Subscription to TV Guide 0.1455 0.1967 0.1538

Subscription to People 0.0218 0.0656 0.0000

Subscription to Glamour 0.0182 0.0164 0.0000

Subscription to Time 0.0764 0.0492 0.0000

Subscription to Newsweek 0.0545 0.0492 0.0000

Table 7: Percentage of people who subscribe to Newspaper/Magazines from Clusters formed from Pizza

vs. Hotdogs

analytics final project

Documents