analytics final project
TRANSCRIPT
-
8/9/2019 Analytics Final Project
1/49
IE 330
Final Project
12/10/14
Group 2:
Brian Leap
Matthew Murphy
Troy McCrum
-
8/9/2019 Analytics Final Project
2/49
Introduction
When people receive coupons in the mail or see advertisements in their favorite magazines,
they often interpret such events as random or maybe they think that companies are just trying to cast a
wide net to drum up business. In some cases, these things could be true; however, randomness usually
has nothing to do with it at all. What prospective customers may not know, or perhaps do not want to
know, is that companies know a lot more about them than they think. And random appearances of
coupons in the mailbox are in fact conscious, targeted efforts from marketing teams to persuade a
certain type of consumer to purchase a certain type of product.
So how do companies know so much about their customers? Any time a customer uses a credit
card, completes a survey, uses a coupon, subscribes to a magazine, etc., information about that
customer is being recorded and saved into a database. All sorts of customer information are valuable to
a company. A customers transactions, race, income, family size, pets, age, and occupation are all
examples of useful characteristics that companies can utilize in designing a marketing strategy. The
more information a company can gain about a customer, the better that company can market to that
customer. Additionally, as companies accumulate MORE information about MORE customers, insights
can be made about customers as a group rather than individually. In other words, companies can assign
individuals to a group and then simply market to that group, instead of marketing from customer to
customer. This process of grouping similar customers together is known as clustering. Clustering is an
essential technique because the time and resources spent in marketing to each customer individually
would far outweigh any sort of potential monetary gains.
As companies accrue more and more customers, the thought of pouring through customer data
by hand becomes impossible, and even somewhat laughable. The solution to this problem is a database.
A database of customer information to a marketing team is similar to the Periodic Table of Elements to achemist. Both a database and the periodic table are simply collections of information that are organized
in a certain way. To the average person, they are nothing more than that. However, for someone with
the necessary knowledge and tools, the information can be manipulated to create exciting new ideas.
With the help of analytics, marketing teams are able to query, or intelligently extract, data contained in
a database in order to make inferences about customers. Extracted data can then be converted into
tables, charts, graphs, or other types of graphical visualizations to be used in customer and marketing
analysis. Without a well-organized database, a technique like clustering would be impossible. Therefore,
using analytics to gain insight to customer and market behavior all starts with a good database.
Problem Description
The problem presented in this project is that a supermarket has an enormous amount of
transaction and demographic data pertaining to its customers, but the marketing team needs help in
finding ways to use this data to develop marketing and advertising strategies.
The project is divided into two parts:
-
8/9/2019 Analytics Final Project
3/49
The first part of the project deals with predictive analytics. This requires substantial use of
Microsoft Excel. Using Excel the team must calculate moving averages and exponential smoothing,
construct time series graphs, utilize regression lines, use k-means clustering to group similar customers
together, and use demographics data to form insights about the customer base. Using this information,
sales quantities of top-selling items must be predicted and methods of promoting and advertising
products must be developed.
The second part of the project deals primarily with databases. Part 2 requires the team to use
Microsoft Access to create database tables for analysis and perform queries to find interesting trends in
the data. Also, additional graphical visuals must be created outside of Access or Excel by using R. Again,
the team must make inferences and marketing recommendations based on the analysis.
Project Objectives
Part 1 (Excel)
o
Calculate prediction error, find: Mean Squared Average,
Mean Absolute Deviation,
Tracking Signal.
o Construct time series graphs
o Use OLS Regression to predict future sales
o Use k-means clustering to:
Group customers with similar buying habits,
Use analytics to find demographic commonalities
within those groups,
Make marketing recommendations based on analysis.
o Use demographic data to derive insights to the customer base
Part 2 (Access)
o Create database tables for analysis
o Identify relationships between tables
o Perform queries on various topics to gain insights
o Create advanced graphical visuals to display data
-
8/9/2019 Analytics Final Project
4/49
Part 1: Analytics
Moving Average and Exponential Smoothing
1)
Figure 1: Prediction error for weeks 10 through 20 for item types 3 and 17. The graph shows
the Mean Squared Error, Mean Absolute Deviation, and Tracking Signal.
Regression
2)
Figure 2:Time series for the sales of snacks (Item 17) over the 104 week period
Also included are the trend line, regression equation, and R2value
Item Type 3 Mean Squared Error Mean Absolute Deviation Tracking Sig
4 period moving average 1235.22 24.32 -1.03
2 period weighted moving average 1481.34 28.71 -0.56
Exponential Smoothing 1192.31 23.9 -0.83
Item Type 17
4 period moving average 868.05 25.18 5.58
2 period weighted moving average 801.37 23.29 4.59
Exponential Smoothing 783.55 25.35 6.63
-
8/9/2019 Analytics Final Project
5/49
Figure 3:Time series for the sales of eggs (Item 12) over the 104 week period
Also included are the trend line, regression equation, and R2value
Figure 4:Time series for the sales of butter (Item 3) over the 104 week period
Also included are the trend line, regression equation, and R2value
-
8/9/2019 Analytics Final Project
6/49
Figure 5:Time series for the sales of cookies (Item 8) over the 104 week period
Also included are the trend line, regression equation, and R2value
Figure 6:Time series for the sales of cereal (Item 5) over the 104 week period
Also included are the trend line, regression equation, and R2value
-
8/9/2019 Analytics Final Project
7/49
Figure 7:Time series for the sales of BBQ (Item 2) over the 104 week period
Also included are the trend line, regression equation, and R2value
Figure 8:Time series for the sales of cat food (Item 4) over the 104 week period
Also included are the trend line, regression equation, and R2value
-
8/9/2019 Analytics Final Project
8/49
Figure 9:Time series for the sales of ice cream (Item 13) over the 104 week period
Also included are the trend line, regression equation, and R2value
Figure 10:Time series for the sales of crackers (Item 9) over the 104 week period
Also included are the trend line, regression equation, and R2value
-
8/9/2019 Analytics Final Project
9/49
-
8/9/2019 Analytics Final Project
10/49
Item Type Item Description Regression Equation
17 Snacks y = 0.0423x + 85.11312 Eggs y = -0.1012x +76.178
3 Butter y = -0.4956x + 81.59
3)
Table 2:Regression Equations for Top 3 Selling Items Using Only First 60 Weeks of Data
Figure 12:Time Series for the Sales of Snacks (Item 17) over the 104 Week Period
Regression line displayed is based off of the results from the first 60 weeks
-
8/9/2019 Analytics Final Project
11/49
Figure 13:Time Series for the Sales of Eggs (Item 12) over the 104 Week Period
Regression line displayed is based off of the results from the first 60 weeks
Figure 14:Time Series for the Sales of Butter (Item 3) over the 104 Week Period
Regression line displayed is based off of the results from the first 60 weeks
-
8/9/2019 Analytics Final Project
12/49
Week Item 17 - Snacks Item 12 - Cereal Item 3 - Butter
61 87.6933 70.0048 51.3584
62 87.7356 69.9036 50.8628
63 87.7779 69.8024 50.3672
64 87.8202 69.7012 49.8716
65 87.8625 69.6000 49.3760
66 87.9048 69.4988 48.8804
67 87.9471 69.3976 48.3848
68 87.9894 69.2964 47.8892
69 88.0317 69.1952 47.3936
70 88.0740 69.0940 46.8980
71 88.1163 68.9928 46.4024
72 88.1586 68.8916 45.9068
73 88.2009 68.7904 45.4112
74 88.2432 68.6892 44.9156
75 88.2855 68.5880 44.4200
76 88.3278 68.4868 43.9244
77 88.3701 68.3856 43.4288
78 88.4124 68.2844 42.9332
79 88.4547 68.1832 42.4376
80 88.4970 68.0820 41.9420
Week Item 17 - Snacks Item 12 - Cereal Item 3 - Butter
61 59 104 38
62 44 36 24
63 49 36 41
64 71 39 37
65 75 47 35
66 64 46 41
67 52 34 41
68 54 45 79
69 47 38 40
70 72 58 24
71 53 43 39
72 46 71 74
73 55 95 49
74 48 27 53
75 77 50 12876 91 63 68
77 71 147 71
78 43 61 43
79 62 62 82
80 52 94 167
4)
Table 3:Projected Units of Snacks, Cereal, and Butter Sold for Weeks 61 through 80 using
Regression Equations obtained from first 60 Weeks of Data
Table 4: Actual Units of Snacks, Cereal, and Butter Sold for Weeks 61 through 80
-
8/9/2019 Analytics Final Project
13/49
Clustering
5)
Figure 15:Graphical Representation of K-Means Clustering using K=3 for Units of Crackers
Purchased vs. Units of Cookies Purchased by Each Customer over the 104 Week Span
Figure 16:Graphical Representation of K-Means Clustering using K=4 for Units of Crackers
Purchased vs. Units of Cookies Purchased by Each Customer over the 104 Week Span
-
8/9/2019 Analytics Final Project
14/49
Figure 17:Graphical Representation of K-Means Clustering using K=5 for Units of Crackers
Purchased vs. Units of Cookies Purchased by Each Customer over the 104 Week Span
Figure 18:Graphical Representation of K-Means Clustering using K=6 for Units of Crackers
Purchased vs. Units of Cookies Purchased by Each Customer over the 104 Week Span
-
8/9/2019 Analytics Final Project
15/49
Figure 19:Graphical Representation of K-Means Clustering using K=3 for Units of Eggs
Purchased vs. Units of Cereal Purchased by Each Customer over the 104 Week Span
Figure 20:Graphical Representation of K-Means Clustering using K=3 for Units of Hotdogs
Purchased vs. Units of Pizza Purchased by Each Customer over the 104 Week Span
-
8/9/2019 Analytics Final Project
16/49
General
6)
Figure 21: Graphical representation of race of store customers.
Figure 22: Chart representing total number of customers falling into each income bracket.
Race (%)
White
Black
Hispanic
Oriential
0
10
20
30
40
50
60
70
80
90
0 - 10k 10 -
11.9k
12 -
14.9k
15 -
19.9k
20 -
24.9k
25 -
34.9k
35 -
44.9k
45 -
54.9k
55 -
64.9k
65 -
74.9k
75+k
Customer Income
-
8/9/2019 Analytics Final Project
17/49
Figure 23: Chart representing total number of customers in occupation by field.
Part 1 Discussion and Overall Recommendations:
1)
Problem #1 involved using the solution to question 5 from homework 7. This question asks for the salesforecast of the two highest selling items which are item types 3 and 17. The following forecasts were
completed in homework 7: 4-period moving average, 2-period weighted moving average, and
exponential smoothing. Using these solutions, the mean squared error, mean absolute deviation, and
the tracking signal were computed. Using the equations shown in (Figure 1A) in the Appendix, each of
these errors were calculated.
According to the results, the Mean Squared Error for Exponential Smoothing provided the least amount
of error for both Item types. The values are rather large showing that prediction sales data may not be
entirely effective. I would recommend the store owner to stay away from these sales forecasting
methods, but to choose the exponential smoothing if need be.
2)
Problem #2 provides some interesting insight into the buying patterns of certain items. With the help of
a pivot table, it was easy to condense the data to determine the top ten most frequently purchased
items. After determining what these top ten items were, a time series graph was created for each
individual item. The last aspect of the time series that was implemented was the regression line (trend
0
20
40
60
80
100
120
140
Male
Female
-
8/9/2019 Analytics Final Project
18/49
line). This equation of this trend line allows the user to plug in any value for week number to get an
estimate of how many units of a particular item will be purchased during that given week. To rate the
accuracy of this trend line in predicting the actual number of units sold, a correlation coefficient (R2) was
also computed.
No items by themselves had any particularly interesting trends, however, there were a couple ofinteresting aspects that each of these ten plots had in common. The first noticeable common
occurrence was the sharp decrease in sales of each item during week 57. This could have been caused
by a variety of factors. For example, maybe week 57 was a holiday week so customers were busy
spending their time at home with family instead of at the store. Another possibility is that maybe the
store was closed for a few days that week, thus decreasing its total sales. No conclusive answer can be
obtained from the data given, but there is surely something that happened during week 57 to cause this
decrease in sales.
Another interesting feature that all ten items had in common was a negatively sloping regression line.
This means that as time goes on, the projected sales of every single one of these items is projected to
continually decrease. Basically, this means that the store has a problem. If sales of each of its ten most
popular products continue to decrease as time goes on like the regression model suggests, the store will
bring in less and less money as time goes on. In the long run, if this trend goes on for long enough, the
projected decrease in sales could even lead to the store going out of business.
Finally, one last notable feature that all ten items shared was a very small correlation coefficient. The
correlation coefficient essentially shows how accurate the fit of the regression line is to the observed
data. This value will fall somewhere between 0 and 1; 0 representing minimal accuracy and 1
representing perfect accuracy in the fit of the regression line. Out of all ten items, the largest
correlation coefficient was 0.2358. One of the values were even as low as 0.002. These extremely low
coefficient correlation values indicate that the regression line is not very accurate in predicting the
actual number of units of an item sold given the week number.
3)
Problem #3 further dives into the concept of regression and using a regression line to predict future
response. In this problem, only the top three most frequently purchased items were considered. A
regression line was formed using only the sales data from the first 60 weeks. After this regression line
from the first 60 weeks was calculated and created, it was plotted overtop of the time series plot of all
104 weeks of data for its corresponding item type. This created a visual that made it easy to see just
how accurate each individual regression line was in predicting future response.
After all of these steps were carried out, the most notable occurrence was the great change in the
equations for the regression lines from what they were when all 104 weeks of data was considered. The
initial equations were y = -0.0593x + 84.136 for item 17, y = -0.0168x + 72.545 for item 12, and y = -
0.1308x +72.062 for item 3. However, once the parameters were changed to only include the first 60
weeks, the new regression equations were y = 0.0423x + 85.113 for item 17, y = -0.1012x +76.178 for
item 12, and y = -0.4956x + 81.59 for item 3. One notable occurrence was the change in slope for item
-
8/9/2019 Analytics Final Project
19/49
17. When all 104 weeks of data was used, the slope of the regression line was negative. However,
when only the first 60 weeks of data was considered, the slope of the regression line became positive.
This means that sales are projected to gradually increase when only the first 60 weeks of data is used,
but when all 104 weeks is considered, sales of item 17 are projected to decrease as time goes on.
Another noticeable trend was the rapid decrease in slope of items 12 and 3 when only the first 60 weeks
of data are considered. The slope decreases almost six times as fast for item 12 and almost four times
as fast for item 3 when only the first 60 weeks is used in forming the regression line as opposed to all
104 weeks. However, it should be noted that in both of these cases, the intercept does start at a higher
value for the regression line from only the first 60 weeks, thus somewhat compensating for the large
increase in negativity of the slope. Overall, the great changes that take place in the equations for the
regression lines when more data is used indicates the lack of accuracy of the regression lines in
predicting the response (units sold).
4)
So now, the goal of problem #4 is to predict the number of units that will sell each week, starting from
week 61 and stopping at week 80. These predictions were made using the regression lines found from
considering only the first 60 weeks of data. In order to obtain these predictions, the week number had
to be plugged into the regression equation as the x value. The predicted values and the actual values
can be seen in the Results Section.
It is evident that the predictions obtained through regression are not very accurate for predicting future
sales. For item 17, the average sales from weeks 61 through 80 are significantly less than the predicted
values indicate. The predicted values are also noticeably larger for item 12. There are a few instances
where actual units sold increases to be greater than the projections, but for the most part, the average
sales are less than the predicted values. Like the other two items, the projections for item 3 are not
quite accurate either. The regression line indicates a quick decrease in units sold while in actuality, the
values seem to be gradually increasing.
Overall, after comparing the projections from the regression lines to the actual amount of each item
sold each week, it is clear that there is a lot of error in this prediction method. If the owner of the store
wants to better predict the amount of items that will be sold in the future, forecasting methods such as
Moving Average or Exponential Smoothing may create more accurate results.
5)
Problem #5 dealt with the clustering of two different items against each other. Clustering is a very
helpful strategy when it comes to breaking down and studying data. Customers are an example of
something that is often clustered. In this case, clustering allows the user to separate the customers into
groups so that different recommendations can be given to different groups based off of the common
preferences of the customers in that group. K-means clustering is a very widely used clustering
technique. This method breaks the data down into k separate groups, with each group having its own
centroid. The rule of thumb here is that all of the points in each cluster will be closest to its own
centroid, as opposed to being closer to the centroid of another group.
-
8/9/2019 Analytics Final Project
20/49
So, for problem 5, the k-means clustering algorithm was used to break the data down into k separate
groups. The first two items that were compared were items 8 and 9 (cookies and crackers). These items
were clustered using k values of 3, 4, 5, and 6. The graphs were created and then judged to determine
which value of k produced the best fit for the data. All of the graphs that were created can be seen in
the Results Section. K=6 appeared to create the best fit for the data on the graph because it creates the
most precise groups with points that seem to have the smallest deviation from their centroid. After this,
two more sets of items were clustered, however, k=3 was the only value of k that was used on those
cases. The first graph created contained a comparison between cereal and eggs while the second graph
created focused on hotdogs and pizza.
So, the reason that this clustering procedure was carried out is to help the store manager promote
products based on specific customer attributes. After the customers were assigned to a specific cluster,
the background information on each customer was examined to determine if there were any noticeable
similarities in the characteristics of customers within clusters. This background information was given in
the Demographics Data
So, the cluster analysis containing k=6 groups of crackers (Item 9) purchased vs. cookies (Item 8)
purchased was first examined to determine the buying patterns of the customers in each of the clusters.
The other three graphs for crackers vs. cookies using k=3, 4, and 5 were not considered here because it
was decided that k=6 created the best and most accurate results. In each of the graphs, clusters are
referred to as Series 1, Series 2, etc. Series 1 contains the customers who buy very little of both
products, Series 2 contains customers who buy a lot of crackers and a substantial amount of cookies,
Series 3 contains customers who buy some cookies but not a lot of crackers, Series 4 contains customers
who buy some crackers but not a lot of cookies, Series 5 contains customers who buy a lot of cookies
and a substantial amount of crackers, and Series 6 contains the customers who buy a substantial
amount of cookies and some crackers. To get a better understanding of the customers who belonged toeach of these groups, Family Size was useful in differentiating between clusters. The two groups that
were focused on were Series 2 and Series 5 because those were the groups that contained the
customers who bought a large amount crackers and a large amount of cookies, respectively. In Series 2,
the most common family size by far is 2 people, accounting for 63% of the people in that group.
However, in Series 5, there are three family sizes that make up the majority of this group. A family size
of 2 makes up 33%, a family size of 3 makes up 25%, and a family size of 4 makes up 25%. Note that all
of these Demographic Data statistics can be found in the Appendix in table form. So, now that the
company has some background on the customers belonging to each of the customers, it must now
decide how to allocate its advertising resources to try to increase the sales of cookies and crackers.
Based off of this data, the type of customer who buys a lot of crackers belongs to Series 2 and moreoften than not, has a family size of 2 people. The type of customers who buy a lot of cookies are those
who have families of size 2, 3, or 4. So, in order to maximize profit from sales of both crackers and
cookies, crackers should be marketed mainly towards people who have family sizes of 2 people and
cookies should be marketed towards people who have families ranging from 2-4 people. In order to
advertise to these groups of people, TV would probably be the best way to get the word out. Out of the
people who responded, 264 out of the 272 people owned at least one television. However, if a
-
8/9/2019 Analytics Final Project
21/49
television advertisement is not economically feasible, other methods, such as a newspaper or magazine
advertisement could work well. 50% of the people in Series 2 and 58.33% of the people in Series 5 read
the newspaper. If a magazine advertisement is the way that the store wants to go, their best course of
action would be to put an ad for crackers in Better H&G because 37.5% of people that buy a lot of
crackers also subscribe to Better H&G. To reach the customers who buy a lot of cookies, and
advertisement should be placed in Good House because one third of the people who buy a lot of cookies
also subscribe to Good House.
The next cluster analysis that was carried out compared the amount of eggs (Item 16) purchased
vs. the amount of cereal (Item 5) purchased by each customer. This cluster analysis only used k=3
groups. Series 1 contained the customers who did not buy a lot of either product, Series 2 contained the
customers who bought a lot of cereal, and Series 3 contained the customers who bought a lot of eggs.
When the demographics data of all of the customers was studied by cluster, there were some pretty
interesting results. Annual Income was one factor that created some interesting statistics when studied.
Out of all of the people in Series 2, over 60% of those people make more than $35,000 per year. In
Series 3 on the other hand, less than 19% of people make more than $35,000 per year. So, since there isclearly some correlation between annual income and product purchased, the store should take this into
account for advertising purposes. For the store to increase sales of both of these items, the best course
of action would be to try to advertise cereal to people who make more than $35,000 per year and
advertise eggs to the people who make less than $35,000 per year. In order to do this advertising, once
again the most efficient yet most costly way to do so would be to create a television commercial. If the
store would like to advertise in the newspaper instead, 57.89% of the people who buy a lot of cereal and
60.53% of the people who buy a lot of eggs subscribe to the newspaper. And finally, if a magazine ad is
the way that the store chooses to go, in order to reach the people who buy a lot of cereal, an ad should
be placed in Reader Digest because 26.32% of the people who buy a lot of cereal subscribe. To reach
the people who buy a lot of eggs, a magazine ad should be placed in either Readers Digest or BetterH&G because 26.32% of people who buy a lot of eggs subscribe to each of these magazines.
The final cluster analysis that was carried out compared the number of units of pizza (Item 16)
purchased vs. the number of units of hotdogs (Item 11) purchased by each customer. Again, k=3 was
the value used for number of groups. Series 1 contained people who purchased very few hotdogs and
pizzas, Series 3 contained the people who purchased a lot of hotdogs, and Series 2 pretty much just
contained everyone who did not fit either of those criteria. In this case, race was a factor that lead to
some pretty interesting results. Though white people were the race that was most common in the
study, the data shows some pretty evidence pertaining to the preferences of African Americans. In
Series 1 and 2, the percentage of African American belonging to each cluster is just above 10%.However, in Series 3, the percentage of African American jumps all of the way up above 23%. The
percentage of white people belonging to Series 3 also drops more than 10% from both Series 1 and 2.
As a result of this correlation between race and product purchased, this should be taken into account by
the store when trying to advertise pizza and hotdogs. This data shows that it is a waste of time to
market pizza to African Americans because only a little more than 10% of them ever purchase pizza.
White people would be a better marketing target when it comes to advertising pizza. If an item should
-
8/9/2019 Analytics Final Project
22/49
be marketed to African Americans, it should be hot dogs because a much higher percentage of them
purchase hotdogs. So, once again, the statistics show that the most efficient way to reach consumers is
through a television advertisement. However, in this case a magazine or newspaper ad may not be a
good way to go when trying to advertise. In Series 3, which would be the main group that the store
would want to advertise to because they buy such a large amount of hotdogs, only 38.46% of people
subscribe to the newspaper which, compared to the other groups, is very small. Also, out of this group,
no magazine is subscribed to by more than 16% of the people, making this advertisement possibility also
somewhat ineffective. So, in order to reach these people in Series 3, the best method after all may just
be to save up money to produce a TV commercial because that seems to be the only really effective way
to reach these people.
6)
When developing a marketing strategy it is essential to know and understand ones customer base.
Using the Demographics data, some statistics were able to be pulled from the data to in order to
develop such an understanding of the area of the stores location and the people who live there. The
percentages of ethnicities of the customers were found, as seen in Figure 21. This graph indicates that
the population of the location of the town is predominantly white, and has very low minority population
percentages. This leads to believe that the store is located in a more rural area, as opposed to in a larger
city. More potential evidence for the rural, blue-collar town is that 34% of customers own at least one
dog. This number would not be so high in a city with people living mostly in apartment buildings. The
fact that so many people own dogs brings a recommendation from the team that the store should
promote its dog food and other dog-related products to its customers. This area also seems like a rural
town with blue-collar people because of the annual income. The vast majority of customers fall into the
middle-class range, whereas only a few customers earn more than $75k per year. Surprisingly, the most
common profession among all the customers is retirement. Yes, of all the types of professions there are
more customers that are retired than in any other category. This leads to believe that the store has a
customer base with a majority of elderly citizens. By using clustering and other analytic techniques, the
store has the capability to use all aforementioned statistics to gain an advantage with its sales and
marketing strategies.
-
8/9/2019 Analytics Final Project
23/49
Part 2: Databases
1)
Figure 1:Access Database Table Schema
-
8/9/2019 Analytics Final Project
24/49
Table 1:Transaction:Transaction Information Table
Table 2:CouponLookupCoupon Description
Table
Table 3: Product: Item Description Table
Table 4:Vendor: Vendor Decsription Table
-
8/9/2019 Analytics Final Project
25/49
Table 5:Customer: Customer Demographic Table
Table 6:RaceLookup Race Description Table
Table 7:IncomeLookup Income Table
Table 8:FemaleOccLookupFemale Occupation Table
Table 9:MaleOccLookup Male Occupation Table
-
8/9/2019 Analytics Final Project
26/49
Table 10:MaleAgeLookupMale Age Table
2)
Query 1: Cross table of query, showing all females occupations, in addition to family size and
coupon origin.
SQL code:
TRANSFORM Count([AllOccFemale-CouponOrig].[Customer ID]) AS [CountOfCustomer ID]
SELECT [AllOccFemale-CouponOrig].[Family Size], [AllOccFemale-CouponOrig].Desc,
Count([AllOccFemale-CouponOrig].[Customer ID]) AS [Total Of Customer ID]
FROM [AllOccFemale-CouponOrig]
GROUP BY [AllOccFemale-CouponOrig].[Family Size], [AllOccFemale-CouponOrig].Desc
PIVOT [AllOccFemale-CouponOrig].CouponDesc;
-
8/9/2019 Analytics Final Project
27/49
Query 2: Origins of all coupon transactions.
SQL code:
TRANSFORM Count(CouponAmounts.[Customer ID]) AS [CountOfCustomer ID]
SELECT CouponAmounts.CouponDesc, Count(CouponAmounts.[Customer ID]) AS [Total Of
Customer ID]
FROM CouponAmounts
GROUP BY CouponAmounts.CouponDesc
PIVOT CouponAmounts.CouponID;
-
8/9/2019 Analytics Final Project
28/49
Query 3: Total coupons used that valued greater than $0.99 in regards to item types.
SQL code:
TRANSFORM Count(CouponItemInteraction.[Customer ID]) AS [CountOfCustomer ID]
SELECT CouponItemInteraction.Description, Count(CouponItemInteraction.[Customer ID]) AS [Total Of
Customer ID]
FROM CouponItemInteraction
WHERE (((CouponItemInteraction.[Coupon Value (Cents)])>99))
GROUP BY CouponItemInteraction.Description
PIVOT CouponItemInteraction.[Coupon Value (Cents)];
-
8/9/2019 Analytics Final Project
29/49
Query 4: Total items bought by ethnicity.
SQL code:
TRANSFORM Count(EthnicItemType.[Customer ID]) AS [CountOfCustomer ID]
SELECT EthnicItemType.Desc, Count(EthnicItemType.[Customer ID]) AS [Total Of Customer
ID]
FROM EthnicItemType
GROUP BY EthnicItemType.Desc
PIVOT EthnicItemType.Description;
-
8/9/2019 Analytics Final Project
30/49
Query 5: Units bought compared to family size.
SQL code:
Query 6: Coupon origins based on female occupations.
SQL code:
TRANSFORM Count([Family-Units Bought].[Customer ID]) AS [CountOfCustomer ID]
SELECT [Family-Units Bought].[Units Bought], Count([Family-Units Bought].[Customer ID])
AS [Total Of Customer ID]
FROM [Family-Units Bought]
GROUP BY [Family-Units Bought].[Units Bought]
PIVOT [Family-Units Bought].[Family Size];
TRANSFORM Count([FemaleOcc-Coupons].[Customer ID]) AS [CountOfCustomer ID]
SELECT [FemaleOcc-Coupons].Desc, Count([FemaleOcc-Coupons].[Customer ID]) AS [Total
Of Customer ID]
FROM [FemaleOcc-Coupons]
GROUP BY [FemaleOcc-Coupons].Desc
PIVOT [FemaleOcc-Coupons].[Coupon Origin];
-
8/9/2019 Analytics Final Project
31/49
Query 7: Ice Cream by Day of Week
SQL code:
Query 8: Coupon origin and family size of unemployed females.
SQL code:
TRANSFORM Count([IceCream-ByDay].[Item Type]) AS [CountOfItem Type]
SELECT [IceCream-ByDay].Description, Count([IceCream-ByDay].[Item Type]) AS [Total Of
Item Type]
FROM [IceCream-ByDay]
GROUP BY [IceCream-ByDay].Description
PIVOT [IceCream-ByDay].Day;
TRANSFORM Count([NonEmployFemale-CouponOrig].[Customer ID]) AS [CountOfCustomer ID]
SELECT [NonEmployFemale-CouponOrig].[Family Size], [NonEmployFemale-CouponOrig].Desc,
Count([NonEmployFemale-CouponOrig].[Customer ID]) AS [Total Of Customer ID]
FROM [NonEmployFemale-CouponOrig]
GROUP BY [NonEmployFemale-CouponOrig].[Family Size], [NonEmployFemale-CouponOrig].Desc
PIVOT [NonEmployFemale-CouponOrig].CouponDesc;
-
8/9/2019 Analytics Final Project
32/49
Query 9: Item types bought with subscription to cable tv in regards to male occupation.
SQL code:
TRANSFORM Count([TVAds-MenOcc].[Cable TV]) AS [CountOfCable TV]
SELECT [TVAds-MenOcc].Description, Count([TVAds-MenOcc].[Cable TV]) AS [Total Of Cable TV]
FROM [TVAds-MenOcc]
GROUP BY [TVAds-MenOcc].Description
PIVOT [TVAds-MenOcc].Desc;
-
8/9/2019 Analytics Final Project
33/49
Query 10: Item types bought by income amounts.
SQL code:
TRANSFORM Count([VolumeBought-Income].[Customer ID]) AS [CountOfCustomer ID]
SELECT [VolumeBought-Income].Incomeamt, Count([VolumeBought-Income].[Customer ID]) AS [Total Of
Customer ID]
FROM [VolumeBought-Income]
GROUP BY [VolumeBought-Income].Incomeamt
PIVOT [VolumeBought-Income].Description;
-
8/9/2019 Analytics Final Project
34/49
Query 11: Item types bought by week.
SQL code:
TRANSFORM Count(WeeklyProducts.[Customer ID]) AS [CountOfCustomer ID]
SELECT WeeklyProducts.Description, Count(WeeklyProducts.[Customer ID]) AS [Total Of
Customer ID]
FROM WeeklyProducts
GROUP BY WeeklyProducts.Description
PIVOT WeeklyProducts.Week;
-
8/9/2019 Analytics Final Project
35/49
Query 12: Family size and income with children.
SQL code:
Query 13: What customer bought snacks from what vendor on day 5.
SQL code:
SELECT Customer.[Customer ID], Customer.[Family Size], Customer.Income,
Customer.Children
FROM Customer
WHERE (((Customer.[Subscription to Newsweek])=Yes))
ORDER BY Customer.Children DESC , Customer.[Family Size] DESC;
SELECT Transaction.[Customer ID], Product.Description, Vendor.Vendor, Transaction.Day
FROM Vendor INNER JOIN (Product INNER JOIN [Transaction] ON Product.[Item Type] =
Transaction.[Item Type]) ON Vendor.ItemType = Transaction.[Item Type]
WHERE (((Product.Description)="snack") AND ((Transaction.Day)=5));
-
8/9/2019 Analytics Final Project
36/49
Query 14: Amount of coupons from origin 23, item type 10, with a coupon value greater than
$0.99.
SQL code:
SELECT Transaction.[Coupon Origin], Product.[Item Type], Transaction.[Coupon Value
(Cents)]
FROM Product INNER JOIN (Customer INNER JOIN [Transaction] ON Customer.[Customer ID]
= Transaction.[Customer ID]) ON Product.[Item Type] = Transaction.[Item Type]
WHERE (((Transaction.[Coupon Origin])=23) AND ((Product.[Item Type])=10) AND
((Transaction.[Coupon Value (Cents)])>99));
-
8/9/2019 Analytics Final Project
37/49
Query 15: Hot Dog purchases >1 in regards to family size, number of dogs, day and week.
SQL code:
Query 16: Amount of female cleaners that bought cleansers and amount of units bought.
SQL code:
SELECT Customer.[Customer ID], Customer.[Family Size], Customer.Dogs,
Product.Description, Transaction.Week, Transaction.[Units Bought]
FROM Product INNER JOIN (Customer INNER JOIN [Transaction] ON Customer.[Customer ID]
= Transaction.[Customer ID]) ON Product.[Item Type] = Transaction.[Item Type]
WHERE (((Customer.Dogs)>1) AND ((Product.Description)="dogs"))
ORDER BY Customer.[Customer ID], Transaction.Week;
SELECT Customer.[Customer ID], Customer.[Female Occupation], Transaction.[Item Type],
Transaction.[Units Bought]
FROM Customer INNER JOIN [Transaction] ON Customer.[Customer ID] =
Transaction.[Customer ID]
WHERE (((Customer.[Female Occupation])=8) AND ((Transaction.[Item Type])=6));
-
8/9/2019 Analytics Final Project
38/49
Query 17: Amount of duplicates of vendors and their item numbers.
SQL code:
SELECT First(Vendor.[Vendor]) AS [Vendor Field], First(Vendor.[ItemNumber]) AS[ItemNumber Field], Count(Vendor.[Vendor]) AS NumberOfDups
FROM Vendor
GROUP BY Vendor.[Vendor], Vendor.[ItemNumber]
HAVING Count Vendor.Vendor >1 And Count Vendor.ItemNumber >1 ;
-
8/9/2019 Analytics Final Project
39/49
Query 18: Females subscribed to Better Home & Garden, and their occupations.
SQL code:
SELECT Customer.[Customer ID], Customer.[Subscription to Better H&G],FemaleOccLookup.Desc
FROM FemaleOccLookup INNER JOIN Customer ON FemaleOccLookup.FemaleOccID =
Customer.[Female Occupation]
WHERE (((Customer.[Subscription to Better H&G])=Yes));
-
8/9/2019 Analytics Final Project
40/49
Query 19: Male and female education in regards to income.
SQL code:
SELECT Customer.[Customer ID], Customer.Income, Customer.[Male Education],
Customer.[Female Education]
FROM IncomeLookup INNER JOIN Customer ON IncomeLookup.IncomeID = Customer.Income
WHERE (((Customer.Income)=6));
-
8/9/2019 Analytics Final Project
41/49
Query 20: Coupon origin of snacks bought by African Americans.
SQL code:
SELECT Customer.[Customer ID], RaceLookup.Desc, Product.Description, CouponLookup.CouponDesc
FROM CouponLookup INNER JOIN (RaceLookup INNER JOIN (Product INNER JOIN (Customer INNER
JOIN [Transaction] ON Customer.[Customer ID] = Transaction.[Customer ID]) ON Product.[Item Type] =
Transaction.[Item Type]) ON RaceLookup.RaceID = Customer.Ethnicity) ON CouponLookup.CouponOrig
= Transaction.[Coupon Origin]
WHERE (((RaceLookup.Desc)="black") AND ((Product.Description)="snack"));
-
8/9/2019 Analytics Final Project
42/49
Query 21: How oftern customers purchased hot dogs along with family size and units bought.
SQL code:
3)
Graphics:
Figure1: Ice Cream units sold by day of week.
SELECT Customer.[Customer ID], First(Customer.[Family Size]) AS [Family Size], First(Customer.Dogs)
AS Dogs, Count(Transaction.Week) AS CountOfWeek, First(Transaction.[Units Bought]) AS [UnitsBought]
FROM Product INNER JOIN (Customer INNER JOIN [Transaction] ON Customer.[Customer ID] =
Transaction.[Customer ID]) ON Product.[Item Type] = Transaction.[Item Type]
GROUP BY Customer.[Customer ID]
HAVING (((First(Customer.Dogs))>1))
-
8/9/2019 Analytics Final Project
43/49
Figure 2: The total coupons used and the origin of those coupons.
Figure 3: Sum of total items bought by ethnicity.
-
8/9/2019 Analytics Final Project
44/49
Figure 4: Distribution of coupons used valued greater than $0.99.
Figure 5: Total coupons used by females occupations.
-
8/9/2019 Analytics Final Project
45/49
Part 2 Discussion and Overall Recommendations:
Creating databases in Access has proven to show many important insights to the transaction data.
Through the formulation of tables and queries, information can be found lying within the bulk of the
data. The database was set up by creating a few of the necessary tables. The demographic data was
inputted and created as the customer table. A transaction table was also created where Customer ID isshared between the Transaction and Customer tables (Relationship). To make things easier to visualize,
a Product table was created that lists the description of each item type instead of the corresponding
numerical ids. As queries were being made, understanding many of the numerical signifiers for the
demographics were difficult. Thus the following tables were created to show descriptions when calling
upon them in a query: CouponLookup, IncomeLookup, RaceLookup, FemaleOccLookup,
MaleAgeLookup, and MaleOccLookup. By using the actual descriptions from the demographic data, the
store owner can easily identify anything he or she may be searching for in the form of data analysis.
As queries were being written, certain trends became apparent. The grocery store runs on people
buying their products. Finding relationships of what type of people are buying what type of products canprove to be very beneficial. Linking transaction data to the demographics of the customers is crucial to
marketing and sales of any given company.
To analyze some of the findings from the queries, we will start with the most general query: the quantity
of items bought over the 102 week period. There were a total of 49,576 items bought. Of these 49,576
items bought, 213 purchases were made from Hispanic customers, 255 listed as other, 4,373 African
Americans, and 44,735 Caucasian customers. Within these purchases, the top 3 items bought were
snacks at 6,379, eggs at 5059, and cereal at 4859 items. This data already shows that this store is located
where the primary customers of Caucasian ethnicity.
Coupons can be an integral part to a grocery stores sales. When analyzing coupon data, you can find
what types of items coupons are used most often with, where these coupons are coming from, and
what type of people are using them. Looking at the queries CouponItemInteraction, CouponAmounts,
and FemaleOcc-Coupons can answer a few of these questions. The top 3 items bought with the use of
coupons that valued greater than $0.99 were cereal with 397 instances, detergents with 147, and coffee
with 139. With almost 10 times more than any other origin, the Sunday supplement vendor accounted
for 60 percent of all coupons used at this store with a total of 2014 of the 3376 coupons used.
Newspapers were the next largest source of coupons with 424 transactions. As females accounted for
the bulk of coupon use, the primary occupations of these coupon users were retired. The next two
female occupations with the highest coupon use were unemployed and clerical respectively. Drawing
conclusions from this data, there is probably a large population of retirees that shop at this store. To
target market, I would recommend targeting Caucasian females in the Sunday supplement for items that
you may want to sell more of.
This is just one example of something pulled from this transaction data. There are certainly many more
inferences that can be made by using different queries to find and make these conclusions. We found
that there were many lessons to be learned from this project. The use of databases can be a huge
-
8/9/2019 Analytics Final Project
46/49
advantage when sorting through large amounts of data in search of making inferences. The ability to call
certain criteria from a large selection can return a small, comprehendible set of data. In the transaction
data, we were able to draw strong evidence of the demographic prevalence in the store region. From
this demographic data, buying tendencies gave way to easy assumptions to make about what items sell
and what items do not sell. This project proved to be very beneficial in learning hands-on the
importance of data analytics and databases. This experience will prove to be extremely useful in our
future fields.
Group Roles
Matt MurphyPart 1 - #(2,3,4,5)
Brian LeapPart 2 - #(1,2,3)
Troy McCrumPart 1 - #(1,6) and Part 2 - #(2)
Appendix
Part 1:
Figure 1:Equations used for question 1, prediction errors.
-
8/9/2019 Analytics Final Project
47/49
C1 C2 C3
0 0.018315 0 01 0.087912 0.105263 0.078947
2 0.069597 0 0.052632
3 0.058608 0.078947 0.105263
4 0.120879 0.052632 0.184211
5 0.087912 0.105263 0.157895
6 0.164835 0.052632 0.236842
7 0.153846 0.210526 0.078947
8 0.128205 0.184211 0.078947
9 0.03663 0.052632 0.026316
10 0.029304 0.052632 0
11 0.043956 0.105263 0
C1 C2 C3
0 0 0 0
1 0.876364 0.885246 0.769231
2 0.105455 0.114754 0.230769
3 0.010909 0 0
4 0 0 0
5 0.007273 0 0
C1 C2 C3 C4 C5 C6
0 0.00 0.00 0.00 0.00 0.00 0.00
1 0.12 0.06 0.14 0.07 0.17 0.14
2 0.36 0.63 0.34 0.30 0.33 0.46
3 0.24 0.00 0.34 0.23 0.25 0.03
4 0.15 0.19 0.06 0.33 0.25 0.19
5 0.08 0.00 0.09 0.03 0.00 0.05
6 0.05 0.13 0.02 0.03 0.00 0.14
# of TVs # of Customers
0 8
1 63
2 126
3 75
No Response 77
Table 1:Percentage of people from each series (or cluster) grouped by how many family member they
have (see key in Data Dictionary)
Table 2:Percentage of people from each series (cluster) grouped by how much they make annually (see
key in Data Dictionary)
Table 3:Percentage of people from each series (cluster) grouped by Race (see key in Data Dictionary)
Table 4:Amount of TVs owned by Customers
-
8/9/2019 Analytics Final Project
48/49
C1 C2 C3
Newspaper subscriber 0.4652 0.5789 0.6053
Subscription to Better H&G 0.2491 0.1579 0.2632
Subscription to Good House 0.1062 0.1842 0.1316
Subscription to Ladies HJ 0.1136 0.1316 0.2105
Subscription to McCalls 0.1465 0.0526 0.1316
Subscription to Redbook 0.0733 0.0789 0.0526
Subscription to Reader's Digest 0.2601 0.2632 0.2632
Subscription to Cosmopolitan 0.0256 0.0000 0.0263
Subscription to TV Guide 0.1575 0.1053 0.1842
Subscription to People 0.0293 0.0000 0.0526
Subscription to Glamour 0.0183 0.0000 0.0263
Subscription to Time 0.0806 0.0526 0.0000
Subscription to Newsweek 0.0623 0.0263 0.0000
C1 C2 C3 C4 C5 C6
Newspaper 0.4895 0.5000 0.5000 0.5667 0.5833 0.4054
Subscription to Better H&G 0.2632 0.3750 0.1875 0.1667 0.1667 0.2432
Subscription to Good House 0.1158 0.1250 0.1250 0.1000 0.3333 0.0541
Subscription to Ladies HJ 0.1211 0.1875 0.1406 0.1000 0.0833 0.1351
Subscription to McCalls 0.1263 0.1250 0.1719 0.0333 0.2500 0.1622
Subscription to Redbook 0.0895 0.1250 0.0313 0.0333 0.1667 0.0270
Subscription to Reader's Dig 0.2263 0.1875 0.3281 0.3667 0.0833 0.3243
Subscription to Cosmopolita 0.0211 0.0625 0.0000 0.0000 0.0833 0.0541
Subscription to TV Guide 0.1684 0.0625 0.1563 0.1333 0.0833 0.1622
Subscription to People 0.0105 0.0000 0.0781 0 .0333 0.1667 0.0000
Subscription to Glamour 0.0158 0.0625 0 .0000 0.0333 0.0833 0.0000
Subscription to Time 0.0789 0.0625 0.0625 0.0000 0.0833 0.0811
Subscription to Newsweek 0.0579 0.0000 0.0469 0.0667 0.0833 0.0270
Table 5: Percentage of people who subscribe to Newspaper/Magazines from Clusters formed from
Crackers vs. Cookies
Table 6:Percentage of people who subscribe to Newspaper/Magazines from Clusters formed from Eggs
vs. Cereal
-
8/9/2019 Analytics Final Project
49/49
C1 C2 C3
Newspaper 0.4909 0.5246 0.3846
Subscription to Better H&G 0.2436 0.2459 0.1538
Subscription to Good House 0.0982 0.2295 0.0000
Subscription to Ladies HJ 0.1091 0.2295 0.0000
Subscription to McCalls 0.1200 0.1967 0.1538
Subscription to Redbook 0.0655 0.0984 0.0769
Subscription to Reader's Digest 0.2582 0.3115 0.0769
Subscription to Cosmopolitan 0.0218 0.0328 0.0000
Subscription to TV Guide 0.1455 0.1967 0.1538
Subscription to People 0.0218 0.0656 0.0000
Subscription to Glamour 0.0182 0.0164 0.0000
Subscription to Time 0.0764 0.0492 0.0000
Subscription to Newsweek 0.0545 0.0492 0.0000
Table 7: Percentage of people who subscribe to Newspaper/Magazines from Clusters formed from Pizza
vs. Hotdogs