analytics final project

Upload: bleap817

Post on 01-Jun-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/9/2019 Analytics Final Project

    1/49

    IE 330

    Final Project

    12/10/14

    Group 2:

    Brian Leap

    Matthew Murphy

    Troy McCrum

  • 8/9/2019 Analytics Final Project

    2/49

    Introduction

    When people receive coupons in the mail or see advertisements in their favorite magazines,

    they often interpret such events as random or maybe they think that companies are just trying to cast a

    wide net to drum up business. In some cases, these things could be true; however, randomness usually

    has nothing to do with it at all. What prospective customers may not know, or perhaps do not want to

    know, is that companies know a lot more about them than they think. And random appearances of

    coupons in the mailbox are in fact conscious, targeted efforts from marketing teams to persuade a

    certain type of consumer to purchase a certain type of product.

    So how do companies know so much about their customers? Any time a customer uses a credit

    card, completes a survey, uses a coupon, subscribes to a magazine, etc., information about that

    customer is being recorded and saved into a database. All sorts of customer information are valuable to

    a company. A customers transactions, race, income, family size, pets, age, and occupation are all

    examples of useful characteristics that companies can utilize in designing a marketing strategy. The

    more information a company can gain about a customer, the better that company can market to that

    customer. Additionally, as companies accumulate MORE information about MORE customers, insights

    can be made about customers as a group rather than individually. In other words, companies can assign

    individuals to a group and then simply market to that group, instead of marketing from customer to

    customer. This process of grouping similar customers together is known as clustering. Clustering is an

    essential technique because the time and resources spent in marketing to each customer individually

    would far outweigh any sort of potential monetary gains.

    As companies accrue more and more customers, the thought of pouring through customer data

    by hand becomes impossible, and even somewhat laughable. The solution to this problem is a database.

    A database of customer information to a marketing team is similar to the Periodic Table of Elements to achemist. Both a database and the periodic table are simply collections of information that are organized

    in a certain way. To the average person, they are nothing more than that. However, for someone with

    the necessary knowledge and tools, the information can be manipulated to create exciting new ideas.

    With the help of analytics, marketing teams are able to query, or intelligently extract, data contained in

    a database in order to make inferences about customers. Extracted data can then be converted into

    tables, charts, graphs, or other types of graphical visualizations to be used in customer and marketing

    analysis. Without a well-organized database, a technique like clustering would be impossible. Therefore,

    using analytics to gain insight to customer and market behavior all starts with a good database.

    Problem Description

    The problem presented in this project is that a supermarket has an enormous amount of

    transaction and demographic data pertaining to its customers, but the marketing team needs help in

    finding ways to use this data to develop marketing and advertising strategies.

    The project is divided into two parts:

  • 8/9/2019 Analytics Final Project

    3/49

    The first part of the project deals with predictive analytics. This requires substantial use of

    Microsoft Excel. Using Excel the team must calculate moving averages and exponential smoothing,

    construct time series graphs, utilize regression lines, use k-means clustering to group similar customers

    together, and use demographics data to form insights about the customer base. Using this information,

    sales quantities of top-selling items must be predicted and methods of promoting and advertising

    products must be developed.

    The second part of the project deals primarily with databases. Part 2 requires the team to use

    Microsoft Access to create database tables for analysis and perform queries to find interesting trends in

    the data. Also, additional graphical visuals must be created outside of Access or Excel by using R. Again,

    the team must make inferences and marketing recommendations based on the analysis.

    Project Objectives

    Part 1 (Excel)

    o

    Calculate prediction error, find: Mean Squared Average,

    Mean Absolute Deviation,

    Tracking Signal.

    o Construct time series graphs

    o Use OLS Regression to predict future sales

    o Use k-means clustering to:

    Group customers with similar buying habits,

    Use analytics to find demographic commonalities

    within those groups,

    Make marketing recommendations based on analysis.

    o Use demographic data to derive insights to the customer base

    Part 2 (Access)

    o Create database tables for analysis

    o Identify relationships between tables

    o Perform queries on various topics to gain insights

    o Create advanced graphical visuals to display data

  • 8/9/2019 Analytics Final Project

    4/49

    Part 1: Analytics

    Moving Average and Exponential Smoothing

    1)

    Figure 1: Prediction error for weeks 10 through 20 for item types 3 and 17. The graph shows

    the Mean Squared Error, Mean Absolute Deviation, and Tracking Signal.

    Regression

    2)

    Figure 2:Time series for the sales of snacks (Item 17) over the 104 week period

    Also included are the trend line, regression equation, and R2value

    Item Type 3 Mean Squared Error Mean Absolute Deviation Tracking Sig

    4 period moving average 1235.22 24.32 -1.03

    2 period weighted moving average 1481.34 28.71 -0.56

    Exponential Smoothing 1192.31 23.9 -0.83

    Item Type 17

    4 period moving average 868.05 25.18 5.58

    2 period weighted moving average 801.37 23.29 4.59

    Exponential Smoothing 783.55 25.35 6.63

  • 8/9/2019 Analytics Final Project

    5/49

    Figure 3:Time series for the sales of eggs (Item 12) over the 104 week period

    Also included are the trend line, regression equation, and R2value

    Figure 4:Time series for the sales of butter (Item 3) over the 104 week period

    Also included are the trend line, regression equation, and R2value

  • 8/9/2019 Analytics Final Project

    6/49

    Figure 5:Time series for the sales of cookies (Item 8) over the 104 week period

    Also included are the trend line, regression equation, and R2value

    Figure 6:Time series for the sales of cereal (Item 5) over the 104 week period

    Also included are the trend line, regression equation, and R2value

  • 8/9/2019 Analytics Final Project

    7/49

    Figure 7:Time series for the sales of BBQ (Item 2) over the 104 week period

    Also included are the trend line, regression equation, and R2value

    Figure 8:Time series for the sales of cat food (Item 4) over the 104 week period

    Also included are the trend line, regression equation, and R2value

  • 8/9/2019 Analytics Final Project

    8/49

    Figure 9:Time series for the sales of ice cream (Item 13) over the 104 week period

    Also included are the trend line, regression equation, and R2value

    Figure 10:Time series for the sales of crackers (Item 9) over the 104 week period

    Also included are the trend line, regression equation, and R2value

  • 8/9/2019 Analytics Final Project

    9/49

  • 8/9/2019 Analytics Final Project

    10/49

    Item Type Item Description Regression Equation

    17 Snacks y = 0.0423x + 85.11312 Eggs y = -0.1012x +76.178

    3 Butter y = -0.4956x + 81.59

    3)

    Table 2:Regression Equations for Top 3 Selling Items Using Only First 60 Weeks of Data

    Figure 12:Time Series for the Sales of Snacks (Item 17) over the 104 Week Period

    Regression line displayed is based off of the results from the first 60 weeks

  • 8/9/2019 Analytics Final Project

    11/49

    Figure 13:Time Series for the Sales of Eggs (Item 12) over the 104 Week Period

    Regression line displayed is based off of the results from the first 60 weeks

    Figure 14:Time Series for the Sales of Butter (Item 3) over the 104 Week Period

    Regression line displayed is based off of the results from the first 60 weeks

  • 8/9/2019 Analytics Final Project

    12/49

    Week Item 17 - Snacks Item 12 - Cereal Item 3 - Butter

    61 87.6933 70.0048 51.3584

    62 87.7356 69.9036 50.8628

    63 87.7779 69.8024 50.3672

    64 87.8202 69.7012 49.8716

    65 87.8625 69.6000 49.3760

    66 87.9048 69.4988 48.8804

    67 87.9471 69.3976 48.3848

    68 87.9894 69.2964 47.8892

    69 88.0317 69.1952 47.3936

    70 88.0740 69.0940 46.8980

    71 88.1163 68.9928 46.4024

    72 88.1586 68.8916 45.9068

    73 88.2009 68.7904 45.4112

    74 88.2432 68.6892 44.9156

    75 88.2855 68.5880 44.4200

    76 88.3278 68.4868 43.9244

    77 88.3701 68.3856 43.4288

    78 88.4124 68.2844 42.9332

    79 88.4547 68.1832 42.4376

    80 88.4970 68.0820 41.9420

    Week Item 17 - Snacks Item 12 - Cereal Item 3 - Butter

    61 59 104 38

    62 44 36 24

    63 49 36 41

    64 71 39 37

    65 75 47 35

    66 64 46 41

    67 52 34 41

    68 54 45 79

    69 47 38 40

    70 72 58 24

    71 53 43 39

    72 46 71 74

    73 55 95 49

    74 48 27 53

    75 77 50 12876 91 63 68

    77 71 147 71

    78 43 61 43

    79 62 62 82

    80 52 94 167

    4)

    Table 3:Projected Units of Snacks, Cereal, and Butter Sold for Weeks 61 through 80 using

    Regression Equations obtained from first 60 Weeks of Data

    Table 4: Actual Units of Snacks, Cereal, and Butter Sold for Weeks 61 through 80

  • 8/9/2019 Analytics Final Project

    13/49

    Clustering

    5)

    Figure 15:Graphical Representation of K-Means Clustering using K=3 for Units of Crackers

    Purchased vs. Units of Cookies Purchased by Each Customer over the 104 Week Span

    Figure 16:Graphical Representation of K-Means Clustering using K=4 for Units of Crackers

    Purchased vs. Units of Cookies Purchased by Each Customer over the 104 Week Span

  • 8/9/2019 Analytics Final Project

    14/49

    Figure 17:Graphical Representation of K-Means Clustering using K=5 for Units of Crackers

    Purchased vs. Units of Cookies Purchased by Each Customer over the 104 Week Span

    Figure 18:Graphical Representation of K-Means Clustering using K=6 for Units of Crackers

    Purchased vs. Units of Cookies Purchased by Each Customer over the 104 Week Span

  • 8/9/2019 Analytics Final Project

    15/49

    Figure 19:Graphical Representation of K-Means Clustering using K=3 for Units of Eggs

    Purchased vs. Units of Cereal Purchased by Each Customer over the 104 Week Span

    Figure 20:Graphical Representation of K-Means Clustering using K=3 for Units of Hotdogs

    Purchased vs. Units of Pizza Purchased by Each Customer over the 104 Week Span

  • 8/9/2019 Analytics Final Project

    16/49

    General

    6)

    Figure 21: Graphical representation of race of store customers.

    Figure 22: Chart representing total number of customers falling into each income bracket.

    Race (%)

    White

    Black

    Hispanic

    Oriential

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    0 - 10k 10 -

    11.9k

    12 -

    14.9k

    15 -

    19.9k

    20 -

    24.9k

    25 -

    34.9k

    35 -

    44.9k

    45 -

    54.9k

    55 -

    64.9k

    65 -

    74.9k

    75+k

    Customer Income

  • 8/9/2019 Analytics Final Project

    17/49

    Figure 23: Chart representing total number of customers in occupation by field.

    Part 1 Discussion and Overall Recommendations:

    1)

    Problem #1 involved using the solution to question 5 from homework 7. This question asks for the salesforecast of the two highest selling items which are item types 3 and 17. The following forecasts were

    completed in homework 7: 4-period moving average, 2-period weighted moving average, and

    exponential smoothing. Using these solutions, the mean squared error, mean absolute deviation, and

    the tracking signal were computed. Using the equations shown in (Figure 1A) in the Appendix, each of

    these errors were calculated.

    According to the results, the Mean Squared Error for Exponential Smoothing provided the least amount

    of error for both Item types. The values are rather large showing that prediction sales data may not be

    entirely effective. I would recommend the store owner to stay away from these sales forecasting

    methods, but to choose the exponential smoothing if need be.

    2)

    Problem #2 provides some interesting insight into the buying patterns of certain items. With the help of

    a pivot table, it was easy to condense the data to determine the top ten most frequently purchased

    items. After determining what these top ten items were, a time series graph was created for each

    individual item. The last aspect of the time series that was implemented was the regression line (trend

    0

    20

    40

    60

    80

    100

    120

    140

    Male

    Female

  • 8/9/2019 Analytics Final Project

    18/49

    line). This equation of this trend line allows the user to plug in any value for week number to get an

    estimate of how many units of a particular item will be purchased during that given week. To rate the

    accuracy of this trend line in predicting the actual number of units sold, a correlation coefficient (R2) was

    also computed.

    No items by themselves had any particularly interesting trends, however, there were a couple ofinteresting aspects that each of these ten plots had in common. The first noticeable common

    occurrence was the sharp decrease in sales of each item during week 57. This could have been caused

    by a variety of factors. For example, maybe week 57 was a holiday week so customers were busy

    spending their time at home with family instead of at the store. Another possibility is that maybe the

    store was closed for a few days that week, thus decreasing its total sales. No conclusive answer can be

    obtained from the data given, but there is surely something that happened during week 57 to cause this

    decrease in sales.

    Another interesting feature that all ten items had in common was a negatively sloping regression line.

    This means that as time goes on, the projected sales of every single one of these items is projected to

    continually decrease. Basically, this means that the store has a problem. If sales of each of its ten most

    popular products continue to decrease as time goes on like the regression model suggests, the store will

    bring in less and less money as time goes on. In the long run, if this trend goes on for long enough, the

    projected decrease in sales could even lead to the store going out of business.

    Finally, one last notable feature that all ten items shared was a very small correlation coefficient. The

    correlation coefficient essentially shows how accurate the fit of the regression line is to the observed

    data. This value will fall somewhere between 0 and 1; 0 representing minimal accuracy and 1

    representing perfect accuracy in the fit of the regression line. Out of all ten items, the largest

    correlation coefficient was 0.2358. One of the values were even as low as 0.002. These extremely low

    coefficient correlation values indicate that the regression line is not very accurate in predicting the

    actual number of units of an item sold given the week number.

    3)

    Problem #3 further dives into the concept of regression and using a regression line to predict future

    response. In this problem, only the top three most frequently purchased items were considered. A

    regression line was formed using only the sales data from the first 60 weeks. After this regression line

    from the first 60 weeks was calculated and created, it was plotted overtop of the time series plot of all

    104 weeks of data for its corresponding item type. This created a visual that made it easy to see just

    how accurate each individual regression line was in predicting future response.

    After all of these steps were carried out, the most notable occurrence was the great change in the

    equations for the regression lines from what they were when all 104 weeks of data was considered. The

    initial equations were y = -0.0593x + 84.136 for item 17, y = -0.0168x + 72.545 for item 12, and y = -

    0.1308x +72.062 for item 3. However, once the parameters were changed to only include the first 60

    weeks, the new regression equations were y = 0.0423x + 85.113 for item 17, y = -0.1012x +76.178 for

    item 12, and y = -0.4956x + 81.59 for item 3. One notable occurrence was the change in slope for item

  • 8/9/2019 Analytics Final Project

    19/49

    17. When all 104 weeks of data was used, the slope of the regression line was negative. However,

    when only the first 60 weeks of data was considered, the slope of the regression line became positive.

    This means that sales are projected to gradually increase when only the first 60 weeks of data is used,

    but when all 104 weeks is considered, sales of item 17 are projected to decrease as time goes on.

    Another noticeable trend was the rapid decrease in slope of items 12 and 3 when only the first 60 weeks

    of data are considered. The slope decreases almost six times as fast for item 12 and almost four times

    as fast for item 3 when only the first 60 weeks is used in forming the regression line as opposed to all

    104 weeks. However, it should be noted that in both of these cases, the intercept does start at a higher

    value for the regression line from only the first 60 weeks, thus somewhat compensating for the large

    increase in negativity of the slope. Overall, the great changes that take place in the equations for the

    regression lines when more data is used indicates the lack of accuracy of the regression lines in

    predicting the response (units sold).

    4)

    So now, the goal of problem #4 is to predict the number of units that will sell each week, starting from

    week 61 and stopping at week 80. These predictions were made using the regression lines found from

    considering only the first 60 weeks of data. In order to obtain these predictions, the week number had

    to be plugged into the regression equation as the x value. The predicted values and the actual values

    can be seen in the Results Section.

    It is evident that the predictions obtained through regression are not very accurate for predicting future

    sales. For item 17, the average sales from weeks 61 through 80 are significantly less than the predicted

    values indicate. The predicted values are also noticeably larger for item 12. There are a few instances

    where actual units sold increases to be greater than the projections, but for the most part, the average

    sales are less than the predicted values. Like the other two items, the projections for item 3 are not

    quite accurate either. The regression line indicates a quick decrease in units sold while in actuality, the

    values seem to be gradually increasing.

    Overall, after comparing the projections from the regression lines to the actual amount of each item

    sold each week, it is clear that there is a lot of error in this prediction method. If the owner of the store

    wants to better predict the amount of items that will be sold in the future, forecasting methods such as

    Moving Average or Exponential Smoothing may create more accurate results.

    5)

    Problem #5 dealt with the clustering of two different items against each other. Clustering is a very

    helpful strategy when it comes to breaking down and studying data. Customers are an example of

    something that is often clustered. In this case, clustering allows the user to separate the customers into

    groups so that different recommendations can be given to different groups based off of the common

    preferences of the customers in that group. K-means clustering is a very widely used clustering

    technique. This method breaks the data down into k separate groups, with each group having its own

    centroid. The rule of thumb here is that all of the points in each cluster will be closest to its own

    centroid, as opposed to being closer to the centroid of another group.

  • 8/9/2019 Analytics Final Project

    20/49

    So, for problem 5, the k-means clustering algorithm was used to break the data down into k separate

    groups. The first two items that were compared were items 8 and 9 (cookies and crackers). These items

    were clustered using k values of 3, 4, 5, and 6. The graphs were created and then judged to determine

    which value of k produced the best fit for the data. All of the graphs that were created can be seen in

    the Results Section. K=6 appeared to create the best fit for the data on the graph because it creates the

    most precise groups with points that seem to have the smallest deviation from their centroid. After this,

    two more sets of items were clustered, however, k=3 was the only value of k that was used on those

    cases. The first graph created contained a comparison between cereal and eggs while the second graph

    created focused on hotdogs and pizza.

    So, the reason that this clustering procedure was carried out is to help the store manager promote

    products based on specific customer attributes. After the customers were assigned to a specific cluster,

    the background information on each customer was examined to determine if there were any noticeable

    similarities in the characteristics of customers within clusters. This background information was given in

    the Demographics Data

    So, the cluster analysis containing k=6 groups of crackers (Item 9) purchased vs. cookies (Item 8)

    purchased was first examined to determine the buying patterns of the customers in each of the clusters.

    The other three graphs for crackers vs. cookies using k=3, 4, and 5 were not considered here because it

    was decided that k=6 created the best and most accurate results. In each of the graphs, clusters are

    referred to as Series 1, Series 2, etc. Series 1 contains the customers who buy very little of both

    products, Series 2 contains customers who buy a lot of crackers and a substantial amount of cookies,

    Series 3 contains customers who buy some cookies but not a lot of crackers, Series 4 contains customers

    who buy some crackers but not a lot of cookies, Series 5 contains customers who buy a lot of cookies

    and a substantial amount of crackers, and Series 6 contains the customers who buy a substantial

    amount of cookies and some crackers. To get a better understanding of the customers who belonged toeach of these groups, Family Size was useful in differentiating between clusters. The two groups that

    were focused on were Series 2 and Series 5 because those were the groups that contained the

    customers who bought a large amount crackers and a large amount of cookies, respectively. In Series 2,

    the most common family size by far is 2 people, accounting for 63% of the people in that group.

    However, in Series 5, there are three family sizes that make up the majority of this group. A family size

    of 2 makes up 33%, a family size of 3 makes up 25%, and a family size of 4 makes up 25%. Note that all

    of these Demographic Data statistics can be found in the Appendix in table form. So, now that the

    company has some background on the customers belonging to each of the customers, it must now

    decide how to allocate its advertising resources to try to increase the sales of cookies and crackers.

    Based off of this data, the type of customer who buys a lot of crackers belongs to Series 2 and moreoften than not, has a family size of 2 people. The type of customers who buy a lot of cookies are those

    who have families of size 2, 3, or 4. So, in order to maximize profit from sales of both crackers and

    cookies, crackers should be marketed mainly towards people who have family sizes of 2 people and

    cookies should be marketed towards people who have families ranging from 2-4 people. In order to

    advertise to these groups of people, TV would probably be the best way to get the word out. Out of the

    people who responded, 264 out of the 272 people owned at least one television. However, if a

  • 8/9/2019 Analytics Final Project

    21/49

    television advertisement is not economically feasible, other methods, such as a newspaper or magazine

    advertisement could work well. 50% of the people in Series 2 and 58.33% of the people in Series 5 read

    the newspaper. If a magazine advertisement is the way that the store wants to go, their best course of

    action would be to put an ad for crackers in Better H&G because 37.5% of people that buy a lot of

    crackers also subscribe to Better H&G. To reach the customers who buy a lot of cookies, and

    advertisement should be placed in Good House because one third of the people who buy a lot of cookies

    also subscribe to Good House.

    The next cluster analysis that was carried out compared the amount of eggs (Item 16) purchased

    vs. the amount of cereal (Item 5) purchased by each customer. This cluster analysis only used k=3

    groups. Series 1 contained the customers who did not buy a lot of either product, Series 2 contained the

    customers who bought a lot of cereal, and Series 3 contained the customers who bought a lot of eggs.

    When the demographics data of all of the customers was studied by cluster, there were some pretty

    interesting results. Annual Income was one factor that created some interesting statistics when studied.

    Out of all of the people in Series 2, over 60% of those people make more than $35,000 per year. In

    Series 3 on the other hand, less than 19% of people make more than $35,000 per year. So, since there isclearly some correlation between annual income and product purchased, the store should take this into

    account for advertising purposes. For the store to increase sales of both of these items, the best course

    of action would be to try to advertise cereal to people who make more than $35,000 per year and

    advertise eggs to the people who make less than $35,000 per year. In order to do this advertising, once

    again the most efficient yet most costly way to do so would be to create a television commercial. If the

    store would like to advertise in the newspaper instead, 57.89% of the people who buy a lot of cereal and

    60.53% of the people who buy a lot of eggs subscribe to the newspaper. And finally, if a magazine ad is

    the way that the store chooses to go, in order to reach the people who buy a lot of cereal, an ad should

    be placed in Reader Digest because 26.32% of the people who buy a lot of cereal subscribe. To reach

    the people who buy a lot of eggs, a magazine ad should be placed in either Readers Digest or BetterH&G because 26.32% of people who buy a lot of eggs subscribe to each of these magazines.

    The final cluster analysis that was carried out compared the number of units of pizza (Item 16)

    purchased vs. the number of units of hotdogs (Item 11) purchased by each customer. Again, k=3 was

    the value used for number of groups. Series 1 contained people who purchased very few hotdogs and

    pizzas, Series 3 contained the people who purchased a lot of hotdogs, and Series 2 pretty much just

    contained everyone who did not fit either of those criteria. In this case, race was a factor that lead to

    some pretty interesting results. Though white people were the race that was most common in the

    study, the data shows some pretty evidence pertaining to the preferences of African Americans. In

    Series 1 and 2, the percentage of African American belonging to each cluster is just above 10%.However, in Series 3, the percentage of African American jumps all of the way up above 23%. The

    percentage of white people belonging to Series 3 also drops more than 10% from both Series 1 and 2.

    As a result of this correlation between race and product purchased, this should be taken into account by

    the store when trying to advertise pizza and hotdogs. This data shows that it is a waste of time to

    market pizza to African Americans because only a little more than 10% of them ever purchase pizza.

    White people would be a better marketing target when it comes to advertising pizza. If an item should

  • 8/9/2019 Analytics Final Project

    22/49

    be marketed to African Americans, it should be hot dogs because a much higher percentage of them

    purchase hotdogs. So, once again, the statistics show that the most efficient way to reach consumers is

    through a television advertisement. However, in this case a magazine or newspaper ad may not be a

    good way to go when trying to advertise. In Series 3, which would be the main group that the store

    would want to advertise to because they buy such a large amount of hotdogs, only 38.46% of people

    subscribe to the newspaper which, compared to the other groups, is very small. Also, out of this group,

    no magazine is subscribed to by more than 16% of the people, making this advertisement possibility also

    somewhat ineffective. So, in order to reach these people in Series 3, the best method after all may just

    be to save up money to produce a TV commercial because that seems to be the only really effective way

    to reach these people.

    6)

    When developing a marketing strategy it is essential to know and understand ones customer base.

    Using the Demographics data, some statistics were able to be pulled from the data to in order to

    develop such an understanding of the area of the stores location and the people who live there. The

    percentages of ethnicities of the customers were found, as seen in Figure 21. This graph indicates that

    the population of the location of the town is predominantly white, and has very low minority population

    percentages. This leads to believe that the store is located in a more rural area, as opposed to in a larger

    city. More potential evidence for the rural, blue-collar town is that 34% of customers own at least one

    dog. This number would not be so high in a city with people living mostly in apartment buildings. The

    fact that so many people own dogs brings a recommendation from the team that the store should

    promote its dog food and other dog-related products to its customers. This area also seems like a rural

    town with blue-collar people because of the annual income. The vast majority of customers fall into the

    middle-class range, whereas only a few customers earn more than $75k per year. Surprisingly, the most

    common profession among all the customers is retirement. Yes, of all the types of professions there are

    more customers that are retired than in any other category. This leads to believe that the store has a

    customer base with a majority of elderly citizens. By using clustering and other analytic techniques, the

    store has the capability to use all aforementioned statistics to gain an advantage with its sales and

    marketing strategies.

  • 8/9/2019 Analytics Final Project

    23/49

    Part 2: Databases

    1)

    Figure 1:Access Database Table Schema

  • 8/9/2019 Analytics Final Project

    24/49

    Table 1:Transaction:Transaction Information Table

    Table 2:CouponLookupCoupon Description

    Table

    Table 3: Product: Item Description Table

    Table 4:Vendor: Vendor Decsription Table

  • 8/9/2019 Analytics Final Project

    25/49

    Table 5:Customer: Customer Demographic Table

    Table 6:RaceLookup Race Description Table

    Table 7:IncomeLookup Income Table

    Table 8:FemaleOccLookupFemale Occupation Table

    Table 9:MaleOccLookup Male Occupation Table

  • 8/9/2019 Analytics Final Project

    26/49

    Table 10:MaleAgeLookupMale Age Table

    2)

    Query 1: Cross table of query, showing all females occupations, in addition to family size and

    coupon origin.

    SQL code:

    TRANSFORM Count([AllOccFemale-CouponOrig].[Customer ID]) AS [CountOfCustomer ID]

    SELECT [AllOccFemale-CouponOrig].[Family Size], [AllOccFemale-CouponOrig].Desc,

    Count([AllOccFemale-CouponOrig].[Customer ID]) AS [Total Of Customer ID]

    FROM [AllOccFemale-CouponOrig]

    GROUP BY [AllOccFemale-CouponOrig].[Family Size], [AllOccFemale-CouponOrig].Desc

    PIVOT [AllOccFemale-CouponOrig].CouponDesc;

  • 8/9/2019 Analytics Final Project

    27/49

    Query 2: Origins of all coupon transactions.

    SQL code:

    TRANSFORM Count(CouponAmounts.[Customer ID]) AS [CountOfCustomer ID]

    SELECT CouponAmounts.CouponDesc, Count(CouponAmounts.[Customer ID]) AS [Total Of

    Customer ID]

    FROM CouponAmounts

    GROUP BY CouponAmounts.CouponDesc

    PIVOT CouponAmounts.CouponID;

  • 8/9/2019 Analytics Final Project

    28/49

    Query 3: Total coupons used that valued greater than $0.99 in regards to item types.

    SQL code:

    TRANSFORM Count(CouponItemInteraction.[Customer ID]) AS [CountOfCustomer ID]

    SELECT CouponItemInteraction.Description, Count(CouponItemInteraction.[Customer ID]) AS [Total Of

    Customer ID]

    FROM CouponItemInteraction

    WHERE (((CouponItemInteraction.[Coupon Value (Cents)])>99))

    GROUP BY CouponItemInteraction.Description

    PIVOT CouponItemInteraction.[Coupon Value (Cents)];

  • 8/9/2019 Analytics Final Project

    29/49

    Query 4: Total items bought by ethnicity.

    SQL code:

    TRANSFORM Count(EthnicItemType.[Customer ID]) AS [CountOfCustomer ID]

    SELECT EthnicItemType.Desc, Count(EthnicItemType.[Customer ID]) AS [Total Of Customer

    ID]

    FROM EthnicItemType

    GROUP BY EthnicItemType.Desc

    PIVOT EthnicItemType.Description;

  • 8/9/2019 Analytics Final Project

    30/49

    Query 5: Units bought compared to family size.

    SQL code:

    Query 6: Coupon origins based on female occupations.

    SQL code:

    TRANSFORM Count([Family-Units Bought].[Customer ID]) AS [CountOfCustomer ID]

    SELECT [Family-Units Bought].[Units Bought], Count([Family-Units Bought].[Customer ID])

    AS [Total Of Customer ID]

    FROM [Family-Units Bought]

    GROUP BY [Family-Units Bought].[Units Bought]

    PIVOT [Family-Units Bought].[Family Size];

    TRANSFORM Count([FemaleOcc-Coupons].[Customer ID]) AS [CountOfCustomer ID]

    SELECT [FemaleOcc-Coupons].Desc, Count([FemaleOcc-Coupons].[Customer ID]) AS [Total

    Of Customer ID]

    FROM [FemaleOcc-Coupons]

    GROUP BY [FemaleOcc-Coupons].Desc

    PIVOT [FemaleOcc-Coupons].[Coupon Origin];

  • 8/9/2019 Analytics Final Project

    31/49

    Query 7: Ice Cream by Day of Week

    SQL code:

    Query 8: Coupon origin and family size of unemployed females.

    SQL code:

    TRANSFORM Count([IceCream-ByDay].[Item Type]) AS [CountOfItem Type]

    SELECT [IceCream-ByDay].Description, Count([IceCream-ByDay].[Item Type]) AS [Total Of

    Item Type]

    FROM [IceCream-ByDay]

    GROUP BY [IceCream-ByDay].Description

    PIVOT [IceCream-ByDay].Day;

    TRANSFORM Count([NonEmployFemale-CouponOrig].[Customer ID]) AS [CountOfCustomer ID]

    SELECT [NonEmployFemale-CouponOrig].[Family Size], [NonEmployFemale-CouponOrig].Desc,

    Count([NonEmployFemale-CouponOrig].[Customer ID]) AS [Total Of Customer ID]

    FROM [NonEmployFemale-CouponOrig]

    GROUP BY [NonEmployFemale-CouponOrig].[Family Size], [NonEmployFemale-CouponOrig].Desc

    PIVOT [NonEmployFemale-CouponOrig].CouponDesc;

  • 8/9/2019 Analytics Final Project

    32/49

    Query 9: Item types bought with subscription to cable tv in regards to male occupation.

    SQL code:

    TRANSFORM Count([TVAds-MenOcc].[Cable TV]) AS [CountOfCable TV]

    SELECT [TVAds-MenOcc].Description, Count([TVAds-MenOcc].[Cable TV]) AS [Total Of Cable TV]

    FROM [TVAds-MenOcc]

    GROUP BY [TVAds-MenOcc].Description

    PIVOT [TVAds-MenOcc].Desc;

  • 8/9/2019 Analytics Final Project

    33/49

    Query 10: Item types bought by income amounts.

    SQL code:

    TRANSFORM Count([VolumeBought-Income].[Customer ID]) AS [CountOfCustomer ID]

    SELECT [VolumeBought-Income].Incomeamt, Count([VolumeBought-Income].[Customer ID]) AS [Total Of

    Customer ID]

    FROM [VolumeBought-Income]

    GROUP BY [VolumeBought-Income].Incomeamt

    PIVOT [VolumeBought-Income].Description;

  • 8/9/2019 Analytics Final Project

    34/49

    Query 11: Item types bought by week.

    SQL code:

    TRANSFORM Count(WeeklyProducts.[Customer ID]) AS [CountOfCustomer ID]

    SELECT WeeklyProducts.Description, Count(WeeklyProducts.[Customer ID]) AS [Total Of

    Customer ID]

    FROM WeeklyProducts

    GROUP BY WeeklyProducts.Description

    PIVOT WeeklyProducts.Week;

  • 8/9/2019 Analytics Final Project

    35/49

    Query 12: Family size and income with children.

    SQL code:

    Query 13: What customer bought snacks from what vendor on day 5.

    SQL code:

    SELECT Customer.[Customer ID], Customer.[Family Size], Customer.Income,

    Customer.Children

    FROM Customer

    WHERE (((Customer.[Subscription to Newsweek])=Yes))

    ORDER BY Customer.Children DESC , Customer.[Family Size] DESC;

    SELECT Transaction.[Customer ID], Product.Description, Vendor.Vendor, Transaction.Day

    FROM Vendor INNER JOIN (Product INNER JOIN [Transaction] ON Product.[Item Type] =

    Transaction.[Item Type]) ON Vendor.ItemType = Transaction.[Item Type]

    WHERE (((Product.Description)="snack") AND ((Transaction.Day)=5));

  • 8/9/2019 Analytics Final Project

    36/49

    Query 14: Amount of coupons from origin 23, item type 10, with a coupon value greater than

    $0.99.

    SQL code:

    SELECT Transaction.[Coupon Origin], Product.[Item Type], Transaction.[Coupon Value

    (Cents)]

    FROM Product INNER JOIN (Customer INNER JOIN [Transaction] ON Customer.[Customer ID]

    = Transaction.[Customer ID]) ON Product.[Item Type] = Transaction.[Item Type]

    WHERE (((Transaction.[Coupon Origin])=23) AND ((Product.[Item Type])=10) AND

    ((Transaction.[Coupon Value (Cents)])>99));

  • 8/9/2019 Analytics Final Project

    37/49

    Query 15: Hot Dog purchases >1 in regards to family size, number of dogs, day and week.

    SQL code:

    Query 16: Amount of female cleaners that bought cleansers and amount of units bought.

    SQL code:

    SELECT Customer.[Customer ID], Customer.[Family Size], Customer.Dogs,

    Product.Description, Transaction.Week, Transaction.[Units Bought]

    FROM Product INNER JOIN (Customer INNER JOIN [Transaction] ON Customer.[Customer ID]

    = Transaction.[Customer ID]) ON Product.[Item Type] = Transaction.[Item Type]

    WHERE (((Customer.Dogs)>1) AND ((Product.Description)="dogs"))

    ORDER BY Customer.[Customer ID], Transaction.Week;

    SELECT Customer.[Customer ID], Customer.[Female Occupation], Transaction.[Item Type],

    Transaction.[Units Bought]

    FROM Customer INNER JOIN [Transaction] ON Customer.[Customer ID] =

    Transaction.[Customer ID]

    WHERE (((Customer.[Female Occupation])=8) AND ((Transaction.[Item Type])=6));

  • 8/9/2019 Analytics Final Project

    38/49

    Query 17: Amount of duplicates of vendors and their item numbers.

    SQL code:

    SELECT First(Vendor.[Vendor]) AS [Vendor Field], First(Vendor.[ItemNumber]) AS[ItemNumber Field], Count(Vendor.[Vendor]) AS NumberOfDups

    FROM Vendor

    GROUP BY Vendor.[Vendor], Vendor.[ItemNumber]

    HAVING Count Vendor.Vendor >1 And Count Vendor.ItemNumber >1 ;

  • 8/9/2019 Analytics Final Project

    39/49

    Query 18: Females subscribed to Better Home & Garden, and their occupations.

    SQL code:

    SELECT Customer.[Customer ID], Customer.[Subscription to Better H&G],FemaleOccLookup.Desc

    FROM FemaleOccLookup INNER JOIN Customer ON FemaleOccLookup.FemaleOccID =

    Customer.[Female Occupation]

    WHERE (((Customer.[Subscription to Better H&G])=Yes));

  • 8/9/2019 Analytics Final Project

    40/49

    Query 19: Male and female education in regards to income.

    SQL code:

    SELECT Customer.[Customer ID], Customer.Income, Customer.[Male Education],

    Customer.[Female Education]

    FROM IncomeLookup INNER JOIN Customer ON IncomeLookup.IncomeID = Customer.Income

    WHERE (((Customer.Income)=6));

  • 8/9/2019 Analytics Final Project

    41/49

    Query 20: Coupon origin of snacks bought by African Americans.

    SQL code:

    SELECT Customer.[Customer ID], RaceLookup.Desc, Product.Description, CouponLookup.CouponDesc

    FROM CouponLookup INNER JOIN (RaceLookup INNER JOIN (Product INNER JOIN (Customer INNER

    JOIN [Transaction] ON Customer.[Customer ID] = Transaction.[Customer ID]) ON Product.[Item Type] =

    Transaction.[Item Type]) ON RaceLookup.RaceID = Customer.Ethnicity) ON CouponLookup.CouponOrig

    = Transaction.[Coupon Origin]

    WHERE (((RaceLookup.Desc)="black") AND ((Product.Description)="snack"));

  • 8/9/2019 Analytics Final Project

    42/49

    Query 21: How oftern customers purchased hot dogs along with family size and units bought.

    SQL code:

    3)

    Graphics:

    Figure1: Ice Cream units sold by day of week.

    SELECT Customer.[Customer ID], First(Customer.[Family Size]) AS [Family Size], First(Customer.Dogs)

    AS Dogs, Count(Transaction.Week) AS CountOfWeek, First(Transaction.[Units Bought]) AS [UnitsBought]

    FROM Product INNER JOIN (Customer INNER JOIN [Transaction] ON Customer.[Customer ID] =

    Transaction.[Customer ID]) ON Product.[Item Type] = Transaction.[Item Type]

    GROUP BY Customer.[Customer ID]

    HAVING (((First(Customer.Dogs))>1))

  • 8/9/2019 Analytics Final Project

    43/49

    Figure 2: The total coupons used and the origin of those coupons.

    Figure 3: Sum of total items bought by ethnicity.

  • 8/9/2019 Analytics Final Project

    44/49

    Figure 4: Distribution of coupons used valued greater than $0.99.

    Figure 5: Total coupons used by females occupations.

  • 8/9/2019 Analytics Final Project

    45/49

    Part 2 Discussion and Overall Recommendations:

    Creating databases in Access has proven to show many important insights to the transaction data.

    Through the formulation of tables and queries, information can be found lying within the bulk of the

    data. The database was set up by creating a few of the necessary tables. The demographic data was

    inputted and created as the customer table. A transaction table was also created where Customer ID isshared between the Transaction and Customer tables (Relationship). To make things easier to visualize,

    a Product table was created that lists the description of each item type instead of the corresponding

    numerical ids. As queries were being made, understanding many of the numerical signifiers for the

    demographics were difficult. Thus the following tables were created to show descriptions when calling

    upon them in a query: CouponLookup, IncomeLookup, RaceLookup, FemaleOccLookup,

    MaleAgeLookup, and MaleOccLookup. By using the actual descriptions from the demographic data, the

    store owner can easily identify anything he or she may be searching for in the form of data analysis.

    As queries were being written, certain trends became apparent. The grocery store runs on people

    buying their products. Finding relationships of what type of people are buying what type of products canprove to be very beneficial. Linking transaction data to the demographics of the customers is crucial to

    marketing and sales of any given company.

    To analyze some of the findings from the queries, we will start with the most general query: the quantity

    of items bought over the 102 week period. There were a total of 49,576 items bought. Of these 49,576

    items bought, 213 purchases were made from Hispanic customers, 255 listed as other, 4,373 African

    Americans, and 44,735 Caucasian customers. Within these purchases, the top 3 items bought were

    snacks at 6,379, eggs at 5059, and cereal at 4859 items. This data already shows that this store is located

    where the primary customers of Caucasian ethnicity.

    Coupons can be an integral part to a grocery stores sales. When analyzing coupon data, you can find

    what types of items coupons are used most often with, where these coupons are coming from, and

    what type of people are using them. Looking at the queries CouponItemInteraction, CouponAmounts,

    and FemaleOcc-Coupons can answer a few of these questions. The top 3 items bought with the use of

    coupons that valued greater than $0.99 were cereal with 397 instances, detergents with 147, and coffee

    with 139. With almost 10 times more than any other origin, the Sunday supplement vendor accounted

    for 60 percent of all coupons used at this store with a total of 2014 of the 3376 coupons used.

    Newspapers were the next largest source of coupons with 424 transactions. As females accounted for

    the bulk of coupon use, the primary occupations of these coupon users were retired. The next two

    female occupations with the highest coupon use were unemployed and clerical respectively. Drawing

    conclusions from this data, there is probably a large population of retirees that shop at this store. To

    target market, I would recommend targeting Caucasian females in the Sunday supplement for items that

    you may want to sell more of.

    This is just one example of something pulled from this transaction data. There are certainly many more

    inferences that can be made by using different queries to find and make these conclusions. We found

    that there were many lessons to be learned from this project. The use of databases can be a huge

  • 8/9/2019 Analytics Final Project

    46/49

    advantage when sorting through large amounts of data in search of making inferences. The ability to call

    certain criteria from a large selection can return a small, comprehendible set of data. In the transaction

    data, we were able to draw strong evidence of the demographic prevalence in the store region. From

    this demographic data, buying tendencies gave way to easy assumptions to make about what items sell

    and what items do not sell. This project proved to be very beneficial in learning hands-on the

    importance of data analytics and databases. This experience will prove to be extremely useful in our

    future fields.

    Group Roles

    Matt MurphyPart 1 - #(2,3,4,5)

    Brian LeapPart 2 - #(1,2,3)

    Troy McCrumPart 1 - #(1,6) and Part 2 - #(2)

    Appendix

    Part 1:

    Figure 1:Equations used for question 1, prediction errors.

  • 8/9/2019 Analytics Final Project

    47/49

    C1 C2 C3

    0 0.018315 0 01 0.087912 0.105263 0.078947

    2 0.069597 0 0.052632

    3 0.058608 0.078947 0.105263

    4 0.120879 0.052632 0.184211

    5 0.087912 0.105263 0.157895

    6 0.164835 0.052632 0.236842

    7 0.153846 0.210526 0.078947

    8 0.128205 0.184211 0.078947

    9 0.03663 0.052632 0.026316

    10 0.029304 0.052632 0

    11 0.043956 0.105263 0

    C1 C2 C3

    0 0 0 0

    1 0.876364 0.885246 0.769231

    2 0.105455 0.114754 0.230769

    3 0.010909 0 0

    4 0 0 0

    5 0.007273 0 0

    C1 C2 C3 C4 C5 C6

    0 0.00 0.00 0.00 0.00 0.00 0.00

    1 0.12 0.06 0.14 0.07 0.17 0.14

    2 0.36 0.63 0.34 0.30 0.33 0.46

    3 0.24 0.00 0.34 0.23 0.25 0.03

    4 0.15 0.19 0.06 0.33 0.25 0.19

    5 0.08 0.00 0.09 0.03 0.00 0.05

    6 0.05 0.13 0.02 0.03 0.00 0.14

    # of TVs # of Customers

    0 8

    1 63

    2 126

    3 75

    No Response 77

    Table 1:Percentage of people from each series (or cluster) grouped by how many family member they

    have (see key in Data Dictionary)

    Table 2:Percentage of people from each series (cluster) grouped by how much they make annually (see

    key in Data Dictionary)

    Table 3:Percentage of people from each series (cluster) grouped by Race (see key in Data Dictionary)

    Table 4:Amount of TVs owned by Customers

  • 8/9/2019 Analytics Final Project

    48/49

    C1 C2 C3

    Newspaper subscriber 0.4652 0.5789 0.6053

    Subscription to Better H&G 0.2491 0.1579 0.2632

    Subscription to Good House 0.1062 0.1842 0.1316

    Subscription to Ladies HJ 0.1136 0.1316 0.2105

    Subscription to McCalls 0.1465 0.0526 0.1316

    Subscription to Redbook 0.0733 0.0789 0.0526

    Subscription to Reader's Digest 0.2601 0.2632 0.2632

    Subscription to Cosmopolitan 0.0256 0.0000 0.0263

    Subscription to TV Guide 0.1575 0.1053 0.1842

    Subscription to People 0.0293 0.0000 0.0526

    Subscription to Glamour 0.0183 0.0000 0.0263

    Subscription to Time 0.0806 0.0526 0.0000

    Subscription to Newsweek 0.0623 0.0263 0.0000

    C1 C2 C3 C4 C5 C6

    Newspaper 0.4895 0.5000 0.5000 0.5667 0.5833 0.4054

    Subscription to Better H&G 0.2632 0.3750 0.1875 0.1667 0.1667 0.2432

    Subscription to Good House 0.1158 0.1250 0.1250 0.1000 0.3333 0.0541

    Subscription to Ladies HJ 0.1211 0.1875 0.1406 0.1000 0.0833 0.1351

    Subscription to McCalls 0.1263 0.1250 0.1719 0.0333 0.2500 0.1622

    Subscription to Redbook 0.0895 0.1250 0.0313 0.0333 0.1667 0.0270

    Subscription to Reader's Dig 0.2263 0.1875 0.3281 0.3667 0.0833 0.3243

    Subscription to Cosmopolita 0.0211 0.0625 0.0000 0.0000 0.0833 0.0541

    Subscription to TV Guide 0.1684 0.0625 0.1563 0.1333 0.0833 0.1622

    Subscription to People 0.0105 0.0000 0.0781 0 .0333 0.1667 0.0000

    Subscription to Glamour 0.0158 0.0625 0 .0000 0.0333 0.0833 0.0000

    Subscription to Time 0.0789 0.0625 0.0625 0.0000 0.0833 0.0811

    Subscription to Newsweek 0.0579 0.0000 0.0469 0.0667 0.0833 0.0270

    Table 5: Percentage of people who subscribe to Newspaper/Magazines from Clusters formed from

    Crackers vs. Cookies

    Table 6:Percentage of people who subscribe to Newspaper/Magazines from Clusters formed from Eggs

    vs. Cereal

  • 8/9/2019 Analytics Final Project

    49/49

    C1 C2 C3

    Newspaper 0.4909 0.5246 0.3846

    Subscription to Better H&G 0.2436 0.2459 0.1538

    Subscription to Good House 0.0982 0.2295 0.0000

    Subscription to Ladies HJ 0.1091 0.2295 0.0000

    Subscription to McCalls 0.1200 0.1967 0.1538

    Subscription to Redbook 0.0655 0.0984 0.0769

    Subscription to Reader's Digest 0.2582 0.3115 0.0769

    Subscription to Cosmopolitan 0.0218 0.0328 0.0000

    Subscription to TV Guide 0.1455 0.1967 0.1538

    Subscription to People 0.0218 0.0656 0.0000

    Subscription to Glamour 0.0182 0.0164 0.0000

    Subscription to Time 0.0764 0.0492 0.0000

    Subscription to Newsweek 0.0545 0.0492 0.0000

    Table 7: Percentage of people who subscribe to Newspaper/Magazines from Clusters formed from Pizza

    vs. Hotdogs