set-top box analytics - dell · 2020-06-06 · boston demographics and tv viewing patterns ......
TRANSCRIPT
SET-TOP BOX ANALYTICS
Srinivasan SivaramakrishnanDell [email protected]
Amarendra TummalaDell [email protected]
Luciano TozatoDell [email protected]
Wei Lin Dell [email protected]
2016 EMC Proven Professional Knowledge Sharing 2
Table of Contents
Overview ...........................................................................................................................................3
Introduction ......................................................................................................................................3
Key Findings and Benefits ......................................................................................................................... 3
Methodology ............................................................................................................................................. 4
Data Sources and Discovery ...............................................................................................................5
Boston Demographics and TV Viewing Patterns ..................................................................................6
Total Duration Watched by Ethnicity ........................................................................................................ 7
Breakdown of Duration Watched and Size of Household by Region ........................................................ 7
Breakdown of Duration Watched by Channel and Gender ...................................................................... 8
Box Plot of Total Duration ......................................................................................................................... 8
Breakdown of Duration Watched by Age, Household Size and Income group ........................................ 9
TV Commercial Analysis ................................................................................................................... 10
Commercial Types ................................................................................................................................... 10
Commercial Popularity ............................................................................................................................ 11
Channel Analysis .............................................................................................................................. 13
Primetime TV View .................................................................................................................................. 13
Daily Channel Popularity ......................................................................................................................... 14
Most Watched ......................................................................................................................................... 16
Subscriber Watch Pattern ....................................................................................................................... 17
Subscriber Propensity Index ............................................................................................................. 18
Findings I: Predict likely to Watch Commercials ................................................................................ 20
Findings II: Subscriber Viewership Profiling ....................................................................................... 22
Exploring High Performers of Cluster 5 ................................................................................................... 26
Findings III: Predict and Classify Subscribers...................................................................................... 28
Return on Investment ...................................................................................................................... 29
Conclusion ....................................................................................................................................... 30
References ....................................................................................................................................... 30
2016 EMC Proven Professional Knowledge Sharing 3
Overview
This article discusses applying data science on TV subscriber behavior and viewership patterns. It
illustrates how the TV broadcasting industry can benefit from using data analytics on TV audience
behavioral patterns. It describes how a TV viewer channel switching behavior can be analyzed to
generate numerous data analytics results. Generally, TV Broadcasting and Advertisement industries use
Nielsen ratings as the audience measurement to determine the audience size and composition of
television programming in the United States. Instead of relying on Nielsen ratings, now TV broadcasters
can generate their own metrics by applying data analytics on TV viewers channel switching behavior.
This is just one example. There are many other uses for this type of data analytics.
The recent digital revolution has expanded TV viewing from traditional TV sets to other internet-based
devices. Correspondingly, TV signal transmission is also expanded from traditional set-top box (STB) to
other streaming devices such as IPTV devices and Over-the-Top (OTT) services. Regardless of how the TV
program signal is transmitted or on what device it is watched, underneath there are always various
watching patterns. Still, the basic behavior of the TV subscriber is still the same; they switch channels to
watch programming of their liking. This data science can be applied to all types of TV watching models
as long as there is a way capture subscriber click streams. This article is a study that focuses on STB data
and data analytics on it.
Introduction
Set-Top-Box (STB) Analytics is a data science on TV subscriber behavior and viewership patterns. In
recent years, the media industry has embraced digital technology in big – and many – ways. One of the
noticeable changes is TV and Cable industries sending their broadcast signals in digital form. On the
receiving end, households are getting equipped with two-way communication capable STB devices. Not
only do these devices receive broadcast signals, they also enable TV viewers to request on-demand
programing. Additionally, these devices are capable of collecting clicking behavior of viewers. This has
opened up opportunities to collect and analyze second-by-second channel clicking behavior from
millions of households. Combining this data with detailed TV broadcaster airing logs provides a wealth of
insights into TV audience behavior [8], a veritable goldmine of data on TV audience. Applying data
science on this data opens up many opportunities to TV broadcasters.
Key Findings and Benefits
STB Analytics will help TV broadcasters change their business models from broad audiences to individual/localized content consumers.
Provided valuable insights into subscriber viewing patterns
o Derived a new metric called Viewer Propensity index (PI) that measures an
uninterrupted TV viewing pattern
o Predicted and Classified the subscriber population into Avid and Normal viewers based
on propensity index
2016 EMC Proven Professional Knowledge Sharing 4
o Created subscriber profiles and segments based on demographics attributes and TV
viewing patterns through clustering
o Used collaborative filtering to promote the subscribers from a lower to higher
propensity index within the same cluster
o Analyzed popularity of programing content and commercials by timeslots
o Built a subscriber – program – commercial value chain to create a viewership behavior
profile for subscribers
o Predicted future programing and recommended ‘likely to watch’ commercials
These insights enabled targeted programing and commercials based on viewer segments and
behavior, thus enabling better campaign management.
TV broadcasters can generate their own metrics instead of relying on Nielsen’s ratings. These
metrics help reduce cost and increase revenue by enabling them to:
o Negotiate lower content/programing fees and thus reduce the overall cost
o Negotiate higher advertisement rates on popular and likely to watch programs and thus
increase revenue
This analytics can help to realize a potential Return on Investment value of above 100% over a
five year period, with increased advertisement revenue and decreased programing cost.
Methodology
Various analytics models can be used to analyze STB’s second-by-second clicking behavior from millions
of households and combining that data with detailed TV Broadcaster airing logs and viewer demographic
information.
Using Decision Tree [9] based Model, this analytics can classify subscribers into Avid and Normal
viewers based on the viewer propensity index, which in turn helps TV broadcasters generate
their own metrics and match them against Nielsen ratings in negotiating lower programing fees.
Using Association Rules [5] and Confidence Metrics, this analytics can recommend ‘likely to
watch’ commercials with a percentage of probability. For example, subscribers who have
watched “Old Navy Store” are likely to watch “Rolex” with 88% probability. These
recommendations along with the higher propensity index can be used to negotiate higher
advertisement rates.
Using K-Means clustering [4], the population was segmented into cluster groups based on
demographic attributes, propensity index and viewing duration. These clusters can help in
profiling subscribers based on the above characteristics, which helps in better campaign
management and targeted commercials.
Using Link Analysis [10], customized subscriber – program – commercial value chain segments
can be created to understand the subscriber watch preferences for Avid viewers.
2016 EMC Proven Professional Knowledge Sharing 5
By applying collaborative filtering [11] on Avid subscribers’ key Association Rules from each
cluster, TV broadcasters can target the Normal viewers within that cluster to promote them as
potential “Avid viewers”. This increases revenue by targeting this untapped potential.
Data Sources and Discovery
The input data source for this STB Analytics comes from both TV broadcasting processes and STB device
click stream data. Here is the list of simulated primary input data sets.
Subscriber Data: Subscriber data includes subscriber demographics information including
socioeconomic attributes [1] of TV viewers. This will help in understanding more about subscribers.
TV Playlist: This dataset contains TV broadcasters’ internal TV guide [3] which includes program,
commercial, promo, etc. and airing details. This will help us understand the types of content viewers
are exposed to.
STB Data: This contains click stream viewership activity, i.e. channel watching behavior details for
each subscriber. This data can then be joined with the TV playlist to see which subscribers have
watched which content.
Content: Content and commercial [2] including details such as Type, Genre, etc.
Below is a snapshot of each simulated input data set.
Figure 1: Sample Input Data
This analysis is done on the viewing area in Boston, using that city’s demographic and economic profiles.
Analysis was performed on a small set of sample data created to the specifications below:
5000 TV Subscribers
4 TV Channels – CBS, NBC, FOX and ABC
2016 EMC Proven Professional Knowledge Sharing 6
100+ Commercials in 30+ different categories
8500+ TV Playlist incidents for one month of primetime TV schedule
225,000 TV Viewing incidents from STB
4.7M+ rows of TV Commercial viewership
900K+ rows of TV Program viewership
Time frame from August 3 to 30th, 2015.
Primetime is considered to be 8 pm to 10 pm
STB Metrics
The sections below paper describe STB Analytics process and its findings in more detail. Before we get
there, it is important to understand these STB metrics terminologies.
Total Duration – The amount of time a program or commercial is watched on a STB.
Click count – The number of times a subscriber has switched channels back and forth.
Minutes/Click – Amount of Minutes Watched / Count of clicks.
Popularity – Number of times the AD/Commercial has been aired.
Boston Demographics and TV Viewing Patterns
Neighborhoods of Boston: The picture shows the different neighborhoods of Boston. The subscribers
from the following neighborhoods were considered for analysis.
Figure 2: Neighborhoods of Boston [6]
2016 EMC Proven Professional Knowledge Sharing 7
Total Duration Watched by Ethnicity
The pie chart shows the percentage of TV Duration watched by different ethnicities of Boston
subscribers. We can see that White people have watched the highest % of TV – close to 47%. Asians
have watched 9% and African Americans have watched 23.15 % of Net Duration for August.
Figure 3: Duration Watched by Ethnicity
Breakdown of Duration Watched and Size of Household by Region
Figure 4: Duration Watched and Size of Household by Region
2016 EMC Proven Professional Knowledge Sharing 8
Figure 4 above shows the breakdown of Net TV duration watched across all regions. The upper part
represents the size of household for each Boston neighborhood by region. We can see from the visual
that there are more people in Dorchester(household size) and hence have high total viewing duration.
Breakdown of Duration Watched by Channel and Gender
Figure 5: Duration Watched by Channel and Gender
In Figure 5, we see the breakdown of TV viewership by each gender and channel for all the
neighborhoods/regions of Boston. It can be seen that, compared to men, there are lot of women who
like to watch CBS. Dorchester again has a high proportion of female population watching CBS compared
to other regions as a whole for the month of August.
Box Plot of Total Duration
Figure 6 represents the box plot of total duration by region. It is seen again that Dorchester is an outlier
as it has a substantial major share of viewers who have watched more TV.
2016 EMC Proven Professional Knowledge Sharing 9
Figure 6: Box Plot of Total Duration
Breakdown of Duration Watched by Age, Household Size and Income group
The next three charts show the breakdown of TV Duration watched by Age, Household Size and Income
group.
Figure 7 shows the sum of duration watched across all ages. We can see that the distribution is a bit
skewed toward the population of lower age as they have more share of duration.
Figure 7: Duration by age
2016 EMC Proven Professional Knowledge Sharing 10
Figure 8: Duration by Household size
From Figure 8, we see that a household size of 2 has the highest amount of TV duration compared to
other household groups. Meanwhile, Figure 9 shows the breakdown of Total Duration by all income
groups. Income group “Less than $20,000” had a high share of total duration watched compared to
other income groups.
Figure 9: Duration by income group
TV Commercial Analysis
The next set of analysis focuses on the different type of Commercials and their popularity.
Commercial Types
Figure 10 is a bubble chart where each bubble is a commercial category sized by the number of times
they are aired during the month of August. The bigger the bubble, the larger the number of times they
were aired in August. We can see that Health and Beauty and Beverage are two of the most relayed
commercial types in August 2015.
2016 EMC Proven Professional Knowledge Sharing 11
Figure 10: Commercial categories sized by popularity
Commercial Popularity
A deep drill of the commercials gets to the specific ads popular during the month of August in the
decreasing order of AD counts. Marriot and Princeton University were aired the most times for August.
2016 EMC Proven Professional Knowledge Sharing 12
Figure 11: Commercials by popularity
The chart below shows the links between different commercial categories.
Figure 12: Links between commercials
2016 EMC Proven Professional Knowledge Sharing 13
Channel Analysis
There are primarily four channels considered for TV viewership in August 2015; CBS, NBC, FOX and ABC.
Figure 13 shows the number of times viewers have turned onto each of the four channels. We can see
that CBS seems the most sought after channel as it had been sought more than 60,000 times during
primetime in August 2015. This is followed by NBC, ABC and FOX.
Figure 13: Clicker count by channel
Primetime TV View
Figure 14 shows that more content were aired between 8 pm to 9 pm than from 9 pm to 10 pm. Content
here refers to either program or commercial. Also, we see in the subsequent trend chart that only FOX
2016 EMC Proven Professional Knowledge Sharing 14
has aired more content during 9 pm to 10 pm segment whereas all other channels have reduced the
primetime content as they move to 9 pm segment.
Figure 14: Primetime channel popularity
As seen in Figure 15, apart from FOX, the number of content segments (programs or commercials) has
gone down at 9 pm primetime when compared to 8 pm for all other channels.
Figure 15: Primetime channel popularity trend
Daily Channel Popularity
Figure 16 shows the count of commercials aired for each day in August for each channel. It can be seen
from the chart that as we move along the month of August, we see that CBS had an upward blip around
2016 EMC Proven Professional Knowledge Sharing 15
August 12th and 18th when most of the other channel have aired relatively fewer commercials. Also, it is
clear that all four channels had a similar number of commercials aired over time in August. NBC is
relatively flat whereas ABC and CBS and FOX had more variations.
Figure 16: Daily channel popularity
Figure 17 depicts what is known as a Tree Map which is similar to a heat map. The darker the color the
greater the popularity in terms of number of commercials played. It can be seen that on Mondays CBS
and FOX are more popular than NBC and ABC. Similarly, ABC is more popular on Sundays when
compared to other channels. Thursday CBS primetime is the least popular.
2016 EMC Proven Professional Knowledge Sharing 16
Figure 17: Heat map of commercial popularity
Most Watched
Figure 18: Heat map of program viewership
2016 EMC Proven Professional Knowledge Sharing 17
Figure 18 shows the most watched program/commercials across all the subscriber population in Boston
for the month of August. We can infer that America’s Got Talent is most watched in terms of viewing
duration across all subscribers followed by 48 Hours and American Ninja Warriors.
Subscriber Watch Pattern
The two visuals below shows the subscriber viewing pattern based on minutes watched for the month of
August 2015. The shaded portion on the right represents the forecasted duration for the month of
September.
Subscriber 1
Figure 19: Subscriber 1 watch pattern
Subscriber 2
Figure 20: Subscriber 2 watch pattern
2016 EMC Proven Professional Knowledge Sharing 18
Subscriber Propensity Index
Propensity Index quantifies and captures the uninterrupted TV viewership of a subscriber and which is
usually not measurable [7]. It is a measure of a subscriber’s viewing behavior which is calculated based
on the weighted sum of Age (Age Propensity), Household Size (House Propensity), and Minutes per Click
to give one standardized value for each subscriber. Propensity Index ranges from 0 to 1. Thus if a
subscriber has a value of 0.99, it means that subscriber has the best uninterrupted TV viewing behavior.
Propensity Index =
+
+
The summation of the attributes in the above three pictures gives us the value of Propensity Index. It is
very useful in the current digital TV phase where just flipping channels during commercials in their
TV/Set Top Boxes turned on doesn’t always translate to viewership. This index penalizes or flags
subscribers who just hop on to other channels during commercial breaks and not effectively a viewer of
commercials.
The snapshot below classifies them into Avid or Normal Watchers based on Propensity Index score. A
score greater than 0.5 indicates an avid watcher while less than 0.5 indicates a normal watcher.
2016 EMC Proven Professional Knowledge Sharing 19
Figure 21: Subscribers by watch category
The above bar chart shows each subscriber colored by which category they fall into. There are 1556 Avid
Watchers and 3444 Normal Watchers out of a 5000 subscriber sample population overall.
Correlation with other attributes
The correlation matrix below shows there is extremely high positive correlation between Age and
Propensity Index when compared to Duration Watched and Propensity Index. Strong correlation
represents more area in the pie chart.
2016 EMC Proven Professional Knowledge Sharing 20
Figure 22: Correlation graph
Findings I: Predict likely to Watch Commercials
We first map the Subscribers watch time to the commercials aired on that date as shown below in the
table snapshot. This gives us the viewership data for each subscriber.
Table 1: Subscriber STB click stream data
We then use the methodology of Association Rule Mining [5] to create key association across
commercials. Table 2 shows some top association rules based on confidence for Commercials on CBS
Primetime Saturdays for the Month of August in 2015. Some inference from key rules say there is 87.5 %
chance that subscribers who watched a commercial on Holland America Line is also likely to watch the
commercial on Nissan Motor Corp.
2016 EMC Proven Professional Knowledge Sharing 21
Table 2: Key association rules
2016 EMC Proven Professional Knowledge Sharing 22
Findings II: Subscriber Viewership Profiling
Through K-means Clustering [4], we created cluster segments to profile the subscribers into ten
different target groups as shown below in Figure 23. The cluster segments were driven by the attributes
duration watched, propensity index and other demographic attributes. The scatter plot of Propensity
Index (PI) Vs Duration Watched below shows each cluster segment in different colors. Each point
denotes a subscriber with labels for Age, PI and Ethnicity.
Figure 23: Scatter plot of subscriber cluster segments
Figure 24 below shows the Number of subscribers in each cluster colored by Watch Category. Cluster 5
has almost equal proportion of Avid and Normal watchers when compared to other clusters.
2016 EMC Proven Professional Knowledge Sharing 23
Figure 24: Number of subscribers in each cluster
Figure 25: Zooming on low performers of cluster 5
The above snapshot shows subscribers in cluster 5 who have high watching duration but low
propensity index
Hence, within cluster 5, we can focus these subscribers as the new focus group who we can
promote to high propensity index
The link graph [10] on the bottom shows the commercial categories links among top performers
in Cluster 5
2016 EMC Proven Professional Knowledge Sharing 24
Figure 26: Link graph of top subscriber commercial categories
2016 EMC Proven Professional Knowledge Sharing 25
Table 3 lists the rules generated from running association rule mining on high performers of cluster 5.
Table 3: Association rules from cluster 5 high performers
High Performers in Cluster 5
With this information, we can see the top associations between the program and commercials
There is a 90 % chance that subscribers who have watched the program “48 Hours” are also
likely to watch the commercial on “Red Lobster” within 47% of these two item transactions
Top Rules:
We can use the recommendation rules of the high performers to treat, incubate and promote the low
performers within the same cluster. So, by repeating this process for the other 9 clusters, all the viewers
2016 EMC Proven Professional Knowledge Sharing 26
with low PI score can be moved to high level within each corresponding clusters. This technique is very
similar to collaborative filtering [11] since it bases the recommendation based on his/her peers behavior
and habits.
Figure 27 shows the support and confidence for the key rules.
Figure 27: Association rules
Exploring High Performers of Cluster 5
The link graph [10] on the bottom shows the interaction between TV Program and high
performers from cluster 5
Each Program on the left is colored differently and the pie chart shows the subscriber watching
pattern for every node
Subscriber 3148 only watches “Extant” and “Bachelor in Paradise” whereas Subscriber 906
watches all 4 shows
2016 EMC Proven Professional Knowledge Sharing 27
Figure 28: High performers Subscriber-Program link graph
Figure 29 shows some of interactions between Subscriber and commercial types
Figure 29: High performers Subscriber-Program-Commercial link graph
2016 EMC Proven Professional Knowledge Sharing 28
Findings III: Predict and Classify Subscribers
One can predict and classify subscribers based on key attributes using decision tree model to determine
whether they will be an Avid Watcher or Normal watcher. In the classification tree [9] below, we see
that total duration and Clicker count as the key splitting attributes decide the Propensity Index apart
from Age and Household Size as they were directly used in creating the PI metric. So we try to predict
the Propensity Index for the new set of subscribers when we don’t have any information on their
demographics (Age, Household Size), etc. Hence, a fully mature model with an exhaustive training set
can definitely predict and classify subscribers into Avid or Normal watchers by just knowing Total
Duration they have watched TV and the amount of clicks.
In Figure 30, we see that the major split was at total duration of 1622 minutes. The next split is based on
whether if the clicker count falls above or below 39 or 45. These criteria decide whether one is going to
be an Avid or Norma Watcher. In the first bin (bottom left of the chart), we see that once you are a
subscriber with less than 142 minutes of total duration and a clicker count of less than 39, then 75% of
the time (probability of 0.75) one will be a Normal watcher and only 25% of the time one will be an Avid
Watcher. Similarly, the other 14 bins are constructed based on the splits.
Figure 30: Classification tree
Training set: 4000 subscribers
Test set: 1000 subscribers
2016 EMC Proven Professional Knowledge Sharing 29
Classification Matrix
From the above matrix we see that 43 Avid Watchers and 636 Normal Watchers have been classified
correctly in the test set. Only 321 (278+43) subscribers have been classified incorrectly.
Table 4: Test set after classification
Test Misclassification rate: 321/1000 = 0.321
Model Accuracy with sample test data : 67.9%
The model can be matured as the training set increases
Return on Investment
There are measurable returns on this STB Analytics investment. These returns can be realized across the
board from small to larger size TV broadcasters. Along with the operational benefits, there is a
substantial financial return on investment. For example, with an investment of $8 million over five years,
we can predict additional revenue of $20 million with net return of $11 million. This is based on 1%
2016 EMC Proven Professional Knowledge Sharing 30
increase every year on existing $200 million AD revenue. Similarly, reduce $1 per viewer on programing
cost on a 2 million viewer base every year.
Conclusion
This paper showcases the various data science/analytical methods that can be leveraged to address
some of most common challenges in the Set Top Box industry. With the newly devised metric in
Propensity Index, it is now possible to quantify the uninterrupted TV viewership pattern. Through
clustering, we were able to profile and build customer segments based on subscriber preferences and
habits. Using association rules, we can treat, incubate and promote them with the right set of programs
and commercials that can make them an Avid Subscriber. In this away, we can tap in to the untapped
potential hidden inside each cluster segment. By further dissecting and analyzing the subscriber-
program-commercial linkage value chain, it is now possible to build personalized offers and
recommendations for every subscriber. This analytics can change the business model from broader
content management to more customized content creation and marketing. This not only results in
better campaign management but the analytics also help broadcasters negotiate higher advertising
rates, thus increasing revenue. The analytics can be enriched further by bringing in additional external
data sources from Census, Zillow etc. and also from social media sources like Facebook, Twitter, Yelp,
etc. in order to build a holistic analytic solution.
References
[1] http://www.bostonredevelopmentauthority.org/research-maps/research/overview
[2] http://www.ispot.tv/browse
[3] http://www.tvguide.com/listings/
[4] [https://en.wikipedia.org/wiki/K-means_clustering
[5] https://en.wikipedia.org/wiki/Association_rule_learning
[6] [http://www.overdosesolutions.net/?page_id=19
[7] Consumer Micro-Behavior and TV Viewership Patterns: Data Analytics for the Two-Way Set-Top Box-
ICEC_2012_CATV_viewership_analytics by Ray M. Chang, Robert J. Kauffman and Insoo Son
[8] Evaluating TV Ad Campaigns Using Set-Top Box Data – Google, Inc.by Sundar Dorai-Raj, Yannet
Interian, and Dan Zigmond
[9] Decision Trees for Predictive Modeling – SAS Institute Inc – by Padraic G. Neville
[10] http://www.csc.ncsu.edu/faculty/samatova/practical-graph-mining-with-
R/sample/chapter_5_LinkAnalysis.pdf
[11] https://en.wikipedia.org/wiki/Collaborative_filtering
2016 EMC Proven Professional Knowledge Sharing 31
Dell EMC believes the information in this publication is accurate as of its publication date. The information is subject to change without notice. THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” DELL EMC MAKES NO RESPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Use, copying and distribution of any Dell EMC software described in this publication requires an applicable software license.
Dell, EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries.