sas text mining

Predicting Email

Duration

Using SAS Text Miner

Table of Contents

1. Introduction 1

2. The Barbaric Beginnings 2

3. SAS Text Miner Introduction 4

4. SAS Text Miner Tools 5

5. Results 10

6. Addition Results 12

7. Conclusion and Future Studies 14

8. References 15

9. Appendix 16

Part 1: Introduction

This project started out as a text mining exploration to identify emails related to one specific type

of problem within the company’s software. These emails were coming into the customer service

department of Lithium Technologies. So the clients sending the emails could be talking about a

problem/error they were experiencing with the product, or maybe they were simply asking a

question. They could also be talking about other things: maybe they had a request or an action

they needed to be performed. My initial research goal was to identify frequent issues in order for

the company to decide if it should invest time in developing more efficient methods of

preventing this problem in the future.

My journey began with an initial data set of emails from the years 2008 to 2014. My initial goal

was to look for certain keywords to best predict whether an email belonged to a specific type of

problem. This problem related to a software bug that some clients were experiencing.

Discriminant analyses was used to see which keywords did the best job of segregating the emails

into their respective category (whether they related to the specific software bug or not). Later on,

I realized this method was highly inefficient.

My next steps were to use a program called SAS Text miner for my analysis. This allowed me to

increase the significance of my research question. Instead of predicting whether an email

belonged in a specific category, I would now be able to predict an individual email’s time until

resolution. In order to do so, I looked into 3 different types of model: linear regression, logistic

regression, and a decision tree. Some of these models required a categorical response variable,

but my data contained a continuous variable for the time it takes an email to resolve. In order to

accommodate this, I created a cutoff at the 75th percentile of my continuous total time variable.

Emails in the bottom 75th percentile were classified as a 0 (not taking a long time), while emails

in the top 25th percentile were classified as a 1 (taking a long time). This part of my project will

be discussed further in the SAS Text Miner Intro and Tools sections.

The entire data set contained about 65,000 emails in total (over the years 2008 - 2014). During

my analyses, I discovered the best predictor of an email’s duration was time; on average, emails

from 2008 took much longer to resolve than emails from 2014. Because of this, I ended up only

using emails from 2014 in my final analysis. The reason for this is because I wanted my model to

be as relevant and accurate as possible.

Part 2: The Barbaric Beginnings

At the start of my project, my goal was to predict whether an email related to a specific software

bug or not. To begin, I manually sorted through about 1,000 emails and determined whether or

not they related to the specific problem I was looking for. I then used a frequency word counter

to determine which keywords came up most frequently in the description portion of emails and

also in the subject header. The next step was to run a discriminant analysis on these keywords to

see how well they determined which bucket an email falls into.

There were certain words with a high frequency that weren't necessarily meaningful in terms of

using that word to predict if an email falls into a certain category. For example, a word like

“twitter” would come up very often, however, this is not a good word for separating the

categories because it is likely to show up in all emails coming into the company’s customer

service department; it is equally as likely to show up in emails that fall under the category vs.

other emails.

Figure 1: Jittered scatterplot of the Canonical1 scores of each email

Figure 1 shows a visual of the discriminant analysis. In this particular example, the blue dots

represent emails that describe the specific software bug I was interested in, while the red dots

represent everything else. An email’s classification is based on how close its Canonical1 score

lies to the average of the Canonical1 scores (X axis). These averages are represented by the

larger red and blue circles on the plot. The Canonical1 scores are calculated as a linear

combination of indicator variables which showed whether an email contained a specific word or

not.

Figure 2: Discriminant analysis scoring coefficients example

Figure 2 shows a sample of the scoring coefficients for these indicator variables. Each variable

name represents an indicator for whether or not that word is included in the email. For example,

if we’re using the preceding coefficients to represent the entire data set, an email with the words

“community” and “issue”, but without the words “respond”, “forum”, and “reproduce” would

have a Canonical1 score of about .624 + .363 + 0 + 0 + 0 = .987. This example email’s

Canonical1 score would fall closer to the blue circle and would be predicted to be an email

relating to the specific type software bug I was looking for.

It is important to note that in Figure 1 there are some red dots that fall closer to the blue dot

average as represented by the blue circle; these are the incorrectly predicted emails. The same is

true for the blue dots that fall closer to the red circle. In the score summaries shown on Figure 1,

the percentage of emails that were misclassified is approximately 7.6%.

The method used to determine the proportions of emails relating to the specific software bug is

not very “traditional. I did not check the canonical1 score of an email and then determine which

category’s average canonical1 score was closer. Keep in mind, my initial goal was not to predict

whether individual emails related to the specific software bug I was looking for. Instead, I only

wanted to predict the overall percentage of these emails within a given data set.

In order to do this I examined the probabilities provided for an individual email relating to the

software bug. I would then take the average of these probabilities and use that value as the

overall percentage of emails relating to the specific problem. For example, suppose that we have

a data set of 4 emails. The probabilities of each email relating to the specific problem I want are:

.3, .05, .6, and .7. Traditionally, we would predict an email to relate to the software bug if its

probability is greater than .5, so in this example we would predict 2 out of the 4 emails to be

talking about the software bug. But using my strategy, the proportion of software bug emails

would be predicted as .41 (the average of the four probabilities). I found this method to be much

more accurate when predicting the overall proportions of the large data set I was working with.

In order to perform this discriminant analysis, I manually selected words of my choosing. I did

this by examining about 1,000 emails and determining whether or not they fell into the category I

was looking for. Once this was completed, I would use a frequency word counter to show how

often certain words would appear within the emails I was interested in. Finally, I would choose

the words I thought did a good job of segregating these emails from all others. However, I

wouldn’t always choose the best keywords. After running individual discriminant analyses I

would change the words with coefficient scores close to zero (this means they do a poor job of

segregating the two groups of emails). I would repeat this process until I obtained a small enough

misclassification rate on my sample of emails.

Part 3: SAS Text Miner Intro

This project began with the goal of predicting the overall percentage of emails describing a

specific type of problem. This information could be used by the company in order to decide

whether they should invest time in discovering a more efficient method to prevent this type of

problem in order to save time in the long run.

My overall goal was to save the company as much time as possible. The method which I had

begun my project was not the most ideal way to approach this issue. It required me to sort

through an initial sample of emails by hand, it only allowed me to examine one specific type of

email I wanted to look for, and it forced me to run multiple analyses in order to obtain the best

set of keywords to segregate between the 2 groups. In order to further investigate how to save as

much time as possible, I would need to find a more efficient strategy.

Rather than limiting my research to just examining one specific type of problem, I decided to

find a way to predict how long an individual email will take to resolve. Could the length of time

an email takes to resolve be predicted by a model whose only predictor variable is the physical

text of that email? In order to build such a model I had to improve upon my previous strategy by

using SAS Text Miner to examine the unstructured text buried in the emails.

Figure 3 shows the basic/clean structure of the email data I worked with. It contains columns for

the subject line, body of the email (description), and total time till resolution (in hours).

Figure 3: Sample data set of emails

Part 4: SAS Text Miner Tools

Figure 4: Sample of SAS Text Miner Diagram

Figure 4 shows a simple example of the SAS Text Miner diagram for these text mining tools.

The diagram begins with the node on the left representing the data set. The following text

parsing node is similar to a word frequency counter; it counts the frequency of nouns, pronouns,

interjections, verbs… etc. This information is then passed through the text filter node to correct

for misspellings of words and pluralisms. It also allows you to import a custom list of synonyms

specific for your data, which was a technique that I used for this project.

After the text filter node, the data were run through 3 tools: text topics, text clusters, and text rule

builders. Each of these tools provided me with predictor variables to use in my analysis. A text

topic/cluster is a collection of words that describe and characterize a main theme or idea within

each email. For example, if I create two text topics for my data set of emails: one text topic may

describe emails where the client is talking about experiencing certain bug in the software, while

the other text topic may describe emails where the client isn’t describing a problem, but asking a

question. The text topic about the software bug may contain words such as “issue”, “resolve”, or

“fix”. Similarly, the text topic about questions may contain words such as “ask”, “wondering”, or

“curious”. SAS Text Miner allowed me to specify how many topics/clusters I wanted to create

with my data. It would then scan through every single email in the data set and create the desired

number of text topics/clusters.

Figure 5: Text Topic output

The above Figure 5 shows sample output of some of the words that each text topic contained.

The column labeled # Docs represents the number of documents (emails) that fall under the

corresponding text topic. Some of the text topic words include URL links or email addresses. I

refer to these topics as “garbage” topics. I threw these topics out because they are likely to only

exist within this specific data set used to create the list of text topics. In other words, I am unable

to generalize this information to other emails outside of the data set. The output from the text

clusters is not shown because they operate in a similar manner by grouping words that describe a

group of emails. The main difference between the topics and clusters is that an individual email

can be categorized into multiple text topics, but only one text cluster.

The most significant tool of SAS Text Miner for my analysis is called the text rule builder. The

text rule builder operates slightly differently than the text topics/clusters. Text rules require a

categorical response variable. Unfortunately, I am dealing with a quantitative response variable

(total time in hours). Therefore I created a binary response variable by creating a cutoff value at

the 75th percentile of total time till resolution. If an individual email was in the upper 25th

percentile of total time, it was categorized as taking a long time (this could be a type of email

worth figuring out how to more efficiently respond to in the future). All other emails were

categorized as not taking a long time. The value of 1 was used to represent if the email fell in the

upper 25th percentile and a value of 0 otherwise.

A text rule consists of 1 to 3 words. If an email contains all of the words of the rule, it will

qualify for the rule. If an email qualifies for a rule, it can either be predicted as taking a long time

(value of 1) or not (value of 0).

Figure 6: Text Rule Builder Output

Figure 6 shows output for the text rules. The rule column contains the word(s) for each

individual rule, and if an email contains these words, it qualifies for the rule. The 4th rule shown

in Figure 6 contains the words “search” and “response”, its target value is 1, and has a true

positive/total value of 8/8. This means that out of all the emails, 8 of them contained the words

“search” and “response”. Of those, all 8 were categorized in the upper 25th percentile of total

time till resolution, meaning that they took a long time. There were also “garbage” rules created

in this process such as the output rule number 10.

In order to determine whether an email was classified as taking a long time or not, I combined all

3 of the aforementioned tools. Because the data contains a text based variable for the subject of

the email and another variable for the body of the email, I had to run 2 separate node trails for

each one. I would later have to combine them into a single data set.

Combining the text topics and clusters was very straightforward because their format allowed

them to be merged easily. However, the text rules were a different story. The format they were

exported in didn’t allow merging them into the data. To work around this problem, I created

separate variables which indicated if an email qualified for a meaningful rule. This allowed me to

pick and choose which rules would get passed in. This way, “garbage” rules could be dropped

and only the rules that contained a high percentage of the target value would be included. I

classified the new variables into 8 separate categories:

one_topPredict

one_topSubPredict

one_medPredict

one_medSubPredict

zero_topPredict

zero_topSubPredict

zero_medPredict

zero_medSubPredict

The first variable on the list (one_topPredict) is a top predictor of emails that are categorized as a

1 (taking a long time). In order to create the one_topPredict variable, I examined all the rules

built for the description of the email that did a good job of predicting whether an email took a

long time or not. If an email qualified for one of these rules, I classified it as having a value of 1

under the one_topPredict variable, otherwise it would have a value of 0. The same logic applies

to the other variables. When the variable name contains “Sub”, this variable relates to rules

looking at the subject of the email. The difference between “top” and “med” in the variable

names refers to how high the percentage of emails were accurately predicted. For example, if an

email qualified under the one_topPredict variable, it would have a 97.7% chance of being

categorized as a one (this percentage is the total number of emails which qualified for this

variable and were categorized as a 1, divided by the total number of emails which qualified for

this variable). If an email qualified for one_medPredict it would only have a 63.4% chance of

being categorized as a 1.

The final analytic data set contained several hundred variables. Each email was assigned

indicator variables of whether or not it belonged to the respective text topic, and it was also

assigned a raw score variable for each text topic. If the raw score value for an email was high

enough for a certain text topic, it would qualify to fit into that text topic. This resulted in 270 text

topics for the body of the email and 114 text topics for the subject of an email, each with their

own respective indicator and raw score variables. In terms of text clusters, each email was

assigned singular value decomposition (SVD) scores. SVD scores of an email are used to

determine the probability of an email falling into certain cluster. My final data set contained 50

SVD values and 190 text clusters for the body of an email along with 30 SVD values and 11 text

clusters for the subject of an email.

All of these aforementioned variables were used as potential explanatory variables in a decision

tree model, a logistic regression model, and a linear regression model. For the sake of ease of

interpretation from a non-statistical perspective, I will mainly discuss the findings of the decision

tree model in this report. A decision tree is made up of a series of blocks and branches. The first

block at the top of the decision tree represents the entire data set I was using. Thus, the first

block had 25% of the emails categorized as a 1 (taking a long time), and 75% of the emails

categorized as a 0 (not taking a long time).

Figure 7: Beginning segment of my decision tree model

Figure 7 shows the beginning of the decision tree for this analysis (only part of the total tree).

The first block is split into two separate blocks by a condition. Each condition is selected based

on what will segregate the groups of emails the most. In other words, what condition will put the

highest possible percentage of emails categorized as 1 in one group, with the highest possible

percentage of emails categorized as 0 in the other group. The condition for the very first block is

whether an email falls under the one_topPredict variable or not. If an email falls under the

one_topPredict variable it will be classified into the block on the left, otherwise it will be

classified into the block to the right. The blocks continue to split upon various conditions of the

predictor variables until they reach their respective bottom rows.

The conditions for the decision tree branches were assigned manually; I did not have the

computer automatically assign conditions for me. I wanted to avoid assigning “garbage”

conditions to the decision tree branches because these conditions only exist within this specific

data set and cannot be generalized to external emails. However, I did examine the recommended

conditions (generated by SAS Text Miner) that did the best job of segregating emails based on

my categorical response. This process allowed me to manually look at these conditions and only

use the ones I wanted. Once the decision tree reaches the bottom row of blocks, the emails would

be as segregated as possible. In the final blocks, an email would now be predicted to have a

certain probability of being categorized as a 1 or 0. This probability is based on the block’s

percentage of emails within each category.

Part 5: Results

The results of the decision tree can be used to predict my categorical variable of time duration.

The decision tree sorted through the hundreds of predictor variables and separated the data based

on whether they met certain conditions of these predictor variables. For example, if an email

qualified for a certain text topic, and was also part of a specific cluster, and contained a positive

value for one of the custom rule variables, the decision tree would predict a specific probability

of that email falling into the upper 25th percentile of total time duration.

Certain branches of the decision tree, referred to as “money makers”, classify a high percentage

of emails as equal to 1 (upper 25th percentile) or 0 (lower 75th percentile). These nodes also

contain a decent number of emails within them that meet the specified conditions. Figure 8

shows one of the “money maker” branches for predicting if an email does not take a long time

(value of 0):

Figure 8: Money maker for predicting emails not taking a long time

Out of the 5,518 emails (just the emails from the year 2014), 265 of these emails fell on to this

node. Of these 265 emails, 98.5% of them were categorized as 0 and 1.5% of them were

categorized as 1. The conditions for this node can be seen on the right hand side. The top

condition represents the very first condition emails had to meet to be classified into this node

followed by subsequent conditions in a sequential order. The variable name of the second

condition, “DescCluster_prob136” represents the probability of an email falling into a cluster136

of the description of the email.

Figure 9: Money maker for predicting emails taking a long time

Figure 9 represents an example of a “money maker” that predicts a high percentage of emails

taking a long time. Notice how the number of emails that fall under this node is lower than the

previous. This is to be expected due to the overall percentage of emails taking a long time only

being 25% of the data set as opposed to 75%.

There were several “money maker” blocks for each category of email (long versus not long

duration) which were combined into a final model for the decision tree. From these combined

results, 179 of these emails were classified as having a 97.7% probability of taking a long time

and can be considered the top tier of prediction. The second tier of prediction consisted of a

separate 232 emails having a 62.9% probability of taking a long time. While 62.9% is not an

extremely high probability, this may be due to the nature of the data set with emails taking a long

time only represented by 25% of the data. So I was able to find 232 emails that are more than

twice as likely to take a long time.

In regards to predicting if an email will not take a long time there were 464 emails classified

having a 99.1% probability of not taking a long time. I didn’t create a second tier for predicting

emails not taking a long time because they already represented 75% of the data. When all 3 of

these “prediction” groups are combined together, they only represent about 15.8% of the data set.

In other words, I was only able to find meaningful predictions for 15.8% of the data I was

looking at. While this is not as high as I hoped, it was certainly is better than nothing.

Part 6: Additional Results

A total of 3 models were produced for my analyses: a decision tree, a logistic regression, and a

linear regression. Of these 3 models, only the decision tree was discussed in this project because

of its utility. The decision tree allowed me to create groups of emails that had an extremely high

percentage of belonging to one category or the other, while the same can’t be said for the other

two models.

The logistic regression model predicted the probability of an email taking a long time

(categorized with a value of 1). This is similar to the purpose of my decision tree model,

however, its concept is not as easy to present to a company. It can be difficult to explain concepts

such as log odds and odds ratios to people with little statistical knowledge. It’s much easier to

explain to a company that if an email meets certain conditions, it will have a certain probability

of taking a long time.

While the decision tree is easier to explain, I was only able to significantly predict length of time

for about 16% of all the emails in the data set. The remaining emails are less straightforward in

terms of predictions of which category they belong in, but we may use a linear regression model

to get a slightly better prediction for these remaining emails.

Linear regression is an easy concept to explain to someone with no statistical knowledge. The

linear regression model used about 70 predictor variables which were selected from the same list

of several hundred variables used in the decision tree. Linear regression needs to use a

quantitative response variable. I ended up using the total time till resolution (the initial response

variable I started with).

The variable selection process for linear regression was automated by the computer in order to

come up with the most useful model. Unlike the decision tree, I was not able to select which

predictors I wanted to leave out of the model. This means that the computer would automatically

select “garbage” predictors to insert into my model! However, I could work around this by

removing the unwanted predictor variables from the initial data set before they were passed into

the linear regression node.

Figure 10: Mean Predicted values vs. Mean Target values on Total Time

Figure 10 represents a graph of predicted values vs. target values for average total time across

the depths of the data set. The data are measured at every depth interval of 5 as shown on the

horizontal axis. Depth can be thought of as a percentage of the data. The predicted values shown

in the preceding figure seem like a good fit for the actual target values of total time, but this only

displays the average predicted value for every 5% interval of the data. Meaning this is the

average predicted value for groups of about 250 emails! When the predicted values for the

emails are examined individually they have high residual values overall. This means that the

linear regression model performed well over each average of 5% within my data, but when each

email is examined individually, the model isn't very trustworthy. However, some knowledge is

better than none.

Part 7: Conclusion and Future Studies

What are the actual uses of this information? The company has a certain amount of people in the

customer service department who respond to these emails. Some of these employees have a lot of

experience, while others are more inexperienced. The most time efficient method of responding

to emails would be to have the less experienced employees respond to emails that do not take a

long time, while the more experienced employees would respond to all the other emails.

The main reasoning for the preceding idea is that the biggest time sink for the company would be

when an employee who is less experienced attempts to respond to an email which takes a long

time. Because the employee is less experienced, it will take an even longer amount of time for

the email to resolve. In order to save time, the text mining algorithm based on specific words

found in the subject line and body of the email could be run on every new email that comes into

the customer service department. If the email is predicted to not take a long time, it would be

assigned to a less experienced employee. If an email is predicted to take a long time, it would be

assigned to a more experienced employee. All of the “in-between” emails can be assigned to

whoever is free at the given time.

While the results from the decision tree only classify approximately 16% of the data, anything

helps when it comes to saving the company as much time as possible while responding to

customer service emails. Which is why I believe it is a profitable strategy to initially classify

emails with the decision tree first and then run the linear regression model on emails that aren't

highly predicted to belong in a certain category. Depending on the predicted value from the

linear regression model, we could assign the email to a newer employee or a more experienced

employee.

Thus, up to this point we have successfully established an efficient method for assigning emails

to employees of a company based on topics and clusters of text embedded in the email. But what

if we could identify which types of emails are taking a long time to resolve? This would allow

the company to develop more efficient methods for resolving the subject matter of these emails

in order to save time in the long run. Therefore, a next step for this project would be to examine

the emails with a high probability of being predicted to take a long time. Are there any patterns

in these emails? Are there any specific types of problems these emails are discussing?

In order to find this out, we must delve deep into the various text topics/clusters found in the

email text. What are the words used in the text topics/clusters of high probability email falls?

What do those text topics/clusters mean in context? Do the majority of emails in this specific

node relate to a specific type of problem? In order to do this, we will need to identify the words

associated with each text topic, cluster, and rule. The format for the topics and rules easily allow

this, but clusters are more difficult. The visualization of the decision tree doesn’t allow you to

see the words associated with the clusters. Instead, it only lists the cluster number. To work

around this, you must manually go back into the cluster node output to see the words associated

with the respective cluster number. These findings would be even more useful for a company in

order to discover methods for more efficiently resolving these types of emails in the future.

Part 8: References

Sarma, Kattamuri S. Predictive Modeling with SAS Enterprise Miner: Practical Solutions for

Business Applications. Cary, NC: SAS Institute, 2007. Print.

Ville, Barry De, and Padraic Neville. Decision Trees for Analytics: Using SAS Enterprise Miner.

Cary, NC: SAS Institute, 2013. Print.

SAS Certification Prep Guide: Advanced Programming for SAS 9. Cary, NC: SAS Institute,

2011. Print.

SAS Certification Prep Guide Base Programming for SAS 9. Cary, NC: SAS Institute, 2011.

Print.

Cohen, K. Bretonnel, and Lawrence Hunter. "Getting Started in Text Mining." PLoS

Computational Biology PLoS Comput Biol 4.1 (2008): n. pag. Web.

"Getting Started with SAS Enterprise Guide: Main Menu." Getting Started with SAS Enterprise

Guide: Main Menu. N.p., n.d. Web.

Part 9: Appendix

The purpose of this section is to provide reproducible steps in order to achieve the results I did. I

will start by providing the SAS code used to import and clean the data until it was in the format I

could work with. In order to read the data in, I had to import many csv files separately due to

large data sets that caused my computer to crash.

/* data from year 2010 */

proc import

datafile='E:\Lithium\Case History Status\csv\1 2010.csv'

dbms=csv

out= sample12010

replace;

guessingrows=2000;

run;

proc import


dbms=csv

out= sample22010

replace;

guessingrows=2000;

run;


proc import


dbms=csv

out= sample12011

replace;

guessingrows=2000;

run;

proc import


dbms=csv

out= sample22011

replace;

guessingrows=2000;

run;


proc import


dbms=csv

out= sample12012

replace;

guessingrows=2000;

run;

proc import


dbms=csv

out= sample32012

replace;

guessingrows=2000;

run;

proc import


dbms=csv

out= sample42012

replace;

guessingrows=2000;

run;

proc import


dbms=csv

out= sample52012

replace;

guessingrows=2000;

run;

proc import


dbms=csv

out= sample62012

replace;

guessingrows=2000;

run;


proc import


dbms=csv

out= sample12013

replace;

guessingrows=2000;

run;

proc import


dbms=csv

out= sample22013

replace;

guessingrows=2000;

run;

proc import


dbms=csv

out= sample32013

replace;

guessingrows=2000;

run;

proc import


dbms=csv

out= sample42013

replace;

guessingrows=2000;

run;

proc import


dbms=csv

out= sample52013

replace;

guessingrows=2000;

run;

proc import


dbms=csv

out= sample62013

replace;

guessingrows=2000;

run;

proc import


dbms=csv

out= sample72013

replace;

guessingrows=2000;

run;

proc import


dbms=csv

out= sample82013

replace;

guessingrows=2000;

run;

proc import


dbms=csv

out= sample92013

replace;

guessingrows=2000;

run;

proc import


dbms=csv

out= sample102013

replace;

guessingrows=2000;

run;

proc import


dbms=csv

out= sample112013

replace;

guessingrows=2000;

run;

proc import


dbms=csv

out= sample122013

replace;

guessingrows=2000;

run;


proc import


dbms=csv

out= sample22014

replace;

guessingrows=2000;

run;

proc import


dbms=csv

out= sample32014

replace;

guessingrows=2000;

run;

proc import


dbms=csv

out= sample42014

replace;

guessingrows=2000;

run;

proc import


dbms=csv

out= sample52014

replace;

guessingrows=2000;

run;

proc import


dbms=csv

out= sample72014

replace;

guessingrows=2000;

run;

/* Combining all of the data together from all years */

data alldata;

set sample12010 sample22010 sample12011 sample22011 sample12012

sample32012 sample42012 sample52012 sample62012 sample12013 sample22013



sample42014 sample52014 sample72014;

run;

proc export data=alldata

outfile='E:\Senior Project\Data\data.csv'

dbms=csv

replace;

run;

/* Sorting the Data appropriately */

libname loc "Libraries\Documents";

proc import

datafile='E:\Senior Project\Data\data.csv'

dbms=csv

out=totaldata

replace;

guessingrows=2000;

run;

proc sort data=totaldata;

by Date_Time_Opened Subject Description Case_History_Status;

run;

proc export data=totaldata

outfile='E:\Senior Project\Data\alldatasorted.csv'

dbms=csv

replace;

run;

/* Combining the status durations of the email's

in_progress status */

proc import

datafile='E:\Senior Project\Data\alldatasorted.csv'

dbms=csv

out=sorted

replace;

guessingrows=2000;

run;

/* assigning an ID variable to each individual email */

data sorted1;

set sorted;

by Date_Time_Opened Subject Description Case_History_Status;

dateopened=datepart(Date_Time_Opened);

retain ID 0;

if first.Description then ID=ID+1;

TotDuration+Duration;

if first.Case_History_Status then TotDuration=Duration;

if last.Case_History_Status then output;

run;

/* trasposing the status durations in order to combine later */

proc transpose data=sorted1

out=sorted2;

ID Case_History_Status;

by ID dateopened Subject Description;

var TotDuration;

run;

/* combining the status durations for in_progress status */

data sorted4;

set sorted2;

In_Progress=sum(In_Progress,In_Progress__Engineering_,

In_Progress__Support_,In_Progress__Internal_,

In_Progress__TechOps_,In_Progress__Social_Dynamx_,

In_Progress__DATA_);

Delay=sum(Delay,Delayed);

run;

proc export data=sorted4

outfile='E:\Senior Project\Data\CombinedInProgress.csv'

dbms=csv

replace;

run;

/* assigning half year variables to my emails to

assist with organizing by time. This would allow me

to analyze emails within the year they were sent.

Also summing all statuses to retrieve total time variable. */

proc import

datafile='E:\Senior Project\Data\CombinedInProgress.csv'

dbms=csv

out=combined

replace;

guessingrows=35000;

run;

libname mylib "Desktop";

/* assigning half year values */

data mylib.halfyears (encoding=asciiany);

set combined;

if dateopened<18444 then

halfyear=1;

if dateopened>=18444 and dateopened<18628 then

halfyear=2;


halfyear=3;


halfyear=4;


halfyear=5;


halfyear=6;


halfyear=7;


halfyear=8;

if dateopened>=19724 then

halfyear=9;

/* summing all status durations to retrieve total time */

total_time=sum(In_Progress,New,Updated_by_Customer,Waiting_for_Fix,

Work_Complete,Pending_Customer_Response,Scheduled_for_Production_Deploym,

Waiting_for_Upgrade,Awaiting_Customer_Approval,Delay,ER_Planned_for_Roadmap,

Waiting_for_Enhancement,Preparing_for_Production_Deploym,Delayed__Misc_,

Delayed__Production_Freeze_);

keep ID dateopened subject description halfyear total_time;

run;

proc export data=mylib.halfyears

outfile='E:\Senior Project\Data\total_time.csv'

dbms=csv

replace;

run;

/* this is where I created my categorical response variable.

I actually created 3 separate response variables: one for the

50th percentile, the 75th percentile and the 90th percentile.

I assigned these cutoff values within each half year in order

to adjust for the effect of time on the email's duration. In

my final analysis I ended up only using the 75th percentile. */

libname mylib "Desktop";

proc import

datafile='F:\Senior Project\Data\total_time.csv'

dbms=csv

out=totaltime

replace;

guessingrows=35000;

run;

/* Identifying my cutoff values */

proc means mean median q3 p90 n;

by halfyear;

run;

/* assigning curoff values for each of my 3 categorical responses */

data mylib.total_time_ (encoding=asciiany);

set totaltime;

ind50_time=0;

if halfyear=1 and total_time>164.7

then ind50_time=1;


then ind50_time=1;


then ind50_time=1;


then ind50_time=1;


then ind50_time=1;


then ind50_time=1;


then ind50_time=1;


then ind50_time=1;


then ind50_time=1;

ind75_time=0;


then ind75_time=1;

if halfyear=2 and total_time>534

then ind75_time=1;


then ind75_time=1;


then ind75_time=1;


then ind75_time=1;


then ind75_time=1;


then ind75_time=1;


then ind75_time=1;


then ind75_time=1;

ind90_time=0;


then ind90_time=1;


then ind90_time=1;


then ind90_time=1;


then ind90_time=1;


then ind90_time=1;


then ind90_time=1;


then ind90_time=1;


then ind90_time=1;


then ind90_time=1;

run;

/* I exported 3 total data sets. One for the entire

span from 2008 to 2014, one from 2012 to 2014, and

one of just 2014 emails. In my final analysis I ended

up just looking at the 2014 email data set */

data mylib.total_time_since2012 (encoding=asciiany);

set mylib.total_time_;

if halfyear>4;

run;

data mylib.total_time_2014 (encoding=asciiany);

set mylib.total_time_;

if halfyear=9;

run;

proc export data=mylib.halfyears

outfile='E:\Senior Project\Data\total_time.csv'

dbms=csv

replace;

run;

This concludes the section of data cleaning/manipulation. The next section will refer to code

done within SAS Text Miner: creating my custom synonym data set, merging my topics/clusters,

creating my rule variables, and merging them all together. The first page will include a picture of

my final SAS Text Miner diagram to provide an idea of what everything looked like. I will go on

to explain areas of the diagram in more detail along with the SAS code used in these areas.

Figure 11: SAS Text Miner Final Diagram

The top right of the diagram is the area in which I performed my analysis. You can see a node

for decision tree, logistic regression, and linear regression. I had to use two data sets in this

section in order to use my 2 response variables separately; one data set contained my categorical

response and one data set contained my quantitative response. I compare all 3 models with a

model comparison node at the bottom of this area. This allowed me to identify the

misclassification rates of the categorical response models as well as comparing ROC curves

between the models.

The top left of the diagram represents the node trail used to create my custom set of synonyms

specifically for the jargon of the emails. Below is the code within the SAS code node to create

the data set of synonyms.

/* Creating my custom synonyms */

%textsyn( termds=emws2.textfilter_terms

, docds=&em_import_data

, outds=&em_import_transaction

, textvar=description

, mnpardoc=8

, mxchddoc=10

, synds=mydata.halfyearextsyns

, dict=mydata.engdict2

, maxsped=15

) ;

The middle portion of the diagram is the area where I created my topics/clusters and rules for the

body and subject line of the emails. The node trail on the left is for the body of the emails and the

node trail on the right is for the subject line of the emails. The second from the bottom SAS code

node is where I merge the text topics/clusters together. The SAS code node on the bottom of the

middle section is where I create my rule variables and merge them with my entire data set. The

code for both nodes is displayed below.

/* merging my topics/clusters */

proc sort data=emws2.texttopic_train;

by subject;

run;

proc sort data=emws2.texttopic2_train;

by subject;

run;

proc sort data=emws2.textcluster_train;

by subject;

run;

proc sort data=emws2.textcluster2_train;

by subject;

run;

libname mydata "/home/msanregret/sasuser.v94";

data mydata.bigmergedtopics;

merge emws2.texttopic_train

emws2.texttopic2_train

emws2.textcluster_train

emws2.textcluster2_train;

by subject;

run;

/* Separately creating my custom rule variables.

I will merge them all together later. */

proc sort data = EMWS2.TextRule_Train;

by subject;

run;

/* rule variables for the description of the email */

data description (keep = subject zero_topPredict zero_medPredict

one_topPredict);

set EMWS2.TextRule_Train;

zero_topPredict = 0;

zero_medPredict = 0;

one_topPredict = 0;

if w_ind75_time = 37 then


else if w_ind75_time >= 40 and w_ind75_time <= 44 then


else if w_ind75_time = 47 or w_ind75_time = 48 then

zero_medPredict = 1;

if w_ind75_time = 1 then

one_topPredict = 1;


one_topPredict = 1;

else if w_ind75_time = 17 then

one_topPredict = 1;


one_topPredict = 1;


one_topPredict = 1;

run;

proc sort data = EMWS2.TextRule2_Train;

by subject;

run;

/* rule variables for the subject of the email */

data subject (keep = subject zero_topSubPredict zero_medSubPredict

one_topSubPredict one_medSubPredict);

set EMWS2.TextRule2_Train;

zero_topSubPredict = 0;

zero_medSubPredict = 0;

one_topSubPredict = 0;

one_medSubPredict = 0;

if w_ind75_time = 42 or w_ind75_time = 43 then




else if w_ind75_time >= 48 then

zero_medSubPredict = 1;

if w_ind75_time = 1 or w_ind75_time = 3 then












run;

libname mydata "/home/msanregret/sasuser.v94";

/* merging the rules together */

data mydata.rules;

merge description

subject;

by subject;

run;

/* merging the rules with my dataset */

data mydata.TopicsWithRules_2014;

merge mydata.bigmergedtopics

mydata.rules;

by subject;

run;

sas text mining

Documents

emails duration

individual emails time

initial data set of

specific problem

sas text miner introduction

sas text miner tools

long time

specific type of problem