Predicting online user behaviour using deep learning algorithms

Download Predicting online user behaviour using deep learning algorithms

Post on 14-Apr-2017

1.358 views

Category:

Internet

0 download

TRANSCRIPT

  • Predicting online user behaviour using deeplearning algorithms

    Armando Vieira DataAI.ukLondon, UK

    armando@dataai.uk

    November 15, 2015

    Abstract

    We propose a robust classifier to predict buying intentions basedon user behaviour within a large e-commerce website. In this workwe compare traditional machine learning techniques with the mostadvanced deep learning approaches. We show that both Deep BeliefNetworks and Stacked Denoising auto-Encoders achieved a substantialimprovement by extracting features from high dimensional data duringthe pre-train phase. They prove also to be more convenient to dealwith severe class imbalance.

    Artificial Intelligence, Auto-encoders, Deep Belief Networks,Deep Learning, e-commerce, optimisation

    1 Introduction

    Predicting user intentionality towards a certain product, or category, basedon interactions within a website is crucial for e-commerce sites and ad displaynetworks, especially for retargeting. By keeping track of the search patternsof the consumers, online merchants can have a better understanding of theirbehaviours and intentions [4].

    In mobile e-commerce a rich set of data is available and potential con-sumers search for product information before making purchasing decisions,thus reflecting consumers purchase intentions. Users show different searchpatterns, i.e, time spent per item, search frequency and returning visits [1].

    1

  • 2 DATA DESCRIPTION

    Clickstream data can be used to quantify search behavior using machinelearning techniques [5], mostly focused on purchase records. While purchas-ing indicates consumers final preferences in the same category, search is alsoan essential component to measure intentionality towards a specific category.

    We will use a probabilistic generative process to model user exploratoryand purchase history, in which the latent context variable is introduced tocapture the simultaneous influence from both time and location. By identify-ing the search patterns of the consumers, we can predict their click decisionsin specific contexts and recommend the right products.

    Modern search engines use machine learning approaches to predict useractivity within web content. Popular models include logistic regression (LR)and boosted decision trees. Neural Networks have the advantage over LRbecause they are able to capture non-linear relationship between the inputfeatures and their deeper architecture has inherently greater modellingstrength. On the other hand decision trees - albeit popular in this domain -face additional challenges with with high-dimensional and sparse data [3].

    The advantage of probabilistic generative models inspired by deep neu-ral networks is that they can mimic the process of a consumers purchasebehaviour and capture the latent variables to explain the data.

    The goal in this paper is to identify activity patterns of certain usersthat lead to buy sessions and then extrapolate as templates to predict highprobability of purchase in related websites. The data used consists of about1 million sessions containing the click data of users - however, only 3% ofthe training data consist of buy sessions - so making it a very unbalanceddataset.

    The rest of this paper is organized as follows: Section 2 describes thedata used in our study and pre-processing methods and Non-negative MatrixFactorization for dimensionality reduction. Section 3 presents the classifica-tion algorithms. Section 4 describes in detail the deep learning algorithms(Deep Belief Networks and Stacked Denoising Auto-encoders) and Section 5presents the results.

    2 Data Description

    Data consists of six months of records of user interaction with an e-commercewebsite. Events have a userid, a timestamp, and event type. There are 5categories of events: pageview of a product, basketview, buy, adclick and

    2

  • 2 DATA DESCRIPTION

    adview. There are around 25 000 different types of products. In case of abuy or a basketview we have information about the price and extra details.We ignore adview and adclick events as they are not relevant for the presentpropose.

    The data is very sparse and high dimensional. There are two obviousways to reduce the dimensionality of the data: either by marginalizing thetime (aggregate pageviews per user over the period) or the product pageviews(aggregate products viewed per time frame). In this work we follow the firstapproach as most shopping ( 87%) occurs within 5 days of first visit.

    The training data is composed of a set of sessions s S and each sessioncontains a set of items i I that were displayed to the user. The itemsthat has been bought in session s are denote by Bs. There are two types ofsessions Sb (the sessions that end in buying) and Snb (the sessions that donot end in a transaction).

    Given the set of sessions St, the task is to find all the sessions Stb whichhave at least one buy event. If a session s contains a buy event, we wantto predict the items Bs bought. Therefore we have two broad objectives: 1)classification and 2) order prediction.

    The data is highly unbalanced for the two classes considered (buy andnon-buy), so we face a serious class imbalance problem. Furthermore, onlyabout 1% of products (around 250) have a full category identification. How-ever, this fraction corresponds to about 85% of pageviews and 92% of buys -so we have a very skewed distribution. Initially we consider only interactionswith this subset of products. The data is about 10Gb and cannot be loadedinto memory, so we first took a subsample of the first 100 000 events just tohave a snapshot of the interactions. We found:

    78 360 pageviews events ( 78.4% of total events) from 13342 uniqueusers.

    16 409 basketview ( 16.4%) from 3091 unique users.

    2 430 sales events ( 2.5%) from 2014 unique users (around 1.2 sales peruser).

    If we restrict to the 257 label product categories, we found 39561 pageviews,from 7469 distinct users, which is about half of the population. In this workwe didnt consider time as data is very sparse and we aggregate it at severaltemporal basis (see Table 2)

    3

  • 2.1 Data preprocessing 2 DATA DESCRIPTION

    Table 1: Constructed parameters based on clickstream dataSymbol DescriptionDs Duration session before purchaseC/B Click to buy ratio for usersSB Median number of sessions before buyDesc DescriptionPrice Price of an itemDuration The total time spent on an item over all the sessionsHour hour of the day when the session occurredNc number of clicks in a sessionPrice average items price of purchase in a sessionV iews24h Number of page views in the last 24 hoursV iewsweek Number of page views in the last week

    2.1 Data preprocessing

    Each session an unique id a timestamp is recorded for each activity in thewebsite, so that we order users clicks on the items in a session. The durationof a click could easily be found by simply subtracting time of that click fromthe time of the next click. Now, for each distinct item in a session if we sumthe duration of the clicks in which the item appears, we define the duration ofthe item in that session. After sorting by timestamp we append itemDuration(the time an item is inspected in a session) to each click data. We extractother properties, which are specific to an item and append it to each clickdata - see Table 1. We build a click-buy ratio of users by averaging theclick-buy ratio of all the items in a session.

    We also used the description of the item bought, in a form of a small text.To handle textual data we convert words of descriptions into a 50 dimensionvector using word2vec [12] and used the arithmetic average of the vectors.

    To build the data set we first restrict to the set of 257 product categories.Data was aggregated at the week level per product category and semi-week(two time buckets). In this first iteration we will not add basket viewevents as most of them are made on the same session/day of sales eventsand the objective is to predict sales with at least one day of delay. We willconsider this in next iteration. Users with less then 10 clicks in the websitewere removed. All data sets were balanced: same number of sales eventsand non-sales events. Due to the large size of data, we essentially study

    4

  • 2.2 Non-Negative Matrix Factorization 2 DATA DESCRIPTION

    Table 2: Different datasets used for testing the models

    Data1 Size Description

    Dataset 1 3 000 Sales weekly aggregatedDataset 2 10 000 Same as 1 but more dataDataset 3 30 000 Same as 1 but more dataDataset 4 10 000 Same as 2 but semi-weekly aggreatedDataset 5 10 000 Same as 1 with 2000 categoriesDataset 6 30 000 Same as 3 with 2000 categories

    the importance of sample size and the efficiency of the algorithms dealingwith the dimensionality of the the data. Since we want to predict purchaseswithin a time windows of 24h, we excluded events in this period. Next tabledescribe the various tests done with the 6 datasets consider. The size refersto the number of buying session. All datasets were balanced by subsamplingthe non-buying session data.

    Data was provided in JSON format and we sort all the click and buysessions by sessionId. The number of sessions in our own test data was1506453. We kept 54510 buy sessions in our test data and according toscoring.

    The click data of a buy session contain a set of items bought (Bs). Foreach item i Bs we extract both session-based and item-based features.

    2.2 Non-Negative Matrix Factorization

    In order to test the impact of excluding some product categories we considerData 5 with the top 2000 more visited product categories. Since this a hugedimensional search space, we used Non-Negative Matrix Factorization (NMF)to reduce the dimensionality. NMF is a class of unsupervised learning algo-rithms [9], such as Principal Components Analysis (PCA) or learning vectorquantization (LVQ) that factorizes a data matrix subjected to constraints.Although PCA is a widely used algorithm it has some drawbacks, like itslinearity and poor performance on factors. Furthermore, it enforces a weakorthogonality constraint. LVQ uses a winner-take-all constraint that resultsin clustering the data into mutually exclusive prototypes but it performspoorly on high dimensional correlated data. Given a non-negative matrix V(containing the training data), NMF learns non-negative matrix factors, W

    5

  • 3 CLASSIFIERS

    and H, such that: H = WHEach data vector V (data entry) can be approximated by a linear combi-

    nation of the columns of W , weighted by the patterns matrix H. Therefore,W can be regarded as containing a basis for the linear approximation of thedata in V . Since relatively few basis vectors are used to represent many datavectors, good approximation can only be achieve if the basis vectors discoverthe structure that is latent in the data.

    NMF was successfully applied to high dimensional problems with sparsedata, like image recognition and text analysis. In our case we used NMF tocompress data into a feature subset. The major issue with NMF is the lackof an optimal method to compute the factor matrixes and stopping criteriato find the ideal number of features to be selected.

    3 Classifiers

    Our task is divided in to two subtasks: i) predicting the outcome of a sessionand ii) predict the set of items that should be bought in that session. Two setof classifiers are involved: binary and ranking prediction. Building a singleclassifier is not advisable due to the large dimensionality of the problem.

    Based on the data sets, we test the performance of two classifiers: Logis-tic Regression and Random Forest. The first is a standard in industry andserve as a baseline the second is more robust and produce in general betterresults. It has the disadvantage of their predictions not being ease to under-stand (black box). We used the algorithms without any optimization of theparameters (number of trees, numbers of variables to consider in each split,split level, etc.) As a KPI to measure performance we use the standard AreaUnder Roc curve (AUC). An AUC=0.5 meaning a random (useless) classifierand 1 a perfect one. For all runs we used 10 fold cross validation.

    3.1 Decision Trees

    Decision trees possess several inherent advantages over other classificationmethods such as support vector machines, neural networks, linear regressionand logistic regression. Decision trees are:

    Extremely easy to visualize and interpret: a decision tree can be rep-resented graphically, allowing the user to actually see the structure ofthe classifier;

    6

  • 3.2 Random Forest 3 CLASSIFIERS

    White-box models: by observing a decision tree, one can clearly un-derstand all the intermediate steps of the classification process, such aswhich variables are used, by what order, etc. This is not true for othermethods such as neural networks, whose parameters cannot be directlyinterpreted;

    Extremely fast: decision trees are trained in a relatively short time andare particularly fast in classifying new data.

    However, decision trees possess several drawbacks. The process of build-ing an optimal decision tree can be proved to be NP-hard, and therefore it isnot possible to create a globally optimal tree. Decision trees will often overfitthe data unless some regularization methods, such as pruning, or imposinga minimum number of training samples per leaf, are used. Also, because ofthe characteristics of the cost function used to determine the best split at anode, trees will tend to prefer categorical variables with more categories overother variables. This may cause the classifier to incorrectly consider thesevariables as more important than those with fewer categories.

    3.2 Random Forest

    The Random Forest (RF) algorithm creates an ensemble of decision treesusing randomization. When an input is to be classified, each tree classifiesthe input individually. The final classification is then decided by choosing themajority vote over all the trees. The likelihood of a certain input belongingto each class is computed by averaging the probabilities at the leaves of eachtree.

    Each tree is grown in an independent, random way. The set that is usedto train a given tree is a subset of the original training data; each trainingexample is selected at random (with replacement) from the original data set.At each node of the tree, rather than testing the best split among all theattributes, only a randomly chosen subset of the attributes (which is usuallymuch smaller than the full set of attributes) are used for determining thebest split. Each tree is grown to its full extent, meaning that no pruningoccurs.

    The final classifier is efficient and capable of dealing with large data sets(i.e., data that contains a large number of variables), missing data, andoutliers. In the present problem, there is a large amount of information

    7

  • 4 DEEP LEARNING METHODS

    available for each client. In order to avoid the deletion of possibly significantvariables in order to reduce the data to a manageable size - something whichwould be mandatory if neural networks were used, for example - randomforest is the algorithm of choice.

    Random forest retain the strengths of decision trees while countering someof their disadvantages. Even if the trees in the forest are grown withoutpruning, the fact that the classifier?s output depends on the whole set oftrees and not on a single tree, the risk of overfitting is considerably reduced.The randomness that is introduced in the creation of each tree also preventsthe classifier from memorizing all the examples in the training set. Theregularization techniques mentioned in the previous paragraph can also beapplied to the trees in the forest, further reducing the risk of overfitting.However, random forests have the same bias towards variables with manycategories as decision trees.

    4 Deep Learning Methods

    Deep learning refers to a wide class of machine learning techniques and archi-tectures, with the hallmark of using many layers of non-linear processing thatare hierarchical in nature [13]. The concept of deep learning originated fromartificial neural network research - feed-forward neural networks or MLPswith many hidden layers refereed as deep neural networks (DNNs). Thesenetworks are generally trained by a gradient descent algorithm designatedBack-propagation (BP). However, for deep networks, BP alone has severalproblems: local optima traps in the non-convex objective function and vanishgradients (learning signal vanish exponentially as information in backpropa-gated through layers).

    In this section we will introduce two deep learning approaches to handlethe high dimensionality of the search space and compare performance withlogistic regression and random forest algorithms.

    4.1 Deep Belief Networks

    In 2006 Hinton proposed an unsupervised learning algorithm for a class ofdeep generative models, called deep belief networks (DBN) [13]. A DBNis composed of a stack of restricted Boltzmann machines (RBMs). A corecomponent of the DBN is a greedy, layer-by-layer learning algorithm which

    8

  • 4.1 Deep Belief Networks 4 DEEP LEARNING METHODS

    optimizes DBN weights. Separately, initializing the weights of an MLP witha correspondingly configured DBN often produces much better results thanthat with the random weights.

    DBN belongs to a class of energy based models. In this case the algorithmruns as follows:

    For a given RBM, we relate the units with the energy function,

    Energy(v, h) = bh cv hWv.

    where b, c are offsets/biases and W comprises the weights connecting unitsThe joint probability of the visible (v) and hidden (h) unities, (v,h) is

    P (v, h) =1

    ZeEnergy(v,h)

    where Z is the normalization term.We obtain the free energy form by marginalizing h

    P (v) =

    h eEnergy(v,h)

    Z=eFreeEnergy(v)

    Z

    Taking advantage of free energy form makes it easier to compute gradientswith visible units only.

    We rewrite the energy function into the form,

    Energy(v, h) = (v)i

    i(v, hi).

    Then we factorize P (v)

    P (v) =

    h eEnergy(v,h)

    Z=eFreeEnergy(v)

    Z

    =1

    Z

    h1

    h2

    hk

    e(v)

    i i(v,hi) =1

    Z

    h1

    h2

    hk

    e(v)i

    ei(v,hi)

    =e(v)

    Z

    h1

    e1(v,h1)h2

    e2(v,h2) hk

    ek(v,hk)

    =e(v)

    Z

    i

    hi

    ei(v,hi)

    9

  • 4.2 Auto-encoders 4 DEEP LEARNING METHODS

    Figure 1: Structure of Deep Belief Network.

    DBN were been used for a large variety of problems, ranging from imagerecognition, recommendation algorithms and topic modelling. In addition tothe supply of good initialization points for a multilayer network, the DBNcomes with other attractive properties: the learning algorithm makes effectiveuse of unlabeled data; ii) it can be interpreted as a probabilistic generativemodel and iii) the over-fitting problem, which is often observed in the modelswith millions of parameters such as DBNs, can be effectively alleviated bythe generative pre-training step. The downside of DBN is that they are hardto train and very sensitive to learning parameters like weights initialisation.

    4.2 Auto-encoders

    Autoencoders are a representation learning technique using unsupervised pre-training to learn good representations of the data transform and reduce thedimensionality of the problem in order to facilitate the supervised learningstage.

    An autoencoder is a neural network with a single hidden layer and wherethe output layer and the input layer have the same size. Suppose that theinput x Rm and suppose that the hidden layer has n nodes. Then we havea weight matrix W Rmn and bias vectors b and b in Rm and Rn, respec-tively. Let s(x) = 1/(1 + ex) be the sigmoid (logistic) transfer function.

    10

  • 4.2 Auto-encoders 4 DEEP LEARNING METHODS

    Figure 2: Structure of an autoencoder. The weights of the decoder are thetranspose of the encoder.

    Then we have a neural network as shown in Fig. 2. When using an autoen-coder to encode data, we calculate the vector y = s(Wx+ b); correspondingwhen we use an autoencoder to decode and reconstruct back the original in-put, we calculate z = s(W Tx+ b

    ). The weight matrix of the decoding stage

    is the transpose of weight matrix of the encoding stage in order to reduce thenumber of parameters to learn. We want to optimize W , b, and b

    so that

    the reconstruction is as similar to the original input as possible with respectto some loss function. The loss function used is the least squares loss:

    E(t, z) =1

    2(t z)22 (1)

    where t is the original input. After an autoencoder is trained, its decodingstage is discarded and the encoding stage is used to transform the traininginput examples as a preprocessing step.

    Once an autoencoder layer has been trained, a second autoencoder canbe trained using the output of the first autoencoder layer. This procedurecan be repeated indefinitely and create stacked autoencoder layers of arbi-trary depth. It is been shown that each subsequent trained layer learns a

    11

  • 4.3 Autoencoders formulation 4 DEEP LEARNING METHODS

    better representation of the output of the previous layer. Using deep neuralnetworks such as stacked autoencoders to do representation learning is alsocalled deep autoencoders - a subfield of machine learning.

    For ordinary autoencoders, we usually want that n < m so that thelearned representation of the input exists in a lower dimensional space thanthe input. This is done to ensure that the autoencoder does not learn atrivial identity transformation. A variant is the denoising autoencoders thatuses a different reconstruction criterion to learn representations [10]. Thisis achieved by corrupting the input data and training the autoencoder toreconstruct the original uncorrupted data. By learning how to denoise, theautoencoder is forced to understand the true structure of input data and learna good representation of it. When trained with a denoising criterion, a deepautoencoder is also a generative model. Although the loss function E(t, z) forneural networks in general is non-convex, stochastic gradient descent (SGD)is sufficient for most problems and we use it in this work.

    4.3 Autoencoders formulation

    The derivative of the output error E with respect to an output matrix weightWOij is as follows.

    E

    WOij=E

    zj

    zjWOij

    = (zj tj)s(nj)

    xj

    xjWOij

    = (zj tj)s(nj)(1 s(nj))xi= (zj tj)zj(1 zj)xi

    (2)

    Now that we have the gradient for the error associated to a single trainingexample, we can compute the updates.

    Oj = (zj tj)zj(1 zj)WOij WOij Oj xibOj bOj Oj

    (3)

    The computation of the gradient for the weight matrix between hidden

    12

  • 4.3 Autoencoders formulation 4 DEEP LEARNING METHODS

    layers is similarly easy to compute.

    E

    WHij=E

    yj

    yjWHij

    =

    (mk=1

    E

    zk

    zknk

    nkyj

    )yjnj

    njWHij

    =

    (mk=1

    (zk tk)(1 zk)zkWOjk

    )yj(1 yj)xi

    (4)

    And then using the computed gradient we can define the updates to beused for the hidden layers

    Hj =

    (mk=1

    (zk tk)(1 zk)zkWOjk

    )yj(1 yj)

    WHij WHij Hj xibHj bHj Hj

    (5)

    In general, for a neural network we may have different output error func-tions and these will result in different update rules. We will also give theupdates for the cross-entropy error function with softmax activation in thefinal layer. The cross entropy error function is given by

    E(x, t) = ni=1

    (ti ln zi + (1 ti) ln(1 zi))

    and the softmax function is given by (xj) = exj/(

    k e

    xk). Following thesame procedure as above for computing the gradient and the updates, wefind that for hidden/output layer

    E

    WOij= (zj tj)yi

    Oj = (zj tj)WOij WOij Oj xibOj bOj Oj .

    (6)

    13

  • 4.4 Regularization 4 DEEP LEARNING METHODS

    Algorithm 1 Algorithm for Auto Encoders

    for t = T, . . . , 1 doto compute E

    nett, inititalize real-valued error signal variable t by 0;

    if xt is an input event then continue with next iteration;if there is an error et then t := xt dt;add to t the value

    koutt wv(t,k)k;

    multiply t by ft(nett);

    for all k int add to 4wv(k,t) the value xktend forchange each wi in proportion to 4i and a small real-valued learning rate

    We find that the updates for the hidden layer is the same as in the squarederror loss function with sigmoid activation.

    The algorithm and derivations for the auto-encoder are a slight variationon the above derivations for a more general neural network. The weightmatrix of the output layer (decoding stage) is the transpose of the weightmatrix of the hidden layer (encoding stage). Thus z = s(WO(WHx+b)+b

    ),

    (WH)T = WO, and WHij = WOji . For training denoising autoencoders in par-

    ticular, z = s(WO(WHxcorr + b) + b), where xcorr is a randomly corrupted

    version of the original input data xorig and the loss function is defined asE(xorig, z). In order words, we are trying to learn an autoencoder takes incorrupted input and reconstructs the original uncorrupted version. Oncewe have trained a single autoencoder layer, we can stack another autoen-coder layer on top of the first one for further training. This second autoen-coder takes the corrupted output of the hidden layer (encoding stage) of thefirst autoencoder as input and is again trained to minimize the loss functionE(xorig, z).

    4.4 Regularization

    Avoiding overfiting is especially crucial for deep neural nets with typicallyhave millions of parameters. DBN can generate large and expressive modelscapable of representing complex dependencies between inputs and outputs.Generative unsupervised pre-training [14] is a powerful data-dependent reg-ularizer, while dropout is the most commonly used.

    L2 regularization shifts the weights towards zero which may not be de-sirable. Dropout penalizes large weights that result in uncertain predictions

    14

  • 5 RESULTS

    or hidden unit activations. Another way to view dropout is as approximatemodel averaging over the exponentially numerous different neural nets pro-duced pruning random subsets of hidden units and inputs. In this work weused dropout regularization.

    5 Results

    First we run the algorithms using only the aggregated variables from Table 1to predict buying events. Since this data is low dimensional, we only considerLR and RF algorithms. Note, however, that most buying events occur withinthe 24h time-frame. We found and AUC for LR of 0.58 and for RF of 0.61.

    Then we used all data combining Table 1 with Table 2 and the descriptionabout products using a 50 dimension vector composition (excluding a set ofstop words). Results are present in 3

    Data set LR RFData 1 0.67 0.71Data 2 0.69 0.76Data 3 0.70 0.80Data 4 0.68 0.82Data 5-100 0.62 0.67Data 5-200 0.64 0.69Data 5-300 0.64 0.72

    Table 3: Results for AUC with Random Forest and Logistic Regression.

    We conclude that sample size is an important factor in the performanceof the classifier, though the Logistic Regression does not have the same gainsas the Random Forest (RF) algorithm.

    From data set 4 we also conclude that time of events is an importantfactor to take into account: although we increase the dimensionality of thesearch space, we still have a net gain even using fewer training examples.

    From data set 5, we concluded that the NFM algorithm is doing somecompression on data but not in a very efficient way (only the data with 300features had improved the accuracy over the initial subset of products). Innext section we suggest using Auto-encoders to reduce the dimensionality ofdata for all the 25 000 categories.

    15

  • 5.1 Deep Learning results 5 RESULTS

    Quite surprisingly, we found that the use of detailed information aboutwhich products the user visited does not carry much gain to the logisticregression accuracy (in some cases it even decreases - probably due to theincrease of dimensionality), while RF can capture higher accuracies.

    5.1 Deep Learning results

    One of the main advantages of DBN or SdA is that we can use all the avail-able data (even if unlabeled) to pre-train the model in an unsupervised, orgenerative, way. In DBN this is intended to capture high-order correlation ofthe observed or visible data for pattern analysis or synthesis purposes whenno information about target class labels is available. Then, we can jointlycharacterize statistical distributions of the visible data and their associatedclasses, when available. Finally the use of Bayes rule can turn this type ofgenerative networks into discriminative machines.

    We used all one million session data (pageviews of products aggregatedper user and per week) together with the composed parameters described inTable 1.

    For each assay, we held out at random 25% of data to use as a test set,leaving the remaining 75% as a training set. We split the training set intofour folds and trained each model four times with a different fold held outas validation data. We average the test set AUCs of the four models whenreporting test set results. We used performance on the validation data toselect the best particular model in each family of models. To the extentthat the baseline models required metaparameter tuning (e.g. selecting thenumber of trees in the ensemble), we performed that tuning by hand usingvalidation performance.

    Neural networks have many metaparameters: architectural, such as layersizes and hidden unit transfer functions; optimization, such as learning ratesand momentum values; and regularization, such as the dropout probabili-ties for each layer. Deep Neural Networks can have a very large number ofparameters, in our case, between one and 4 million weights. All neural netmetaparameters were set using Bayesian optimization to maximize the vali-dation AUC. Bayesian optimization is ideally suited for globally optimizingblackbox, noisy functions while being parsimonious in the number of functionevaluations.

    Deep Neural networks require careful tuning of numerous hyperparame-ters, which is one of the hardest and time consuming tasks in implementing

    16

  • 5.1 Deep Learning results 5 RESULTS

    these solutions. We have to go thorough exploration of architectures and hy-perparameters such as regularisation parameters and weights initialisation.We used the constrained version of Spearmint of Snoek et al. [15] with warp-ing enabled and labeled training runs that diverged as constraint violations.We let Spearmint optimize the metaparameters listed below with a budgetof 20 trials. The ranges were picked based on iteration on the first singlehidden layer.

    The metaparameters considered to train the networks were:

    dropout fraction [0:0.3]

    number of training epochs [10: 100] for nets with a single hiddenlayer and [10: 150] for nets with two or more hidden layers

    number of hidden units in each layer. No hidden layer was allowedmore than 500 units. The minimum number of hidden units in a layerfor a single task neural net was 16 in our first single hidden layer and64 all other times.

    the annealing delay fraction [0 : 1] is the fraction of the trainingiterations that must complete before we start annealing the learningrate

    The initial learning rate applied to the average gradient over a mini-batch [0.001: 0:25]

    momentum [0: 0.95]

    the L2 weight cost [0 : 0.01]

    the hidden unit activation function, either logistic sigmoids or rectifiedlinear units - all hidden units in a network use the same activationfunction.

    the noise level applied to the input layer (only for the SdA) [0: 0.2].

    To run the network models we used the implementation Keras based ontheano libraries (see http://keras.io/). A softmax layer was attached to thelast hidden layer of the pre-trained network for the supervision phase. Weoptimize the metaparameters of the networks on dataset 3 (30 000 purchasetransactions) and used the same parameters on other datasets. We found that

    17

  • 6 CONCLUSIONS AND FUTURE WORK

    Figure 3: Distribution of pageview events for a subset of the data.

    dropout for ReLU nets are always kept while for sigmoid transfer functions itwas rarely greater then zero. For the unsupervised we used all data available.

    The results are presented in Table 4. We can see that networks pre-trainedwith Stacked denoising Autoencoders reach the highest accuracy, which maybe due to the fact that we have a very sparse data set. Improvements ascompared with other traditional methods are notorious.

    Data set DBN SdAData 3 0.82 0.83Data 6 0.84 0.86

    Table 4: Results for AUC for classification of purchase likelihood using DBNand SdA.

    6 Conclusions and future work

    In this paper we apply different machine learning algorithm, including deepneural networks for modelling purchase prediction. We showed that using

    18

  • REFERENCES REFERENCES

    boosting methods like random forest improves performance over linear modellike logistic regression.

    To our knowledge this is the first time that deep architectures like DBNand DAE were applied to e-commerce platform for modelling user behaviour.Finally we did a comparison and stated improvements of using deep neuralnetworks with existing algorithms.

    Further research can include testing on real-time data, and see the per-formance effects on a real-time. However, more work would need to be doneon improving time efficiency of the In terms of scalability, the data is ex-tremely sparse and the class of algorithms we used does not parallelize wellwith multiple cores. As the results clearly show gap in performance improv-ing with large data size, it would be interesting to see the effect of usingmuch larger training data. Moreover, since many of the ID-based featuresare in forms of words it may be useful to initialize the neural network as anRBM trained with unsupervised contrastive divergence on a large volume ofunlabaled examples. And then fine tune it as a discriminative model withback propagation. It could also prove useful to train multiple networks inparallel and feed all of their outputs individually to MatrixNet, as featurevectors, instead of just a single average.

    References

    [1] Dembczynski, Krzysztof, Kotlowski, W, and Weiss, Dawid, Predictingads click-through rate with decision rules. In Workshop on Targetingand Ranking in Online Advertising, 2008

    [2] Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenetclassification with deep convolutional neural networks. In Advances inneural information processing systems, pp. 1097-1105, 2012.

    [3] McMahan, H Brendan, Holt, Gary, Sculley, David, Young, Michael,Ebner, Dietmar, Grady, Julian, Nie, Lan, Phillips, Todd, Davydov,Eugene, Golovin, Daniel, et al. Ad click prediction: a view from thetrenches. In Proceedings of the 19th ACM SIGKDD international con-ference on Knowledge discovery and data mining, pp. 1222-1230. ACM,2013.

    19

  • REFERENCES REFERENCES

    [4] Curme, C., Preis, T., Stanley, H.E., Moat, H.S.: Quantifying the se-mantics of search behavior before stock market moves. Proceedings ofthe National Academy of Sciences, 111(32), 11600-11605 (2014)

    [5] Kim, J.B., Albuquerque, P., Bronnenberg, B.J.: Online Demand UnderLimited Consumer Search. Marketing Science 29(6), 1001-1023 (2010)

    [6] Mingyue Zhang, Guoqing Chen and Qiang Wei Discovering Con-sumers? Purchase Intentions Based on Mobile Search Behaviors, Ad-vances in Intelligent Systems and Computing 400 (2015).

    [7] Trofimov, Ilya, Kornetova, Anna, and Topinskiy, Valery. Using boostedtrees for click-through rate prediction for sponsored search. In Proceed-ings of the Sixth International Workshop on Data Mining for OnlineAdvertising and Internet Economy, pp. 2. ACM, 2012.

    [8] Banerjee, A., Ghosh, J. Clickstream clustering using weighted longestcommon subsequences. In: Proc of the Workshop on Web Mining SIAMConference on Data Mining, pp. 33-40 (2001)

    [9] Zhang, Yuyu, Dai, Hanjun, Xu, Chang, Feng, Jun, Wang, Taifeng,Bian, Jiang, Wang, Bin, and Liu, Tie-Yan. Sequential click predic-tion for sponsored search with recurrent neural networks. arXiv preprintarXiv:1404.5772, 2014.

    [10] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-AntoineManzagol. Extracting and Composing Robust Features with DenoisingAutoencoders In Proceedings of the 25th International Conference onMachine Learning, pages 1096-1103, Helsinki, Finland, 2008. ACM.

    [11] Stacked Denoising Autoencoders: Learning Useful Representations ina Deep Network with a Local Denoising Criterion

    [12] Tomas Mikolov, Kai Chen, Greg S. Corrado, and Jerey Dean. Efi-cient estimation of word representations in vector space. CoRR,abs/1301.3781, 2013a. URL http://arxiv.org/abs/1301.3781.

    [13] Geoffrey E. Hinton and Ruslan Salakhutdinov. Reducing the dimen-sionality of data with neural networks. Science, 313(5786):504, 2006.

    20

  • REFERENCES REFERENCES

    [14] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever,and Ruslan Salakhutdinov. Improving neural networks by preventingco-adaptation of feature detectors. The Computing Research Repository(CoRR), http://arxiv.org/abs/1207.0580.

    [15] Jasper Snoek, Hugo Larochelle, and Ryan Prescott Adams. Practicalbayesian optimization of machine learning algorithms. In Advances inNeural Information Processing Systems 25, pages 2960-2968, 2012.

    21

Recommended

View more >