thank you, next: using nlp techniques to predict song...

Thank you, Next: Using NLP Techniques to PredictSong Skips on Spotify based on Sequential User and Acoustic Data

Alex Hurtado 1 Markie Wagner 1 Surabhi Mundada 1

AbstractMusic consumption habits have changed dramati-cally in the past decade with the rise of streamingservices such as Spotify, Apple Music, and Tidal.Today, these services are ubiquitous. These musicproviders are also incentivized to provide songsthat users like in order to create better user expe-rience and increase time spent on their platform.Thus, an important challenge to music stream-ing is determining what kind of music the userwould like to hear. The skip button is one fea-ture instance on these music services that plays alarge role in the users experience and helps iden-tify what a user likes, as with this button, usersare free to abandon songs as they choose. Thus,in this project, we will build a machine learningmodel that will predict if a user will skip a songor not given information about the user’s previ-ous actions during a listening session along withacoustic features of the previous songs.

1. IntroductionAs streaming services like Spotify become the go-toproviders for music, it is increasingly important that theseplatforms are able to recommend the right music to theirusers. Machine learning has long been leveraged to buildrecommender systems in order to recommend music. Thiscan be seen through improvements made to music streamingplatforms, such as customized playlists and suggesting auser new releases that they might like. However, there hasnot been much research done on how a user interacts withmusic sequentially, over the course of one listening session.Skipping behavior serves as a powerful signal about whatthe user does and doesn’t like. For instance, in the afternoon,a user might be looking for classical study music, and thusskipping hip-hop. Later that evening, the user might skipclassical for hip-hop. Being able to use skip behavior inthe context of an entire listening session is key to recom-mending relevant content, and yet there has been a lack ofresearch into this specific recommendation task.

The focus of our project is to build a sequential skip predic-

tion model to predict if a user will skip over a track basedon the user’s interactions with previous songs and the song’smusical qualities in an individual music listening session.The input will be be an embedding of audio features anduser behavior and the output will be predicting whether atrack is ’skipped’ or ’not skipped’. We have implementedsimpler sequential-based models like GBTs, LSTMs, andBi-LSTMS, as well as have applied NLP techniques to ourtask by leveraging and developing two versions of Trans-formers.

2. Related WorksThe Spotify Sequential Skip prediction Challenge was achallenge open to the public, and so other groups have de-veloped their own approaches to tackling this sequentialprediction challenge. Chang et al. (Chang et al., 2019) ex-perimented with concatenating the acoustic and behaviorfeatures of the first 10 tracks with the acoustic feature ofthe second 10 and sending this concatenation through sev-eral convolutional layers as well as a self attention layer tooutput the predictions for tracks 10-20. They did not usean embedding layer, but directly sent in the concatenatedfeature vectors. This team explored both metrics and se-quence learning methods, finding the sequence learning tobe more successful. Beres et al. (Beres et al., 2019) tooka different approach to the challenge, instead choosing touse a Bi-LSTM to autoencoder as well as feature selectionto create their model. This use of LSTMs prompted us tofurther explore the applications of encoders, leading us toeventually leverage other models typically used in NLP usecases. ”Preliminary Investigation of Spotify Sequential SkipPrediction Challenge” Charles Tremlett (Tremlett, 2019)simply used Gradient Boosted Trees and ensembled severalof these trees to predict skips.

For our novel methods, we were interested in leveraging thecutting edge models in NLP that perform well on sequentialprediction tasks. Ilya Sutskever et al. (Sutskever et al.,2014) first proposed the encoder decoder architecture forsequence to sequence learning in 2014. These sequence tosequence models map a fixed-length input sequence to afixed-length target sequence. Further investigation led us tothe Transformer model proposed by Vashwani et al (Vaswani

et al., 2017). The Transformer revolutionized the worldof NLP through the introduction of attention mechanisms,replacing recurrences and convolutions.

Many of these NLP approaches for generating a sequencefrom a sequence use an encoder and/or decoder. In ourproject, we also use this encoder and decoder structure,as well as attention mechanisms to generate context-awarepredictions for skip behavior in a given Spotify listeningsession.

3. Dataset and FeaturesWe are using the Spotify Sequential Skip Prediction datasetfor our project (Brost et al., 2019), which consists of roughly130 million listening sessions. Each session has at most 20music tracks. All the user behavior features and track idsare provided for the first half of the session, but only thetrack ids are provided for the second half. The track idsare associated with the track’s acoustic features that can beextracted via the Spotify API. The user behavior featuresconsist of actions like whether the user paused, hour of day,etc. and acoustic features consist of data like danceability,tempo, etc.

Figure 1: Dataset Structure

In order to pre-process the data for training, we mergeduser behavior and acoustic features using track ids for eachtrack in the first half of a listening session. We then pre-processed categorical features into one-hot representationsand normalized them (using either z-score and min/max).This process created an input track embedding for our model.For the tracks used in the testing phase (tracks from thesecond half of the session), we only used the pre-processedacoustic features embeddings.

We chose to use the ‘skip 2’ feature as our output label forwhether a track was skipped/not skipped. This is because’skip 2’ indicates whether a user skipped a track early on (i.e.a third of the way through a track), whereas other the skipfeatures ’skip 1’ and ’skip 4’ respectively indicate whethera user skipped the track almost as soon as it played or nearthe end of the track. ’Skip 2’ is more ideal in indicatingthe user’s actions of listening to a song for a little bit andfiguring out if they like/dislike that song (since many peopletend to skip songs they like even halfway through).

4. Methods4.1. Preliminary Baseline: Gradient Boosted Trees

We started with using the Gradient Boosted Trees (GBT)algorithm as our baseline because the boosting method inGBT generates predictors sequentially and builds off ofprevious data (Zhang & Haghani, 2015); i.e. the idea ofGradient Boosted trees is to improve upon the predictionsof the first tree in order to build the second tree, and so on.This method of building onto previous data aligns with thetask we were solving: sequence-based classification wherewe want to build a model for each track we are predictingon based on the previous track. More specifically, for thealgorithm itself, we decided to implement Light GBT, dueto its high speed, low memory, and capacity to handle largedatasets.

4.2. Recurrent Neural Baselines: LSTM and Bi-LSTM

The nature of the task - building a sequential skip predictionmodel - lent itself towards the use of models that couldunderstand sequential behavior. In this sense, recurrentneural models like LSTMs and Bi-LSTMs are shown tohave great capacity to comprehend sequence-based, variable-length data when compared to other static deep learningmodels. This architecture uses two separate recurrent neuralmodels, the first to encode the inputs into a single high-dimensional state, and the second to decode the state passedfrom the encoder towards the defined prediction task.

In our case, we used this encoder-decoder architecture totrain and evaluate on our sequence-to-sequence skip predic-tion task. The track acoustic and user behavior embeddingsfor the first half of the listening session are passed as inputinto the encoder model. The resulting hidden state fromthe encoder is then passed into the decoder model whereit is joined by the input features representing the acousticattributes of each track from the last half of the listening ses-sion. This decoder model determines the most likely labelfor each track. The choice of using LSTMs and Bi-LSTMsas our recurrent neural models was chosen due to the domi-nance of LSTMs in the world of sequential data over otherrecurrent neural models, such GRUs, due to their ability toforget, maintain, and regulate cell memory between states.

4.3. Transformer: Traditional NLP Method

Due to the sequential nature of our data, when investigat-ing what advances in other domains may be applicable toour problem and while experimenting with the LSTM/Bi-LSTMs, we found many parallels to NLP. Like many nat-ural language problems, our challenge required a modelthat could leverage the sequential structure of the data andunderstand both long and short-range dependencies. Uponinvestigating the current advances in NLP, we were drawn

to the Transformer and the introduction of attention mecha-nisms and positional encodings in the encoders and decoders.These Transformers were able to learn long and short-rangedependencies in a way that LSTMs and RNNs could not,which is especially important in our dataset because whentrying to predict skip/no skip for a certain track, it is im-portant to take into account the audio features for all theprevious tracks as well as historical skip behavior. Thus,we decided to develop the architecture for the traditionaltransformer model used in NLP tasks (Figure 2) and applyit to our task.

Figure 2: Architecture of Traditional NLP Transformer

Figure 3: Our Simplified Diagram of Traditional Transformer

In order to develop this model, we started with creating ameaningful embedding of the user behavior and acousticfeatures for each track in a session (i.e. 20 tracks). Since weare predicting on the last 10 tracks of a session, we paddedand passed in only the acoustic features for these tracks intothe embedding. Then, the embedding gets passed into theencoder, creating a hidden state, which gets passed into thedecoder along with the output target prediction sequencefor the last 10 tracks. The output target sequence that getspassed into the decoder is right-shifted by one so that thedecoder uses the output prediction of the previous track andproperly predicts the skip behavior of tracks in the correctorder. We finally added a linear layer at the end of ourtransformer to map our predictions into our labels space(skip/no skip).

4.4. Transformer: Feature-Forcing Method

We also decided to take another novel approach to solvingthis challenge: enhancing the architecture of the traditionalNLP transformer model for our specific task and data. Thus,the question became ”how might we leverage the structureof the transformers to predict skips?”. Transformers dowell with taking an input sequence and converting it into anoutput sequence (i.e. language translation). However, ourproblem is unique in that only the second half of the inputsequence (session) needs predictions, while the first half hasground truth labels we want to leverage in the second half’spredictions. With this problem structure in mind, we cameto create our own model: The Feature-Forcing Transformer(Figure 4).

Figure 4: Architecture of our Feature-Forcing Transformer

In the Transformer, the encoder outputs a set of attentionvectors K and V, which the decoder utilizes in the encoder-decoder attention layer. This layer allows the decoder tolearn which parts of the input sequence it should pay at-tention to. For our feature-forcing model, we fed only thefirst 10 tracks (with behavior and audio features) into theencoder, hypothesizing that the encoder would produce anoutput that meaningfully represents the relevant informationfrom the first 10 tracks.

Then, typically for a Transformer, the decoder takes in astarting token as the first input for target output 0, and thenfor the following k decoding steps, the k-1 target output ofthe decoder is taken in as an input. One alternative methodthat is applied to training the decoder is called ”Teacherforcing”. Teacher forcing helps the model by passing inthe true labels of a sequence into the decoder as opposed tolooping back in and sending in the labels generated by thedecoder. Our ”Feature-forcing” method similarly disregardsthe decoder output and sends in predetermined features in-stead. For a target track i, instead of sending in the previousprediction of ’skip 2’ for a target track (i-1), we send in theaudio features of the track i. Thus, in the decoder, the modelcombines the audio features of track i along with the outputof the encoder (which encodes audio and behavior for thefirst half of the session).

5. Experiments and Discussion5.1. Accuracy Metric

The evaluation metric used in the original Spotify SequentialSkip Prediction Challenge is mean Average Accuracy (m-AA or MAA). For a single example session i with T tracks,we can define Average Accuracy as

AA =1

T

T∑i=1

A(i) · 1{y(i) = y(i)}

where y(i) is the ground truth label for track i, y(i) is thepredicted label for track i, and A(i) is the accuracy of labelpredictions in the sequence up to and including the ith track.

It should be noted that the expected result of Average Ac-curacy (AA)is much different than that of conventional ac-curacy. This is because AA acts like a weighted sum ofaccuracy that places emphasis on correct classifications ear-lier in the sequence. Unlike conventional accuracy, AAcannot achieve a value above 0.6 even if just the first labelis misclassified. The table below demonstrates this.

Table 1: Demonstrates how Average Accuracy can varygreatly given different prediction sequences.

5.2. Loss Functions

For our experiments, we leveraged two kinds of loss func-tions to train our models. The first loss function exploredwas binary cross-entropy loss. For a single session with Ttracks, we can define cross-entropy loss as

CE Loss = − 1

T

N∑i=1

y(i) log p(y(i))+(1−y(i)) log(1−p(y(i)))

where y(i) is the ground truth label for track i and p(y(i)) isthe predicted probability of track i being labeled positive.

Cross-entropy loss, while useful for maximizing accuracygenerally, does not adequately maximize Average Accuracy.As a result, we defined an experimental loss function thatcombined aspects of minimizing cross-entropy loss whilealso maximizing Average Accuracy. For a single sessionwith T tracks, this new loss function can be defined as

CEAA loss = − 1

T

T∑i=1

[y(i) log p(y(i)) + (1− y(i))

log(1− y(i)) +A(i) · 1{y(i) = y(i)}]

where y(i) is the ground truth label for track i, y(i) is thepredicted label for track i, and A(i) is the accuracy of labelpredictions in the session up to and including track i (Wealso refer to this as Cross Entropy + MAA in our paper).

5.3. Discussion and Error Analysis

Starting with the results of our baseline, we see that theGradient Boosted Trees (GBT) model achieves 0.51620subset accuracy on our data. It is important to note that theway we passed input into this model is different comparedto our other models, since GBT was meant to be an initial,simple test. Thus, we believe GBT may be achieving around50% accuracy due to random guessing about whether a trackis skipped or not skipped.

Below is a datatable with the rest of our models’ accuracies.

We see that the LSTM and Bi-LSTMs both overfit to thedata since the validation MAA is higher than the test MAA.This could be fixed by more rigorous hyperparameter tuning,which we focused on carrying out for our best model: thetransformers. Furthermore, we can see that the Bi-LSTMperforms slightly better than the LSTM, perhaps due to themore detailed hidden encoder representation of the tracksequence in the Bi-LSTM.

Finally, we can make some important observations aboutthe developed transformers:

Transformers (Traditional NLP): Based on the output,we see that this transformer did not help improve accuracy,dropping to an accuracy lower than the LSTM. Mappingfrom an input size of 20 tracks to an output size of 10 tracksmay not be the best representation of our problem withsequence-based classification for the last 10 tracks. Theseparation between the first and second half of a session isless pronounced in this model as all tracks are sent togetherinto the encoder.

Transformer (Feature-Forcing): In this model, the en-coder represented meaningful information about the firsthalf of a session and the decoder paid attention to both therepresentation of the first half and the individual audio fea-tures for each track that needs to be predicted. We see from

the results that this architecture proved to be a better modelfor the problem at hand. We also see that this transformerdoes not overfit and achieves a higher accuracy than thetraditional NLP transformer, LSTM, and Bi-LSTM models.

We have also done some feature analysis on this feature-forcing Transformer. Since the attention mechanism is usedto weight specific encoder outputs of the input sequence, wecan visualize what the transformer is focusing on at eachtime step:

Figure 4: The Encoder-Decoder Attention Weights Visualized

In the image above, we see the attention mechanism as amatrix. The columns correspond to the input steps, and rowscorrespond to the output steps. Interestingly, we notice thateach of the decoder outputs especially focus on the last fewtracks from the encoder sequence. One reason this might beis because the skip/no skip behavior of the last track oftenis a strong indicator of behavior in the future, more so thanearlier track skipping behavior.

Furthermore, we explored a variety of different hyperparam-eters on this successful Transformer, especially the size ofthe hidden state. After trying 64, 128, and 256, we foundthat the optimal hidden state was size 128. Perhaps 64 wasnot a large enough size to fully express the input sequence,and 256 was too large and also wasn’t able to convey mean-ing. We also explored batch size, but after several epochswe found no significant difference in accuracy. For batchsize, we settled on 64 per batch to speed up training timewithout sacrificing accuracy. With more hyperparametertuning, we likely could have boosted up the MAA.

In the end, this novel reformulation of the Transformerproved to be a moderate success. Our feature-forced Trans-former enjoyed a MAA of 0.458, which places us at 29thplace in the online submissions for the Spotify SequentialSkip Prediction Challenge.

5.4. Feature Analysis: LSTMs

We can also learn about the relative importance of eachinput feature by examining a heatmap of their weights on asimple logistic regression model. In particular, we traineda logistic regression model to predict the skip label of the11th track in a session where the input is a concatenatedvector of the 67 acoustic and user behavior features for thefirst 10 tracks and the 29 acoustic features of the 11th track.

Figure 5: Heatmap of relative weight importance of each feature acrossincreasing L1 regularization penalty

By analyzing the above heatmap, we can immediately noticethat as L1 regularization penalty increases, the activationsof the features related to the first 9 tracks (feature indices0 to 602) greatly diminish. Features from the 10th track,however, are maintained across regularization penalties. Wecan also note, however, that the features relating to theacoustics of the target track (indices 670 to 699) also zero-out as regularization increases, indicating that even a track’sacoustic attributes are not entirely important to its skip labelprediction. In fact, by the regularization penalty of 1 andfurther, what we see is that the most important features relateto the user’s behavior on the previous track. Among thesefeatures are attributes all heavily related to track playingbehavior i.e. if the user ended the 10th track by pressingforward or letting it finish, if the user briefly played the10th track before skipping, and if the user skipped the 10thtrack altogether. This indicates that the most important traitstowards predicting the skip label of any given track is theuser’s behavior for the immediately preceding track.

6. Future WorkGiven more time, we want to explore the transformer ar-chitecture more. Based on the transformers implementedsuccessfully in NLP applications and the performance ofour feature-forcing transformer, we could combine aspectsof the traditional NLP transformer with the feature-forcingtransformer in order to create a better model. We could dy-namically append the output prediction for each track fromthe decoder with the track’s audio features and injected theseconcatenated embeddings into our decoder for prediction.This architecture would perhaps better follow the transform-ers that are known to work well in NLP, as well as wouldbetter represent our model by considering the audio featuresthat are inputting into the decoder. Finally, exploring vari-able length sessions and improved feature analysis wouldhelp us better understand and improve our model.

Acknowledgements and ContributionsWe would like to thank Andrew Ng, Moses Charikar, andChristopher Re for providing us the opportunity to do thisproject and for providing us the knowledge necessary tocarry out the research. We would additionally like to thankSpotify for providing the datasets that we used for ourproject.

In terms of contributions:

Alex: Developed the data pre-processing, LSTMs. Wrotethe paper and poster

Markie: Developed the both of the transformers (focus onfeature-forcing), Bi-LSTM, GBTs. Wrote the paper andposters.

Surabhi: Developed both of the transformers, GBTs. Wrotethe paper and poster.

Code RepositoryThe code to our implementation can be foundhere. More details of the code are included in theReadMe: https://github.com/alexanderjhurtado/cs229-spotify/tree/master/data

ReferencesBeres, F., Kelen, D. M., Benczur, A., et al. Sequential skip

prediction using deep learning and ensembles. 2019.

Brost, B., Mehrotra, R., and Jehan, T. The music stream-ing sessions dataset. In Proceedings of the 2019 WebConference. ACM, 2019.

Chang, S., Lee, S., and Lee, K. Sequential skip predic-tion with few-shot in streamed music contents. CoRR,abs/1901.08203, 2019. URL http://arxiv.org/abs/1901.08203.

Schedl, Zamani, C. D. Y. and Elahi. Current challenges andvisions in music recommender systems research. Inter-national Journal of Multimedia Information Retrieval, 7,Jun 2018.

Sutskever, I., Vinyals, O., and Le, Q. Sequence to sequencelearning with neural networks. Advances in NIPS, 2014.

Tremlett, C. Preliminary investigation of spotify sequentialskip prediction challenge. 2019.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attentionis all you need. CoRR, abs/1706.03762, 2017. URLhttp://arxiv.org/abs/1706.03762.

Zhang, Y. and Haghani, A. A gradient boosting method toimprove travel time prediction. 2015. doi: 10.1016/j.trc.2015.02.019.

http://arxiv.org/abs/1901.08203



thank you, next: using nlp techniques to predict song...

Documents