AC209a Data Science Group 77 presents...
Spotify Song Recommendation Project


Can we recommend similar songs that can be added to a given playlist?

Problem Statement

Spotify is an online streaming service that provides over 30 million tracks to 83 million active subscribers, as of June 2018 (Ref. 1). One of the reasons Spotify has gained wide popularity is that it allows users to create their own playlists and also get song recommendations. In order to further enhance user engagement, the aim of our project is to recommend similar songs that could be added to any given playlist.

Genre is an informative song feature that users use, either explicitly or subconsciously, when grouping songs into playlists. However, about 7% of the songs within the Spotify database have not yet been classified into a specific genre, inhibiting Spotify’s ability to properly recommend similar songs that can be added to a user’s playlist. Therefore, our first goal is to create a model that predicts the genre of each song that are not yet classified. Then, we plan to create a recommendation system that utilizes all song features, including imputed genres, to suggest similar songs that can be added to a given playlist.

Data Description

We used million playlist dataset (in JSON format) from Spotify RecSys Challenge 2018. We used spotipy python library to access Spotify API database for song features for songs in playlists. We collected song feature information for 119,064 songs in 1812 playlists for our analysis.

For each song in a playlist, we collected the following features (Ref. 2):

Variable Type Description
Genre string, categorical, y-variable Genre of the song. If there were multiple genres associated with the song, we retrieved the first genre.
Valence float Represents how positive the track is. Value ranges from 0.0 to 1.0, with 1.0 being positive (happy, cheerful) songs, and 0.0 being negative (sad, depressed, angry) songs.
Popularity integer The popularity of the song, ranging from 0 to 100, with 100 being the most popular. This is scored based on the total number of plays the song has had, and how recent the plays were.
Tempo float The average speed or pace of a song, calculated by beats per minute (BPM).
Loudness float The overall average loudness of the song, measured in decibels (dB).
Energy float Measure of intensity and activity. Value ranges from 0.0 to 1.0, and songs with high values feel fast, loud, and noisy.
Danceability float Represents how danceable the song is, based on tempo, rhythm stability, beat strength, and overall regularity. Value ranges from 0.0 to 1.0, and 1.0 is most danceable.
Acousticness float Represents the confidence measure of how acoustic the song is. Value ranges from 0.0 to 1.0, with 1.0 being the highest confidence that the track is acoustic.
Liveness float Detects the presence of an audience in the song. Value ranges from 0.0 to 1.0, with 1.0 showing a strong likelihood that the song was performed live.
Mode integer, categorical 0 if song is in minor key, 1 if in major key, and -1 for no result.
Speechiness float Detects the presence of spoken words in a song. Value ranges from 0.0 to 1.0. Recordings such as talk show and audio book would have a value closer to 1.0. Values above 0.66 are recordings that are probably made entirely of spoken words. Values between 0.33 and 0.66 are recordings that contain both music and speech, such as rap music. Values below 0.33 are most likely to be songs.
Instrumentalness float Predicts whether the recording contains vocals, such as rap or spoken word tracks. Value ranges from 0.0 to 1.0, and the closer the value is to 1.0, the higher likelihood the recording contains no vocal content.


Although we wanted to collect as much data as possible, there were limitations on how much data we could access in a given time due to rate limitations of the Spotify Web API. For example, about every 2-3 hours, the API session would lose connection after collecting song feature information for 100~200 songs. Therefore, we distributed the data collection work by using multiple Spotify API tokens to get as much data as possible.

Within our dataset, there were a total of 981 song genres. Most genre labels were so specific (ex. aussietronica) that without the entire Spotify song database, there would be very few songs belonging to those genres. This would inhibit our ability to create accurate classification models. Given our relatively small song dataset, we categorized each genre into more generic genre categories (ex. rock). Genres that were difficult to classify as generic genres were classified as ‘other’. After data cleaning, there are a total of 30 genres for the songs. 8235 songs (<7%) that did not have genre information were removed from our dataset, which resulted in a total of 119,064 $-$ 8235 = 110,829 songs in our dataset.

Exploratory Data Analysis

Figure 1. Frequency of genre types in songs dataset

Most of the songs from our dataset had genres such as 'dance pop' and 'rock'. In contrast, there were not many songs that had genres such as 'a capella' and 'new age'. To account for such imbalance in song genres, we made sure that all the genres are equally represented in our training and test dataset by stratification.

Figure 2. Distribution of feature values for each genre

In general, the values in these graphs make sense $-$ for instance, on average, classical songs have the highest acousticness and lowest energy value, whereas comedy recordings have the highest liveness and speechiness values. The median value of features such as instrumentalness, acousticness, and energy showed high variations among different genres, which implies that these features may be helpful in distinguishing the genres of the songs.

Figure 3. Correlation heatmap between predictor variables

There were multiple pairs of features that were highly correlated. For instance, loudness and energy were highly positively correlated (0.75), whereas energy and acousticness were highly negatively correlated (-0.65). This result suggests that we would need to use regularization techniques to reduce potential multicollinearity issues among predictor variables.


A. Genre Prediction

All predictor variables were standardized before analysis, so that all predictor variables fall within the same scale and therefore contribute equally in the model. The dataframe was then split into a train and test dataset, and stratified by genre so that each genre had equal representation in the datasets. We used a subset (10%) of the training dataset to train our models, as it was taking a very long time to train some models using the entire dataset.

We used principal component analysis (PCA) to reduce dimensionality of training set features (for original feature set, features with interaction terms, and features with polynomial terms of varying degrees). We used PCA components to train a model if it took a very long time to train a model using original feature set (for instance, it took >18 hours to train a logistic regression model with interaction terms). For those models, we used the minimum number of PCA components that explain at least 90% of the variance in the training dataset, as seen in the graph below.

Figure 4.

Naive baseline model

We created a very naive baseline model so that we can better determine the success of more complex models. This naive baseline predicted each song genre to be the most common genre found in the training dataset, ‘dance-pop’. The resulting accuracy on the training set is 0.153. This informs that 15.3% of the songs in the training set have a genre of ‘dance-pop’.

A more robust baseline model

Considering how unintelligent the naive baseline model is, we chose to create a more robust baseline classification model using logistic regression. Logistic regression is suitable as a baseline model because it is an intuitive linear classifier that predicts the probabilities for a song to belong to each genre, and chooses the genre with the highest probability. We performed several logistic regressions by performing permutations of L1/L2 regularizations of varying regularization strengths, adding interaction terms to the feature set, and adding polynomial terms of varying degrees. As for logistic regression with interaction terms, we used 40 PCA components. We used 9 PCA components for polynomial terms up to degree 2, 4 PCA components for polynomial terms up to degree 3, and 3 PCA components for polynomial terms up to degree 4. The accuracy scores for each permutation on the training set, as well as cross validation scores, are reported below.

Baseline Model Training accuracy Cross-validation accuracy Best Parameters
Logistic Regression 0.3411 0.3361$\pm$0.0041 penalty=l1
Logistic Regression with interaction terms (+PCA) 0.3145 0.3155$\pm$0.0038 penalty=l2
Logistic Regression with polynomial terms (+PCA) 0.2781 0.2744$\pm$0.0064 polynomial degree=2

Using PCA to reduce dimensionality intrinsically results in loss of information, which explains why the logistic regression model not utilizing PCA has the highest cross-validation accuracy score. Therefore, we chose our robust baseline model to be the logistic regression not utilizing PCA and not including additional features. Although utilizing L1 regularization within this best model resulted in the highest cross-validation accuracy score, the cross-validation accuracy scores for L2 or no regularization were within the standard error range of the score for L1 regularization. Therefore, we cannot definitively state that L1 regularization performs significantly better than L2 or no regularization. The fact that regularization did not significantly improve cross-validation accuracy suggests that the dataset has enough samples and a limited amount of features that are not conducive for the model to overfit.

Comparing model performance

We trained several classification models for genre prediction and compared their cross-validation accuracy scores to select the best model.

Due to time constraints, we were unable to perform cross-validation on the multi-layer perceptron model. However, this is an action we would like to perform in the future.

Model Training accuracy Cross-validation accuracy Best Parameters
Naive Baseline Model 0.1530 0.1543$\pm$0.0036 NA
Robust Baseline Model 0.3411 0.3361$\pm$0.0041 penalty=l1
Random Forest 0.9986 0.5387$\pm$0.0052 max_depth=100
Linear Discriminant Analysis 0.3158 0.3087$\pm$0.0097 NA
Quadratic Discriminant Analysis 0.3068 0.2265$\pm$0.1130 NA
AdaBoost 0.2640 0.2704$\pm$0.0047 base_estimator=DecisionTreeClassifier(max_depth=3)
Gradient Boost 0.9825 0.5099$\pm$0.0048 max_depth=3
K-Nearest Neighbors (+PCA) 0.9985 0.4372$\pm$0.0079 k=1
Multi-layer Perceptron 0.5288 0.4228 optimizer=Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False)

Figure 5.

In summary, all models performed better than our naive baseline model, which predicts all song genres as the most frequent genre, ‘dance pop’. Below are sources of errors that may have resulted in some models performing better than other models.

1) Classification of a large number of genres

Our robust baseline model (logistic regression) had a low cross-validation accuracy score compared to other models. The output of a logistic regression is a single sigmoid function indicating the probability of an observation belonging to a single class. In a multi-class problem, a single sigmoid cannot indicate probabilities of belonging to each class; therefore, a series of binary logistic regression must be performed for each class, which is also known as a one-vs-rest approach. In our genre classification problem, the multiclass logistic regression outputs 30 probabilities that each song will belong to one of the 30 genres, and the genre with the highest probability is assigned to the song. With such a large number of genre classes, the probabilities that a song will belong to each genre (not belonging to rest of the genres) are low. This explains the poor prediction accuracy for our multiclass logistic regression model.

2) Assumption of data distributions

LDA and QDA models assume multivariate normality, which means that independent variables exhibit Gaussian distribution within each genre. Furthermore, LDA also assumes that the variance of each feature is the same within each genre. However, Fig.2 shows that most of the distributions of each predictor variable are skewed, and the variance of each feature differs within each genre. Therefore the assumptions for LDA and QDA models do not hold, which explains their poor performance.

3) The curse of high dimensionality

K-NN model suffers from high dimensionality. The algorithm utilizes Euclidean distance to discern nearest neighbors; however, Euclidean distances between all vectors approach the same value as dimensionality increases (Ref. 3). Therefore, Euclidean distance becomes a meaningless metric to determine nearest neighbors. To mitigate this problem, we used PCA to reduce dimensionality. However, since we still wanted to maintain an accurate representation of the data, we chose the minimum number of PCA components (n=9) that explains at least 90% of the variance. We could not choose a smaller number of PCA components because the percentage of variance explained in the dataset would sharply decrease (Fig.4). Consequently, using PCA did not significantly reduce dimensionality; therefore, we still cannot be confident in our result that the best value of k (nearest neighbor) is indeed 1.

4) Limited number of observation in some genres

QDA relies on the calculation of covariance amongst each genre. As seen in Fig.1, there are instances of genres that contain a very few number of songs, which is exacerbated when using a subset of the data. Having a limited number of observations inhibits our ability to confidently determine the true distribution of the predictors, which results in high uncertainty in the covariance between these genres. This explains why there is a large standard error in the QDA cross-validation accuracy.

Choosing the final model

Random forest model achieved the highest cross-validation score, and there are several explanations for this result. There is intrinsically large variance of the parameter estimates in classification problems where there exist a large number of classes. Random forest is very effective at reducing the variance, which improves the generalization of the model and thus resulting in a high cross-validation accuracy. Moreover, the accuracy of random forest models is not affected by distribution of data, nor is it affected by high dimensionality, unlike other models. In addition, random forest doesn’t require meticulous tuning of hyperparameters as compared to boosting methods.

The importance of each feature in a random forest model is calculated as the average of the Gini index decrease when splitting using that feature. The table below shows that speechiness was the most important feature for determining genre classes. This makes intuitive sense because there is a large discrepancy in the presence of spoken words among genres; for instance, rap music will have a higher speechiness value than edm or classical music.

Feature Importance
Speechiness 0.1267$\pm$0.0059
Danceability 0.1096$\pm$0.0056
Acousticness 0.1024$\pm$0.0055
Valence 0.1010$\pm$0.0058
Loudness 0.0988$\pm$0.0061
Popularity 0.0966$\pm$0.0051
Energy 0.0960$\pm$0.0056
Tempo 0.0938$\pm$0.0037
Liveness 0.0865$\pm$0.0060
Instrumentalness 0.0725$\pm$0.0055
Mode 0.0163$\pm$0.0046

Due to the analysis above, we chose to use the random forest model for genre prediction. The test set accuracy score for random forest model was 0.5496, which is slightly better than the cross-validation accuracy. This shows that this model generalizes well, and gives us additional evidence that the model is not overfit to the training data set.

B. Song Recommendation

Now that the database is complete with the imputed genre information, the recommendation system can be used to its full capability. The input to the recommendation system is a list of songs (user's playlist). Since the genre feature is a categorical variable, the genre variable is one-hot encoded in order to measure nearest neighbors in the recommendation system. All song features within the playlist are then standardized using the scaler that was fitted using the training dataset in order to perform a proper comparison with the other songs in our database.

It is difficult to define similarity between a list of song vectors and another song vector. Therefore, we aggregated all of the song vectors into one playlist vector, by taking the average of the feature values. This final vector contains feature values for what an average song would look like in the playlist.

Now, the averaged playlist vector can be used as a query vector to discern the top 'k' similar songs from the database, where 'k' is an integer. Cosine similarity is used instead of euclidean distance as a distance metric because cosine similarity is less vulnerable to high dimensionality (Ref 5,6). The top ‘k’ songs that are most similar to the averaged playlist vector (but are not already contained in the playlist) are then recommended to the user to be added to the playlist.

Our song recommendation system also allows the user to get a wider variety of recommended songs, if the user wishes to. For instance, if a user wants to get 10 similar song recommendations for his/her playlist, but wants a little more variety in the recommendations, our algorithm has an option to first retrieve more than the 10 nearest songs, and then randomly select 10 songs from that pool.

Below is a mock example showing the input and output of our recommendation system. Top 20 songs that were most similar to the average playlist vector were first retrieved, and 10 songs within these 20 songs were randomly chosen.

Input of user song playlist
Song Artist
I Need a Girl Part 2 (feat. Loon, Ginuwine & Mario Winans) Diddy
California Niia
Closedloop Elliot Moss
Something Like Chaos Water Park
Nobody - Atom Tree Remix Niia
Bloodstream Stateless
The Journey Sol Rising
Love In Bad Company Gotts Street Park
Miss You - HONNE Remix James Hersey
Show Me (feat. Madison Ryann Ward) John Splithoff
Drinkin' Problem Midland

Output of user song playlist
Song Artist
It Doesn't Work Like That Faye Webster
Everything to Me Lips
Still Waters Eventide
Talking to Myself Watsky
So Good To Me - Re-work Whilk & Misky
Under Blankets crwn
Meet Me In The Middle - From The "Fifty Shades of Grey" Jessie Ware
Ocean View 2.0 Pell
Makin' Ice Ripe
Guilty - Chill Out Mix Hamilton


In Spotify’s current database, 7% of songs did not have genre labels. Genre is an important feature used to recommend similar songs to users and any recommendation system is limited if it does not have a full dataset. Therefore, we first aimed to predict genres for songs missing genre labels. We created training and test sets only containing songs with genre labels. We created a logistic regression baseline model using the training set, which had a cross-validation accuracy of 0.3361. We then trained a random forest model, a QDA model, an AdaBoost model, etc. using the training data. The random forest model performed the best with the cross-validation score of 0.5387 and with the test set accuracy of 0.5496. The high validation and the test set scores show that the model generalizes well on unseen data. Therefore, we would recommend using our random forest model to impute missing genre labels in Spotify’s database. Finally, with the completed data, we built a recommendation system utilizing cosine similarity to determine similar songs. It gives a user the freedom to choose flexible varieties of songs: a user can choose very similar songs to be recommended for his/her playlist or a more diverse set of songs to be recommended.

There are several areas for improvement we would like to address in the future. With our limited dataset, we had to merge 981 song genres into 30 genres, which introduces a source of error. For example, “dance folk” and “country folk” both had few numbers of songs, so we merged them into “folk” genre, despite the possibility of them having different values for loudness, acousticness, danceability, etc. To mitigate this problem, we would like to collect more song data in the future, and doing so could further improve the performances of our classification models. Ideally, we would be able to re-create our models using the entire dataset to obtain more accurate results on their performances. Finally, we would like to determine a success metric for our song recommendation system, potentially by utilizing user feedback. By being able to assess our recommendation system, we can improve our model based on incoming data from users.

Future Work

We would like to make user-specific recommendations, meaning that the recommended songs are tailored based on a user’s library and listening preferences. One extension of this is to make time-specific playlist recommendations: for example, if a user tends to listen to relaxing music at night and high-energy music in the morning, we would like to automatically recommend songs that align with those listening habits. In order to accomplish this, we would need to obtain access to user’s playlist library.

We would also like to have a user interface in our recommendation system such that the user can input a preference for the playlist (ex. party, jazz, happy, etc.), which would influence what songs are added to the playlist. Such customized user input will help us obtain more user preference data and update our model accordingly, so that the resulting song recommendations align more with what the user expected.


  1. Statista provides statistics on the number of Spotify subscribers.
  2. We referred to Spotify Web API Reference to learn how to retrieve song feature information through Spotify API. These pages(1, 2, 3) explain song features in more detail.
  3. Mewada, Pradeep, and Jagdish Patil. "Performance Analysis of k-NN on High Dimensional Datasets." International Journal of Computer Applications 16.2 (2011): 1-5.
  4. In general, we referred our course's textbook for our model explanations.
    James, Gareth, et al. An introduction to statistical learning. Vol. 112. New York: springer, 2013.
  5. Salton, Gerard, Anita Wong, and Chung-Shu Yang. "A vector space model for automatic indexing." Communications of the ACM 18.11 (1975): 613-620.
  6. Qian, Gang, et al. "Similarity between Euclidean and cosine angle distance for nearest neighbor queries." Proceedings of the 2004 ACM symposium on Applied computing. ACM, 2004.

Our Team

We are Group #77 for Harvard's AC209a: Introduction to Data Science course.

Eric (Eun Seuk) Choi

M.S. Student in Data Science

Rachel Moon

M.S. Student in Data Science

Nick (Feng) Qian

M.S. Student in Data Science

Cory Williams

M.S. Student in Computational Science