About Me

Abdishakur Yoonis

MSc Data Scientist
MSc Software Engineer
BSc Software Engineer

Data Science

Data Analysis

Artificial intelligence (AI)

Machine Learning

Data Engineering

Business Intelligence

Software Engineering

Software Development

Python, R, SQL, C#, Java, JavaScript and many more

Feel free to reach out or follow

Sparkify: Predicting Churn for a Music Streaming Service

Introduction
Project Motivation
File Descriptions
Load and Clean Dataset
Exploratory Data Analysis
Feature Engineering
Modelling
Business Impact
Project Reflection
Future Work
Github Page Blog Post
Licensing, Authors, Acknowledgements
Conclusions
References

Introduction

In this project, I will load and manipulate a music app dataset similar to Spotify with Spark to engineer relevant features for predicting churn. Where Churn is cancelling their service altogether. By identifying these customers before they churn, the business can offer discounts and incentives to stay thereby potentially saving the business revenue. This workspace contains a tiny subset (128MB) of the full dataset available (12GB).

Installations

NumPy
Pandas
Seaborn
Matplotlib
PySpark SQL
PySpark ML

No additional installations beyond the Anaconda distribution of Python and Jupyter notebooks.

Project Motivation

For this project I was interested in predicting customer churn for a fictional music streaming company: Sparkify.

The project involved:

Loading and cleaning a small subset (128MB) of a full dataset available (12GB)
Conducting Exploratory Data Analysis to understand the data and what features are useful for predicting churn
Feature Engineering to create features that will be used in the modelling process
Modelling using machine learning algorithms such as Logistic Regression, Random Forest, Gradient Boosted Trees, Linear SVM, Naive Bayes

File Descriptions

There is one exploratory notebook and html file of the notebook available here to showcase my work in predicting churn. Markdown cells were used throughout to explain the process taken.

Load and Clean Dataset

Our mini-dataset file is mini_sparkify_event_data.json. First the dataset must be loaded and cleaned, checking for invalid or missing data - for example, records without userids or sessionids.

We can now create a Spark Session.

create a Spark session

Spark_Session

Spark_Context

load in the dataset

df = spark.read.json(“mini_sparkify_event_data.json”)

print the schema

df.printSchema()

Exploratory Data Analysis

Define Churn

A column Churn will be created to use as the label for our model. Cancellation Confirmation events is used to define churn, which happen for both paid and free users. We will assign a 1 where a user has churned and a 0 where they have not churned.

Explore Data

Exploratory data analysis will be performed to observe the behavior for users who stayed vs users who churned. Starting by exploring aggregates on these two groups of users, observing how much of a specific action they experienced per a certain time unit or number of songs played.

EDA for Users that Stayed vs Users that Churned

Now we can examine behaviour of those who churned vs those who did not churn. First we will visualise those who churned vs those who stayed.

convert to pandas for visualisation

df_churn = df_churn.toPandas()

plot the number of users that churned

plt.figure(figsize = [8,6]) ax = sns.barplot(data = df_churn, x = ‘churn’, y=’count’) plt.title(“Numbers of Users That Churned”);

Now we can do the same process for customers who didn’t churn.

churn_users

hist_curn

We can see from the above plots that length distribution is very similar for users that churned and those who stayed. This won’t be very useful for predicting customer churn. Let’s try a categorical feature: gender.

I want to convert to pandas for visualisation

hist_curn2

df_gender

df_gender_curn

level_show

df_gender_curn2

From the above chart, we can see that the most popular action for both users that stayed and those that churned was to skip to the next song. We can also see that churned users rolled the ad and thumbs down songs more. Those who were more likely to stay performed more thumbs up actions, added friends and also added songs to playlist.

df_page

df_page2

df_page3

Calculating Songs per Hour

We can now turn our attention to calculating the number of songs listened to by churn and non churn users per hour.

songs_in_hour

songs_in_hour2

From above we can see that there is a peak of songs played between 3pm and 8pm. Next we will examine users who churned by using the same process.

Songs Per Session for Users who Churned vs. Those who Stayed

We can plot this in a simple way which will allow us to compare those who churned and those who stayed in a bar chart by getting the averages for both groups.

average_songs_table

average_songs_table2

UserAgent: Operating System and Browsers

Now we can extract the Operating System a user is on to understand if this has an effect on churn.

location_count

location_count2

Most users were based in CA. More users in MI, KY, and OH states churned than stayed. This may be difficult to engineer a useful feature for when it comes to modelling. Let’s leave this for now and move onto another column from our dataset; operating systems and browsers.

web_browsers

web_browsers_by_devices

web_browsers_by_devices2

Windows was the most used. Linux users have the highest rate of churn. It is very few customers that this has affected therefore this won’t be used in our model.

browser_count

web_browsers2

Chrome was the most popular browser. Firefox users were most likely to churn. Internet Explorer had the fewest number of users that churned. There is no clear issue with browsers which is making users churn. Therefore this won’t be used in our model.

Days Since Registration for Sparkify

Finally, we can look at the number of days since a user had registered.

df_days

df_days2

Now I need to minus these and work that out in days by minus the registration from ts

df_days3

I use to Pandas for the plot boxplot

boxplot

Feature Engineering

Now that EDA has been performed, we can build out the features that seem most promising to train our model on.

The features we will build out are:

Categorical:

gender
level

Numerical:

number of songs per session
number of rollads actions
number of thumb down actions
number of thumbs up actions
number of friends added
number of songs added to playlist
number of different artists listened to on Sparkify
number of days since registering

We will also then add a churn label and join these all together. This will create a dataFrame where each row represents information pertaining to each individual user. Once we drop the userId, this dataframe can be vectorised, standarised and fed into our different machine learning algorithms.

First we will take our categorical variables and convert these into numeric variables, ready for our model.

Gender

Our first feature is gender which is a categorical one. We will assign a 1 for ‘female’ and a 0 for ‘male’.

gender_userId

gender_userId2

userId_level

Average Number of songs per session

Our third feature is average number of songs per session for each user.

userId_level2

userId_level3

Number of rollads actions

Next feature we can consider is number of roll advert actions. This had a higher number of roll ad count for those who churned since those who use the app for free are shown ads whereas paid subscribers aren’t shown ads.

userId_level4

Number of thumb down actions

The fifth feature we can add to our feature dataframe is thumbs down. Users who had churned in the past had performed more thumbs down actions than those who stayed with the service.

userId_level5

Number of thumbs up actions

We can do the same for thumb up actions. Users who stayed with the service had performed more thumbs up actions in the past.

userId_level6

Number of friends added

Similarly, number of friends added can indicate if a user is likely to churn or not. In the past, those who added more friends stayed with the app.

userId_level7

Number of songs added to playlist

Again, those who added more songs to their playlists had stayed with the service so this can provide an indication of whether a user is likely to churn.

userId_level8

Number of different Artists Listened to on Sparkify

As we discovered in EDA, users that listened to more diverse artists were less likely to churn.

userId_level9

Number of Days Since Registering

Number of days since registering also looked useful from our EDA. We saw that users who had a shorter number of days since registering churned more than those who had used the service for a longer time.

df_feature

Now we have a dataframe with all the features we can into our model where each row represents a user.However first we need to do some preprocessing.

df_feature2

df_feature3

df_feature4

Standardisation

Now that we have our vectors we can standardise our values. This is important for our machine learning model so that those features with the highest values don’t dominate the results and so that we can make the individual features look like standard normally distributed data.

df_feature5

Train / Test / Validation Split

Let’s check how many records we have in total is 225 as it should be.

df_feature6

This count is what we would expect, now we can split our data into train, test and validation sets. Here we will do a 60:20:20 split and include a seed so we can reproduce the result. I’ve included the same seed for the different machine learning models so that my results can be reproduced.

df_feature7

Modelling

Now we have created our features dataFrame with only numeric variables, we can split the full dataset into train, test, and validation sets. We will test out different machine learning classification algorithms including:

Logistic Regression
Random Forest Classifier
Gradient-Boosted Tree Classifier
Linear Support Vector Machine
Naive Bayes

We will use these classification algorithms since churn prediction is a binary classification problem, meaning that customers will either churn (1) or they will stay (0) in a certain period of time.

Metrics

We will evaluate the accuracy of the various models, tuning parameters as necessary. We will finally determine our winning model based on test accuracy and report results on the validation set. Since the churned users are a fairly small subset, I will use F1 score as the metric to optimize. F1 is a measure of the model’s accuracy on a dataset and is used to evaluate binary classification systems like we have here. F1-score is a way of combining the precision and recall of the model and gives a better measure of the incorrectly classified cases than accuracy metric. F1 is also better for dealing with imbalanced classes like we have here.

Now we can start modelling. When we identify the model with the best F1 score, accuracy and time we will then tune the model.

The models I have selected are below with the reasons why these have been chosen. Each model that has been chosen is suitable for our binary classification problem of predicting churn.

Logistic Regression: Logistic regression is the first machine learning algorithm we can try. Logistic regression is a reliable machine learning algorithm to try since this is a binary classification problem and logistic regression provides a model with good explainability. Logistic regression is also easy to implement, interpret and is efficient to train. It is also less inclined to overfitting.
Random Forest: Random Forest is a powerful supervised learning algorithm that can be used for classification. RF is an ensemble method that creates multiple decision trees to make predictions and takes a majority vote of decisions reached. This can help avoid overfitting. RF is also robust and has good performance on imbalanced datasets like we have here.
Gradient Boosted Tree Classifier: GBT provides good predictive accuracy. This works by building one tree at a time where each new tree helps correct errors made by the previous tree compared to RF which builds trees independently. There is a risk of overfitting with GBT so this needs to be considered. However GBT performs well with unbalanced data which we have here.
Linear SVC: SVC is another supervised learning binary classification algorithm. It works well with clear margins of separations between classes and is memory efficient.
Naive Bayes: Finally, we will try Naive Bayes. This is another classifier algorithm that is easy to implement and is fast.

Training the Models & Evaluating the Model Performance

Steps:

Instantiate
Fit Models on Train
Predicting
Evaluating

df_feature8

df_feature9

df_feature10

Now that we have our results we can choose our best model. Random Forest and Gradient Boosted Trees performed well but random forest was faster so I will choose this one to tune.

Model Tuning for Best Models:

Now we can tune our model using paramGridbuilder and CrossValidator. I am going to select Random Forest since this is the best compromise for F1 score, accuracy, and time to run. Random Forrest had a F1 score of 0.87 and accuracy of 0.88 and took 2 min 57s compared to GTB which achieved a similar score of 0.88 for both F1 score and accuracy but took 3 min 51s.

Random Forest

df_feature11

Parameters

I will select numTrees and maxDepth for our RF model tuning.

NumTrees: I have chosen to go up to 100 trees to improve performance. Since these trees are individual randomised models in an ensemble there is not a great risk of overfitting with this numTrees parameter.
Maxdepth: I have chosen a max of 15 to reduce the possibility of overfitting. Anything over 15 would increase the risk of overfitting greatly.
Numfolds: I originally had numFolds = 5 but had to change to 3 to speed up the process.

df_feature12

Best Model Performance Results:

We can now get the final results for our random forest model.

df_feature13

df_feature14

Feature Importance:

Finally, we can check the feature importance for our best model and plot this in a chart.

df_feature15

df_feature16

Business Impact

Now, Sparkify can use this information to target customers who are likely to churn and offer attractive incentives to stay, thereby saving Sparkify revenue and getting the customer a nice deal. Since we found that newer customers are more likely to churn, we could target them with a nice free trial of the premium service without those pesky ads! Sparkify could also work on music recommendation system so they can recommend songs that users will enjoy more and thumbs down less.

Project Reflection

From this project I have learned how to manipulate datasets with Spark to engineer relevant features for predicting churn. I used Spark MLib to build machine learning models to predict churn. It was interesting to start with a dataset which had the customers’ user interactions and then use this to predict whether or not they were likely to churn. The best model was the Random Forest classifier which achieved an accuracy and F1 score of 0.88. It was interesting to build my first model for predicting churn in pyspark as opposed to pandas.

Future Work

This project could have been improved by:

doing more feature engineering to select the best features to get a better score
considered overfitting problems in more depth
analysing mispredicted users

Conclusions

We started the project with a small dataset of just 128MB and 225 unique customers. After loading and cleaning our data we explored the dataset for useful features to predict churn and were able to build out the most promising features. We then preprocessed these and used the features with different machine learning algorithms. Random Forest performed the best, so we tuned the model and achieved an accuracy and F1 score of 0.88.

Github Pages Blog Post

The main findings of the code can be found at the Github Data Science Portofolio available here explaining the technical details of my project. A Random Forest Classifier was chosen to be the best model by evaluating F1 score and accuracy metrics. The final model achieved an F1 and Accuracy score of 0.88.

Licensing, Authors, Acknowledgements, etc.

I’d like to acknowledge Udacity for the project idea and workspace.

References

https://stackoverflow.com/questions/21702342/creating-a-new-column-based-on-if-elif-else-condition
https://stackoverflow.com/questions/46921465/extract-substring-from-text-in-a-pandas-dataframe-as-new-column
https://developers.whatismybrowser.com/useragents/explore/layout_engine_name/trident/
https://sparkbyexamples.com/pyspark/pyspark-when-otherwise/
https://stackoverflow.com/questions/52943627/convert-a-pandas-dataframe-to-a-pyspark-dataframe
https://stackoverflow.com/questions/29600673/how-to-delete-columns-in-pyspark-dataframe
https://stackoverflow.com/questions/48738354/having-troubles-joining-3-dataframes-pyspark
https://stackoverflow.com/questions/59886143/spark-dataframe-how-to-keep-only-latest-record-for-each-group-based-on-id-and
https://stackoverflow.com/questions/46956026/how-to-convert-column-with-string-type-to-int-form-in-pyspark-data-frame
https://medium.com/swlh/logistic-regression-with-pyspark-60295d41221
https://towardsdatascience.com/machine-learning-with-pyspark-and-mllib-solving-a-binary-classification-problem-96396065d2aa
https://spark.apache.org/docs/latest/ml-classification-regression.html#logistic-regression
https://stackoverflow.com/questions/60772315/how-to-evaluate-a-classifier-with-apache-spark-2-4-5-and-pyspark-python
https://stackoverflow.com/questions/60772315/how-to-evaluate-a-classifier-with-apache-spark-2-4-5-and-pyspark-python
https://spark.apache.org/docs/2.2.0/ml-classification-regression.html
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
https://stackoverflow.com/questions/32565829/simple-way-to-measure-cell-execution-time-in-ipython-notebook
https://www.silect.is/blog/random-forest-models-in-spark-ml/
https://stackoverflow.com/questions/75440/how-do-i-get-the-string-with-name-of-a-class

Sparkify: Predicting Churn for a Music Streaming Service

Data Science Portfolio

About Me

Abdishakur Yoonis

MSc Data Scientist MSc Software Engineer BSc Software Engineer

Data Science

Data Analysis

Artificial intelligence (AI)

Machine Learning

Data Engineering

Business Intelligence

Software Engineering

Software Development

Python, R, SQL, C#, Java, JavaScript and many more

Feel free to reach out or follow

Sparkify: Predicting Churn for a Music Streaming Service

Table of Contents

Introduction

Installations

Project Motivation

File Descriptions

Load and Clean Dataset

create a Spark session

load in the dataset

print the schema

Exploratory Data Analysis

Define Churn

Explore Data

EDA for Users that Stayed vs Users that Churned

convert to pandas for visualisation

plot the number of users that churned

I want to convert to pandas for visualisation

Calculating Songs per Hour

Songs Per Session for Users who Churned vs. Those who Stayed

UserAgent: Operating System and Browsers

Days Since Registration for Sparkify

I use to Pandas for the plot boxplot

Feature Engineering

Gender

Average Number of songs per session

Number of rollads actions

Number of thumb down actions

Number of thumbs up actions

Number of friends added

Number of songs added to playlist

Number of different Artists Listened to on Sparkify

Number of Days Since Registering

Standardisation

Train / Test / Validation Split

Modelling

Metrics

Training the Models & Evaluating the Model Performance

Model Tuning for Best Models:

Random Forest

Parameters

Best Model Performance Results:

Feature Importance:

Business Impact

Project Reflection

Future Work

Conclusions

Github Pages Blog Post

Licensing, Authors, Acknowledgements, etc.

References

MSc Data Scientist
MSc Software Engineer
BSc Software Engineer