All Projects

Data Science Portfolio

Home Resume Projects

About Me

Card image

Abdishakur Yoonis


MSc Data Scientist
MSc Software Engineer
BSc Software Engineer

Data Science
Data Analysis
Artificial intelligence (AI)
Machine Learning
Data Engineering
Business Intelligence
Software Engineering
Software Development
Python, R, SQL, C#, Java, JavaScript and many more

linkedin github github
Feel free to reach out or follow

Compilation of personal and online courses projects. This portfolio, as a whole, aims to demonstrate proven experience in Data Science principals including obtaining/cleaning data, building Extract, Transform, Load, (ETL) pipelines, Exploratory Data Analysis (EDA), and building and validating Machine Learning models.

Project Capstone Of Sparkify

Mar 2022 – Apr 2022

In this project, I will load and manipulate a music app dataset similar to Spotify with Spark to engineer relevant features for predicting churn. Where Churn is cancelling their service altogether. By identifying these customers before they churn, the business can offer discounts and incentives to stay thereby potentially saving the business revenue.

We will implement and test out different machine learning algorithms and the best methods have been used, here are:

  • Logistic Regression
  • Random Forest Classifier
  • Gradient-Boosted Tree Classifier
  • Linear Support Vector Machine
  • Naive Bayes

Project Motivation

For this project I was interested in predicting customer churn for a fictional music streaming company: Sparkify.

The project involved:

  • Loading and cleaning a small subset (128MB) of a full dataset available (12GB)
  • Conducting Exploratory Data Analysis to understand the data and what features are useful for predicting churn
  • Feature Engineering to create features that will be used in the modelling process
  • Modelling using machine learning algorithms such as Logistic Regression, Random Forest, Gradient Boosted Trees, Linear SVM, Naive Bayes

Predicting Price Seattle

Feb 2021 – Mar 2021

I was curious to look into the AirBnB dataset for Seattle. I needed to discover more about pricing patterns, customer feedback, and pricing forecasting. Some of the questions I’ve looked into are:

Article Recommendations

Apr 2022

This project focuses on analyzing interactions between users and articles on the IBM Watson Studio platform. New article recommendations are made to users based on their interactions with articles. Based on the data available, we can use various methods to make these recommendations. The methods used here are Rank Based, Collaborative Filtering, and Matrix Factorization.

Disaster Response Pipeline | Web-App

Feb 2022
  • Utilized frameworks such as NLTK and Scikit-Learn to perform ETL, build ML pipeline, and deploy ML model to a local web application.
  • The ML pipeline processes 26,000 raw text messages using NLTK and Scikit-Learn to build a multioutput classification model.
  • Maximized F1 score through feature engineering and parameter tuning.

Communicate Data Finding

May 2021
  • I will investigate Ford GoBike System Dataset, assess its quality and tidiness, then clean it which called data wrangling.
  • Ford GoBike System Dataset includes information about individual rides made in a bike-sharing system covering the greater San Francisco Bay area.
  • Most users are Subscribers
  • Subscribers have the lowest number of trips in the weekend holiday (Saturday and Sunday) while Customers have the lowest number of trips on Wednesday and have realtively better numbers of trips in the weekend holiday (Saturday and Sunday).
  • Most of trips are taking less than 15 mins and the top number of the trips takes about 10 mins.
  • Subscribers have narrower trip duration than Customers
  • Subscibers have more specific Trips than casual Customers
  • Most users are Subscribers and Dominant gender is Male
  • User with Age between 25 and 35 are making top of number of trips
  • Most of trips are taking less than 15 mins and most trips takes about 10 mins.
  • The lowest number of trips are in the weekend holiday (Saturday and Sunday) while the top number of the trips is on Thursday the last working day of the week.
  • The rush hours of number of trips are 8 (8 am)and 17 (5 pm) while the number of the trips is the least and decreasing rapidly after midnight until dawn.
  • Subscribers tends to rent bikes on working days while Customers tends to rent more in the weekend holiday (Saturday and Sunday) for longer duration.
  • Subscribers tends to rent bikes for shorter duration while Customers tends to rent bikes for longer duration
  • Both of Users type have their trips duration decreases as age increases and Subscribers of old age (60-80) have a longer trip duration and larger no of trips from customers.

Analyze AB Test Results

Apr 2021
  • Remove duplicates or records with missing or mismatched values
  • Handle the rows where the landing_page and group columns don’t align

Data Analytics:

  • Compute probabilities of converting:
    • regardless of page
    • Given that an individual received the treatment
    • Given that an individual received the control page
  • Perform Hypothesis Testing and calculate p-values
  • Conduct Logistic Regression

The findings show that the new and old pages have roughly equivalent chances of converting users, based on the statistical tests we conducted, the Z-test, logistic regression model, and actual difference identified. The null hypothesis is not rejected. I advise the e-commerce business to retain the old page. This would save you time and money by avoiding the need to establish a new website.

Wrangle And Analyze Data of WeRateDogs

Jun 2021

The dataset that I will be wrangling (and analyzing and visualizing) is the tweet archive of Twitter user @dog_rates, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people’s dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because “they’re good dogs Brent.” WeRateDogs has over 4 million followers and has received international media coverage. The Tweet Image Predictions, i.e., what breed of dog (or other object, animal, etc.) is present in each tweet according to a neural network. This file (image_predictions.tsv) is hosted on Udacity’s servers and should be downloaded programmatically using the Requests library and the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv