Real-world data rarely comes clean. Using Python and its libraries, I will gather data from a variety of sources and in a variety of formats, assess its quality and tidiness, then clean it. This is called data wrangling. I will document my wrangling efforts in a Jupyter Notebook, plus showcase them through analyses and visualizations using Python (and its libraries) and/or SQL.
The dataset that I will be wrangling (and analyzing and visualizing) is the tweet archive of Twitter user @dog_rates, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people’s dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because “they’re good dogs Brent.” WeRateDogs has over 4 million followers and has received international media coverage.
WeRateDogs downloaded their Twitter archive and sent it to Udacity via email exclusively for me to use in this project. This archive contains basic tweet data (tweet ID, timestamp, text, etc.) for all 5000+ of their tweets as they stood on August 1, 2017. More on this soon.
My tasks in this project are as follows:
Gather each of the three pieces of data as described below in a Jupyter Notebook titled wrangle_act.ipynb:
The WeRateDogs Twitter archive. I was given this file, so imagine it as a file on hand. Download this file manually by clicking the following link: twitter_archive_enhanced.csv
The tweet image predictions, i.e., what breed of dog (or other object, animal, etc.) is present in each tweet according to a neural network. This file (image_predictions.tsv) is hosted on Udacity’s servers and should be downloaded programmatically using the Requests library and the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv
I need to be able to create written documents that contain images and I need to be able to export these documents as PDF files.
Data is successfully gathered:
Each piece of data is imported into a separate pandas DataFrame at first.
We can see that the number of Retweets is higher at certain times of the day, such as 4 p.m. (16 hours) or 5 p.m. (17 hours). In contrast, at 3 a.m., 4 a.m., and 1 p.m., the number of Retweets drops dramatically.
Please keep in mind that the numbers 0, 1 and 6 indicate Monday, Tuesday, and Sunday, respectively. When comparing Saturday and Sunday, we can see that Tuesday and Friday have a better performance in terms of Retweets.
The Breeds with the most Retweets (those with over 23 Retweets over the time range specified in Data Frame) are presented in the above plot. I observed that different breeds have vastly diverse characteristics. The average retweet count for French Bulldogs is over 2500, while the average retweet count for pug toy poddles is less than 1500.
There are nearly 6000 tweets on WeRateDogs. I was able to examine approximately 1500 tweets. The most popular dog breeds are the Golden Retriever (143), Labrador Retriever (103), Pembroke (94), and Chihuahua (87).
As shown in the graph below, the page grew in popularity over time. The number of favourites seem to be increasing. We might assume that as the WeRateDog account grew in popularity, tweets were becoming more and more popular.
Dogs are divided into four stages by WeRateDogs: doggo, pupper, puppo, and floof (er). According to the graph below, Pupper is the most common dog group, followed by Doggo, and Floofer is quite rare.
Lower ratings were more common near the start of the account’s activity. With the passage of time, less and fewer dogs obtained a bad rating, while more and more obtained a high grade.
Favorites ratio Vs Retweets
Two reports: