My first project as a Data Science student @ Flatiron

I’m currently a full time student at flatiron doing the online data science program and as part of the final project of Module 1, every student has to post a blog of some finding, some nifty code, or something that we found interesting related to the subject of the project.

My background is in geophysics and prior to starting this intensive program I had no coding experience, so I have found this a fascinating but very complex new world.

As you can imagine, in 3 weeks of online training, the chances of me finding a proper “nifty code” are vague, or if you want to see it in a different more positive way, every code line if nifty as it is new to me.

Now to the question: what do I write about? It has to be related to the project, but my coding skills are not great at this point so I won’t impress anyone with this blog. With that in mind, I decided to tell the story of how I created my first SQL database, tried to map it manually on a board to get a SQL schema, and then use it to run queries that would make my life easier when it comes to answering the questions of the project.

The project that I’m about to hand it later on tonight is a fictional situation where Microsoft has seen how all the big companies are creating original video content and are making large profits from it. Now, they have hired my team to help them better understand the movie industry. Our goal is to use data analysis to shortlist which films are doing the best at the box office and based on our findings make some recommendations to their CEO.

I won’t bother readers with everything I had to do to reach some obvious conclusions. What I will try to do is tell a story about one of the questions and along the way drop a few basic code lines and some cool plots that I manage to make.

The question that I will address is when would it be the best time of the year to release movies. So why did I choose this question? Because I am a movie freak and I know, or at least that is what I think, that the majority of the blockbuster movies are released during the summer holidays, Christmas and around April/May. I won’t go into details trying to justify this, so I will leave up to you to either trust me on that, or do your own research and have your own opinion.

Ok, so lets start with the project now. Where did I get the data to do this type of study? I didn’t have to do any online research, work with APIs or even scrape the web for it. As part of the package for the project, Microsoft gave us 11 files/dataset from different sources listed below:

Box Office Mojo

  • bom.movie_gross.csv.gz

IMDB

  • imdb.name.basics.csv.gz
  • imdb.title.akas.csv.gz
  • imdb.title.basics.csv.gz
  • imdb.title.crew.csv.gz
  • imdb.title.principals.csv.gz
  • imdb.titlee.ratings.csv.gz

TMDB

  • tmdb.movies.csv.gz

The Number

  • tn.movie_budgets.csv.gz

Rotten Tomatoes:

  • rt.movies_info.tsv.gz
  • rt_reviews.tsv.gz

Now we were told that this data came from different sources (other students) so it is quite messy and will need some cleaning. The advice that we received was to open the files, look at them and think about the questions that you want to answer, based on this build an SQL database or simply work in Pandas and then come up with your conclusions and recommendations for Microsoft.

When I did the module on SQL queries I found them to be very useful and friendly. They are fast and you really don’t need to do much in terms on joins as you call everything with your queries. So I decided on the following workflow:

1. Load the 11 datasets into Jupyter Notebooks

2. Clean the 11 datasets using the pandas library

3. Draw the map of the 11 datasets on a board to create a preliminary SQL schema, or at least attempt to do it

4. Create the SQL database

5. Run queries to create the DataFrames that I could use for my analysis

6. Generate a few plots

7. Analyse the plots and come up with conclusions

It doesn’t sound that hard doesn’t it? But remember, no coding experience and 7 days to do this plus a long list of deliverables which included a GitHub repository with all the files cleaned, a proper README, the notebook itself, a PowerPoint Presentation in a PDF format, a video giving the presentation to your client (Microsoft) and of course this blog. So if you ask me, quite challenging.

So with that plan in my head I thought naively that if I followed it once the files are cleaned and the SQL database is done and running, the rest will be a piece of cake. How wrong was I? but I will go into that later on this blog.

At this point I have an idea in my head and nothing will stop me. I opened Jupyter Notebook and call all the necessary libraries: Pandas, Numpy, Sqlite3, matplotlib, Seaborn, os, glob and of course the magical line that allows plots to be displayed on my Jupyter notebook: %matplotlib inline. Following this is the loading of my files and now the cleaning starts.

11 different datasets to be clean in one single notebook didn’t sound like a great idea to me. So I decided to create one notebook per dataset and then just a folder with all of them cleaned.

Here is where the fun started because of no previous experience. How do I perform the cleaning? Is there a workflow that I should follow? Yes there is. In fact there are many for free online that will guide you, so that will not be a problem. The only problem will be time. I watched a few videos here and there and also read a few documents and those were very helpful, but it is always a different story when you try on your own.

As I had 11 dataset I treated them as isolated cases with no recipes except for some standard functions to better understand the dimensions of each files, the columns and type of data that I was going to be working with. Because I have no proper general workflow I will just add some bits and pieces of what I did for one of the datasets.

The example that I will use corresponds to the bom.movie_gross.csv from Box Office Mojo.

1. The first step I did was to get to know the data. With the file loaded through the pd.read_csv(‘./filename’) I ran a .head() and a .tail() to see the top and bottom 5 rows. This will not only show me the names of the columns that I will be working with but also will give me a glance at any possible missing values. The output of both commands is on Figure 1 below:

Snapshot of my first attempt to clean a dataframe in Jupyter Notebook using pandas

By looking at that image you will see quite fast that there are some issues. The column called “foreign_gross” has a few NaN values, so I will definitely need to look into that and find a solution if I am to use this as one of my inputs for the SQL database.

Before I did anything I ran a df_bm.shape and a df_bm.info(). The first command gave me the shape of my file (rows=3387, Columns=5, see image below). The second gave me a more interesting output:

Info() output for the DataFrame df_bm

So why is it interesting? Well it is interesting because it not only gave me the number of rows again but also listed each column on the left side with the number of rows of data that each has and the type of each column. If you focus for example on the “foreign_gross” column we see that it has 2037 non-null values. So where are the rest? The table/dataframe has 3387 rows, therefore I am missing 1350 rows. That’s a high number of almost equivalent to 40% of the data. Another interesting information here is that the “foreign_gross”, which is supposed to be a number has a type equal to “object”. So more things to fix.

So where are those 1350 rows from the “foreign_gross”? I ran a .isna() followed by a .sum() to have the total number of na values and the results are once again interesting:

Snapshot from Jupyter Notebook showing the sum of NaN values in the df_bm dataframe

At this point I have a few things to address but all seem to coe from the same column, the “foreign_gross”. I didn’t want to just drop the column wihout before exploring the rest of the 10 files. By doing so I realized that I didn’t need this column because one other dataset had over 25000 rows of data and had enough information to calculate the “foreign_gross” by substracting the “domestic_gross” from the “worldwide_gross”. So yes, I can drop this column and the studio as I knew it was one I wasn’t going to be using. For this I used the following command:

df_bm.dropna(subset=[‘studio’, ‘domestic_gross’], inplace=True)

Then I double check for the sum of the na values and all seems to be ok now:

Snapshot from Jupyter Notebook showing the sum of NaN values per column

With no missing values, I ran a quick code lines for duplicates:

duplicates = df_bm[df.bm.duplicated()]

print(len(duplicates))

Luckily for this dataset, no duplicates were found.

For this file there was nothing complicated. It was all relatively straight forward and not much code was needed. The last thing I did was changed the name of one of the columns from “title” to “movie_title” with the following command:

df_bm = df_bm.rename(columns = {‘title’: ‘movie_title’})

Ok so this is one of the simplest files which was not intended as my choice but it happened. So just for the purpose of giving you a hint of the others I will list some of the other functions that I used to clean them:

1. Deleted duplicates using: df = df.drop_duplicates()

2. Replaced NaN values with the str “missing”: df = df.replace(np.nan, “missing”)

3. Changed the date format using: df[‘date’] = pd.to_datetime(df.date]

As you can see nothing too complicated but just a lot of work to go over all the files and them used them to create the SQL dataframe.

To create this database we used the following code:

Creating the SQL Database with the cleaned 11 dataset

Now with the databased created I will share what I thought I had as an SQL Schema:

Preliminary SQL Schema

That is just preliminary as in reality the links between the tables aren’t all working. Once you join tables using the SQL querries you end up with a new set of duplicate values and a bunch of inconsistencies that give you errors. Why? Because the tables are not consistent and therefore don’t really talk to each other. In order to get this finalized more time and work will be needed. So for me some benefits come from this database. Clearly not what I expected but I must admit that the process to reach to this point if challenging but fun and now that some queries do work, I can see the potential of having it work properly.

One of the queries that worked was the following:

SQL query with df output below

After a few modifications here and there I managed to end up with the following two data frames:

First dataframe showing the mean of the worldwide_gross and profit per season
Second dataframe showing mean of the worldwide_gross and profit per calendar month

Now that I have those two tables, the next step of creating plots using matplotlib and seaborn is relatively easy. As this was my first proper project I did quite a few different plots, but I will share only a couple here to cut this story soon.

The code for my first plot is below with the plots following:

Code for creating subplot using two different dataframes and the Seaborn library
Subplots created in Seaborn diplaying the mean profit per season (left) and mean profit per month (right)

The second set of plots that I will share are again subplots. This time 4x1 and corresponding to the seasons divided by month. They are all at the same scale so that a proper compasiron can be done and the colors match with the left plot for consistencies purposes. The code is the following:

Code for 4x1 subplots illustrating the mean profit per month separated by seasons
Mean profit per month plotted per season

So there you go, a simple process that took me the entire week and the conclusions as I mentioned at the beginning are very predictable as these last plots suggest. The best seasons for releasing movies are summer rand winter. What I was not expecting is that the best month is January. June as number 2 was predictable and April as well. December didn’t do too bad for a number 4, and of course there is Autumn as the worst season to release movies.

I won’t go into the conclusions of this study as it is not the intention of this blog.

I hope I didn’t bore you that much and for those that like technical stuff I promise on my next blog I will be putting a lot more code and less words

Until I guess my next project in Module 2

Javier

Data Scientist formerly an engineer with 20 years of experience in the energy industry. Fan of ML and DL and everything related to AI