This is the third project that I am doing for my Data Science Bootcamp at Flatiron and as you can read on the title, this one is about the NBA. For this project we had a series of datasets to chose from, or you had the freedom to get your own data as well, so I decided to go with something that I really like and that is basketball.
I am not an expert in basketball but I have been following it for almost 30 years now so I think I know a bit about it. As a fan of NBA statistics I always wanted to write my own code to predict game results, and now I can actually do it.
For this project I used a series of Machine Learning (ML) classifier algorithms to try and predict the winning team. Because I am using classifiers, I am predicting the chances that the Home Team have of winning or losing the game.
The workflow that I used is the typical data science project one which consists of the following steps:
- Data Gathering
- Data Cleaning and Conditioning
- EDA (Exploratory Data Analysis)
- Model(s) selection and Training
- Evaluation the Model(s)
- Final Modeling
- Post-evaluation of the Model
So with no further ado, let’s expand on each one of these steps and add some results on the way.
- Pre-Evaluation: Because all our projects need approval before we embark on them and realise a week later that it is not doable, this first step is quite important.
Here I evaluated the project idea and tried to confirm if enough data was available for later training and testing.
The idea is quite simple to explain: given an NBA season (2018–2019), I will use the game results to engineer a new binary column that I called “Game_Result” and will indicate a 1 if the home team (Team2) won the game or 0 if the home team lost the game. The visiting team will be called Team1, so from now onwards I will use this nomenclature.
In terms of data, there are around 70 years of data available between the ABA (basically the NBA previously to changing its name and other details for those not familiar with it) and NBA data that can be scrapped or in some cases simply downloaded from the web. so data will certainly not be a problem, the real issue will be how many years should I be using to predict game winners. There is really no point in using data from 70, 40, or even 30 years ago. 20 still sounds like a lot, and maybe even 10. So there is definitely some testing to do in order to determine the best number of years to predict today’s home team results. However, there is also a minimum for me to test as it is a requirement for the project to have a minimum of 30,000 rows.
There is a second problem that I will need to deal with, and this came to my mind when I started scrapping the data. There are over 100 basketball team stats, so which ones should I be using? I only have a week to do this, so I can’t test them all. So I need a plan to shortlist them.
I know that this is doable as others have done it, but I will certainly need to do a lot of web scrapping and probably come up with my own database which should be model ready. This takes us to the Data Gathering:
2. Data Gathering. Gathering the data is not an easy task if you want to get the most recent and accurate stats, and if you want to have it relatively clean. Even where there are plenty of free sources to play with, I decided to createe a completely new database taken bits and pieces from many and use only thee last 20 NBA seasons (from 2000 until 2020) for the predictions, and 30 years (1990 until 2020) to study thee game’s evolution and understand if there is any particular stat, or combination of stats, that would translate into wins.
The sites I took my data from are the following:
Because the main objective of this project was to predict game results, I had to come up with a combination of statistics that would translate into wins. Now, for those who like basketball, you know that there are probably close to a hundred stats to play with, some are pure stats such as PPG (points per game), FG% (Field Goal Percentage), or 3PM (3-points made), and others are engineered, such as the players PER (Player Efficiency Rating), or for teams AST/TO (Assist to Turnover Ratio), or even a bit more advance like the EST.TOV% EST.PACE (Estimated Pace), so in order to come up with a good enough combination I needed to look into a few of them, read about them, and then shortlist them to start testing algorithms that would ultimately filter them.
According to NBA Analysts (stats.nba.com) some of the most relevant statistics that translate into wins are the following 9:
- Offensive Rating: estimated number of points a team scores per 100 possessions (higher is better)
- Defensive Rating: estimated number of points a team gives up per 100 possessions (lower is better)
- Field Goal Percentage
- Field Goal Attempts
- 3-point Percentage
- 3-point Attempts
- Assist/Turnover Ratio
- Rebound Differential
- Pace: estimated number of possessions per 48 minutes
If you would like to read more about any of these, you can refer to the link below, where you will find some of the easiest to understand explanations:
When you start digging for these statistics you will find out that some of them belong to the “General or Traditional Stats”, some others to the “Advanced Stats”, and the rest to what is called the “Four Factors”, which talks about efficiency in percentages.
In order to gathered all this data, I had to scrape some of the sites listed above, download some monthly stats as Excel files, and also manually type some of the missing values. For the scrapping, I used a Python package called Selenium. The link is below if by any chance you want to use it. It’s quite easy and fast and doesn’t involve any complex coding skills:
3. Data Cleaning and Conditioning. As mentioned above I created my own datasets by scrapping, downloading and manually filling the data for missing values for 1 season (2018–2019), 10 seasons (2010–2020) and 20 seasons (2000–2020).
This is an interesting dataset with plenty of stats to explore (approximately 50) and for those not familiar with NBA basketball, an NBA Regular Season consists of 82 games per team that are played within the course of 7 months (from October until April). So to have an idea of the size of the final DataFrame, multiply 82 games * 30 Teams * 20 years and divide it by 2 as games are between two teams. That results in almost 25000 rows with 50 columns of Stats.
The cleaning of the data was not too complicated because after scrapping it I partially conditioned it in Excel, so by the time I load it into my Jupyter Notebook little had to be done. I won’t add the Data Cleaning details into this blog to avoid making it too long, but I will add thee GitHub link just in case anyone is interested in seeing the work done.
After the cleaning was finalised I ended with the following three dataframes:
With those three DataFrames, I engineered a new column called Team_id for each which basically added the Team name + the year + the month, all as a string, and then I used those to merge the three together. This ended with the DataFrame beelow:
DataFrame 4 was the input for the 10 year models, but on top of these DataFrames, secondary ones were created for 2010–2020 and 2018–2019, plus, 30 smaller ones, one for each team were also created but only for the 10 seasons (2010–2020). So quite a bit of data when you think about it.
4. EDA. The main goal/objective of this project was to predict the home team winning or losing games, but also, I had a secondary objective which was necessary to full-fill the first one and that one was of understanding the game’s evolution. By trying to do so, I might be able to get hints of which statistical values are the most representative in terms on win shares, and maybe come up with a combination that would translate as wins.
For this second study I used a smaller database that consists of the most relevant average stats per year for the last 30 years for all the teams together. I then plotted most of them with a fix ‘Season’ on the x-axis and the stat to study on the y-axis. You can see a few of them on the next figures, and if you interested on seeing the rest of them, once again, refer to my GitHub and you will find a large list of them.
This first plot (above) shows the Age and the weight plotted against Seasons (year).What these two tell us is no surprise. With time, players are joining the NBA at a younger age, with many even straight from highshool. After seeing the success that some big name players have had at 18 or younger, this open a lot of opportunities to many others since teams are willing to take the risk, if they believe the talent and maturity is there. Some good examples are Kevin Garnett, Kobe Bryant, and Lebron James among others. Of course there is also a long list of low-impact players coming from highschool, but it’s a chance that players and teams are taking more and more.
The second plot was expected as well. With all the advances that the fitness industry has had in the past 30 years, you can see athletes becoming stronger, faster and more athletic in general. This is clearly seen in the NBA when you compare players from 20 years ago and 10 years ago with today’s players. However, this doesn’t translate as more physicality, on the contrary, the NBA rules have change quite a lot since then, allowing now considerably less contact than before and up to a certain degree losing up their defensive efforts.
The previous two plots show on top the average rebounds per game and below the Points per game. The rebounds per game are some how a surprise as I wasn’t expecting the number of rebounds to have increased. Then when I looked at other plots, I got some hints that this could be due to the game having a faster pace. A faster pace would mean more shots per game, and without changing the field goal percentage, it will result in more rebounds per game right away. Then again, it could also mean that that the field goal percentage has dropped. When I looked at the field goal percentage of the past 30 years, yes it has dropped, but by a 2%, so not significant enough to produce such an increase of rebounds, so it has to be that the game is faster and therefore players get to shoot more and by almost keeping the same field goal percentage, scoring has to have increased as well, which is something that we clearly see on the lower plot. Today, teams are scoring an average of 5 more points peer game compared to 1990. There has been a drastic drop though as you can see in the middle of the plot, so the defense has been relaxing more and more with the years, or at least this is what that plot suggests.
These last two plots are key for the interpretation of the defensive efforts. Steals and blocks per game are direct indicators of the team’s and player’s defense and as you can see from both plot, the numbers are dropping, hence the interpretation given by almost every NBA analyst today that team’s need defense to win, however, it has lost relevance compare to the offense. We have reached the point where some coaches have the mentality that a good defense can be broken with a great offense, and that they can outscore the good defensive teams. that is not entirely wrong and very good offensive teams have done quite well in terms of win shares, however, none well enough to win a championship. In this study that is not the ultimate goal but to guess the winning team per game, therefore these are good observations/suggestions to which stats we should be looking at now that we know that defense plays a very important role to win a championship, but for a single game, a devastating offense might be key. This takes us to the next set of plots:
From the figure above we can see the left, the plot corresponds to data from 2000 until 2010, and on the right, from 2010 until 2020. These plots illustrate how the game has shifted it’s approach from a defensive minded with a big weight on offense as well, to a more offensive minded game, where defense is important but not as much as defense. Keep in mind that this is the average of the 30 NBA teams. If we do the same exercise for the top 5 or 6 teams today, so lets say the Milwaukee Bucks, LA, Lakers, LA, Clippers, Toronto Raptos and Boston for example, what they all have in common is a very tight defense with a good offense. So following that offensive trends translates into wins, but not into championships. The best example for this is the Houston Rockets. This team has one of the best offense in the game today, but so far they have not been able to win a championship with that strategy.
With all the previous analysis done, including the one not described here but present on the EDA notebook, I analyzed the stats listed on the left longer column on the previous figure, and shortlisted the 9 stats on the right which offer a good balance and represent today’s game. These stats will hopefully translate into wins, but remember that not necessarily into championship. In order to predict the champion I believe there is more to add to these models. what exactly? I will go into that on the way-forward and the bottom of this document. And, why didn’t I go that route? mainly because of time as I had only one week to do this project.
5. Model(s) Selection and Training. I mentioned at the beginning that I was going to test as many ML algorithms as possible in order to come up with the one that best fitted the project’s objective. So thats exactly what I did. Below I have listed all the ML algorithms and below the list there is a summary which I will go over to let you understand what it means.
- Logistic Regression Classification
- KNN Classification
- Bagged Trees
- Random Forest
- SVM Linear Classification
- SVC Prediction
- Bagging SVC Ensemble Classifier
- AdaBoost Classifier
- XGBoost Classification
These first set of tests were done with inputs from all the team together, the reason for this was to compare it with the NBA reported prediction accuracy which you can see in the middle as 75%. This means that in 75% of the games, they predicted that the Home Court Advantage was defended and as a result the home team (Tam2) won thee game. As you see using 1 year of data, meaning the previous season is always slightly below their predictions with the exception of the Logistic Regression which ended up with 76.85%, so with an over-estimation or probably it was just over-fitted. The results with the 10 years of data were the best ones, with some particular exception (Decision Tree), the rest were all quite close to the NBA predictions. Then if wee look at 20 years of data, the results are not bad, but I believe that this is not a great option as the teams are literarily not the same. Some of the owners have changed, some of the teams or home court have changed, players have retired, the NBA rules have completely changed, and many other factors have changed that don’t make these stats the best to predict the results of today’s game. However, as you can see, somehow the accuracy aren’t bad at all. Still, out of these three samples I decided to keep the 10 years for the real time predictions. With this said, let’s see what the model evaluations game as results and how we went to evaluating them
6. Evaluating the Model(s). To evaluate these results, two teams were randomly selected. One of the teams is the San Antonio Spurs, which is one of the most successful franchises of the past 20 years or so. They have won 5 championships within that time frame: 1999, 2003, 2005, 2007, 2014 and have made the playoff in over 95% of the seasons. Their time is over now though as they just entered a rebuilding process due to many player’s retirements (see image below):
The second team is the Miami Heat. This franchise has not been as successful. They have won’ 3 titles in the same time frame: 2006, 2012, 2013, but they have gone through a series of rebuilding processes and have gain and lost very high profile players such as Lebron James, Chris Bosh and Dwyane Wade, so this is a tier 2 team compare to the tier 1 that San Antonio is. The results of all of the models for the Miami Heat can be seen on the next image:
In this tables, in the light green is what I called the “True” or the ground true. These are the real percentages/margins of wins for the home team (Team2). On the top there is the description of the data use, this would be the 1, 10, and 20 years, with their respective accuracy values below.
Now I mentioned above that I chose to keep the 10 year data for the final predictions, but I’m sure you will be able to spot the KNN Classification of 100% accuracy. This is an unrealistic value, the reason why it happened is because of the small sample (11)
This is the same exercise done for the Miami Heat. You can see that the NBA predicted 69% and the real value has been 68.7% so extremely accurate for them. Once again, the 20 years doesn’t seem to be the best input but the results aren’t bad.
This takes us to the real time test. By the time this work was done the first game listed on the image above had not happened yet, so it is from that point onwards that the predictions start to be properly evaluated. Surprisingly the accuracy is very high, and a lot higher that I was expecting. Now I won’t lie and these are games that without an algorithm I could have predicted, and probably even with slightly more accuracy, as I would have no predicted the Boston Celtic beating the 76ers on their last match if they have had Ben Simmons on their team, but due to injuries he wasn’t, so that changed everything for these matchup. Still, for someone with not much basketball knowledge this can come in handy for betting and making some money. You won’t rich most probably but you can some fun playing with all the models and modifying the variables.
As a basketball fan that has been following the game for over 30 years now, I know that using 20 years of data to predict games today is without a doubt not a good decision, in fact 10 years, even with the accuracy results that I got, is not the best either. On the contrary, 1 year is not enough. Why? because the NBA is a business driven league, and in today’s economic environment some of the owners can manage to put together a set of start or superstars to make a run of winning a championship right away. So this is a big question for me today: how many years is ideal to account for these fast changes happening in the NBA and make the most accurate predictions possible? with this said, I would continue this project after handling my results to my Cohort and will update it with not only 2, 3, 4 and 5 year predictions, but also will add players stats that can add to a win or a lost game. Below are the steps that I will do in the coming months, hopefully before the 2020–2021 season start:
- Run models using 2, 3, 4 and 5 year data and compare them, but not only among themselves, but with the 1 and 10 years that I have already done. For this I will also add a new engineer data which will be how well a team does again each of the other 29 teams when playing at home. That should have a very strong weights on the predictions
- Run in parallel predictive models using player stats and not only team stats.
- Find a way to account for playoff experience and increase/decrease in performances under pressure, meaning in the playoffs
I hope you enjoyed this reading, if you have made it this far of course, and you want to see any of the code written to reach these results, feel free to go into my GitHub or you can email me directly as well (firstname.lastname@example.org) to further discuss this or any basketball related subject.