By: Wilson Yu, Dean Jones, and Lia Bergman-Turnbull
With the coronavirus holding the nation and world in its grasp and most human activity coming to a standstill, there is a huge void in our lives. For all us non-essential workers, we are starting to feel a bit restless, despite knowing we are fortunate to be at home safely. We long to eat at a restaurant or sit in a cafe. It feels like friends have not been seen in too long, even as Zoom executives say otherwise. At this point, even watching The Office on Netflix is getting old. For millions of NBA fans here and around the world, we miss basketball. Badly.
Things are slowly starting to return to normal though. As approved by the NBA Board of Governors, we now know that the league intends to return on the tentative start date of June 31st with 22 teams participating. The 22 is composed of the 16 currently holding playoff seeds and the six teams six games or less behind the eighth seed in each conference. In the East, the lone team that can still steal a playoff seed is the Washington Wizards. In the West, the Portland Trail Blazers, New Orleans Pelicans, Sacramento Kings, San Antonio Spurs and Phoenix Suns are sure to battle bitterly for a ticket to the postseason. The teams already eliminated include most notably the reigning Western Conference champion Golden State Warriors.
The returning teams will play eight seeding games based on their regular season schedule. The seven best teams in each conference will then be guaranteed a spot in the playoffs. If the eighth best team is more than four games ahead of the ninth, it will also enter the playoffs. Otherwise, these two teams on the bubble will engage in a play-in tournament. The eighth best team would only need to win once for the final seed, while the ninth must win twice.
Players, however, have still not yet provided full confirmation on whether they intend to participate. Willingly entering the strict confines of Disney's ESPN Wide World of Sports Complex in Orlando for a prolonged period of time, potentially placing health at risk for the sake of profits and entertainment, even the necessary yet surreal, post-apocalyptic experience of playing games with no fans… these are all conditions of the arrangement that are understandable sticking points. So nothing is definite just yet.
Yet while some may feel that this season’s champion would always be attached to a glaring asterisk, we know many are simply aching to watch hoops. The LA war between LeBron’s Lakers and Kawhi’s Clippers finally resolved in the Western Conference Finals. Giannis Antekounmpo and the Bucks’ redemption campaign after a stunning 2019 defeat by the defending champion Raptors. Perhaps an underdog team that gets hot at the right time and rides that streak to crash the playoffs - the Miami Heat or Dallas Mavericks? Or even one of the teams like the Trail Blazers or Wizards not yet dead? (Not seriously, but you never know.) And, well, maybe seeing some team other than the Warriors just representing the West.
In the interest of preparing for the playoffs if they finally do arrive and spending time during this break on something basketball-related, we decided to construct a playoff bracket based on the current standings and simulate outcomes for every round. Yes, we know that things could change dramatically if the approved format does become reality. And they probably would, with some rust and rest hurting and helping teams to ultimately shake up the playoff picture. We cannot predict how that will happen though, so to reflect fairness, we will reward teams for their accomplishments up to this point.
In the following sections, we will discuss how we constructed our model that generated win probabilities for every playoff series. We will delve deeper into some particularly intense matchups, considering angles that could not be accounted for in the mathematical calculations. We will announce who we expect to hoist the Larry O'Brien trophy triumphantly after all that has happened. Please enjoy the read, and let this get your mind off spending the thousandth day indoors.
Probability Model & Monte Carlo Simulation
We first created a model that would calculate the probability a team wins their series based on regular season team statistics. To do this, we used logistic regression, because it is a probabilistic model that models binary dependent variables, with our dependent variable here being the won/lost result of the series. For our training data, we used the previous five years of team statistics and playoff outcomes. We decided on this time period because we believe it best represents today’s style of gameplay, noting factors such as increased three-point attempts and the rise of “small ball” play. Since our model intended to predict the win probability of two teams in a matchup, our data contained the two teams’ difference in statistics and the outcome of the series. After we ran the regression and obtained the coefficients needed, we applied these results to the 2020 playoff matchups to figure out the win probabilities of each series.
We then proceeded to do a Monte Carlo simulation to figure out the probability each team wins each round of the playoffs using the results of our model. We assigned the First Round matchups based on each team’s current seed in the regular season. We then performed ten thousand simulations of the First Round matchups to calculate the probabilities each team wins the First Round and makes it to the Conference Semifinals. Based on this probability and the probabilities from our model, we again performed ten thousand simulations to figure out the probabilities each team wins the Conference Semifinals and makes it to the Conference Finals. This step is again repeated to figure out the probabilities each team wins the Finals.
Selection of Variables
For our model, our goal was to use statistically significant variables that would explain different contributions to a series win and yield reasonable results. The variables we looked at included Win% (percentage of games won), Simple Rating System (rating that takes into account average point differential and strength of schedule), Offensive Rating (points scored by a team per 100 possessions), Defensive Rating (points team allowed opponent to score per 100 possessions), Net Rating (numerical difference between offensive and defensive rating), Offensive Efficient Field Goal% (a statistic that adjusts field goal percentage by assigning a greater weight to 3-point field goals), Defensive Efficient Field Goal% (the efficient field goal% that a team allowed the opponent to achieve), and Rebound% (percentage of offensive and defensive rebounds grabbed). We tried numerous combinations of these variables to see which would be the best fit for our model.
Many of the combinations involved the issue of collinearity. Collinearity occurs when two or more of the predictor variables are highly correlated, which causes the standard errors of collinear variables to be higher, and may also produce unexpected coefficients. For example, when Win% and Simple Rating System (SRS) are included in the combination, the coefficient assigned to the SRS variable punishes teams with a higher SRS. This is a problem because it is intuitively incorrect that a team with a higher SRS should have a lower win probability with all else constant.
A trend we saw was that the more variables we included in our combination, the more likely the model would have statistically insignificant variables. Such is likely because as more variables were added, there was more opportunity for predictive power of some variables to be contained in other variables. We saw this trend rise when we included a “defensive” variable with two or more “offensive” variables. For example, when we attempted Win%, Offensive Efficient Field Goal% (OEFG%), and Defensive Efficient Field Goal% (DEFG)% as our combination, the DEFG% variable was deemed insignificant by our regression.
Another factor that influenced our variable selection was the reasonability of the final results. Although some combinations met our previous two criteria, since we only used regular season statistics to project playoff win probabilities, some of the results were questionable. This was either due to outliers among our 2020 playoff teams, or simply due to lack of predictive power in our selected variables. For example, when we included Offensive Rating as one of the variables, the model placed a heavy weight on it. When we applied the regression on our playoff teams, the model generated extremely high win probabilities for the Dallas Mavericks, as they have the highest Offensive Rating in the league by far. Although the Mavericks lacked in the other statistics, their Offensive Rating statistic essentially “overpowered” the rest of the variables, and they were projected to be the most probable winner of the Western Conference Finals. We ruled this result to be a bit unreasonable, so we decided to change our variables. Taking into account all of these considerations, the combination of variables we deemed the best for our model was Win% and Offensive Efficient Field Goal% (OEFG%).
Model Shortcomings & Improvements
One of the major assumptions we made for our premise was that regular season team statistics are strong predictors of a team’s playoff success. While these statistics do hold a degree of predictive power, they lack any information about individual player data or any other specific data. For example, if a team’s star player were to suffer a season-ending injury near the start of the playoffs, this would not affect the team’s statistics greatly, but this incident can potentially cause the team to drastically underperform relative to their predicted performance. This approach also leaves out variables such as difficulty of playoff route and opponent’s style of gameplay. Another issue with our model is that we used regular season statistics of a 64-67 game season for our current playoff teams, while the model was trained with data from completed 82 game seasons. With about 20% of the season left unfinished, the results we observe would undoubtedly be more accurate had the season not been postponed. An improvement for our model could be to increase the number of variables we use. Having only two variables, the model is a bit general, even with the use of team statistics. It would be a great improvement if we could find another combination of three or more variables while meeting all the criteria explained in the previous section. The tables below show us the win probabilities of each team for each round of the playoffs using Win% and OEFG% as our two variables.
The Complete Bracket
The complete bracket we created contains the most probable outcome of the playoffs given the results of our model. As stated before, we set up the teams in the First Round series based on their current seeding. The winner of each series would simply be the team whose win probability is greater than 50% given their matchup. We proceeded with this method for all the series until the bracket was finished. To determine the number of games it took for each series to end, we recorded the length and win probability of each series of the same five years of playoffs we used in our training data. We then performed a Poisson regression using win probability as our predictor variable and series length as our dependent variable to create a new model. We applied this new model to our current playoff series’ win probabilities to get the resulting series lengths as seen in our bracket. The table below details the exact probabilities the winner of each series has of winning. With this, we will proceed to discuss all the match-ups with an emphasis on some of the more intense and interesting matchups, such as the 7-game series, the Los Angeles match, and the Finals, in greater detail.
Sources: basketball-reference.com, stats.nba.com, espn.com
Comments