By: Shail Mirpuri
Introduction
In the realm of franchise T20 cricket, there is no bigger league than the Indian Premier League (IPL). Touted as a competition where ‘talent meets opportunity’, the IPL is watched by over 100 million people around the globe. One thing that makes the IPL so entertaining is its unpredictable nature; we often see teams who finish in the bottom half one season come back very strong the next season. In this article, we will analyze 3 different models to see if we can predict the outcome of each match in the IPL as well as the overall season. The first two models are simulation-based approaches, whereas the last model aims to use machine learning to predict these outcomes. Since the IPL has been indefinitely suspended 3 weeks after starting due to the pandemic situation in India, we will focus on how each model has performed thus far, and see if we can use this to predict the rest of the tournament.
Data
The first step before developing models was to collect adequate data, so we can aim to learn from the past to predict future matches. Luckily there are several data sources available covering the microscopic aspects of each game. We decided to primarily use two sources of data in developing the model. The first source was match-level data for every season since the inception of the IPL back in 2008. This dataset provided insight into the head-to-head records as well as the overall performance of teams over time within the IPL. Besides this, we also used delivery level data in order to analyze the breakdown of games through the number of wickets taken, the run rate, the number of boundaries hit, and many other meta-statistics. When developing the models, we only focused on the matches (and deliveries) that were played during or after the 2018 IPL season because there was a major revamp in the drafting system starting in the 2018 season, which essentially saw all teams redraft their entire squads. Since the purpose of this model is to predict the outcome of the current IPL season, it would not make sense to focus on data before the 2018 season as each squad has drastically changed since the start of the IPL back in 2008. Apart from gathering data to train our models to attempt and perform better than picking by random chance (50%), we also web-scrapped the current season’s fixture and results in order to apply and evaluate all of our models. With all the data collection and preprocessing out of the way, we shall now move onto the method behind each model.
The Methodology
Model 1: Parametric Bootstrapping
The inspiration for the first model came when analyzing the overall distribution of runs scored per innings in the Indian Premier League. We decided to make the assumption that the total runs a team scores in a given match is normally distributed. In order to confirm this assumption we analyzed the QQ plot for the total runs scored in each IPL innings since 2008.
From the Normal QQ Plot above we can clearly see that the normality assumption is mostly satisfied as the majority of points lie on the QQ line in the graph above. Based upon this validated assumption, we decided to model the runs a team scores in a given innings as a normal distribution with mean and standard deviation equal to the average and standard deviation runs scored by a team against their specific opposition. We would then sample the runs each team scores in a given match based upon their distributions, and then compare these scores to determine the winner of a given match. Essentially we are using parametric bootstrapping to create a stochastic model that can account for the unpredictability of the Indian Premier League as well as the head to head records between different teams.
In order to better understand how this model works, we will consider the example of a match between the Chennai Super Kings (CSK) and the Mumbai Indians (MI). After subsetting our data sources to include only innings where CSK were batting and MI were bowling, we found that since the start of the 2018 IPL, in this fixture CSK averagely score around 141 runs with a standard deviation of 24.7. On the other hand, MI tends to score around 150 runs with a standard deviation of 21.0 against CSK. Therefore, we would randomly sample CSK’s score from a normal distribution with a mean of 141 and standard deviation of 24.7, and repeat the same process for MI but with their corresponding mean and standard deviation. Once we get both these values, we would first round them to the nearest integers (since runs have to be whole numbers), and then do a simple comparison to determine the outcome of the particular match (CSK win, Tie, or MI win).
Since this type of model involved using stochastic random variables to determine the outcome of a given match, we wanted to look at the long-term expected outcome for each match in order to make a more reliable prediction. We did this by simulating each fixture 1000 times and counting the frequency of each outcome in order to obtain an overall prediction for the match. By replicating each match many times, we were also able to see which games were expected to be extremely close, and which games were expected to be one-sided. We can do this by analyzing the empirical probability of each team winning a given match by taking the frequency of each outcome and dividing it by the total number of replicates for the match (1000).
Model 2: Non-parametric Bootstrapping
The second model attempts to account for the complexity in the different stages of a T20 match by looking at delivery level analysis. In this model, we have assumed that all of the outcomes of historical deliveries bowled after the 2018 IPL season delivery are representative of the overall outcomes for each given ball. Again like Model 1, for each match we have subsetted the data to only include historical delivery level data for a given team against a specific opposition. We wanted to breakdown this data further before actually randomly sampling all the 120 balls (one innings) with replacement. This is because the outcome of a ball is highly dependent on many factors such as the phase of play. For instance, in the first 6 overs of an IPL game (known as the powerplay) there are only two fielders allowed outside the thirty-yard circle, meaning that there is a greater chance of boundaries being hit during this phase of an innings. Therefore, in order to account for this variation when randomly sampling, we first decided to split the historical delivery data into three subsets: the powerplay overs (overs 1- 6), the middle overs (overs 7 - 15), and the death overs (overs 16-20). Using this breakdown, we then simulated each match through random sampling with replacement n times for each phase, where n corresponds to the number of balls in each phase. We also ensured to sample extra balls for illegal deliveries such as no-balls and wides. By using non-parametric bootstrapping, in each match we would first simulate the powerplay phase, keep track of the runs scored and wickets taken, and pass this onto the next phase. We also programmed our simulation of an innings to stop once the number of wickets taken had crossed 10 in order to be consistent with the rules of cricket. Once we stimulated the first innings, we would move onto the next innings and re-run a simulation with the bowling and batting team swapped. This would allow us to obtain the number of runs scored by each team, and compare the results to determine the winner of the match. One additional feature of this model’s simulation was that it also accounted for the case in which the scores were tied. It did so by simulating a super-over that used the death over data to represent the outcomes of a super over since batsmen tend to be equally as aggressive in both these phases. The super over is repeated until we have a winner of the match in order to mimic reality. Therefore, this model will always predict a winner in a given match rather than a tie.
As with the previous model, Model 2 is also a stochastic model that depends on random chance, and therefore we wanted to simulate 1000 trials for each match in order to gain more insight into the long term expectation for each match. Using these 1000 trials, we would then predict the winner of a match based upon the most frequent prediction from our 1000 replications of the simulations. Apart from predicting the winner, we could forecast the probabilities associated with each result by analyzing the relative frequency of each outcome.
Model 3: Machine Learning
Moving on from simulations, we also developed a machine learning model to attempt and predict the outcomes of each match. Before actually training the model, the first step was to select and engineer useful features from our data that may be indicative of a match’s outcome. Below is summary of the features we used within the model:
Each team’s overall win record
The head to head record between each team in prior matches
The average runs scored per over by each team broken down by all three phases of the game (Powerplay, Middle, and Death Overs)
The head to head average runs scored, runs conceded, wickets taken, wickets lost, boundaries hit, and boundaries conceded
The proportion of common dismissal types for each team (Caught, Bowled, LBWs, Run Outs)
The batting, and bowling averages and strike rates for the most experienced members of each team
As we can see from the features listed above, we accounted for a lot of different variables when building our model in order to try and capture any hidden trends that may have played a role in the overall result of a match. One feature in particular that we wanted to add to our model was the key player statistics for the most experienced members of each team. Since the IPL is based upon a draft system, the composition of each team can completely change season on season, meaning that even though we’re only considering data that occurred after IPL 2018, there still may have been some transfers that could make certain teams more appealing in the current season. In order to account for the actual player quality, we used web-scraping to obtain the actual squad members for each IPL team, and then obtained each player’s statistics since the 2018 IPL season. We decided to only include players who played at least half (22 games) of the total IPL games played per team since 2018. This is because we wanted to account for the important/experienced players of a team since these players tend to play a huge role in a team’s performance. After only including the important players in our dataset, we aggregated the average batting, and bowling statistics for each team, and used this as a feature in our dataset to predict whether or not a team won a given match.
Besides feature engineering and selection, another important process of machine learning model building involves selecting the actual model to produce our final predictions. Through the use of cross-validation, we found that the K-Nearest Neighbours model was performing significantly better than other models such as XGBoost and Random Forest.
The K-Nearest-Neighbours (KNN) model was trained on a dataset with 35 features that were all normalized using Minimum-Maximum Scaling. This meant that all features were scaled between 0 and 1 in order to ensure that our model accounted for each one relatively equally rather than being heavily influenced by features with larger magnitudes or ranges of values. Each row in our training dataset represented a match in which we had a set of features of ‘Team A’ and ‘Team B’ with the target column representing whether or not ‘Team A’ won that particular match. One limitation with using a KNN mode architecture is that it is hard to account for categorical variables properly. To combat this, we extracted key quantitative statistics to represent the categorical data. For instance, we used target encoding to transform the categorical variable of a Team’s name to their win percentage in previous seasons. Therefore, even though KNNs aren’t able to properly use categorical variables, this machine learning problem didn’t require any important categorical variables to be accounted for. This meant that we were able to use the KNN model architecture relatively effectively to try and predict the outcomes of each match.
After deciding to use a KNN model, we performed hyperparameter optimization using a Grid Search technique that optimized for the accuracy of predictions using cross-validation. Grid Search allowed us to tune different hyperparameters by searching along all the possible combinations of hyperparameters to see which one gave the best results in cross-validation. The hyperparameters that we focused on tuning were the different values of k, the metric used to measure distance between data points, and the leaf size. We found that the optimal hyperparameters for this model were a k value of 13, a metric of manhattan, and a leaf size of 20. With this optimized and validated final model ready, we can now move onto actually seeing how it performed in predicting this year’s tournament so far.
Defining Success
Before moving into the performance of each model in predicting the 2021 IPL, we first need to define the metrics of success we considered. The first metric of success involves looking at the accuracy of the model in terms of correctly predicting individual match results correctly. In order to account for the probabilities of each model, for each individual match we also computed the sum of the probability that a model predicted for the winner of the match. This metric acts as a means of weighting the success of each individual prediction. For instance if a model predicts that a team would have only a 10% chance of winning a game when they actually do so, then this is a much worse prediction than if another model predicted that the team had a 49% chance of winning. Even though our overall accuracy metric would equally count both these models’ predictions as inaccuracies, the weighted success model ensures that we actually account for the certainty in each prediction. With this metric of success the higher the metric, the better the model’s overall performance because we are only looking at the actual winners’ predicted probabilities. This means that the closer the probability is to 1, the better the prediction for the individual match. Generalizing this idea to the sum of all probabilities, we can say that higher overall sums indicate better overall predictions .
Apart from a match-level breakdown, we also defined two additional metrics of success that considers how well the model predicts the overall league table. The first of these metrics is the average deviation/error in terms of points predicted for each team, while the second is the average deviation in terms of the rank predicted for each team. We realized that it may be difficult to consistently predict match-on-match performances for each team, so we wanted to see if any of the models gave us a greater insight into a teams overall performance within the course of the season. It's also important to note that we will be treating teams with the same points as the same rank in the league table when making these comparisons rather than ranking them based upon Net Run Rate (NRR) since all of our models do not account for NRR. Our plans to project the entire 2021 IPL were halted by the mid-season suspension announced on the 4th of May, and therefore we have decided to focus on how each model has performed on the 29 games that have been played.
Table 1: Success Metrics For Each Model
From Table 1, we can clearly see that our machine learning model performed the best in terms of predicting the outcome of each match in this year’s IPL so far with an accuracy of 65.5% and a weighted success score of 15.23 out of a maximum of 29 (the total number of games played). This model also tended to most closely match each team’s current points and rank in comparison to models 1 and 2. Besides highlighting Model 3’s superior performance, Table 1 shows a slight contrast between the performance of the parametric and non-parametric bootstrapping models. We can see that the parametric bootstrapping model tends to perform better when it comes to predicting the actual results of a match, while the non-parametric bootstrapping model tends to generate a closer prediction for the overall points and rank. One reason for this is that if we take a look at the match level predictions for the parametric bootstrapping model (Model 1) then we can clearly see that this model tends to predict a much closer/competitive season in which most teams earn a similar number of points. This can be seen by the standard deviation of the total points predicted for the entire IPL season by each model, in which Model 1 has a value of 4.78, Model 2 has a value of 7.71, and Model 3 has a value of 8. Since Model 3 has been shown to be the best model based upon our success metrics, this suggests that this year’s IPL has been dominated more by the top 4 teams, who have all left behind the bottom few teams.
Table 2: Predicted Rank By Each Model (After 29 Games)
Key
In Table 2, we can take a look at the actual predictions for the rank in this year’s IPL and compare this to the predicted rank after the first 29 matches for each model. One thing to note is that for both the actual and predicted rankings, we only used the points earned to determine ranking, meaning that if two teams are tied on the same points then we will count them as being at the same rank on the league table. We can see in the table above that model 3 was able to correctly predict 50% of the current 2021 IPL league table. This is extremely impressive especially if you consider that there are 40,320 (8!) different ways the league table could be arranged. This further emphasises the predictive superiority of model 3 when compared to the other two models.
Table 3: Predicted Points By Each Model (After 29 Games)
Key
In Table 3, we can see the predicted points by each model and compare these to reality. Here, we can again see the predictive ability of the K-Nearest Neighbour Model (Model 3). One thing that distinguishes this model’s performance from the rest is its ability to predict the downfall of the Sunrisers Hyderabad (SRH) in IPL 2021. Both model 1, and 2 predict that SRH would be in the top 3 teams at this stage of the tournament, however model 3 is able to identify SRH’s fall off. One potential reason for this may be due to the fact that the batting strike rate of SRH’s most experienced players was the lowest in comparison to all the teams. This is because we accounted for the player statistics of the most experienced players in model 3 but not in model 1 and 2, and SRH’s average batting strike rate was one of the lowest.
Apart from this, we can also see that all models seemed to underestimate how well the Royal Challengers Bangalore (RCB) has performed this year. This is likely because in RCB’s history they have always been a team that has underperformed within the IPL, which can be reflected by their 8th and 6th place finishes in the 2018 and 2019 IPL respectively. Since all three model’s are trained and/or simulated based upon the data from the past three IPL seasons, this may have led to the prediction of RCB’s poor IPL season, which is a stark contrast to the flying start they’ve had in their opening 7 games. Another reason for the inaccurate predictions for RCB matches is the unexpected performance of new players in their team this season such as Glenn Maxwell, and current Purple Cap holder Harshal Patel. Even though the third model accounts for players that have been transferred to each team at the start of this season, RCB players like Maxwell have truly had a change in fortune in terms of performance. This is one huge limitation of all three models; they all assume that a team’s or its player’s past performance in previous seasons have a huge impact on the current season's performance. In reality, however, we often see amazing comebacks from experienced players and new rising stars bursting onto the scene. This is what likely makes the IPL so entertaining - there’s no amount of data that tells us with certainty who will win a given match.
Other insights drawn from each model
With Model 1 and 2 we replicated each game 1000 times to analyze the distribution of wins predicted for each team. By analyzing this distribution, we can tell which games our model expected to be extremely tight, and which ones it expected to be one-sided. One way we can do this is by dividing the number of times a team won by the total number of replicates (1000), in order to get a proportion or probability of winning as predicted by our model. There were certain games that were predicted to be extremely one-sided, which ended up being so in reality. For instance, Model 2 predicted that there was an 81.5% chance that the Royal Challengers Bangalore (RCB) would beat the Kolkata Knight Riders (KKR), and in the actual match RCB ended up winning by a whooping 38 runs. It is also important to note that there were some games that were predicted extremely incorrectly. For example, according to model 2, the Rajasthan Royals only had 11.5% chance of beating the Kolkata Knight Riders (KKR), but they ended up easily doing so with 6 wickets and 1 over to spare. This again highlights the thrill of the IPL, in which there’s no guarantee that the historical favourite of a match will actually win the match.
Another interesting insight drawn from the machine learning model (Model 3) building process was that the K-Nearest Neighbours Model tended to perform the best. Since the K-Nearest Neighbours model attempts to predict a new observation by comparing it to the set of training data points with similar features, this suggests that accounting for the past matches when making predictions for future matches is important. Although a 65.5% accuracy is definitely not state of the art, it is a lot better than a 50% accuracy, which is what we would expect if we were to randomly guess the outcomes of each result. This suggests that even though there is a great deal of unpredictability within the IPL, we can reduce this uncertainty by actually studying historical data such as run-rates in each phase of play, past head to head record, etc.
Since we’ve shown that Model 3 is definitely the best model in terms of all three success metrics, let’s now consider it’s prediction for the rest of the IPL. We will take the actual results for the games that have already been played and predict the outcomes of games that have not been played in order to get an expected league table.
Table 4: Final League Table Predictions for IPL 2021
According to Model 3, we should expect the top three teams qualifying for the playoffs to be the Chennai Super Kings, Delhi Capitals and Mumbai Indians. These teams are predicted to be joined by either the Rajasthan Royals, Punjab Kings, Royal Challengers Bangalore or Kolkata Knight Riders, which suggests that our model believes it will be a close battle for the fourth playoff slot. Something important to note is that since over-half the IPL is yet to be played, a lot of the teams will be playing each other again. Since model 3 does not involve a simulation and random sampling, the model would predict the exact same outcomes for each combination of matches that it did in the first half. In addition to this, the model is trained on data from the prior seasons only and won’t be able to account for the result of the first round of games in this current season. This is why we see RCB’s predicted second half of the season to be extremely poor with them only predicted to win one out of their seven remaining games. Despite this, it will definitely be interesting to look-back on these predictions once the IPL season is completed by the end of the year.
Key Takeaways
Overall in this article we’ve explored different methods of model building in the context of predicting the result of a Indian Premier League match. We’ve seen not only how cricket, like all other sports, is extremely unpredictable, but also how we can use data in creative ways to reduce this uncertainty slightly when it comes to making predictions. With a rapid rise in the application of model building in sport contexts through techniques like machine learning, it is crucial to interpret predictions and insights from a model with caution. This is because in the context of sports past events do not entirely determine future outcomes. In fact, the unpredictability of Sports is one of the main reasons there is such a rise in the demand for analytics within the field. Tournaments like the Indian Premier League may be full of uncertainty, but that is what makes them a thrilling experience adored by several million fans worldwide.
Code: Github Repository
Comments