By: Sankeerth Gandhari
Introduction
The National Football League (NFL) season is an annual spectacle that captures the attention of millions of fans and analysts alike. Predicting the season’s outcome, particularly identifying the team most likely to win, is a complex and challenging task. This paper uses machine learning techniques to predict the winner of the 2024 NFL season, utilizing key metrics such as Expected Points Added (EPA), Red Zone Efficiency, Strength of Schedule (SOS), and other performance indicators.
The purpose of this project is to use advanced analytics and machine learning to predict a highly unpredictable sport. While traditional sports predictions rely on expert opinions, this approach leverages data-driven techniques to produce a predictable winner. Given the volatility of sports and NFL outcomes, this aims to present a prediction model for the winner of the 2024 season.
The NFL is a league characterized by high levels of competition, making it an ideal candidate for this analysis. The various aspects of team performance—ranging from red zone efficiency to point differential—allow machine learning models to uncover patterns and predict outcomes. In this report, I provide a comprehensive overview of the data, methods, and results that lead to the prediction of the 2024 NFL season winner.
Data Collection
The data for this project was sourced from multiple reputable sports databases, including:
NFL Stats: Data on team performance, including passing, rushing, and defensive stats, for the 2024 season.
Pro Football Reference: Team performance metrics, such as win-loss record percentages.
The dataset was compiled to include performance metrics (features) such as:
EPA (Expected Points Added Offense + Defense): A measure of a team's offensive and defensive efficiency, capturing the value added to the game in terms of expected points from each play.
Red Zone Efficiency: The ability of teams to convert opportunities inside the 20-yard line into touchdowns.
Strength of Schedule (SOS): Indicates the toughness of the teams' opponents, reflecting the difficulty of their season.
Current Record % (Win-Loss): The percentage of games a team has won out of the total games played.
Point Differential: The difference between points scored and points allowed, indicating overall team performance and dominance.
Analysis
Feature Engineering
To predict the 2024 NFL season winner, I created a set of features designed to capture both individual player performance and team-wide trends. The features used in the machine learning models included:
Offensive and Defensive EPA: To quantify how effectively a team performs on each play compared to expectations.
Point Differential: Highlights the margin of dominance in games.
Red Zone Efficiency: Reflects performance close to the end zone.
Strength of Schedule (SOS): Captures the context of performance against varying levels of competition.
Win-Loss Record as a Percentage: Serves as a basic measure of overall team success.
These metrics were chosen because they capture key aspects of team performance, such as efficiency, consistency, and competitiveness, which are critical when making predictions. These features were also scaled from 0-1, avoiding dominance by any single metric.
In logistic regression, each feature (or variable) is assigned a weight, meaning certain features influence the prediction more than others. However, I chose to create all features with equal weights in this project, so there is not one single variable that influences the outcome of the NFL winner.
Why Logistic Regression?
Logistic regression was chosen for this project due to its use for tasks that output probabilities, such as predicting the winner of the NFL.
Easy to Understand: The model gives weights to each feature (like EPA or SOS), showing how much each affects the prediction.
Efficient: It’s fast to run and works well with a dataset of NFL games.
Probability Scores: It tells you the chances of each team winning, not just a yes/no answer.
Downsides:
Simplicity: It assumes a straight-line relationship between features and the outcome, which might miss complex patterns.
Feature Quality: The model’s accuracy depends on good and properly scaled features.
Model Mechanics:
Logistic regression predicts the probability of an outcome using the formula:
hθ(x):
This is the output of the logistic regression model.
It represents the predicted probability of a team winning the NFL season.
The value ranges between 0 and 1.
If the score (hθ(x)) is very high (positive), the probability of winning the season is closer to 1.
If the score (hθ(x)) is very low (negative), the probability of winning is closer to 0.
w0 (Intercept):
This is the bias term and acts as the model's default prediction when all feature values are zero.
It ensures the model starts with an appropriate base prediction before accounting for feature values for each team.
w1,w2,…,,wn (Weights):
These represent the importance or contribution of each feature to the prediction.
Larger weights (in absolute value) indicate that the corresponding feature has a stronger impact on predicting a team’s probability of winning.
In this project, these weights are equal, meaning all features contribute equally after scaling.
X1,X2,,…,Xn (Features):
These are the input features used to predict the outcome.
In this project, the input features include:
EPA, Red Zone Eff, Point Differential, SOS, and Record
e:
The base of the natural logarithm (~2.718).
It’s part of the sigmoid function that converts the linear combination of features into a probability between 0 and 1.
Data Preparation
Scaling: Min-Max Scaling (For each feature)
Min-Max scaling is a data preprocessing technique used in machine learning to transform features to a specific range between 0 and 1. It is also known as normalization.
Purpose:
Scale Features to a Common Range: Min-Max scaling brings all features to a common scale, preventing features with larger values from dominating the model's end outcome.
Improve Algorithm Performance: Many machine learning algorithms, including Logistic Regression, perform better when all features are on a similar scale.
Data Interpretation: Scaled features are easier to interpret and compare, as they are all within a defined range.
Reduce Outlier Impact: Min-Max scaling can help reduce the impact of outliers by bringing them closer to the rest of the data.
Formula:
where:
X_i is the original value of the feature.
min(X) is the minimum value of the feature in the dataset.
max(X) is the maximum value of the feature in the dataset.
Find Min and Max: The algorithm first identifies the minimum and maximum values in the dataset of each feature.
Apply Formula: The feature value is subtracted by the minimum value and divided by the range (max_value - min_value).
Scaled Values: This results in scaled values between 0 and 1. The minimum value of the feature will be scaled to 0, and the maximum value will be scaled to 1. All other values will fall within this range.
Train-Test Split:
80% Training: To train the logistic regression model.
20% Test: To assess the model’s final performance.
Dataset Size: The data was trained on all 2024 NFL regular season games before 11/30 (approximately 12 games for each team)
What does the data say?
Red Zone Efficiency:
EPA:
Strength of Schedule:
Evaluation Metrics
Accuracy: The accuracy score quantifies how well the model's predictions align with the rankings based on the raw, scaled data.
The accuracy score is calculated by:
Selecting a value for k, representing the number of teams to consider (e.g., top 25).
Identifying the top k teams from the raw rankings and the model predictions.
Comparing these two lists of top k teams and counting the number of teams that appear in both lists (i.e., the common teams).
Dividing the number of common teams by k to obtain the accuracy score.
Higher accuracy scores indicate stronger agreement between the raw rankings and the model's predictions. This suggests that the model is effectively capturing the important factors that contribute to a team's overall performance and likelihood of winning.
Lower accuracy scores suggest potential discrepancies between the two ranking methods.
For example, if I selected the top 25 teams in the NFL, the model has an accuracy rating of 72%, meaning it correctly predicted 18 of the top 25 teams in the NFL. This suggests that the model's predictions generally align with the rankings derived from the raw data, but there are some discrepancies.
Results
Raw Scaled Data (Ranked):
Team | Average Score |
Arizona | 0.53108 |
Atlanta | 0.277542 |
Baltimore | 0.815673 |
Buffalo | 0.761685 |
Carolina | 0.159504 |
Chicago | 0.499231 |
Cincinnati | 0.516627 |
Cleveland | 0.386242 |
Dallas | 0.181421 |
Denver | 0.583324 |
Detroit | 0.889917 |
Green Bay | 0.662022 |
Houston | 0.565884 |
Indianapolis | 0.38772 |
Jacksonville | 0.27221 |
Kansas City | 0.528754 |
LA Chargers | 0.498004 |
LA Rams | 0.398637 |
Las Vegas | 0.226305 |
Miami | 0.428773 |
Minnesota | 0.667855 |
NY Giants | 0.227517 |
NY Jets | 0.399658 |
New England | 0.339189 |
New Orleans | 0.440161 |
Philadelphia | 0.666714 |
Pittsburgh | 0.56144 |
San Francisco | 0.530281 |
Seattle | 0.45758 |
Tampa Bay | 0.552521 |
Tennessee | 0.370679 |
Washington | 0.6129 |
Logistic Regression Prediction
Top 10 Teams in Rankings (Chances of Winning):
Team | Chances of Winning (%) |
Detroit | 7.60% |
Baltimore | 6.52% |
Buffalo | 5.92% |
Philadelphia | 5.62% |
Minnesota | 5.20% |
Green Bay | 5.19% |
Washington | 4.84% |
Pittsburgh | 4.72% |
Denver | 4.59% |
Houston | 4.37% |
The predicted winner of the 2024 NFL season is Detroit, with a 7.60% chance of winning. Detroit led in critical metrics such as EPA, record, and point differential, contributing to its top rank. It also placed in the top 5 in red zone efficiency.
Analysis of Results
While Detroit emerged as the predicted winner, the model also highlighted Baltimore and Buffalo as strong contenders due to their balanced performances across offensive and defensive metrics. Some discrepancies, such as Philadelphia ranking higher than Minnesota despite slightly lower stats and raw score, could be due to the logistic regression’s slight inaccuracy.
Possible Inaccuracies
The model’s reliance on scaled metrics means that other factors, such as player injuries or mid-season trades, could significantly impact the actual outcomes. Additionally, logistic regression’s assumption that all features are independent might not fully capture the complexity of team dynamics.
Conclusion
The machine learning model provided a data-driven prediction for the 2024 NFL season winner. The Detroit Lions emerged as the most likely winner, with a 7.60% probability of success. However, the unpredictable nature of sports means that real-world outcomes could differentiate from these predictions. Future versions of this project could use neural networks to capture more complex interactions and improve accuracy.
Sources:
Datasets/Code: https://github.com/Sunnyg83/BSA_NFL_Predictor
Kommentare