This study explores how in-game data, like as shots on goal, fouls, and cards, affect match outcomes by looking at 380 matches from the English Premier League’s 2021–2022 season. We examine the critical impact these variables play in deciding whether a home or away side wins using Evan Gower’s painstakingly selected dataset. We examine these data and use logistic regression to forecast match outcomes, providing information about the dynamics of the game. This work opens the door for predictive modelling in sports analytics while also improving our knowledge of football strategy.
Introduction
Evan Gower’s work on Kaggle has made it possible to obtain statistics from 380 matches, giving us a detailed glimpse at the English Premier League season of 2021–2022. Along with comprehensive statistics for both home and away sides, such as goals, shots, fouls, and cards, it also provides important game information such team names, match dates, and referees. This dataset, which contains data on halftime performance as well as full-time results, is an invaluable resource for anyone wishing to examine the factors that affect football match outcomes, from individual player affects to team plans.
Importing Dataset
import pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsfrom sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import confusion_matrix, accuracy_score# Load the datasetdf = pd.read_csv('data/soccer21-22.csv')# Creating a binary target variable 'Result' where 1 represents a home win ('H') and 0 otherwisedf['Result'] = (df['FTR'] =='H').astype(int)# Feature selectionfeatures = ['HS', 'AS', 'HST', 'AST', 'HF', 'AF', 'HY', 'AY', 'HR', 'AR']X = df[features]y = df['Result']# Splitting the dataset into training and testing setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# Logistic Regression modelmodel = LogisticRegression(max_iter=1000)model.fit(X_train, y_train)# Predictionsy_pred = model.predict(X_test)# Evaluating the modelaccuracy = accuracy_score(y_test, y_pred)cm = confusion_matrix(y_test, y_pred)
Question 1
What is the connection between in-game metrics such as shots on goal, fouls committed, and cards received, and the outcomes of soccer matches? Can these metrics help in creating a predictive model to forecast whether the match results will favor the home or away team?
Introduction
The first query explores the complex connections between in-game metrics—like shots on goal, fouls, and cards—and how those relationships affect the results of football matches. Our goal is to use these factors to investigate if they can accurately forecast match outcomes that will benefit the home team or the away team. This question is especially fascinating since it addresses fundamental football dynamics and provides information on how different gameplay elements affect a team’s eventual success or failure.
Approach
First, we do a logistic regression analysis to see how in-game metrics might be used to predict a home team’s outcome. We choose relevant attributes such as shots, shots on goal, fouls committed, and cards both the home and away teams have earned. To guarantee the robustness of the model, the dataset is divided into training and testing sets. The logistic regression model is then trained, and its effectiveness is assessed using an accuracy score and a confusion matrix. We are able to measure the effect of every in-game indicator on the probability that the home team will win thanks to this scientific approach.
Data Preparation and Pre-processing We begin the data preparation step by loading the entire football dataset, which includes a wealth of match information from the 2021–2022 English Premier League season. We convert the full-time result (FTR) into a binary target variable called “Result,” where a “1” denotes a home team win, to make our analysis easier. Then, we choose crucial elements that could have an impact on how the game turns out, like shots, shots on goal, fouls, and cards both the home and away teams receive. We then divided the dataset into training and testing sets, which prepared the stage for creating and assessing our prediction model. The data is ready for perceptive examination thanks to this careful preparation.
Plot 1: Feature importance visualization
# Plot 1: Feature importance visualizationcoefficients = pd.DataFrame(model.coef_[0], X.columns, columns=['Coefficient']).sort_values(by='Coefficient', ascending=False)plt.figure(figsize=(10, 8))sns.barplot(x=coefficients['Coefficient'], y=coefficients.index)plt.title('Feature Importance in Predicting Home Team Wins')plt.xlabel('Coefficient Value')plt.ylabel('In-game Metrics')plt.show()
Feature Importance Visualization by Logistic Regression
A positive coefficient value (bar extending to the right) suggests that higher values of the corresponding metric increase the likelihood of the home team winning. For instance, a positive coefficient for Number of shots on Target by Home Team (HTS) indicates that the home side has a greater chance of winning if they have more shoots on target.
On the other hand, a negative coefficient value (bar extending to the left) indicates that the probability of the home side winning is decreased at greater levels of that statistic. For example, if Number of redcards recieved Home Team (HR) has a negative coefficient, it means that the home team’s chances of winning are likely to be lowered by committing more red cards received.
Plot 2: Confusion Matrix
# Plot 2: Confusion Matrixsns.heatmap(cm, annot=True, fmt='d', cmap='Blues')plt.title('Confusion Matrix for Home Team Win Prediction')plt.xlabel('Predicted Label')plt.ylabel('True Label')plt.show()# Print accuracyprint(f'Model Accuracy: {accuracy:.2f}')
Model Accuracy: 0.78
Confusion Matrix for Home Team Win Prediction
This chart shows how well the model guessed if the home team would win a soccer game. It’s like a report card for the model’s predictions. For example, if the model guessed that the home team wouldn’t win (maybe it thought the game would be a draw or the away team would win), and that was correct, that count goes in the top left corner. If the model thought the home team wouldn’t win, but they actually did, that count goes in the bottom left corner.
On the other hand, if the model predicted a home win and was right, that goes in the bottom right corner. But if it predicted a home win and was wrong, that goes in the top right corner.
Heat Map
# Define the target variable - Home win (1) or not (0)df['HomeWin'] = df['FTR'].apply(lambda x: 1if x =='H'else0)# Select features of interestfeatures = ['HS', 'AS', 'HST', 'AST', 'HF', 'AF', 'HY', 'AY']# Compute correlation matrix for the featurescorr = df[features].corr()# Heatmap of correlation between featuresplt.figure(figsize=(8, 6))sns.heatmap(corr, annot=True, cmap='coolwarm', fmt=".2f")plt.title("Feature Correlation Heatmap")plt.show()
Feature Correlation Heatmap This heatmap visualizes the correlation between different features (metrics) from the soccer dataset. Each cell in the heatmap shows the correlation coefficient between two features, ranging from -1 to 1. A correlation coefficient close to 1 implies a strong positive relationship (as one feature increases, the other tends to increase), close to -1 implies a strong negative relationship (as one feature increases, the other tends to decrease), and around 0 implies no linear relationship.
For example, if ‘HS’ (Home Shots) and ‘HST’ (Home Shots on Target) have a correlation coefficient close to 1, it means that when the home team takes more shots, they also tend to have more shots on target. Conversely, if ‘HS’ and ‘AF’ (Away Fouls) have a correlation coefficient close to -1, it suggests that as the home team takes more shots, the away team tends to commit fewer fouls, indicating a possible defensive strategy or play style difference when under pressure.
Question 2
How do the outcomes of soccer matches vary under the premise that all games concluded at halftime? How does this disparity differ from team to team, and what is the influence of the overall outcome of the league championship?
Introduction
In question two, we analyze a hypothetical scenarios to offer practical insights derived from the European Premier League’s 22nd season. By exploring how matches might end at halftime, we aim to analyze the real-world dynamics of team performance. Specifically, we will investigate whether successful teams exhibit early aggression in the first half or stage comebacks in the second half, and whether team rankings reflect consistent performance or fluctuate based on halftime strategies. This analysis offers practical implications for coaching strategies, player development, and tactical approaches, providing actionable insights to enhance team performance and strategic decision-making in professional football.
Approach
Initially, we will employ traditional ranking methods based on full-time match results. Per the European Premier League, rank will be determined based on points per game, where a win is three points, a draw is one point, and a loss is zero points. When necessary, we will employ goal difference and total goals scored to determine team standings. Subsequently, we will simulate halftime outcomes for each match and recalculate team rankings using the same criteria applied for full-time results. An analysis will be conducted in two parts: first, comparing the disparities between the rankings derived from full-time and halftime outcomes; and secondly, focusing on the performance of the top three and bottom three teams in both scenarios to assess the influence of halftime outcomes on promotion, relegation, and the overall league championship. Through this comprehensive approach, we aim to provide insights into the dynamics of soccer match outcomes and their implications for team standings and league competitiveness.
Data Preparation and Pre-processing Data preparation for the analysis of the 2021-2022 European Premier League (EPL) season began with the calculation of points awarded to home and away teams based on match outcomes, adhering to the EPL’s point allocation system. Subsequently, the dataset was organized through grouping operations, which revealed instances of point ties among teams. To address this, a feature engineering process was employed to derive the goal difference metric, representing the difference between goals scored and goals conceded by each team. This involved aggregating match statistics to determine the net goals accumulated by each team throughout the season. By iteratively refining the dataset through these preprocessing steps, we ensured the readiness of the data for subsequent analysis, laying the groundwork for examining match outcomes and team dynamics during the 2021-2022 EPL season.
With an overview of the preparation and pre-processing steps, we begin by calculating team rankings using full time results. The code below shows the general structure that is used to calculate team rankings using the win/draw/loss point systems. Although the code works for both full time results and half time results, the instance below illustrates an example for full time results.
Show code to rank teams based on points
# Import librariesimport pandas as pd# Load the datasetdf = pd.read_csv('data/soccer21-22.csv')# Function to determine the winner based on pointsdef calculate_points(row):if row['FTR'] =='H':return3elif row['FTR'] =='D':return1else:return0# Apply the function to calculate points for each matchdf['HomePoints'] = df.apply(lambda row: calculate_points(row), axis =1)df['AwayPoints'] = df.apply(lambda row: 3- calculate_points(row) if row['FTR'] !='D'else1, axis =1)# Aggregate points for each teamhome_points = df.groupby('HomeTeam')['HomePoints'].sum().reset_index()away_points = df.groupby('AwayTeam')['AwayPoints'].sum().reset_index()# Combine home and away pointsteam_points = pd.merge(home_points, away_points, how ='outer', left_on ='HomeTeam', right_on ='AwayTeam')team_points['TotalPoints'] = team_points['HomePoints'] + team_points['AwayPoints']# Sort team_points DataFrame based on TotalPointsteam_points = team_points.sort_values(by ='TotalPoints', ascending =False)# Create ranking DataFrameft_ranking = pd.DataFrame({'Team': team_points['HomeTeam'], # You can choose 'HomeTeam' or 'AwayTeam' because they are the same after merging'Points': team_points['TotalPoints'],'Ranking': range(1, len(team_points) +1)})print(ft_ranking)
Team Points Ranking
11 Man City 93 1
10 Liverpool 92 2
5 Chelsea 74 3
16 Tottenham 71 4
0 Arsenal 69 5
12 Man United 58 6
18 West Ham 56 7
9 Leicester 52 8
3 Brighton 51 9
19 Wolves 51 10
13 Newcastle 49 11
6 Crystal Palace 48 12
2 Brentford 46 13
1 Aston Villa 45 14
15 Southampton 40 15
7 Everton 39 16
8 Leeds 38 17
4 Burnley 35 18
17 Watford 23 19
14 Norwich 22 20
Show code to plot team ranks
# Set the style to tickssns.set_style("ticks")plt.figure(figsize=(12, 6))bars = plt.bar(ft_ranking['Team'], ft_ranking['Points'], color=[team_colors.get(team, 'gray') for team in ft_ranking['Team']],alpha=1,width=0.5)plt.title('Full-Time Rankings Based on Points')plt.xlabel('Team')plt.ylabel('Total Points')plt.xticks(rotation=45, ha='center')for bar, rank inzip(bars, ft_ranking['Ranking']): plt.text(bar.get_x() + bar.get_width() /2-0.15, bar.get_height() +0.1, str(rank), ha='left', color='black', fontsize=12)plt.show()
The table and bar plot above show the rankings of all the teams based on the point system. Although Brighton and Wolves are ranked as 9th and 10th, we see from the table that they have tied in the number of points earned. The EPL further breaks ties by considering goal difference, which is the number of goals scored in each season minus the number of goals conceded in the season. The below code shows how to further break down ties using goal difference.
Show code to rank teams based on goal difference
# Aggregate goals scored and conceded for each teamhome_goals_scored = df.groupby('HomeTeam')['FTHG'].sum().reset_index()home_goals_conceded = df.groupby('HomeTeam')['FTAG'].sum().reset_index()away_goals_scored = df.groupby('AwayTeam')['FTAG'].sum().reset_index()away_goals_conceded = df.groupby('AwayTeam')['FTHG'].sum().reset_index()# Merge home and away goals scored and conceded with team_pointsteam_points = pd.merge(team_points, home_goals_scored, how ='left', left_on ='HomeTeam', right_on ='HomeTeam')team_points = pd.merge(team_points, home_goals_conceded, how ='left', left_on ='HomeTeam', right_on ='HomeTeam')team_points = pd.merge(team_points, away_goals_scored, how ='left', left_on ='HomeTeam', right_on ='AwayTeam')team_points = pd.merge(team_points, away_goals_conceded, how ='left', left_on ='HomeTeam', right_on ='AwayTeam')# Rename columns during mergeteam_points.rename(columns={'FTHG_x': 'HomeGoalsScored', 'FTAG_x': 'HomeGoalsConceded','FTAG_y': 'AwayGoalsScored', 'FTHG_y': 'AwayGoalsConceded'}, inplace =True)# Fill NaN values with 0team_points.fillna(0, inplace =True)# Calculate total goals scored and concededteam_points['TotalGoalsScored'] = team_points['HomeGoalsScored'] + team_points['AwayGoalsScored']team_points['TotalGoalsConceded'] = team_points['HomeGoalsConceded'] + team_points['AwayGoalsConceded']# Calculate goal differenceteam_points['GoalDifference'] = team_points['TotalGoalsScored'] - team_points['TotalGoalsConceded']# Sort team_points DataFrame based on TotalPoints and GoalDifferenceteam_points = team_points.sort_values(by = ['TotalPoints', 'GoalDifference'], ascending = [False, False])# Create ranking DataFrameft_ranking = pd.DataFrame({'Team': team_points['HomeTeam'], # You can choose 'HomeTeam' or 'AwayTeam' because they are the same after merging'Points': team_points['TotalPoints'],'GoalDifference': team_points['GoalDifference'],'Ranking': range(1, len(team_points) +1)})print(ft_ranking)
Team Points GoalDifference Ranking
0 Man City 93 73 1
1 Liverpool 92 68 2
2 Chelsea 74 43 3
3 Tottenham 71 29 4
4 Arsenal 69 13 5
5 Man United 58 0 6
6 West Ham 56 9 7
7 Leicester 52 3 8
8 Brighton 51 -2 9
9 Wolves 51 -5 10
10 Newcastle 49 -18 11
11 Crystal Palace 48 4 12
12 Brentford 46 -8 13
13 Aston Villa 45 -2 14
14 Southampton 40 -24 15
15 Everton 39 -23 16
16 Leeds 38 -37 17
17 Burnley 35 -19 18
18 Watford 23 -43 19
19 Norwich 22 -61 20
Show code to plot team ranks when considering goal difference
# Set the style to whitegridsns.set_style("whitegrid")fig, ax = plt.subplots(figsize = (12, 6))bars = ax.bar(ft_ranking['Team'], ft_ranking['GoalDifference'], color = [team_colors.get(team, 'gray') for team in ft_ranking['Team']], alpha=1, width=0.5)plt.title('Bar Plot of Goal Difference for Teams')plt.xlabel('Team')plt.ylabel('Goal Difference')plt.xticks(rotation=45, ha='right')plt.show()
With the point system and goal difference, we no longer have ties between any teams and have the official ranking of teams in the 2021-2022 EPL Season. Now we repeat the process again with the half time results before beginning our discussion. The below table and plot show the half-time result ranking.
Team Points Ranking
10 Liverpool 79 1
11 Man City 77 2
5 Chelsea 68 3
16 Tottenham 65 4
0 Arsenal 63 5
15 Southampton 52 6
12 Man United 51 7
18 West Ham 50 8
13 Newcastle 49 9
6 Crystal Palace 48 10
19 Wolves 48 11
1 Aston Villa 47 12
9 Leicester 45 13
3 Brighton 44 14
8 Leeds 40 15
7 Everton 36 16
2 Brentford 36 17
4 Burnley 34 18
14 Norwich 29 19
17 Watford 28 20
Again, there are ties with the rankings when only looking at points based on half-time results. Therefore, we must look at goal difference as well. The table and plot below show the updated rankings when considering points and goal difference.
Team Points GoalDifference Ranking
0 Liverpool 79 30 1
1 Man City 77 34 2
2 Chelsea 68 24 3
3 Tottenham 65 16 4
4 Arsenal 63 12 5
5 Southampton 52 -8 6
6 Man United 51 -3 7
7 West Ham 50 6 8
8 Newcastle 49 1 9
10 Wolves 48 3 10
9 Crystal Palace 48 -4 11
11 Aston Villa 47 -4 12
12 Leicester 45 -3 13
13 Brighton 44 -6 14
14 Leeds 40 -19 15
16 Brentford 36 -13 16
15 Everton 36 -14 17
17 Burnley 34 -10 18
18 Norwich 29 -23 19
19 Watford 28 -19 20
Discussion
The analysis of halftime outcomes in the context of team rankings provides valuable insights into the dynamics of team performance and strategic decision-making in professional football. In particular, focusing on the top three teams before and after halftime reveals interesting patterns. Despite the structural changes in the game, the top three teams largely maintain their positions, with only minor adjustments observed. This consistency suggests that these teams possess strategies or training regimens that consistently place them at the forefront of the league, regardless of variations in match structure or halftime strategies.
When examining the overall changes in rankings, notable shifts are observed, particularly in the swapping of first and second place. This exchange of positions between the top two teams, while subtle, may indicate differing strengths in the first and second halves of matches. For instance, while Manchester City emerges as the eventual winners of the EPL league, the swapping of positions with Liverpool could imply a potential area for improvement in either team’s gameplay strategy. It suggests the need for Liverpool to enhance their stamina or second-half strategies, despite their strong performance in the initial phases of matches.
Moreover, significant changes in rank, such as Southampton’s improvement from 15th to 6th place, underscore the importance of halftime strategies and team conditioning. The notable rise in Southampton’s ranking may signify either a highly aggressive starting strategy that may not be sustainable or potential deficiencies in second-half performance due to inadequate conditioning or strategic adjustments. These findings highlight the complexity of team dynamics and the multifaceted factors influencing match outcomes and league standings.
Overall, the insights derived from this analysis can offer valuable guidance for teams undergoing significant shifts in rankings. By identifying potential areas for improvement, such as halftime strategies, conditioning, or tactical adjustments, teams can make informed decisions to enhance their performance and competitiveness in professional football leagues.
References
Pandas documentation used to assist with joining dataframes and resetting index. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html
We also checked the official Premier League website to determine how rankings are calculated. https://www.premierleague.com/premier-league-explained