INFO 523 - Spring 2024 - Project Final
Our project predicts the match results by using historical ODI cricket data and state-of-the-art machine learning. It redefines cricket analysis by closely monitoring forecast accuracy and encouraging user engagement.
We have considered dataset from year 2002 to 2022 as past data. using this following data we have trainned the following machine learning model to predict the current years match winner (considering 2023 as the current year).
The primary goal of our project is to use past and current match data, train them under regression classification models, select the best model and then use it to predict the winner of a cricket match based on the live match statistics (e.g., runs scored, wickets fallen, overs bowled), that update at regular intervals.
This model will be displayed on a webpage which will refresh every time a new stat is updated, and the prediction from the model will update.
ODI_Match_Data.csv: Provides facts about the location and season of the cricket matches along with team information and the play results from each team member. We’ll need this one to investigate partnerships between batsmen. It’s dimensions are 155432 rows of data by 23 variable columns.
ODI_Match_info.csv: Overlaps in data with the above but provides information on the umpire, performance, and the city the match took place. We’ll need this one to analyze the batting and bowling performance of each player. It’s dimensions are 2380 rows of data by 18 variable columns.
Here, we first carried out Data Preprocessing where we merged the match_data and match_info files, then carried out Exploratory Data Analysis by using pairplots to find correlations between different variables, if any.
Step 2: Data Manipulation - During a thorough process of Data Cleaning, we dropped some columns which were largely empty and replaced some empties with ‘Unknown’.
Step 3: Feature Engineering - We normalized the continuous data and encoded categorical variables.
Step 4 Feature selection - we directly select the top 10 columns except winnerTeam as the input features of the model and the winnerTeam itself as the target feature.
The input features are - [‘ball’, ‘batting_team’, ‘bowling_team’, ‘non_striker’, ‘bowler’, ‘runs_off_bat’, ‘team1’, ‘team2’, ‘toss_winner’, ‘dl_applied’]
From the results of Logistic regression, we observe the accuracy and precision of this model is around 55% and the area under ROC curve is also around 0.5. As these metrics show that this might not be the best model, we move forward and train another model.
Now we take the Random Forest Classifier, as we observe the accuracy and precision of this model is around 96% and the area under ROC curve is also around 0.96. These metrics clearly show that this is the best model, so we move forward with this model.
The following links direct us to the webpage where you can find the demonstration of our model which predicts the winner results of ODI 2023 cricket matches.