Prediction of ODI Outcomes

Proposal

Our project predicts the match results, player performances, and significant events by using historical ODI data and state-of-the-art machine learning. It redefines cricket analysis by closely monitoring forecast accuracy and encouraging user engagement.

Author

Affiliation

DaakuDataSingh

School of Information, University of Arizona

Introduction

Since its inception in the late 16th century, cricket has established itself as once of the most popular sports on the planet with yearly viewers numbering in the billions. With so many viewers, prediction of match outcomes have become a hot topic for enthusiasts and sports bettors alike. We plan to use “live” data from 2023 to predict the outcome of the match based on a logistic regression model that takes into account features that are dynamically changing. With each match “update” we will feed the new data into our model to update the prediction it is making. By doing so we can dynamically see how the outcome of a game changes as the match progresses.

Why the ODI dataset

Cricket enjoys widespread popularity worldwide, particularly among cricket enthusiasts, making it a sport with broad appeal.
The ODI dataset is rich in detail, encompassing player statistics, match outcomes, team performances, and other relevant information.
With the ODI dataset, we can conduct thorough comparative analyses across players, teams, and historical eras. These comparisons can uncover trends, patterns, and changes, offering valuable insights into the sport’s development over time.

High-level Goal

In this project we will create a dashboard of a live, ongoing cricket match that updates and predict its plots and statistics at regular intervals.

Goal Details

Motivation:

For this project we wanted to showcase our prediction skill-set with the challenge of a live dataset, and what better live dataset to use than sports, so we went with our favorite sport - cricket. Our idea is to compare and contrast a predicted trend with the actual live trend. For the purposes of demonstration we won’t use an actual real-time data as we found that there would be unwanted associated costs, so we will instead use a ready made latest dataset, where our back-end server will automatically update the data table and communicate it to the front-end at regular intervals, essentially mocking a live match.

Description of datasets:

We’ll use two datasets:

ODI_Match_Data.csv: Provides facts about the location and season of the cricket matches along with team information and the play results from each team member. We’ll need this one to investigate partnerships between batsmen. It’s dimensions are 155432 rows of data by 23 variable columns. The data that appears in this proposal is a truncated version for ease of storage, but the project will utilize an API that will supply the entire dataset.

Variables that we’ll use:

Variable Name	Data type	Description
match_id	double	A unique identifier for each ODI cricket match.
season	character	The season in which the match took place
start_date	character	The date on which the match started.
innings	double	The innings number (1st innings or 2nd innings).
ball	double	A numeric representation of the ball number bowled in the innings.
batting_team	character	The name of the batting team for the current innings.
bowling_team	character	The name of the bowling team for the current innings.
striker	character	The batsman who is currently facing the ball.
non_striker	character	The batsman at the non-striker’s end.
bowler	character	The bowler who is delivering the ball.
runs_off_bat	double	The number of runs scored off the bat (excluding extras).
extras	double	The total number of extra runs (wides, no-balls, byes, leg-byes, penalty) in the current ball.
wides	double	The number of wide deliveries bowled in the current ball.
noballs	double	The number of no-ball deliveries bowled in the current ball.
byes	double	The number of byes scored in the current ball.
legbyes	double	The number of leg-byes scored in the current ball.

ODI_Match_info.csv: Overlaps in data with the above but provides information on the umpire, performance, and the city the match took place. We’ll need this one to analyze the batting and bowling performance of each player. It’s dimensions are 2380 rows of data by 18 variable columns.

Variables that we’ll use:

Variable Name	Data type	Description
id	double	A unique identifier for each cricket match.
season	character	The season in which the match took place
date	date	The date on which the match was played.
team1	character	The name of the first cricket team participating in the match.
team2	character	The name of the second cricket team participating in the match.
result	character	The result of the match (e.g., “normal,” “tie,” “no result”).
winner	character	The winning team of the match.
win_by_runs	double	The margin of victory in runs (0 for wickets, if not applicable).
win_by_wickets	double	The margin of victory in wickets (0 for runs, if not applicable).

import pandas as pd
data = pd.read_csv('data/ODI_Match_Data.csv')
print(data.head())

   match_id   season  start_date                           venue  innings  \
0   1389389  2023/24  2023-09-24  Holkar Cricket Stadium, Indore        1   
1   1389389  2023/24  2023-09-24  Holkar Cricket Stadium, Indore        1   
2   1389389  2023/24  2023-09-24  Holkar Cricket Stadium, Indore        1   
3   1389389  2023/24  2023-09-24  Holkar Cricket Stadium, Indore        1   
4   1389389  2023/24  2023-09-24  Holkar Cricket Stadium, Indore        1   

   ball batting_team bowling_team     striker   non_striker  ... wides  \
0   0.1        India    Australia  RD Gaikwad  Shubman Gill  ...   NaN   
1   0.2        India    Australia  RD Gaikwad  Shubman Gill  ...   NaN   
2   0.3        India    Australia  RD Gaikwad  Shubman Gill  ...   NaN   
3   0.4        India    Australia  RD Gaikwad  Shubman Gill  ...   NaN   
4   0.5        India    Australia  RD Gaikwad  Shubman Gill  ...   NaN   

   noballs  byes  legbyes  penalty  wicket_type  player_dismissed  \
0      NaN   NaN      NaN      NaN          NaN               NaN   
1      NaN   NaN      NaN      NaN          NaN               NaN   
2      NaN   NaN      NaN      NaN          NaN               NaN   
3      NaN   NaN      NaN      NaN          NaN               NaN   
4      NaN   NaN      NaN      NaN          NaN               NaN   

   other_wicket_type other_player_dismissed cricsheet_id  
0                NaN                    NaN      1389389  
1                NaN                    NaN      1389389  
2                NaN                    NaN      1389389  
3                NaN                    NaN      1389389  
4                NaN                    NaN      1389389  

[5 rows x 23 columns]

Research goal:

Match Outcome Prediction -

Use historical and current match data with logistic regression classification to predict the winner of a cricket match based on the live match statistics (e.g., runs scored, wickets fallen, overs bowled), that update every minute. Each time a statistic is updated, the prediction will update as well.

Implementation :

Analysis Plan

We will first preprocess the dataset by cleaning and converting it before analyzing the results of matches between two teams, depending on the current match we are focused on. Using feature engineering, we will extract pertinent aspects like the location, the bowling and batting teams, the number of runs scored, the number of wickets taken, etc. After dividing the dataset into training and testing sets, the selected model will be trained using the training set. Metrics uch as accuracy, precision, recall, or F1-score will be used to measure the model’s predictive power on the testing data after training. In the end, we will use the trained model to forecast the match result based on the attributes that were taken out of the data. This strategy will shed light on the likely outcome of the match. This model will be displayed on a webpage which will refresh every time a new stat is updated, and the prediction from the model will update.

Mocking Live Data :

To avoid unnecessary costs associated with real-time data, we will split the data into two parts: past data and live data.

The past data will include information from years 2002 to 2022, while the live data will consist of data from the year 2023. Each entry from 2023 will be read from the actual CSV file and entered into a database table with an interval of 10 to 20 seconds between two consecutive entries. These entries will be considered as live data and will be sent to the API caller.

Back-End API

GET /current-performance

Live Data: Will show the live stats (mocked) of an ongoing match between team A and B
Prediction: Who will win out of Team A and B

These APIs are subject to change, additions, or removals as we analyze the data.

Schedule

Task	Priority (as of 4/10)	Responsible	Due
Finalize proposal w/ changes	High	Christian, Srinivasan, Akash	4/10
EDA/Data Selection	High	Rajitha, Praveen, Christian	4/13
Feature Engineering	Medium	Akash, Rajitha, Gowtham, Tejas	4/17
Model Training , Validation, and Comparison	Medium	Christian, Praveen, Srinivasan, Gowtham	4/19
Model Integration with API from 526	Low	Srinivasan, Tejas, Praveen	4/22
Write Up Outline	Low	Rajitha, Tejas, Akash, Gowtham	4/24
Presentation/ Write Up Final Draft	Low	Entire Team	4/31
Review	Low	Entire Team	4/31-5/5
Presentation	Low	Entire Team	5/6

References

Utkarsh Tomar. (2023). ODI Men’s Cricket Match Data (2002-2023) [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DS/3780212

--- title: "Prediction of ODI Outcomes" subtitle: "Proposal" author: - name: "DaakuDataSingh" affiliations: - name: "School of Information, University of Arizona" description: "Our project predicts the match results, player performances, and significant events by using historical ODI data and state-of-the-art machine learning. It redefines cricket analysis by closely monitoring forecast accuracy and encouraging user engagement." format: html: code-tools: true code-overflow: wrap code-line-numbers: true embed-resources: true editor: visual code-annotations: hover execute: warning: false jupyter: python3 --- ## Introduction Since its inception in the late 16th century, cricket has established itself as once of the most popular sports on the planet with yearly viewers numbering in the billions. With so many viewers, prediction of match outcomes have become a hot topic for enthusiasts and sports bettors alike. We plan to use "live" data from 2023 to predict the outcome of the match based on a logistic regression model that takes into account features that are dynamically changing. With each match "update" we will feed the new data into our model to update the prediction it is making. By doing so we can dynamically see how the outcome of a game changes as the match progresses. ## Why the ODI dataset - Cricket enjoys widespread popularity worldwide, particularly among cricket enthusiasts, making it a sport with broad appeal. - The ODI dataset is rich in detail, encompassing player statistics, match outcomes, team performances, and other relevant information. - With the ODI dataset, we can conduct thorough comparative analyses across players, teams, and historical eras. These comparisons can uncover trends, patterns, and changes, offering valuable insights into the sport's development over time. ## High-level Goal In this project we will create a dashboard of a live, ongoing cricket match that updates and predict its plots and statistics at regular intervals. ## Goal Details ### Motivation: For this project we wanted to showcase our prediction skill-set with the challenge of a live dataset, and what better live dataset to use than sports, so we went with our favorite sport - cricket. Our idea is to compare and contrast a predicted trend with the actual live trend. For the purposes of demonstration we won't use an actual real-time data as we found that there would be unwanted associated costs, so we will instead use a ready made latest dataset, where our back-end server will automatically update the data table and communicate it to the front-end at regular intervals, essentially mocking a live match. ### Description of datasets: We'll use two datasets: 1. **ODI_Match_Data.csv**: Provides facts about the location and season of the cricket matches along with team information and the play results from each team member. We'll need this one to investigate partnerships between batsmen. It's dimensions are 155432 rows of data by 23 variable columns. The data that appears in this proposal is a truncated version for ease of storage, but the project will utilize an API that will supply the entire dataset. Variables that we'll use: | Variable Name | Data type | Description | |:------------------|:------------------|:---------------------------------| | match_id | double | A unique identifier for each ODI cricket match. | | season | character | The season in which the match took place | | start_date | character | The date on which the match started. | | innings | double | The innings number (1st innings or 2nd innings). | | ball | double | A numeric representation of the ball number bowled in the innings. | | batting_team | character | The name of the batting team for the current innings. | | bowling_team | character | The name of the bowling team for the current innings. | | striker | character | The batsman who is currently facing the ball. | | non_striker | character | The batsman at the non-striker's end. | | bowler | character | The bowler who is delivering the ball. | | runs_off_bat | double | The number of runs scored off the bat (excluding extras). | | extras | double | The total number of extra runs (wides, no-balls, byes, leg-byes, penalty) in the current ball. | | wides | double | The number of wide deliveries bowled in the current ball. | | noballs | double | The number of no-ball deliveries bowled in the current ball. | | byes | double | The number of byes scored in the current ball. | | legbyes | double | The number of leg-byes scored in the current ball. | 2. **ODI_Match_info.csv**: Overlaps in data with the above but provides information on the umpire, performance, and the city the match took place. We'll need this one to analyze the batting and bowling performance of each player. It's dimensions are 2380 rows of data by 18 variable columns. Variables that we'll use: | Variable Name | Data type | Description | |:------------------|:------------------|:---------------------------------| | id | double | A unique identifier for each cricket match. | | season | character | The season in which the match took place | | date | date | The date on which the match was played. | | team1 | character | The name of the first cricket team participating in the match. | | team2 | character | The name of the second cricket team participating in the match. | | result | character | The result of the match (e.g., "normal," "tie," "no result"). | | winner | character | The winning team of the match. | | win_by_runs | double | The margin of victory in runs (0 for wickets, if not applicable). | | win_by_wickets | double | The margin of victory in wickets (0 for runs, if not applicable). | ```{python} import pandas as pd data = pd.read_csv('data/ODI_Match_Data.csv') print(data.head()) ``` ### Research goal: Match Outcome Prediction - Use historical and current match data with logistic regression classification to predict the winner of a cricket match based on the live match statistics (e.g., runs scored, wickets fallen, overs bowled), that update every minute. Each time a statistic is updated, the prediction will update as well. ## Implementation : ### Analysis Plan We will first preprocess the dataset by cleaning and converting it before analyzing the results of matches between two teams, depending on the current match we are focused on. Using feature engineering, we will extract pertinent aspects like the location, the bowling and batting teams, the number of runs scored, the number of wickets taken, etc. After dividing the dataset into training and testing sets, the selected model will be trained using the training set. Metrics uch as accuracy, precision, recall, or F1-score will be used to measure the model's predictive power on the testing data after training. In the end, we will use the trained model to forecast the match result based on the attributes that were taken out of the data. This strategy will shed light on the likely outcome of the match. This model will be displayed on a webpage which will refresh every time a new stat is updated, and the prediction from the model will update. ![](images/workflow.png) ### Mocking Live Data : To avoid unnecessary costs associated with real-time data, we will split the data into two parts: past data and live data. The past data will include information from years 2002 to 2022, while the live data will consist of data from the year 2023. Each entry from 2023 will be read from the actual CSV file and entered into a database table with an interval of 10 to 20 seconds between two consecutive entries. These entries will be considered as live data and will be sent to the API caller. ## Back-End API GET /current-performance - Live Data: Will show the live stats (mocked) of an ongoing match between team A and B - Prediction: Who will win out of Team A and B *These APIs are subject to change, additions, or removals as we analyze the data.* ## Schedule | Task | Priority (as of 4/10) | Responsible | Due | |------------------|------------------|------------------|------------------| | Finalize proposal w/ changes | High | Christian, Srinivasan, Akash | 4/10 | | EDA/Data Selection | High | Rajitha, Praveen, Christian | 4/13 | | Feature Engineering | Medium | Akash, Rajitha, Gowtham, Tejas | 4/17 | | Model Training , Validation, and Comparison | Medium | Christian, Praveen, Srinivasan, Gowtham | 4/19 | | Model Integration with API from 526 | Low | Srinivasan, Tejas, Praveen | 4/22 | | Write Up Outline | Low | Rajitha, Tejas, Akash, Gowtham | 4/24 | | Presentation/ Write Up Final Draft | Low | Entire Team | 4/31 | | Review | Low | Entire Team | 4/31-5/5 | | Presentation | Low | Entire Team | 5/6 | ## References 1. Utkarsh Tomar. (2023). ODI Men's Cricket Match Data (2002-2023) \[Data set\]. Kaggle. https://doi.org/10.34740/KAGGLE/DS/3780212