Prediction of ODI Outcomes

Proposal

Our project predicts the match results, player performances, and significant events by using historical ODI data and state-of-the-art machine learning. It redefines cricket analysis by closely monitoring forecast accuracy and encouraging user engagement.
Author
Affiliation

DaakuDataSingh

School of Information, University of Arizona

Introduction

Since its inception in the late 16th century, cricket has established itself as once of the most popular sports on the planet with yearly viewers numbering in the billions. With so many viewers, prediction of match outcomes have become a hot topic for enthusiasts and sports bettors alike. We plan to use “live” data from 2023 to predict the outcome of the match based on a logistic regression model that takes into account features that are dynamically changing. With each match “update” we will feed the new data into our model to update the prediction it is making. By doing so we can dynamically see how the outcome of a game changes as the match progresses.

Why the ODI dataset

  • Cricket enjoys widespread popularity worldwide, particularly among cricket enthusiasts, making it a sport with broad appeal.

  • The ODI dataset is rich in detail, encompassing player statistics, match outcomes, team performances, and other relevant information.

  • With the ODI dataset, we can conduct thorough comparative analyses across players, teams, and historical eras. These comparisons can uncover trends, patterns, and changes, offering valuable insights into the sport’s development over time.

High-level Goal

In this project we will create a dashboard of a live, ongoing cricket match that updates and predict its plots and statistics at regular intervals.

Goal Details

Motivation:

For this project we wanted to showcase our prediction skill-set with the challenge of a live dataset, and what better live dataset to use than sports, so we went with our favorite sport - cricket. Our idea is to compare and contrast a predicted trend with the actual live trend. For the purposes of demonstration we won’t use an actual real-time data as we found that there would be unwanted associated costs, so we will instead use a ready made latest dataset, where our back-end server will automatically update the data table and communicate it to the front-end at regular intervals, essentially mocking a live match.

Description of datasets:

We’ll use two datasets:

  1. ODI_Match_Data.csv: Provides facts about the location and season of the cricket matches along with team information and the play results from each team member. We’ll need this one to investigate partnerships between batsmen. It’s dimensions are 155432 rows of data by 23 variable columns. The data that appears in this proposal is a truncated version for ease of storage, but the project will utilize an API that will supply the entire dataset.

    Variables that we’ll use:

    Variable Name Data type Description
    match_id double A unique identifier for each ODI cricket match.
    season character The season in which the match took place
    start_date character The date on which the match started.
    innings double The innings number (1st innings or 2nd innings).
    ball double A numeric representation of the ball number bowled in the innings.
    batting_team character The name of the batting team for the current innings.
    bowling_team character The name of the bowling team for the current innings.
    striker character The batsman who is currently facing the ball.
    non_striker character The batsman at the non-striker’s end.
    bowler character The bowler who is delivering the ball.
    runs_off_bat double The number of runs scored off the bat (excluding extras).
    extras double The total number of extra runs (wides, no-balls, byes, leg-byes, penalty) in the current ball.
    wides double The number of wide deliveries bowled in the current ball.
    noballs double The number of no-ball deliveries bowled in the current ball.
    byes double The number of byes scored in the current ball.
    legbyes double The number of leg-byes scored in the current ball.
  2. ODI_Match_info.csv: Overlaps in data with the above but provides information on the umpire, performance, and the city the match took place. We’ll need this one to analyze the batting and bowling performance of each player. It’s dimensions are 2380 rows of data by 18 variable columns.

    Variables that we’ll use:

    Variable Name Data type Description
    id double A unique identifier for each cricket match.
    season character The season in which the match took place
    date date The date on which the match was played.
    team1 character The name of the first cricket team participating in the match.
    team2 character The name of the second cricket team participating in the match.
    result character The result of the match (e.g., “normal,” “tie,” “no result”).
    winner character The winning team of the match.
    win_by_runs double The margin of victory in runs (0 for wickets, if not applicable).
    win_by_wickets double The margin of victory in wickets (0 for runs, if not applicable).
import pandas as pd
data = pd.read_csv('data/ODI_Match_Data.csv')
print(data.head())
   match_id   season  start_date                           venue  innings  \
0   1389389  2023/24  2023-09-24  Holkar Cricket Stadium, Indore        1   
1   1389389  2023/24  2023-09-24  Holkar Cricket Stadium, Indore        1   
2   1389389  2023/24  2023-09-24  Holkar Cricket Stadium, Indore        1   
3   1389389  2023/24  2023-09-24  Holkar Cricket Stadium, Indore        1   
4   1389389  2023/24  2023-09-24  Holkar Cricket Stadium, Indore        1   

   ball batting_team bowling_team     striker   non_striker  ... wides  \
0   0.1        India    Australia  RD Gaikwad  Shubman Gill  ...   NaN   
1   0.2        India    Australia  RD Gaikwad  Shubman Gill  ...   NaN   
2   0.3        India    Australia  RD Gaikwad  Shubman Gill  ...   NaN   
3   0.4        India    Australia  RD Gaikwad  Shubman Gill  ...   NaN   
4   0.5        India    Australia  RD Gaikwad  Shubman Gill  ...   NaN   

   noballs  byes  legbyes  penalty  wicket_type  player_dismissed  \
0      NaN   NaN      NaN      NaN          NaN               NaN   
1      NaN   NaN      NaN      NaN          NaN               NaN   
2      NaN   NaN      NaN      NaN          NaN               NaN   
3      NaN   NaN      NaN      NaN          NaN               NaN   
4      NaN   NaN      NaN      NaN          NaN               NaN   

   other_wicket_type other_player_dismissed cricsheet_id  
0                NaN                    NaN      1389389  
1                NaN                    NaN      1389389  
2                NaN                    NaN      1389389  
3                NaN                    NaN      1389389  
4                NaN                    NaN      1389389  

[5 rows x 23 columns]

Research goal:

Match Outcome Prediction -

Use historical and current match data with logistic regression classification to predict the winner of a cricket match based on the live match statistics (e.g., runs scored, wickets fallen, overs bowled), that update every minute. Each time a statistic is updated, the prediction will update as well.

Implementation :

Analysis Plan

We will first preprocess the dataset by cleaning and converting it before analyzing the results of matches between two teams, depending on the current match we are focused on. Using feature engineering, we will extract pertinent aspects like the location, the bowling and batting teams, the number of runs scored, the number of wickets taken, etc. After dividing the dataset into training and testing sets, the selected model will be trained using the training set. Metrics uch as accuracy, precision, recall, or F1-score will be used to measure the model’s predictive power on the testing data after training. In the end, we will use the trained model to forecast the match result based on the attributes that were taken out of the data. This strategy will shed light on the likely outcome of the match. This model will be displayed on a webpage which will refresh every time a new stat is updated, and the prediction from the model will update.

Mocking Live Data :

To avoid unnecessary costs associated with real-time data, we will split the data into two parts: past data and live data.

The past data will include information from years 2002 to 2022, while the live data will consist of data from the year 2023. Each entry from 2023 will be read from the actual CSV file and entered into a database table with an interval of 10 to 20 seconds between two consecutive entries. These entries will be considered as live data and will be sent to the API caller.

Back-End API

GET /current-performance

  • Live Data: Will show the live stats (mocked) of an ongoing match between team A and B

  • Prediction: Who will win out of Team A and B

These APIs are subject to change, additions, or removals as we analyze the data.

Schedule

Task Priority (as of 4/10) Responsible Due
Finalize proposal w/ changes High Christian, Srinivasan, Akash 4/10
EDA/Data Selection High Rajitha, Praveen, Christian 4/13
Feature Engineering Medium Akash, Rajitha, Gowtham, Tejas 4/17
Model Training , Validation, and Comparison Medium Christian, Praveen, Srinivasan, Gowtham 4/19
Model Integration with API from 526 Low Srinivasan, Tejas, Praveen 4/22
Write Up Outline Low Rajitha, Tejas, Akash, Gowtham 4/24
Presentation/ Write Up Final Draft Low Entire Team 4/31
Review Low Entire Team 4/31-5/5
Presentation Low Entire Team 5/6

References

  1. Utkarsh Tomar. (2023). ODI Men’s Cricket Match Data (2002-2023) [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DS/3780212