Cricket Match Results Forecasting

INFO 523 - Project Final

Using Prediction Model to predict Live ODI Cricket Match Winners
Author
Affiliation

Daaku Data Singh

School of Information, University of Arizona

Abstract

Our project predicts the match results by using historical ODI cricket data and state-of-the-art machine learning. It redefines cricket analysis by closely monitoring forecast accuracy and encouraging user engagement. The primary goal of our project is to use historical and current match data, train them under regression classification models, select the best model and then use it to predict the winner of a cricket match based on the live match statistics (e.g., runs scored, wickets fallen, overs bowled), that update at regular intervals. This model will be displayed on a webpage which will refresh every time a new stat is updated, and the prediction from the model will update.

Introduction

For this project we wanted to showcase our prediction skill-set with the challenge of a live dataset, and what better live dataset to use than sports, so we went with our favorite sport - cricket. The past data will include information of ODI Matches from years 2002 to 2022, while the live data will consist of data from the year 2023. Each entry from 2023 will be read from the actual CSV file and entered into a database table with an interval of 10 to 20 seconds between two consecutive entries. These entries will be considered as live data and will be sent to the API caller.

We’ll use two datasets:

ODI_Match_Data.csv: Provides facts about the location and season of the cricket matches along with team information and the play results from each team member. We’ll need this one to investigate partnerships between batsmen. It’s dimensions are 155432 rows of data by 23 variable columns. The data that appears in this proposal is a truncated version for ease of storage, but the project will utilize an API that will supply the entire dataset.

ODI_Match_info.csv: Overlaps in data with the above but provides information on the umpire, performance, and the city the match took place. We’ll need this one to analyze the batting and bowling performance of each player. It’s dimensions are 2380 rows of data by 18 variable columns.

Setup

Code
import pandas as pd
import glob
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.model_selection import train_test_split as tts
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.model_selection import cross_validate
import matplotlib.pyplot as plt
import pickle
import joblib

EDA

Here, we first carried out Data Preprocessing where we merged the match_data and match_info files, then carried out Exploratory Data Analysis by using pairplots to find correlations between different variables, if any. Additionally, we determined the size, data types, sum of NAs, and some descriptive statistics to provide a deeper understanding of the dataset and guide feature engineering.

Code
info = pd.read_csv('data/ODI_Match_info.csv')
info = info.rename(columns = {'id':'match_id'})

#append all files together
csv_files = ['data/output_1.csv','data/output_2.csv','data/output_3.csv','data/output_4.csv','data/output_5.csv','data/output_6.csv','data/output_7.csv','data/output_8.csv','data/output_9.csv']

matchData = pd.concat([pd.read_csv(f,low_memory=False) for f in csv_files ], ignore_index=True) #we import the data as seperate csvs to overcome GitHub's size limitations, then concatenate them all

#merge frames on match ID column

totalData = pd.merge(matchData, info, on = 'match_id') #merge by identical column 'match_id'
totalData.drop(totalData.filter(regex='_y$').columns, axis=1, inplace=True) #drop duplicate columns

totalData = totalData.rename(columns = {'season_x':'season', 'venue_x':'venue'})

from02to22 = totalData[~totalData['season'].astype(str).str.startswith(('2023/2024','2023', '2022/23'))] #exclude 2023 data

from02to22
print(type(from02to22)) #confirm data is read in as a df
print(from02to22.shape) #confirm data shape
print(from02to22.dtypes) #understand the types of data in the df
print(from02to22.isna().sum()) #count NA values in columns
print(pd.DataFrame.describe(from02to22)) #descriptive function to look at dataframe)


winners = sns.countplot(data = from02to22, y = 'winner', order=from02to22['winner'].value_counts().index)
winners

corr = sns.pairplot(from02to22)
corr
<class 'pandas.core.frame.DataFrame'>
(1170917, 38)
match_id                    int64
season                     object
start_date                 object
venue                      object
innings                     int64
ball                      float64
batting_team               object
bowling_team               object
striker                    object
non_striker                object
bowler                     object
runs_off_bat                int64
extras                      int64
wides                     float64
noballs                   float64
byes                      float64
legbyes                   float64
penalty                   float64
wicket_type                object
player_dismissed           object
other_wicket_type         float64
other_player_dismissed    float64
cricsheet_id                int64
city                       object
date                       object
team1                      object
team2                      object
toss_winner                object
toss_decision              object
result                     object
dl_applied                  int64
winner                     object
win_by_runs                 int64
win_by_wickets              int64
player_of_match            object
umpire1                    object
umpire2                    object
umpire3                    object
dtype: object
match_id                        0
season                          0
start_date                      0
venue                           0
innings                         0
ball                            0
batting_team                    0
bowling_team                    0
striker                         0
non_striker                     0
bowler                          0
runs_off_bat                    0
extras                          0
wides                     1144180
noballs                   1166089
byes                      1169151
legbyes                   1158834
penalty                   1170904
wicket_type               1139201
player_dismissed          1139201
other_wicket_type         1170917
other_player_dismissed    1170917
cricsheet_id                    0
city                       164797
date                            0
team1                           0
team2                           0
toss_winner                     0
toss_decision                   0
result                          0
dl_applied                      0
winner                      31969
win_by_runs                     0
win_by_wickets                  0
player_of_match             47608
umpire1                         0
umpire2                         0
umpire3                    126959
dtype: int64
           match_id       innings          ball  runs_off_bat        extras  \
count  1.170917e+06  1.170917e+06  1.170917e+06  1.170917e+06  1.170917e+06   
mean   6.599334e+05  1.457323e+00  2.268423e+01  7.865784e-01  4.903934e-02   
std    4.018850e+05  4.982441e-01  1.382769e+01  1.249957e+00  2.941420e-01   
min    6.481400e+04  1.000000e+00  1.000000e-01  0.000000e+00  0.000000e+00   
25%    2.990100e+05  1.000000e+00  1.060000e+01  0.000000e+00  0.000000e+00   
50%    5.730140e+05  1.000000e+00  2.210000e+01  0.000000e+00  0.000000e+00   
75%    1.104478e+06  2.000000e+00  3.420000e+01  1.000000e+00  0.000000e+00   
max    1.331370e+06  4.000000e+00  4.990000e+01  7.000000e+00  6.000000e+00   

              wides      noballs         byes       legbyes  penalty  \
count  26737.000000  4828.000000  1766.000000  12083.000000     13.0   
mean       1.202304     1.038318     2.063420      1.369941      5.0   
std        0.789166     0.328190     1.314314      0.884506      0.0   
min        1.000000     1.000000     1.000000      1.000000      5.0   
25%        1.000000     1.000000     1.000000      1.000000      5.0   
50%        1.000000     1.000000     1.000000      1.000000      5.0   
75%        1.000000     1.000000     4.000000      1.000000      5.0   
max        5.000000     5.000000     4.000000      5.000000      5.0   

       other_wicket_type  other_player_dismissed  cricsheet_id    dl_applied  \
count                0.0                     0.0  1.170917e+06  1.170917e+06   
mean                 NaN                     NaN  6.599334e+05  7.751531e-02   
std                  NaN                     NaN  4.018850e+05  2.674075e-01   
min                  NaN                     NaN  6.481400e+04  0.000000e+00   
25%                  NaN                     NaN  2.990100e+05  0.000000e+00   
50%                  NaN                     NaN  5.730140e+05  0.000000e+00   
75%                  NaN                     NaN  1.104478e+06  0.000000e+00   
max                  NaN                     NaN  1.331370e+06  1.000000e+00   

        win_by_runs  win_by_wickets  
count  1.170917e+06    1.170917e+06  
mean   3.503768e+01    2.651942e+00  
std    5.257946e+01    3.139602e+00  
min    0.000000e+00    0.000000e+00  
25%    0.000000e+00    0.000000e+00  
50%    0.000000e+00    0.000000e+00  
75%    5.900000e+01    6.000000e+00  
max    2.750000e+02    1.000000e+01  

Data Cleaning

Here, during a thorough process of Data Cleaning, we dropped some columns which were largely empty (>1000000), and replaced some empties with ‘Unknown’. We also created two new variables that assign the winner of the match as the column of the winner to ensure a binary outcome we can predict. Categorical variables that are NaN values are converted to ‘Unknown’ to so the data can be used as an additional encoded variable rather than dropping the row entirely. Finally, we dropped data points that are only supplied after the match and would theoretically not be available to predict a live ongoing match, and create another pairplot with our newly cleaned data to determine any correlations.

Code
#drop columns that have more than 1Million NaNs

colNaCounts = from02to22.isna().sum()


columns_to_drop = colNaCounts[colNaCounts >= 1000000].index.tolist()

# Drop identified columns from the DataFrame
from02to22 = from02to22.drop(columns=columns_to_drop)


#revalue new winner column

from02to22['winnerTeam'] = from02to22.apply(lambda row: 'team1' if row['winner'] == row['team1'] else 'team2', axis=1)

from02to22['toss_winner'] = from02to22.apply(lambda row: 'team1' if row['toss_winner'] == row['team1'] else 'team2', axis=1)



#convert Nan cities to 'Unknown'
#drop winner NA columns
#convert NA player of match to 'unknown'
#convert NA umpire 3 to 'unknown'

from02to22['city'] = from02to22['city'].fillna('Unknown') 
from02to22['player_of_match'] = from02to22['player_of_match'].fillna('Unknown') 
from02to22['umpire3'] = from02to22['umpire3'].fillna('Unknown') 
from02to22 = from02to22.dropna(subset=['winner'])
from02to22 = from02to22.drop(columns = ['match_id', 'start_date', 'date', 'winner', 'cricsheet_id', 'season', 'venue', 'city', 'player_of_match', 'win_by_runs', 'win_by_wickets', 'umpire1', 'umpire2', 'umpire3', 'result']) #is date specific data really useful? also drop continuous match identifiers. We want the match stats
corr = sns.pairplot(from02to22)
corr

Below we perform encoding and data scaling to use later in our web application.

Feature engineering for model implementation

Code
import joblib

to_norm = (from02to22.select_dtypes(include =['int64', 'float64']))
to_encode = (from02to22.select_dtypes(include =['object']))

cricDataSet = from02to22.copy()

label_encoder = LabelEncoder()
min_max_scaler = MinMaxScaler()

label_encoders = {}

for col in cricDataSet.columns:
    if cricDataSet[col].dtype == 'object':  # Check if column is categorical
        label_encoders[col] = LabelEncoder()
        label_encoders[col].fit(cricDataSet[col])
        cricDataSet[col] = label_encoders[col].transform(cricDataSet[col])

# Save label encoder and scaler objects to disk
for col, encoder in label_encoders.items():
    joblib.dump(encoder, f'data/label_encoder_{col}.joblib')

cricDataSet[['innings', 'ball', 'runs_off_bat', 'extras', 'dl_applied']] = min_max_scaler.fit_transform(cricDataSet[['innings', 'ball', 'runs_off_bat', 'extras', 'dl_applied']])


joblib.dump(label_encoder, 'data/label_encoder.joblib')
joblib.dump(min_max_scaler, 'data/min_max_scaler.joblib')

from02to22.head()
innings ball batting_team bowling_team striker non_striker bowler runs_off_bat extras team1 team2 toss_winner toss_decision dl_applied winnerTeam
94186 1 0.1 Namibia Papua New Guinea L Louwrens D la Cock N Pokana 4 0 Namibia Papua New Guinea team1 bat 0 team1
94187 1 0.2 Namibia Papua New Guinea L Louwrens D la Cock N Pokana 0 1 Namibia Papua New Guinea team1 bat 0 team1
94188 1 0.3 Namibia Papua New Guinea L Louwrens D la Cock N Pokana 0 0 Namibia Papua New Guinea team1 bat 0 team1
94189 1 0.4 Namibia Papua New Guinea L Louwrens D la Cock N Pokana 0 0 Namibia Papua New Guinea team1 bat 0 team1
94190 1 0.5 Namibia Papua New Guinea L Louwrens D la Cock N Pokana 0 0 Namibia Papua New Guinea team1 bat 0 team1
Code
cricDataSet
innings ball batting_team bowling_team striker non_striker bowler runs_off_bat extras team1 team2 toss_winner toss_decision dl_applied winnerTeam
94186 0.0 0.000000 13 19 819 339 810 0.571429 0.000000 11 18 0 0 0.0 0
94187 0.0 0.002008 13 19 819 339 810 0.000000 0.166667 11 18 0 0 0.0 0
94188 0.0 0.004016 13 19 819 339 810 0.000000 0.000000 11 18 0 0 0.0 0
94189 0.0 0.006024 13 19 819 339 810 0.000000 0.000000 11 18 0 0 0.0 0
94190 0.0 0.008032 13 19 819 339 810 0.000000 0.000000 11 18 0 0 0.0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1265098 1.0 0.853414 10 16 622 1600 867 0.000000 0.000000 14 9 1 1 0.0 0
1265099 1.0 0.863454 10 16 1611 618 583 0.142857 0.000000 14 9 1 1 0.0 0
1265100 1.0 0.865462 10 16 622 1600 583 0.000000 0.000000 14 9 1 1 0.0 0
1265101 1.0 0.867470 10 16 1611 15 583 0.285714 0.000000 14 9 1 1 0.0 0
1265102 1.0 0.869478 10 16 1611 15 583 0.000000 0.000000 14 9 1 1 0.0 0

1138948 rows × 15 columns

Feature Engineering for initial model testing

The next step is the Feature Engineering where we: - Normalized the continuous data - Encoded categorical variables

Code
#normalize continuous data because we dont have a normal distribution

to_norm = (from02to22.select_dtypes(include =['int64', 'float64'])) #select continuous variables

continuous = MinMaxScaler().fit_transform(to_norm) #fit and transform min max scaler (normalizes)
continuous = pd.DataFrame(continuous, index = from02to22.index, columns = list(to_norm))

#encode categorical variables

to_encode = (from02to22.select_dtypes(include =['object'])) #test if anything is an object or category variable

label_encoders = [] #make new encoder for each column
encoded_data = pd.DataFrame()

encoding_dicts = {}

# Iterate over each column in 'to_encode' and encode using a separate LabelEncoder
for col in to_encode:
    # Create a new instance of LabelEncoder for the current column
    encoder = LabelEncoder()
    
    # Fit and transform the data in 'from02to22[col]'
    encoded_data[col] = encoder.fit_transform(from02to22[col])
    
    # Store the encoder in the list
    label_encoders.append(encoder)

    encoding_dicts[col] = dict(zip(encoder.classes_, encoder.transform(encoder.classes_)))


#patch the columns back together

data = pd.concat([encoded_data.reset_index(drop=True), continuous.reset_index(drop=True)], axis = 1) #reset indices to avoid errors in concat

Feature Selection

As we have eliminated the unnecessary columns from data previously through data preprocessing and cleaning, now we directly select the top 10 columns except winnerTeam as the input features of the model and the winnerTeam itself as the target feature. We used select K-best to choose the best 10 variables for predicting, using the f_classif function.

Code
xTemp = cricDataSet.drop('winnerTeam', axis = 1)
yTemp = cricDataSet['winnerTeam']

selector = SelectKBest(score_func= f_classif, k = 10)
top10 = selector.fit_transform(xTemp,yTemp) #create variable that is the top 10 best feautures
cols_idxs = selector.get_support(indices=True) #grab indices from feature cols, get_support is from **sklearn**
top10 = xTemp.iloc[:,cols_idxs] #add columns from whole dataset to the selected columns dataset https://stackoverflow.com/questions/39839112/the-easiest-way-for-getting-feature-names-after-running-selectkbest-in-scikit-le
top10.columns
Index(['ball', 'batting_team', 'bowling_team', 'non_striker', 'bowler',
       'runs_off_bat', 'team1', 'team2', 'toss_winner', 'dl_applied'],
      dtype='object')
Code
cricDataSet[['innings', 'ball', 'runs_off_bat', 'extras', 'dl_applied']] = min_max_scaler.fit_transform(cricDataSet[['innings', 'ball', 'runs_off_bat', 'extras', 'dl_applied']])

joblib.dump(min_max_scaler, 'data/min_max_scaler.joblib')

cricDataSet.head()
innings ball batting_team bowling_team striker non_striker bowler runs_off_bat extras team1 team2 toss_winner toss_decision dl_applied winnerTeam
94186 0.0 0.000000 13 19 819 339 810 0.571429 0.000000 11 18 0 0 0.0 0
94187 0.0 0.002008 13 19 819 339 810 0.000000 0.166667 11 18 0 0 0.0 0
94188 0.0 0.004016 13 19 819 339 810 0.000000 0.000000 11 18 0 0 0.0 0
94189 0.0 0.006024 13 19 819 339 810 0.000000 0.000000 11 18 0 0 0.0 0
94190 0.0 0.008032 13 19 819 339 810 0.000000 0.000000 11 18 0 0 0.0 0

Model Training (Logistic Regression)

The input features and target feature are now broken down into training and testing groups and are put under our first regression model - Logistic regression. We chose to break the data into 80% training and 20% testing to ensure as many potential relationships are explored as possible, as we have many based on our 10 independent variables.

Code
X = cricDataSet.drop('winnerTeam', axis = 1) #input
y = cricDataSet['winnerTeam'] #target

#split data into training and testing


#break into 4 groups for testing and training, make the training dataset 80% of the data and the testing dataset 20% https://builtin.com/data-science/train-test-split
X_train, X_test, y_train, y_test = tts(X, y, test_size = 0.2) 

#training and testing models

#Logistic Regression
lr = LogisticRegression()

lr.fit(X_train, 
       y_train)
predictLR = lr.predict(X_test)

outcomeLR = pd.DataFrame ({'Actual': y_test, 'Predicted': predictLR})

Model Validation

The results of Logistic Regression are now cross validated and their performance metrics are displayed. As we observe the accuracy and precision of this model is around 56% and the area under ROC curve is also around 0.5. As these metrics show that this might not be the best model, we move forward and train another model. One benefit we may see in with this model is the speed of training, so if we potentially had better predictive variables, we may see a stronger model. Finally, the model does not converge as the maximum number of iterations are reached. Attempts were made to increase the iteration limit, though this was futile as the same warning was received.

Code
#Evaluating models
#logistic regression
cvLR = cross_validate(lr,X_train, y_train)
print("cross validation of logistic regression", cvLR)
print("accuracy of logistic regression:", metrics.accuracy_score(y_test, predictLR)) #testing how accuracy of the models
print("precision of logistic regression:", metrics.precision_score(y_test, predictLR, average = 'weighted'))
print("recall of logistic regression:", metrics.recall_score(y_test, predictLR, average = 'weighted'))
print("ROCAUC macro of logistic regression:", metrics.roc_auc_score(y_test, predictLR))
print("ROCAUC micro of logistic regression:", metrics.roc_auc_score(y_test, predictLR, average = 'micro'))


y_pred_proba = lr.predict_proba(X_test)[::,1] #https://www.statology.org/plot-roc-curve-python/
fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)
plt.plot(fpr,tpr)
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.title('ROC Curve Logistic Regression')
plt.show()
cross validation of logistic regression {'fit_time': array([4.39633799, 4.20979023, 4.32797837, 4.39308453, 4.1381259 ]), 'score_time': array([0.02576947, 0.02948046, 0.02303123, 0.02139449, 0.0218761 ]), 'test_score': array([0.57073401, 0.56974626, 0.56873107, 0.57021582, 0.56715378])}
accuracy of logistic regression: 0.5682207296193863
precision of logistic regression: 0.5548387218330451
recall of logistic regression: 0.5682207296193863
ROCAUC macro of logistic regression: 0.5313543518731261
ROCAUC micro of logistic regression: 0.5313543518731261

Random Forest Classifier

Now we take the Random Forest Classifier, cross validate its results and display its performance metrics. As we observe the accuracy and precision of this model is around 96% and the area under ROC curve is also around 0.96. These metrics clearly show that this is the best model, so we move forward with this model. One drawback we can see is that the model takes a very long time to train, so we have printed the accuracy metrics as markdown text and an image for time saving. Based on these metrics we will move forward with a random forest classifier model.

Code
#random forest
rfc = RandomForestClassifier(criterion='gini')
rfc.fit(X_train,
        y_train)
predictRFC = rfc.predict(X_test)
outcomeRFC = pd.DataFrame ({'Actual': y_test, 'Predicted': predictRFC})


#Evaluating models
#random forest
cvRF = cross_validate(rfc,X_train, y_train)
print("cross validation of random forest", cvRF)
print("accuracy of random forest:", metrics.accuracy_score(y_test, predictRFC)) #testing how accuracy of the models
print("precision of random forest:", metrics.precision_score(y_test, predictRFC, average = 'weighted'))
print("recall of random forest:", metrics.recall_score(y_test, predictRFC, average = 'weighted'))
print("f1 of random forest:", metrics.f1_score(y_test, predictRFC, average = 'weighted'))
print("ROCAUC macro of random forest:", metrics.roc_auc_score(y_test, predictRFC))
print("ROCAUC micro of random forest:", metrics.roc_auc_score(y_test, predictRFC, average = 'micro'))


y_pred_proba = rfc.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)
plt.plot(fpr,tpr)
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.title('ROC Curve Random Forest')
plt.show()
cross validation of random forest {'fit_time': array([114.77010131, 100.95979095, 100.71152282, 100.81103635,
       103.81996179]), 'score_time': array([4.20399642, 4.05240583, 4.09542727, 4.13592744, 4.09596825]), 'test_score': array([0.98771895, 0.98767505, 0.98768603, 0.9884597 , 0.98787802])}
accuracy of random forest: 0.9891654594143728
precision of random forest: 0.9891646541862661
recall of random forest: 0.9891654594143728
f1 of random forest: 0.9891647876625119
ROCAUC macro of random forest: 0.9889520519879526
ROCAUC micro of random forest: 0.9889520519879526

accuracy of random forest: 0.9899249308573687 precision of random forest: 0.9899241933153199 recall of random forest: 0.9899249308573687 f1 of random forest: 0.9899243403563959 ROCAUC macro of random forest: 0.9897132632925527 ROCAUC micro of random forest: 0.9897132632925527

Model Recreation for Main Application

From this point onwards, the following code is used to build files that will be fed into our web application. We retrain our model with the manual encoded variables from earlier so that each unique value receives a unique code when we implement the model in our app. Most of the process below is identical to our process above. This ensures the trees in the random forest have the correct values to assign to each outcome.

Label Encoding

Encoding the labels that contains team name and player name and saving them into json file.

Code
import json

teams = pd.concat([from02to22['team1'], from02to22['team2']]).unique()

# Create a dictionary mapping each team to a unique number
team_map = {team: i+1 for i, team in enumerate(teams)}

players = pd.concat([from02to22['striker'], from02to22['non_striker'], from02to22['bowler']]).unique()

player_map = {player: i+1 for i, player in enumerate(players)}

with open('data/team_map.json', 'w') as f:
    json.dump(team_map, f)

with open('data/player_map.json', 'w') as f:
    json.dump(player_map, f)
Code
# Load the team map dictionary from the saved JSON file
with open('data/team_map.json', 'r') as f:
    team_map_loaded = json.load(f)

with open('data/player_map.json', 'r') as f:
    player_map_loaded = json.load(f)

temp = from02to22.copy()

team_map_loaded = dict(team_map_loaded)
player_map_loaded = dict(player_map_loaded)

temp['team1'] = temp['team1'].apply(lambda x: team_map_loaded.get(x))

temp['team2'] = temp['team2'].apply(lambda x: team_map_loaded.get(x))

temp['batting_team'] = temp['batting_team'].apply(lambda x: team_map_loaded.get(x))

temp['bowling_team'] = temp['bowling_team'].apply(lambda x: team_map_loaded.get(x))

temp['striker'] = temp['striker'].apply(lambda x: player_map_loaded.get(x))

temp['non_striker'] = temp['non_striker'].apply(lambda x: player_map_loaded.get(x))

temp['bowler'] = temp['bowler'].apply(lambda x: player_map_loaded.get(x))

temp['toss_winner'] = temp['toss_winner'].apply(lambda x: 0 if x == "team1" else 1)

temp['winnerTeam'] = temp['winnerTeam'].apply(lambda x: 0 if x == "team1" else 1)

temp['toss_decision'] = temp['toss_decision'].apply(lambda x: 0 if x == "bat" else 1)

Min - Max scaling

Scaling all the numerical features and saving the MinMaxScaler object as a .joblib file.

Code
min_max_scaler = MinMaxScaler()

temp[['innings', 'ball', 'runs_off_bat', 'extras', 'dl_applied']] = min_max_scaler.fit_transform(temp[['innings', 'ball', 'runs_off_bat', 'extras', 'dl_applied']])

joblib.dump(min_max_scaler, 'data/min_max_scaler.joblib')
['data/min_max_scaler.joblib']

Model retraining

Re-training the model such with the encoded features and scaled features and saving the model as a pickle file. This will be used to within the web application.

Code
cricDataSet = temp.copy()

X = cricDataSet.drop('winnerTeam', axis = 1) #input
y = cricDataSet['winnerTeam'] #target

X_train, X_test, y_train, y_test = tts(X, y, test_size = 0.2) 

# rfc = RandomForestClassifier(criterion='gini')
# rfc.fit(X_train, 
#           y_train)
Code
# Step 2: Save the trained model to a pickle file
# with open("data/cricketPrediction.pkl", "wb") as f:
#     pickle.dump(rfc, f)