Fraud Detection using Ensemble Learning

Proposal

Credit Card Fraud Detection Using Anomaly Detection Techniques
Author
Affiliation

Roxana Akbarsharifi
Omid Zandi
Deema Albluwi
Gowtham Gopalakrishnan
Nandhini Anne
Sai Navya Reddy Busireddy

School of Information, University of Arizona

High Level Goal

The overarching goal is to evaluate the enhancement in machine learning models’ performance achieved by integrating various models through a novel ensemble method known as stacked generalization. To this end, we will apply the stacked model to a comlex classification problem aimed at detecting fraudulent credit card transactions.

Project Description and Motivation

Our project is motivated by the goal of accurately identifying fraudulent transactions using ensemble learning approaches (baggin, boosting, and stacking). We plan to use advanced data mining and machine learning techniques to create a model capable of spotting transactions that stand out as potentially fraudulent. A key part of our project involves comparing different machine learning models to see which one performs best on each dataset we study. We’re committed to understanding why certain models are more effective in specific situations.

Also, we aimed to test the efficiency of Stacked Generalization, commonly known as “stacking,” which is an ensemble learning technique that combines multiple classification or regression models via a meta-classifier or a meta-regressor. The base level models are trained based on a complete training set, then the meta-model is trained on the outputs of the base level models as features. Figure 1 illustrates the overall workflow of stacking method.

Through this work, we can make a meaningful contribution to both the field of study and the practical efforts to secure financial transactions.

stacking
Fig 1. Overall Flowchart of Stacking method on the mlxtend Library

Dataset

We chose a dataset comprising over 550,000 credit card transactions made by European cardholders in 2023. This dataset is particularly appealing because of its substantial size, the anonymity of its features, and its real-world relevance. It includes 31 features: a unique transaction ID, 28 anonymized attributes (V1-V28) that could encompass various transaction details such as time, location, and merchant category, the transaction amount, and a binary class indicating whether the transaction was fraudulent. The data’s anonymization ensures privacy and ethical compliance, while the binary classification makes it suitable for supervised learning approaches in fraud detection.

One of the disadvantages of the the dataset is that since the features are anonymized, feature importance analysis, such as SHAP (SHapley Additive exPlanations) or permutation feature importance, does not provide any useful information about interpretability of the machine learnig models. Also, the imbalance of the data is an issue, making it a challenging problem.

import numpy as np
import pandas as pd
df = pd.read_csv('./data/creditcard_2023.csv')
df.head()
id V1 V2 V3 V4 V5 V6 V7 V8 V9 ... V21 V22 V23 V24 V25 V26 V27 V28 Amount Class
0 0 -0.260648 -0.469648 2.496266 -0.083724 0.129681 0.732898 0.519014 -0.130006 0.727159 ... -0.110552 0.217606 -0.134794 0.165959 0.126280 -0.434824 -0.081230 -0.151045 17982.10 0
1 1 0.985100 -0.356045 0.558056 -0.429654 0.277140 0.428605 0.406466 -0.133118 0.347452 ... -0.194936 -0.605761 0.079469 -0.577395 0.190090 0.296503 -0.248052 -0.064512 6531.37 0
2 2 -0.260272 -0.949385 1.728538 -0.457986 0.074062 1.419481 0.743511 -0.095576 -0.261297 ... -0.005020 0.702906 0.945045 -1.154666 -0.605564 -0.312895 -0.300258 -0.244718 2513.54 0
3 3 -0.152152 -0.508959 1.746840 -1.090178 0.249486 1.143312 0.518269 -0.065130 -0.205698 ... -0.146927 -0.038212 -0.214048 -1.893131 1.003963 -0.515950 -0.165316 0.048424 5384.44 0
4 4 -0.206820 -0.165280 1.527053 -0.448293 0.106125 0.530549 0.658849 -0.212660 1.049921 ... -0.106984 0.729727 -0.161666 0.312561 -0.414116 1.071126 0.023712 0.419117 14278.97 0

5 rows × 31 columns

# Print the shape of the DataFrame
print("Shape of the DataFrame:", df.shape)


# Print the data types of each column
# print("Data types of the columns:")
# print(df.dtypes)


# Use .info() to get a concise summary of the DataFrame
print("DataFrame Information:")
df.info()
Shape of the DataFrame: (568630, 31)
DataFrame Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 568630 entries, 0 to 568629
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   id      568630 non-null  int64  
 1   V1      568630 non-null  float64
 2   V2      568630 non-null  float64
 3   V3      568630 non-null  float64
 4   V4      568630 non-null  float64
 5   V5      568630 non-null  float64
 6   V6      568630 non-null  float64
 7   V7      568630 non-null  float64
 8   V8      568630 non-null  float64
 9   V9      568630 non-null  float64
 10  V10     568630 non-null  float64
 11  V11     568630 non-null  float64
 12  V12     568630 non-null  float64
 13  V13     568630 non-null  float64
 14  V14     568630 non-null  float64
 15  V15     568630 non-null  float64
 16  V16     568630 non-null  float64
 17  V17     568630 non-null  float64
 18  V18     568630 non-null  float64
 19  V19     568630 non-null  float64
 20  V20     568630 non-null  float64
 21  V21     568630 non-null  float64
 22  V22     568630 non-null  float64
 23  V23     568630 non-null  float64
 24  V24     568630 non-null  float64
 25  V25     568630 non-null  float64
 26  V26     568630 non-null  float64
 27  V27     568630 non-null  float64
 28  V28     568630 non-null  float64
 29  Amount  568630 non-null  float64
 30  Class   568630 non-null  int64  
dtypes: float64(29), int64(2)
memory usage: 134.5 MB

Research Questions

Research Question 1:

What is the comparative performance of anomaly detection algorithms, including Random Forest, XGBoost, KNN, for fraud detection in this specific dataset?

Analysis Plan:

  • Anomaly detection datasets are highly imbalanced and the rare class (anomalies) is often more important, they require special sampling techniques. The most plausible technique is oversampling the minority class and Undersampling the majority class.
  • Split the dataset into training and testing sets and train individual anomaly detection models (Random Forest, XGBoost, KNN) on the training set.
  • Hypertuning the trained models.
  • Evaluate the performance of each model on the testing set using metrics such as precision, recall, F1-score, and area under the ROC curve.
  • Analyze the reasons behind the performance differences observed, potentially considering factors such as model complexity, feature importance, and dataset characteristics.

Research Question 2:

How does the stacked generalization technique, implemented with the mlxtend library, improve fraud detection performance by leveraging the synergy between base classifiers?

Analysis Plan:

  • Implement stacked generalization using the mlxtend library with the trained models from previous question.
  • Split the base learners output into training and testing sets.
  • Combine predictions from base classifiers using the stacking ensemble approach and train a meta-classifier on the combined predictions.
  • Evaluate the performance of the stacked model and compare it with the base learners.
  • Analyze the reasons behind the performance improvement, considering factors such as model diversity, ensemble learning principles, and the dataset’s characteristics.

Plan of Attack

Task Name Assignee Due Summary
Exploratory Data Analysis Sai & Nandhini 04/07/2024 Comparing the statistical distribution of the anonymized features, Exploring the relationship between the amount of transactions and fraudulent transactions
Feature Selection and Engineering Gowtham 04/09/2024 Performing PCA and selecting the number of PCAs, Exploring random forest feature importance
Training the Base Learners Deema 04/14/2024 Training one machine learning algorithm from each of the ensemble learning approaches (bagging, boosting, and stacking) along with artificial neural networks
Hypertuning Base Learners Roxana 04/20/2024 Hypertuning the base learners using grid search or random search
Model Evaluation Sai 04/24/2024 Evaluating models using categorical metrics, confusion metrics, and ROC curve
Developing Stacked Generalization Omid 04/30/2024 Selecting the metalearner and Testing the potential improvement upon the base learners
Preparing the Final Report and Presentation Nandhini 04/05/2024 Finalizing the results and practicing the oral presentation

Repo Organization

Project repository comprises of following folders :

  • .github/: Reserved for GitHub-related files like workflows, actions, and customized templates tailored for issue management.

  • _extra/: Acts as a repository for miscellaneous files that don’t fit into other project sections, offering flexibility for various supplementary documents.

  • _freeze/: Stores frozen environment files detailing the project’s setup and dependencies.

  • data/: Houses essential data files crucial for project operations, including input files, datasets, and other vital resources.

  • images/: Serves as a central repository for visual assets such as diagrams, charts, and screenshots essential for project documentation and presentation.

  • .gitignore: Defines exclusions from version control, streamlining the versioning process.

  • README.md: Serves as the primary source of project information, encompassing setup instructions, usage guidelines, and an overview of project objectives and scope.

  • _quarto.yml: Functions as the configuration file for Quarto, governing the construction and rendering of Quarto documents.

  • about.qmd: Provides supplementary contextual information about the project’s purpose and team members.

  • index.qmd: Acts as the main hub for the project, offering detailed descriptions including code snippets, visualizations, and outcomes.

  • presentation.qmd: Serves as a Quarto file for presenting the final project results in slideshow format.

  • project-final.Rproj: Project file for organization and management within the R environment.

  • proposal.qmd: Contains the project proposal, encompassing dataset details, metadata, project description, questions, and weekly plan updates.

  • requirements.txt: Specifies project dependencies and their versions essential for successful execution.

References

[1] The Data source link is attached here: https://www.kaggle.com/datasets/nelgiriyewithana/credit-card-fraud-detection-dataset-2023

[2] Github Link: https://github.com/INFO523-S24/project-final-MiningMinds