Fraud Detection using Ensemble Learning
Proposal
High Level Goal
The overarching goal is to evaluate the enhancement in machine learning models’ performance achieved by integrating various models through a novel ensemble method known as stacked generalization. To this end, we will apply the stacked model to a comlex classification problem aimed at detecting fraudulent credit card transactions.
Project Description and Motivation
Our project is motivated by the goal of accurately identifying fraudulent transactions using ensemble learning approaches (baggin, boosting, and stacking). We plan to use advanced data mining and machine learning techniques to create a model capable of spotting transactions that stand out as potentially fraudulent. A key part of our project involves comparing different machine learning models to see which one performs best on each dataset we study. We’re committed to understanding why certain models are more effective in specific situations.
Also, we aimed to test the efficiency of Stacked Generalization, commonly known as “stacking,” which is an ensemble learning technique that combines multiple classification or regression models via a meta-classifier or a meta-regressor. The base level models are trained based on a complete training set, then the meta-model is trained on the outputs of the base level models as features. Figure 1 illustrates the overall workflow of stacking method.
Through this work, we can make a meaningful contribution to both the field of study and the practical efforts to secure financial transactions.
Dataset
We chose a dataset comprising over 550,000 credit card transactions made by European cardholders in 2023. This dataset is particularly appealing because of its substantial size, the anonymity of its features, and its real-world relevance. It includes 31 features: a unique transaction ID, 28 anonymized attributes (V1-V28) that could encompass various transaction details such as time, location, and merchant category, the transaction amount, and a binary class indicating whether the transaction was fraudulent. The data’s anonymization ensures privacy and ethical compliance, while the binary classification makes it suitable for supervised learning approaches in fraud detection.
One of the disadvantages of the the dataset is that since the features are anonymized, feature importance analysis, such as SHAP (SHapley Additive exPlanations) or permutation feature importance, does not provide any useful information about interpretability of the machine learnig models. Also, the imbalance of the data is an issue, making it a challenging problem.
id | V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | ... | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | Amount | Class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | -0.260648 | -0.469648 | 2.496266 | -0.083724 | 0.129681 | 0.732898 | 0.519014 | -0.130006 | 0.727159 | ... | -0.110552 | 0.217606 | -0.134794 | 0.165959 | 0.126280 | -0.434824 | -0.081230 | -0.151045 | 17982.10 | 0 |
1 | 1 | 0.985100 | -0.356045 | 0.558056 | -0.429654 | 0.277140 | 0.428605 | 0.406466 | -0.133118 | 0.347452 | ... | -0.194936 | -0.605761 | 0.079469 | -0.577395 | 0.190090 | 0.296503 | -0.248052 | -0.064512 | 6531.37 | 0 |
2 | 2 | -0.260272 | -0.949385 | 1.728538 | -0.457986 | 0.074062 | 1.419481 | 0.743511 | -0.095576 | -0.261297 | ... | -0.005020 | 0.702906 | 0.945045 | -1.154666 | -0.605564 | -0.312895 | -0.300258 | -0.244718 | 2513.54 | 0 |
3 | 3 | -0.152152 | -0.508959 | 1.746840 | -1.090178 | 0.249486 | 1.143312 | 0.518269 | -0.065130 | -0.205698 | ... | -0.146927 | -0.038212 | -0.214048 | -1.893131 | 1.003963 | -0.515950 | -0.165316 | 0.048424 | 5384.44 | 0 |
4 | 4 | -0.206820 | -0.165280 | 1.527053 | -0.448293 | 0.106125 | 0.530549 | 0.658849 | -0.212660 | 1.049921 | ... | -0.106984 | 0.729727 | -0.161666 | 0.312561 | -0.414116 | 1.071126 | 0.023712 | 0.419117 | 14278.97 | 0 |
5 rows × 31 columns
# Print the shape of the DataFrame
print("Shape of the DataFrame:", df.shape)
# Print the data types of each column
# print("Data types of the columns:")
# print(df.dtypes)
# Use .info() to get a concise summary of the DataFrame
print("DataFrame Information:")
df.info()
Shape of the DataFrame: (568630, 31)
DataFrame Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 568630 entries, 0 to 568629
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 568630 non-null int64
1 V1 568630 non-null float64
2 V2 568630 non-null float64
3 V3 568630 non-null float64
4 V4 568630 non-null float64
5 V5 568630 non-null float64
6 V6 568630 non-null float64
7 V7 568630 non-null float64
8 V8 568630 non-null float64
9 V9 568630 non-null float64
10 V10 568630 non-null float64
11 V11 568630 non-null float64
12 V12 568630 non-null float64
13 V13 568630 non-null float64
14 V14 568630 non-null float64
15 V15 568630 non-null float64
16 V16 568630 non-null float64
17 V17 568630 non-null float64
18 V18 568630 non-null float64
19 V19 568630 non-null float64
20 V20 568630 non-null float64
21 V21 568630 non-null float64
22 V22 568630 non-null float64
23 V23 568630 non-null float64
24 V24 568630 non-null float64
25 V25 568630 non-null float64
26 V26 568630 non-null float64
27 V27 568630 non-null float64
28 V28 568630 non-null float64
29 Amount 568630 non-null float64
30 Class 568630 non-null int64
dtypes: float64(29), int64(2)
memory usage: 134.5 MB
Research Questions
Research Question 1:
What is the comparative performance of anomaly detection algorithms, including Random Forest, XGBoost, KNN, for fraud detection in this specific dataset?
Analysis Plan:
- Anomaly detection datasets are highly imbalanced and the rare class (anomalies) is often more important, they require special sampling techniques. The most plausible technique is oversampling the minority class and Undersampling the majority class.
- Split the dataset into training and testing sets and train individual anomaly detection models (Random Forest, XGBoost, KNN) on the training set.
- Hypertuning the trained models.
- Evaluate the performance of each model on the testing set using metrics such as precision, recall, F1-score, and area under the ROC curve.
- Analyze the reasons behind the performance differences observed, potentially considering factors such as model complexity, feature importance, and dataset characteristics.
Research Question 2:
How does the stacked generalization technique, implemented with the mlxtend library, improve fraud detection performance by leveraging the synergy between base classifiers?
Analysis Plan:
- Implement stacked generalization using the mlxtend library with the trained models from previous question.
- Split the base learners output into training and testing sets.
- Combine predictions from base classifiers using the stacking ensemble approach and train a meta-classifier on the combined predictions.
- Evaluate the performance of the stacked model and compare it with the base learners.
- Analyze the reasons behind the performance improvement, considering factors such as model diversity, ensemble learning principles, and the dataset’s characteristics.
Plan of Attack
Task Name | Assignee | Due | Summary |
---|---|---|---|
Exploratory Data Analysis | Sai & Nandhini | 04/07/2024 | Comparing the statistical distribution of the anonymized features, Exploring the relationship between the amount of transactions and fraudulent transactions |
Feature Selection and Engineering | Gowtham | 04/09/2024 | Performing PCA and selecting the number of PCAs, Exploring random forest feature importance |
Training the Base Learners | Deema | 04/14/2024 | Training one machine learning algorithm from each of the ensemble learning approaches (bagging, boosting, and stacking) along with artificial neural networks |
Hypertuning Base Learners | Roxana | 04/20/2024 | Hypertuning the base learners using grid search or random search |
Model Evaluation | Sai | 04/24/2024 | Evaluating models using categorical metrics, confusion metrics, and ROC curve |
Developing Stacked Generalization | Omid | 04/30/2024 | Selecting the metalearner and Testing the potential improvement upon the base learners |
Preparing the Final Report and Presentation | Nandhini | 04/05/2024 | Finalizing the results and practicing the oral presentation |
Repo Organization
Project repository comprises of following folders :
.github/: Reserved for GitHub-related files like workflows, actions, and customized templates tailored for issue management.
_extra/: Acts as a repository for miscellaneous files that don’t fit into other project sections, offering flexibility for various supplementary documents.
_freeze/: Stores frozen environment files detailing the project’s setup and dependencies.
data/: Houses essential data files crucial for project operations, including input files, datasets, and other vital resources.
images/: Serves as a central repository for visual assets such as diagrams, charts, and screenshots essential for project documentation and presentation.
.gitignore: Defines exclusions from version control, streamlining the versioning process.
README.md: Serves as the primary source of project information, encompassing setup instructions, usage guidelines, and an overview of project objectives and scope.
_quarto.yml: Functions as the configuration file for Quarto, governing the construction and rendering of Quarto documents.
about.qmd: Provides supplementary contextual information about the project’s purpose and team members.
index.qmd: Acts as the main hub for the project, offering detailed descriptions including code snippets, visualizations, and outcomes.
presentation.qmd: Serves as a Quarto file for presenting the final project results in slideshow format.
project-final.Rproj: Project file for organization and management within the R environment.
proposal.qmd: Contains the project proposal, encompassing dataset details, metadata, project description, questions, and weekly plan updates.
requirements.txt: Specifies project dependencies and their versions essential for successful execution.
References
[1] The Data source link is attached here: https://www.kaggle.com/datasets/nelgiriyewithana/credit-card-fraud-detection-dataset-2023
[2] Github Link: https://github.com/INFO523-S24/project-final-MiningMinds