Personal Data Detection

INFO 523 - Project Final

Our team is participating in a Kaggle Competition with the higher objective being to create a model capable of detecting personally identifiable information (PII) within student writing, and a secondary goal of evaluating and comparing our models for efficiency to identify the best model.

Author

Affiliation

KG Competitors

School of Information, University of Arizona

Introduction

The primary objective of this project is to develop a model capable of detecting personally identifiable information (PII) in student writing. This is crucial because PII presents significant obstacles to creating publicly accessible educational datasets, as it could potentially expose students to various risks. To address these concerns, our team is participating in a Kaggle competition focusing on PII detection in a dataset of approximately 22,000 student essays. Each essay was submitted in response to a single assignment prompt, requiring students to apply course concepts to real-world scenarios. The dataset has been pre-processed to ensure student privacy, with original PII replaced by surrogate identifiers through a partially automated process.

Our approach to this problem is centered on word-level classification, commonly known as ‘Named Entity Recognition’, where each word in a text is assigned to one of 13 predefined categories. The key variables in the dataset include tokens representing simplified word forms, each associated with these labels. We chose to use three DeBERTa models of varying sizes—extra small, small, and large—to conduct our analysis. The models were fine-tuned using KerasNLP with JAX as the backend and evaluated based on their F5-scores, which prioritize recall to reduce the impact of false negatives.

Our analysis plan consisted of several key steps. First, the raw text was tokenized using the DeBERTa tokenizer to ensure proper segmentation of named entities. Next, each model was fine-tuned with a small learning rate (6e-6), using different batch sizes and training durations to reflect their respective resource requirements. The models were then evaluated based on their performance on a validation set, focusing on the F5-score, where recall is considered five times more important than precision. We also compared the performance and resource utilization of each model to determine which was most efficient and effective.

This report outlines our strategy, methods, and conclusions for tackling the problem of PII identification in student works. We walk through our approach, the features of the dataset, and our analytical plan throughout this paper, explaining how we explored several DeBERTa model configurations, training processes, and evaluation criteria. Through thorough performance evaluation and painstaking fine-tuning of these models, we want to achieve the best possible balance between computing efficiency and accuracy. In the end, our results provide useful information for academics and professionals involved in the study of educational data, adding to the current conversation about protecting student privacy and expanding the boundaries of data-driven educational research.

Abstract

Our team is participating in a Kaggle competition in which the primary goal is to develop a model capable of detecting personally identifiable information (PII) in student writing. PII creates significant obstacles to the analysis and creation of publicly accessible datasets designed to enhance education, as sharing such data could expose students to potential risks. To address these risks, educational data must be thoroughly screened and cleansed of PII before it is publicly released, a task that could be optimized with data science techniques. Typically, text classification tasks involve categorizing entire sentences. However, our task focuses on word-level classification, where each word in a text belongs to one of several predefined categories, akin to a multiclass classification problem. This approach is commonly known as ‘Named Entity Recognition’. Our process involved creating and running three DeBERTa models of various sizes (extra small, small, and large) to solve the Kaggle objective, while being able to compare our three models and determine which one works best.

Questions

Can we develop a model that successfully detects personally identifiable information (PII) in student writing?
How can we evaluate the model’s performance effectively? Which metrics are most appropriate for PII detection tasks?

Dataset

We are using a dataset sourced from the Kaggle competition we are participating in, which comprises roughly 22,000 essays submitted by students. Each essay was written in response to a single assignment prompt, asking students to apply course concepts to a real-world situation. We chose this dataset and competition because some of our team members have experience in NLP tasks, providing an opportunity to share knowledge with the rest of the team while addressing a practical problem. The competition’s goal is to detect personally identifiable information (PII) within these essays. To ensure student privacy, any original PII in the dataset has been replaced with surrogate identifiers of similar types through a partially automated process.

Dataset Used: The Learning Agency Lab - PII Data Detection
Description: Contains annotations for named entities (word categories) across 13 categories.
Preprocessing: The text is tokenized using DeBERTa’s tokenizer to ensure that named entities are correctly identified and segmented.

GitHub does not allow files over 100 mb, so we store the compressed version of the dataset and decompress it temporarily to show what the data looks like.

The code for displaying json data is adapted from medium.com/@nslhnyldz

	labels	tokens
864	I-NAME_STUDENT	Shende
2707	B-NAME_STUDENT	Alina
2216	B-NAME_STUDENT	Om
1422	I-NAME_STUDENT	Ciccone
816	B-NAME_STUDENT	Dawn
863	B-NAME_STUDENT	Vijay
2206	B-NAME_STUDENT	Rachel
1330	B-NAME_STUDENT	Luis
2662	I-NAME_STUDENT	Rodriguez
914	B-NAME_STUDENT	Mohamed

Json is not the most convenient format for displaying information, so we need to transform it into a DataFrame. To convey what the data looks like, we will transorm the dataset into a csv file and display it. It’s important to note that the central part of the dataset are the tokens (simplified word forms) that can carry one of the following labels:

O - Not a part of an entity constituting personal information
NAME_STUDENT - The full or partial name of a student that is not necessarily the author of the essay. This excludes instructors, authors, and other person names.
EMAIL - A student’s email address.
USERNAME - A student’s username on any platform.
ID_NUM - A number or sequence of characters that could be used to identify a student, such as a student ID or a social security number.
PHONE_NUM - A phone number associated with a student.
URL_PERSONAL - A URL that might be used to identify a student.
STREET_ADDRESS - A full or partial street address that is associated with the student, such as their home address.

Token labels are presented in BIO (Beginning, Inner, Outer) format. The PII type is prefixed with “B-” when it is the beginning of an entity. If the token is a continuation of an entity, it is prefixed with “I-”. Tokens that are not PII are labeled “O”.

We can also view the full entities, and not just separate tokens. For that we would remove the labels prefixes and merge adjacent PII entities that follow a prefix pattern “B-I-I …”.

	entity	label
783	Abdul Khan	NAME_STUDENT
180	Fabio Sameh	NAME_STUDENT
1243	https://www.facebook.com/pamela23	URL_PERSONAL
1244	Lisa Santiago	NAME_STUDENT
205	Rhiannon Karim	NAME_STUDENT
1544	https://hernandez.com/exploremain.html	URL_PERSONAL
1040	Brenda Garcia	NAME_STUDENT
1571	David	NAME_STUDENT
1429	264945858442	ID_NUM
988	Christina Bennett	NAME_STUDENT

Additionally, each label can be prefaced by either B or I, which is a way to tell whether a token is the first word of an entity or its continuation. The majority of tokens are labelled as “O”, meaning they do not constitute PII. For display purposes, we will only select those tokens and their labels that do constitute personal information.

Analysis Plan

Methodology:
1. Tokenization: Tokenize the raw text using the appropriate DeBERTa tokenizer.
2. Fine-Tuning: Model are be fine-tuned on the same training set with a very small learning rate 6e-6. KerasNLP python package was used with JAX backed. Models were fine tuned on a Nvidia 4070Ti GPU.
3. Evaluation: Assessed each model’s performance on the validation set F5-score. In the F5 score, recall is considered five times as important as precision. This means the F5 score is particularly useful in our case where missing out on positive cases (false negatives) is much more problematic than incorrectly labeling negative cases as positive (false positives) since there is a huge class imbalance.
4. Comparison: Compared the performance and resource utilization of each model.

Model 1: Extra Small DeBERTa

Configuration: DeBERTa-v3 Extra Small - a 70.68 Million parameter model.
Training Details: Tuned with a smaller batch size due to resource constraints, focusing on maximizing model efficiency.
Expected Advantage: Lower computational requirements, making it suitable for environments with strict resource limitations.

Model 2: Small DeBERTa

Configuration: DeBERTa-v3 Small - a 141.30 Million parameter model.
Training Details: Balance between computational demand and performance, using a moderate batch size and learning rate.
Expected Advantage: Expected to perform better than the extra small model with manageable resource use.

Model 3: Large DeBERTa

Configuration: DeBERTa-v3 Large - a 434.01 Million parameter model
Training Details: Utilizes larger batch sizes and more extended training periods to exploit the model’s capacity fully.
Expected Advantage: Highest accuracy and performance, suitable for scenarios where computational resources are less of a constraint.

Results & Conclusion

Performance Analysis:
- The Extra Small DeBERTa model offers the fastest inference time but at the cost of lower F5 score.
- The Small DeBERTa model provides a balanced performance with reasonable inference times and improved F5 Score over the extra small variant.
- The Large DeBERTa model achieves the highest F5 score, reflecting its larger model capacity and outperformance.

Conclusion:
- Regarding our research goals and inquiries, we can indeed construct a model capable of accurately detecting personally identifiable information (PII) in student writings. To assess the efficiency of this model, we designed three different models of varying sizes and evaluated their performance to determine which was most effective. We compared their F5 scores and took their inference times into consideration. It became apparent that the large DeBERTa model achieved the highest F5 score, thereby emerging as the most effective model overall.
- However, the choice of model depends on the specific needs for efficiency and performance. For limited-resource environments, the Extra Small or Small models are recommended. For maximum accuracy, where resources are plentiful, the Large model is the best choice.

References

[1] Our Kaggle Competition Info can be found here: https://www.kaggle.com/competitions/pii-detection-removal-from-educational-data/overview

--- title: "Personal Data Detection" subtitle: "INFO 523 - Project Final" author: - name: "KG Competitors" affiliations: - name: "School of Information, University of Arizona" description: "Our team is participating in a Kaggle Competition with the higher objective being to create a model capable of detecting personally identifiable information (PII) within student writing, and a secondary goal of evaluating and comparing our models for efficiency to identify the best model." format: html: code-tools: true code-overflow: wrap embed-resources: true editor: visual execute: warning: false echo: false jupyter: python3 --- ```{python} #Importing the libraries import plotly.graph_objects as go import matplotlib.pyplot as plt import pandas as pd ``` ## Introduction The primary objective of this project is to develop a model capable of detecting personally identifiable information (PII) in student writing. This is crucial because PII presents significant obstacles to creating publicly accessible educational datasets, as it could potentially expose students to various risks. To address these concerns, our team is participating in a Kaggle competition focusing on PII detection in a dataset of approximately 22,000 student essays. Each essay was submitted in response to a single assignment prompt, requiring students to apply course concepts to real-world scenarios. The dataset has been pre-processed to ensure student privacy, with original PII replaced by surrogate identifiers through a partially automated process. Our approach to this problem is centered on word-level classification, commonly known as 'Named Entity Recognition', where each word in a text is assigned to one of 13 predefined categories. The key variables in the dataset include tokens representing simplified word forms, each associated with these labels. We chose to use three DeBERTa models of varying sizes—extra small, small, and large—to conduct our analysis. The models were fine-tuned using KerasNLP with JAX as the backend and evaluated based on their F5-scores, which prioritize recall to reduce the impact of false negatives. Our analysis plan consisted of several key steps. First, the raw text was tokenized using the DeBERTa tokenizer to ensure proper segmentation of named entities. Next, each model was fine-tuned with a small learning rate (6e-6), using different batch sizes and training durations to reflect their respective resource requirements. The models were then evaluated based on their performance on a validation set, focusing on the F5-score, where recall is considered five times more important than precision. We also compared the performance and resource utilization of each model to determine which was most efficient and effective. This report outlines our strategy, methods, and conclusions for tackling the problem of PII identification in student works. We walk through our approach, the features of the dataset, and our analytical plan throughout this paper, explaining how we explored several DeBERTa model configurations, training processes, and evaluation criteria. Through thorough performance evaluation and painstaking fine-tuning of these models, we want to achieve the best possible balance between computing efficiency and accuracy. In the end, our results provide useful information for academics and professionals involved in the study of educational data, adding to the current conversation about protecting student privacy and expanding the boundaries of data-driven educational research. ## Abstract Our team is participating in a Kaggle competition in which the primary goal is to develop a model capable of detecting personally identifiable information (PII) in student writing. PII creates significant obstacles to the analysis and creation of publicly accessible datasets designed to enhance education, as sharing such data could expose students to potential risks. To address these risks, educational data must be thoroughly screened and cleansed of PII before it is publicly released, a task that could be optimized with data science techniques. Typically, text classification tasks involve categorizing entire sentences. However, our task focuses on word-level classification, where each word in a text belongs to one of several predefined categories, akin to a multiclass classification problem. This approach is commonly known as 'Named Entity Recognition'. Our process involved creating and running three DeBERTa models of various sizes (extra small, small, and large) to solve the Kaggle objective, while being able to compare our three models and determine which one works best. ## Questions 1. Can we develop a model that successfully detects personally identifiable information (PII) in student writing? 2. How can we evaluate the model's performance effectively? Which metrics are most appropriate for PII detection tasks? ## Dataset We are using a dataset sourced from the Kaggle competition we are participating in, which comprises roughly 22,000 essays submitted by students. Each essay was written in response to a single assignment prompt, asking students to apply course concepts to a real-world situation. We chose this dataset and competition because some of our team members have experience in NLP tasks, providing an opportunity to share knowledge with the rest of the team while addressing a practical problem. The competition's goal is to detect personally identifiable information (PII) within these essays. To ensure student privacy, any original PII in the dataset has been replaced with surrogate identifiers of similar types through a partially automated process. - **Dataset Used**: [The Learning Agency Lab - PII Data Detection](https://www.kaggle.com/competitions/pii-detection-removal-from-educational-data/data) - **Description**: Contains annotations for named entities (word categories) across 13 categories. - **Preprocessing**: The text is tokenized using DeBERTa's tokenizer to ensure that named entities are correctly identified and segmented. ```{python} #| label: load-pkgs #| message: false import numpy as np import seaborn as sns from zipfile import ZipFile import json import pandas as pd sns.set(font_scale = 1.25) sns.set_style("white") ``` GitHub does not allow files over 100 mb, so we store the compressed version of the dataset and decompress it temporarily to show what the data looks like. ```{python} #| label: unzip-dataset #| message: false with ZipFile("data/essays.zip", 'r') as z: z.extractall(path="data/temp") ``` *The code for displaying json data is adapted from medium.com/@nslhnyldz* ```{python} #| label: load-dataset #| message: false data = json.load(open('data/temp/train.json')) labels = [] tokens = [] for i in data: labels.extend([j for j in i['labels'] if j!='O']) tokens.extend([k for j, k in zip(i['labels'],i['tokens']) if j!='O']) data = pd.DataFrame({'labels':labels,'tokens':tokens}) data.sample(n=10) ``` Json is not the most convenient format for displaying information, so we need to transform it into a DataFrame. To convey what the data looks like, we will transorm the dataset into a csv file and display it. It's important to note that the central part of the dataset are the tokens (simplified word forms) that can carry one of the following labels: - O - Not a part of an entity constituting personal information - NAME_STUDENT - The full or partial name of a student that is not necessarily the author of the essay. This excludes instructors, authors, and other person names. - EMAIL - A student’s email address. - USERNAME - A student's username on any platform. - ID_NUM - A number or sequence of characters that could be used to identify a student, such as a student ID or a social security number. - PHONE_NUM - A phone number associated with a student. - URL_PERSONAL - A URL that might be used to identify a student. - STREET_ADDRESS - A full or partial street address that is associated with the student, such as their home address. Token labels are presented in BIO (Beginning, Inner, Outer) format. The PII type is prefixed with “B-” when it is the beginning of an entity. If the token is a continuation of an entity, it is prefixed with “I-”. Tokens that are not PII are labeled “O”. We can also view the full entities, and not just separate tokens. For that we would remove the labels prefixes and merge adjacent PII entities that follow a prefix pattern "B-I-I ...". ```{python} #| label: merge-pii-entities #| message: false def transform_dataframe(df): new_rows = [] current_entity = '' current_label = '' for index, row in df.iterrows(): label = row['labels'] token = row['tokens'] if label.startswith('B-'): if current_entity != '': new_rows.append({'entity': current_entity, 'label': current_label}) current_entity = token current_label = label[2:] elif label.startswith('I-'): current_entity += ' ' + token if current_entity != '': new_rows.append({'entity': current_entity, 'label': current_label}) return pd.DataFrame(new_rows) transformed_df = transform_dataframe(data) transformed_df.sample(n=10) ``` Additionally, each label can be prefaced by either B or I, which is a way to tell whether a token is the first word of an entity or its continuation. The majority of tokens are labelled as "O", meaning they do not constitute PII. For display purposes, we will only select those tokens and their labels that do constitute personal information. ```{python} # Data for main plot categories = ['B-EMAIL', 'B-ID_NUM', 'B-NAME_STUDENT', 'B-PHONE_NUM', 'B-STREET_ADDRESS', 'B-URL_PERSONAL', 'B-USERNAME', 'I-ID_NUM', 'I-NAME_STUDENT', 'I-PHONE_NUM', 'I-STREET_ADDRESS', 'I-URL_PERSONAL'] all_data = [39, 78, 1365, 6, 2, 110, 6, 1, 1096, 15, 20, 1] training_set = [36, 68, 1102, 4, 1, 82, 6, 1, 852, 12, 10, 1] test_set = [3, 10, 263, 2, 1, 28, 0, 0, 244, 3, 10, 0] # Create main plot fig = go.Figure() # Add bars for all categories fig.add_trace(go.Bar( x=categories, y=all_data, name='All data', hovertemplate='%{x}: %{y}', marker_color='#074173' )) fig.add_trace(go.Bar( x=categories, y=training_set, name='Training set', hovertemplate='%{x}: %{y}', marker_color='#75A47F' )) fig.add_trace(go.Bar( x=categories, y=test_set, name='Test set', hovertemplate='%{x}: %{y}', marker_color='#FF8870' )) # layout fig.update_layout( title='Classwise distribution of Words ', xaxis_title='Categories', yaxis_title='Count (log scale)', barmode='group', yaxis_type='log' # Setting the y-axis to a logarithmic scale ) # figure fig.show() ``` ## Analysis Plan - **Methodology**: 1. **Tokenization**: Tokenize the raw text using the appropriate DeBERTa tokenizer. 2. **Fine-Tuning**: Model are be fine-tuned on the same training set with a very small learning rate 6e-6. KerasNLP python package was used with JAX backed. Models were fine tuned on a Nvidia 4070Ti GPU. 3. **Evaluation**: Assessed each model's performance on the validation set F5-score. In the F5 score, recall is considered five times as important as precision. This means the F5 score is particularly useful in our case where missing out on positive cases (false negatives) is much more problematic than incorrectly labeling negative cases as positive (false positives) since there is a huge class imbalance. 4. **Comparison**: Compared the performance and resource utilization of each model. ### Model 1: Extra Small DeBERTa - **Configuration**: [DeBERTa-v3 Extra Small](https://huggingface.co/microsoft/deberta-v3-xsmall) - a 70.68 Million parameter model. - **Training Details**: Tuned with a smaller batch size due to resource constraints, focusing on maximizing model efficiency. - **Expected Advantage**: Lower computational requirements, making it suitable for environments with strict resource limitations. ### Model 2: Small DeBERTa - **Configuration**: [DeBERTa-v3 Small](https://huggingface.co/microsoft/deberta-v3-small) - a 141.30 Million parameter model. - **Training Details**: Balance between computational demand and performance, using a moderate batch size and learning rate. - **Expected Advantage**: Expected to perform better than the extra small model with manageable resource use. ### Model 3: Large DeBERTa - **Configuration**: [DeBERTa-v3 Large](https://keras.io/api/keras_nlp/models/deberta_v3) - a 434.01 Million parameter model - **Training Details**: Utilizes larger batch sizes and more extended training periods to exploit the model's capacity fully. - **Expected Advantage**: Highest accuracy and performance, suitable for scenarios where computational resources are less of a constraint. ## Results & Conclusion - **Performance Analysis**: - The **Extra Small DeBERTa** model offers the fastest inference time but at the cost of lower F5 score. - The **Small DeBERTa** model provides a balanced performance with reasonable inference times and improved F5 Score over the extra small variant. - The **Large DeBERTa** model achieves the highest F5 score, reflecting its larger model capacity and outperformance. ```{python} # Results results = [ [0.891, 0.932, 0.968], [0.902, 0.935, 0.972], [0.899, 0.932, 0.972], [0.897, 0.934, 0.968], [0.902, 0.934, 0.974], [0.888, 0.935, 0.973], [0.893, 0.932, 0.968], [0.897, 0.931, 0.966], [0.898, 0.936, 0.964], [0.894, 0.929, 0.970] ] # Extracting data for each result result1 = [r[0] for r in results] result2 = [r[1] for r in results] result3 = [r[2] for r in results] # Plotting plt.figure(figsize = (6, 5)) plt.plot(result1, label = 'deberta_v3_extra_small_en', marker = 'o') plt.plot(result2, label = 'deberta_v3_small_en', marker = 'o') plt.plot(result3, label = 'deberta_v3_large_en', marker = 'o') plt.ylabel('F-beta score') plt.title('Average F-beta score on test set on three models') plt.legend(loc = 'upper left', bbox_to_anchor = (1.02, 1.0)) plt.grid(True) plt.show() ``` - **Conclusion**: - Regarding our research goals and inquiries, we can indeed construct a model capable of accurately detecting personally identifiable information (PII) in student writings. To assess the efficiency of this model, we designed three different models of varying sizes and evaluated their performance to determine which was most effective. We compared their F5 scores and took their inference times into consideration. It became apparent that the large DeBERTa model achieved the highest F5 score, thereby emerging as the most effective model overall. - However, the choice of model depends on the specific needs for efficiency and performance. For limited-resource environments, the Extra Small or Small models are recommended. For maximum accuracy, where resources are plentiful, the Large model is the best choice. ## References \[1\] Our Kaggle Competition Info can be found here: <https://www.kaggle.com/competitions/pii-detection-removal-from-educational-data/overview>