Personal Data Detection

Proposal

Our team is participating in a Kaggle Competition with the higher goal and objective being to create a model capable of detecting personally identifiable information (PII) within student writing. The existence of PII poses a formidable barrier to the analysis and creation of openly accessible datasets aimed at advancing education, as making such data publicly available exposes students to potential risks. To mitigate these risks, it is imperative to thoroughly screen and cleanse educational data of any PII prior to its public release, a process that could be optimized and facilitated through the application of data science techniques.
Author
Affiliation

KG Competitors

School of Information, University of Arizona

import numpy as np
import seaborn as sns
from zipfile import ZipFile
import json
import pandas as pd

sns.set(font_scale = 1.25)
sns.set_style("white")

Dataset

GitHub does not allow files over 100 mb, so we store the compressed version of the dataset and decompress it temporarily to show what the data looks like.

with ZipFile("data/essays.zip", 'r') as z: 
    z.extractall(path="data/temp")

Json is not the most convenient format for displaying information, so we need to transform it into a DataFrame. To convey what the data looks like, we will transorm the dataset into a csv file and display it. It’s important to note that the central part of the dataset are the tokens (simplified word forms) that can carry one of the following labels:

  • O - Not a part of an entity constituting personal information
  • NAME_STUDENT - The full or partial name of a student that is not necessarily the author of the essay. This excludes instructors, authors, and other person names.
  • EMAIL - A student’s email address.
  • USERNAME - A student’s username on any platform.
  • ID_NUM - A number or sequence of characters that could be used to identify a student, such as a student ID or a social security number.
  • PHONE_NUM - A phone number associated with a student.
  • URL_PERSONAL - A URL that might be used to identify a student.
  • STREET_ADDRESS - A full or partial street address that is associated with the student, such as their home address.

Token labels are presented in BIO (Beginning, Inner, Outer) format. The PII type is prefixed with “B-” when it is the beginning of an entity. If the token is a continuation of an entity, it is prefixed with “I-”. Tokens that are not PII are labeled “O”.

Additionally, each label can be prefaced by either B or I, which is a way to tell whether a token is the first word of an entity or its continuation. The majority of tokens are labelled as “O”, meaning they do not constitute PII. For display purposes, we will only select those tokens and their labels that do constitute personal information.

The code for displaying json data is adapted from medium.com/@nslhnyldz

data = json.load(open('data/temp/train.json'))

labels = []
tokens = []
for i in data:
    labels.extend([j for j in i['labels'] if j!='O'])
    tokens.extend([k for j, k in zip(i['labels'],i['tokens']) if j!='O'])

data = pd.DataFrame({'labels':labels,'tokens':tokens})
data.sample(n=10)
labels tokens
732 B-ID_NUM Iz.:999893751750
845 B-NAME_STUDENT Karina
2166 I-NAME_STUDENT Bergamo
111 B-NAME_STUDENT Ahmed
200 I-NAME_STUDENT Ahmed
2664 I-NAME_STUDENT Roberto
437 I-NAME_STUDENT Ka
423 I-NAME_STUDENT Khan
1343 I-NAME_STUDENT Padilla
20 B-NAME_STUDENT Sakir

The histogram below shows how frequently we see each of the labels in our dataset. Some PII categories are underrepresented, so we may eventually need to source additional data to achieve better model performance.

data_copy = data.copy()

hist = sns.countplot(data=data_copy, x="labels", order=data['labels'].value_counts().index)
labels = hist.set(xlabel ="Label", ylabel = "Number of occurences", title ='Label frequency')

hist.tick_params(axis='x', labelrotation = 70)

We can also view the full entities, and not just separate tokens. For that we would remove the labels prefixes and merge adjacent PII entities that follow a prefix pattern “B-I-I …”.

def transform_dataframe(df):
    new_rows = []
    current_entity = ''
    current_label = ''
    for index, row in df.iterrows():
        label = row['labels']
        token = row['tokens']
        if label.startswith('B-'):
            if current_entity != '':
                new_rows.append({'entity': current_entity, 'label': current_label})
            current_entity = token
            current_label = label[2:]
        elif label.startswith('I-'):
            current_entity += ' ' + token
    if current_entity != '':
        new_rows.append({'entity': current_entity, 'label': current_label})
    return pd.DataFrame(new_rows)

transformed_df = transform_dataframe(data)
transformed_df.sample(n=10)
entity label
914 Jamshed NAME_STUDENT
625 Luca NAME_STUDENT
791 Michele Schmidt NAME_STUDENT
345 Gabriel Lara NAME_STUDENT
286 Rosa Perna NAME_STUDENT
410 Ernesto Young NAME_STUDENT
543 Karla NAME_STUDENT
248 Irina NAME_STUDENT
237 Marcelo Caroline NAME_STUDENT
1187 Fabian Silvera NAME_STUDENT

Our dataset comes from the Kaggle competition we are competeing in, and it consists of around 22,000 essays submitted by students, all of which were written in response to a single assignment prompt, which required students to apply course concepts to a real-world scenario. We selected this dataset/competition because certain team members possess experience in NLP tasks, and it presents an opportunity to impart new skills to the rest of the team while tackling a tangible problem aligned with Kaggle’s objective of detecting personally identifiable information (PII) within these essays.

To safeguard student privacy, original PII in the dataset has been substituted with surrogate identifiers of similar types using a partially automated procedure. 70% of essays have been set aside for the test set, resulting in our team using other datasets to supplement our work for this project. (Supplemental dataset search is in progress).

The competition data is provided in JSON format, containing various components such as a document identifier, the complete essay text, a token list, details about whitespace, and token annotations. The documents underwent tokenization using the SpaCy English tokenizer.

Questions

  1. Can we develop a model that successfully detects personally identifiable information (PII) in student writing?
  2. How can we evaluate the model’s performance effectively? Which metrics are most appropriate for PII detection tasks?

Analysis plan

Problem Introduction

Our team is participating in a Kaggle Competition with the higher end goal and objective being to create a model capable of detecting personally identifiable information (PII) within student writing. Generally, text classification tasks are focussed towards sentence classification. However, our task focusses on word classification, where a word appearing in some text belongs to one of the predefined categories similar to a multiclass classification problem. This formulation is also called, “Named Entity Recognition”.

Problem formulation

  • Data Processing: Since, computational models do not understand raw text, we will tokenize text using a language model tokenizer which would convert the words it into integers for model compatibility, adding special tokens like [CLS], [SEP], and [PAD] for enhanced input handling. Subsequently, each token would be mapped to a n-dimensional vector representation. Leveraging the keras-NLP and tensorflow’s tf.data.Dataset function, we will create separate data loaders for training and validation.

  • Model Training: We will be using language models such as DeBERTa, Roberta etc which are transformer based neural language models having two distinct features. The first is it’s disentangled attention mechanism which involves representing each word with two vectors, one encoding its content and the other its position, and computing attention weights between words using disentangled matrices based on their content and relative positions. The second is it’s improved mask decoder, which replaces the output softmax layer to predict masked tokens during model training. The model will be trained using CrossEntropy loss, evaluated using the Fβ-score metric, and will utilize a Dense layer with softmax activation for prediction.

Training will be conditionally executed based on the CFG.train flag (as mentioned in the jupyter notebook), with progress monitored through epoch updates.

  • Evaluation: Validation data will undergo similar preprocessing for uniform sample sizes and tokenization, with token IDs generated for inference.

  • Example: Under our GitHub repository, under the folder “analysis”, a jupyter notebook is available which contains data and task description alongwith a DeBERTa model run for our first attempt at creating a model to detect PII in student writing.

Plan of Attack

Task Name Status Assignee Due Priority Summary
Proposal Description Completed Remi Hendershott 03 Apr 2024 Moderate Concise summary outlining the main idea of a proposal
Dataset Completed Makism Kulki 03 Apr 2024 Moderate Uploading and Loading the Dataset
Questions Completed Everyone 03 Apr 2024 Moderate Team consolidates findings to generate comprehensive questions aimed at exploring deeper insights
Proposal Peer Review Completed Everyone 03 Apr 2024 Moderate Reviewing Other Teams
Analysis Completed Monica Tejaswi, Remi Hendershott, G Sai Laasya, Shashank Yadav 08 Apr 2024 Moderate Team consolidates findings to generate comprehensive questions aimed at exploring deeper insights
DeBERTA model Completed Shashank Yadav 08 Apr 2024 Moderate Created the jupyter notebook which contains data and task description alongwith a model run
Revising Proposal Completed Everyone 08 Apr 2024 Moderate Revising proposal after getting constructive feedback from peers.
Proposal Instructor Review Completed Everyone 08 Apr 2024 Moderate Attain feedback on proposal from Instructor.
Code Peer Review Completed Everyone 01 May 2024 Moderate Reviewing Other Teams’ Code in class together.
Write-Up Completed Everyone 06 May 2024 High Completing a detailed written document explaining our project, our model, data visualizations of our code chunks and results, as well as our conclusion and findings.
Presentation Completed Everyone 06 May 2024 High Creating a short, but informative slideshow to present our project and model results to the class.

Repo Organization

  • .github/: specifically designated for GitHub-related files, including workflows, actions, and templates customized for managing issues.

  • _extra/: dedicated to storing miscellaneous files that do not categorically align with other project sections, serving as a versatile repository for various supplementary documents.

  • _freeze/: houses frozen environment files that contain detailed information about the project’s environment setup and dependencies.

  • _analysis/: Contains the jupyter notebooks of our analysis plan of the project.

  • data/: designated for storing essential data files crucial for the project’s operations, including input files, datasets, and other vital data resources.

  • images/: functioning as a central repository for visual assets utilized across the project, such as diagrams, charts, and screenshots, this directory houses essential visual components essential for project documentation and presentation purposes.

  • .gitignore: designed to define exclusions from version control, ensuring that specific files and directories are not tracked by Git, thereby simplifying the versioning process.

  • README.md: functioning as the central repository of project information, this README document provides vital details covering project setup, usage instructions, and an overarching summary of project objectives and scope.

  • _quarto.yml: functioning as a crucial configuration file for Quarto, this document encapsulates a range of settings and options that dictate the construction and rendering of Quarto documents. It enables customization and provides control over the output of the document.

  • about.qmd: Quarto file supplements project documentation by providing additional contextual information, describing our project purpose, as well as names and background of individual team members.

  • index.qmd: serves as the main page for our project, where our write up will eventually be. This Quarto file offers in-depth descriptions of our project, encompassing all code and visualizations, as well as eventually our results.

  • presentation.qmd: serves as a Quarto file that will present our slideshow of our final results of our project.

  • project-final.Rproj:

  • proposal.qmd: designed as the Quarto file responsible for our project proposal, housing our dataset, metadata, description, and questions, as well as our weekly plan of attack that will be updated weekly.

  • requirements.txt: specifies the dependencies and their respective versions required for the project to run successfully.

References

[1] Our Kaggle Competition Info can be found here: https://www.kaggle.com/competitions/pii-detection-removal-from-educational-data/overview