Exploring European Drug Development: COVID Vaccine Revisions and Recent Hepatitis B Medications

Proposal

Code
import numpy as np
import pandas as pd

Dataset

Code
# Read in the data
#url = 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-03-14/drugs.csv'
url = 'data/drugs_dataset.csv'
drugs = pd.read_csv(url)

# # Display dimensions of the dataset
# dimensions = drugs.shape
# print(f"Dimensions of the dataset: {dimensions}")

# # Display features (column names) of the dataset
# features = drugs.columns.tolist()
# print("Features of the dataset:")
# print(features)

# # Display data types of each column
# data_types = drugs.dtypes

# # Count the number of numerical and categorical variables
# numerical_vars = data_types[data_types != 'object'].index.tolist()
# categorical_vars = data_types[data_types == 'object'].index.tolist()

# # Display the counts and names of numerical and categorical variables
# num_numerical_vars = len(numerical_vars)
# num_categorical_vars = len(categorical_vars)

# print(f"Number of numerical variables: {num_numerical_vars}")
# print(f"Number of categorical variables: {num_categorical_vars}")

# print("\nNumerical variables:")
# print(numerical_vars)

# print("\nCategorical variables:")
# print(categorical_vars)

drugs.head()
category medicine_name therapeutic_area common_name active_substance product_number patient_safety authorisation_status atc_code additional_monitoring ... marketing_authorisation_holder_company_name pharmacotherapeutic_group date_of_opinion decision_date revision_number condition_indication species first_published revision_date url
0 human Adcetris Lymphoma, Non-Hodgkin; Hodgkin Disease brentuximab vedotin brentuximab vedotin 2455 False authorised L01XC12 False ... Takeda Pharma A/S Antineoplastic agents 2012-07-19 2022-11-17 34.0 Hodgkin lymphomaAdcetris is indicated for adul... NaN 2018-07-25T13:58:00Z 2023-03-13T11:52:00Z https://www.ema.europa.eu/en/medicines/human/E...
1 human Nityr Tyrosinemias nitisinone nitisinone 4582 False authorised A16AX04 False ... Cycle Pharmaceuticals (Europe) Ltd Other alimentary tract and metabolism products, 2018-05-31 2023-03-10 4.0 Treatment of adult and paediatric patients wit... NaN 2018-07-26T14:20:00Z 2023-03-10T17:29:00Z https://www.ema.europa.eu/en/medicines/human/E...
2 human Ebvallo Lymphoproliferative Disorders tabelecleucel tabelecleucel 4577 False authorised NaN True ... Pierre Fabre Medicament NaN 2022-10-13 2023-03-09 2.0 Ebvallo is indicated as monotherapy for treatm... NaN 2022-10-12T16:13:00Z 2023-03-10T13:40:00Z https://www.ema.europa.eu/en/medicines/human/E...
3 human Ronapreve COVID-19 virus infection casirivimab, imdevimab casirivimab, imdevimab 5814 False authorised J06BD True ... Roche Registration GmbH Immune sera and immunoglobulins, 2021-11-11 2023-02-24 3.0 Ronapreve is indicated for:Treatment of COVID-... NaN 2021-11-12T16:30:00Z 2023-03-10T12:29:00Z https://www.ema.europa.eu/en/medicines/human/E...
4 human Cosentyx Arthritis, Psoriatic; Psoriasis; Spondylitis... secukinumab secukinumab 3729 False authorised L04AC10 False ... Novartis Europharm Limited Immunosuppressants 2014-11-20 2023-01-26 30.0 Plaque psoriasisCosentyx is indicated for the ... NaN 2018-06-07T11:59:00Z 2023-03-09T18:53:00Z https://www.ema.europa.eu/en/medicines/human/E...

5 rows × 28 columns

A brief description of the dataset:

The dataset[1] consists of 1988 records and 28 features, providing a comprehensive overview of various pharmaceutical products and medicines. It encompasses diverse information, including the medicine’s category, name, therapeutic area, common name, active substance, and unique product number. Additionally, details about patient safety, authorization status, ATC code, and whether additional monitoring or conditional approval is required are included. The dataset captures essential regulatory information, such as the marketing authorization date, refusal date, and the company holding the marketing authorization. With variables indicating generic status, biosimilarity, orphan medicine designation, and accelerated assessment, the dataset offers a rich source for exploring the landscape of pharmaceutical products. Alongside pharmacotherapeutic group information, details about the indication for specific conditions, target species, and revision history provide a holistic view. The dataset’s diverse nature, coupled with numerical, categorical, and date variables, makes it a valuable resource for conducting analyses in the fields of pharmaceutical research, healthcare, and regulatory affairs.

Dataset Dimensions:
Number of Rows (Observations): 1988
Number of Columns (Features): 28

Reasons of choosing the dataset:

The dataset covers a wide range of drug applications, providing a holistic view of the European drug development landscape. This includes both successful and unsuccessful applications, offering a complete picture of the regulatory environment. Data from the European Medicines Agency is highly reliable, as it is collected and maintained by an authoritative regulatory body. This ensures the accuracy of our analysis.

Understanding how drug development has evolved over time, including which areas (therapeutic, disease focus) are gaining attention. Identifying the success rate of drug applications, which can offer insights into regulatory challenges or the quality of drug development. We’ll gain experience in managing large datasets, including cleaning, filtering, and organizing data to prepare it for analysis. This is a fundamental skill in data science. The dataset provides an opportunity to apply different techniques to identify trends, correlations, and patterns in drug development and approval processes.

Column Data Type Description
medicine_name String The brand name of the medicine.
therapeutic_area String The therapeutic area(s) for which the medicine is authorized.
authorisation_status String The current authorization status of the medicine (e.g., Approved, Refused).
revision_number Integer The number of times the medicine’s authorization details have been revised.
conditional_approval String Indicator if conditional approval is applied.
category String The category (human or veterinary) of the medicine.
marketing_authorisation_holder_company_name String The company holding the marketing authorization for the medicine.
revision_date Date The date of the latest revision for the medicine.

Questions

Question 1: Which COVID vaccines have undergone the most revisions while maintaining an approved authorization status with no conditions applied?

Question 2: What are the most recently released medicines (name and company) authorized for human usage for ‘Hepatitis B’?

Analysis plan

  • A plan for answering each of the questions including the variables involved, variables to be created (if any), external data to be merged in (if any).

Variables involved to answer question 1:

  1. ‘medicine_name’ : The brand name of the medicine.
  2. ‘therapeutic_area’ : The therapeutic area for which the medicine is authorized.
  3. ‘authorisation_status’ : The authorization status of the medicine.
  4. ‘revision_number’ : The number of revisions for the medicine.
  5. ‘conditional_approval’ : Indicator if conditional approval is applied.

Plan for answering question 1:

  1. Filter the dataset to include only COVID vaccines based on ‘therapeutic_area’ variable.
  2. Exclude medicines with conditions applied in the ‘conditional_approval’ variable.
  3. Sort the dataset based on the ‘revision_number’ in descending order to get the vaccines with the most revisions.
  4. Extract and display the relevant columns (‘medicine_name’ and ‘revision_number’) for the vaccines.

No external data is needed for question 1, and no new variables need to be created.

Variables involved to answer question 2:

  1. ‘category’ : The category (human or veterinary) of the medicine.
  2. ‘medicine_name’ : The brand name of the medicine.
  3. ‘therapeutic_area’ : List of therapeutic areas for which the medicine is authorized.
  4. ‘authorisation_status’ : The authorization status of the medicine.
  5. ‘marketing_authorisation_holder_company_name’ : The company holding the marketing authorization for the medicine.
  6. ‘revision_date’: The date of the latest revision for the medicine.

Plan for answering question 2:

  1. Filter the dataset to include only medicines related to ‘Hepatitis B’ in the ‘therapeutic_area’ variable.
  2. Filter the dataset to include only medicines for humans in the ‘category’ variable.
  3. Sort the dataset based on the ‘revision_date’ in descending order to get the most recently revised medicines.
  4. Extract and display the relevant columns (‘medicine_name’ and ‘marketing_authorisation_holder_company_name’) for the most recently revised medicines.

No external data or new variables are needed for question 2.

Project Timeline

  1. By Feb 22nd: Clarify dataset details, set clear expectations for the project and finalize workflow.

  2. By Feb 26th: Complete analysis for Question 1 and incorporate findings into presentation.qmd

  3. By March 4th: Finish analysis for Question 2 and update presentation.

  4. March 5th-10th: Make final adjustments to the project, presentation and writeup, practice for presentation.

References

  1. Title: European Drug Development

    Author: jonthegeek

    Date: 2023-03-14

    Link: https://github.com/rfordatascience/tidytuesday/tree/master/data/2023/2023-03-14