Code
import numpy as np
import pandas as pd
Proposal
# Read in the data
#url = 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-03-14/drugs.csv'
url = 'data/drugs_dataset.csv'
drugs = pd.read_csv(url)
# # Display dimensions of the dataset
# dimensions = drugs.shape
# print(f"Dimensions of the dataset: {dimensions}")
# # Display features (column names) of the dataset
# features = drugs.columns.tolist()
# print("Features of the dataset:")
# print(features)
# # Display data types of each column
# data_types = drugs.dtypes
# # Count the number of numerical and categorical variables
# numerical_vars = data_types[data_types != 'object'].index.tolist()
# categorical_vars = data_types[data_types == 'object'].index.tolist()
# # Display the counts and names of numerical and categorical variables
# num_numerical_vars = len(numerical_vars)
# num_categorical_vars = len(categorical_vars)
# print(f"Number of numerical variables: {num_numerical_vars}")
# print(f"Number of categorical variables: {num_categorical_vars}")
# print("\nNumerical variables:")
# print(numerical_vars)
# print("\nCategorical variables:")
# print(categorical_vars)
drugs.head()
category | medicine_name | therapeutic_area | common_name | active_substance | product_number | patient_safety | authorisation_status | atc_code | additional_monitoring | ... | marketing_authorisation_holder_company_name | pharmacotherapeutic_group | date_of_opinion | decision_date | revision_number | condition_indication | species | first_published | revision_date | url | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | human | Adcetris | Lymphoma, Non-Hodgkin; Hodgkin Disease | brentuximab vedotin | brentuximab vedotin | 2455 | False | authorised | L01XC12 | False | ... | Takeda Pharma A/S | Antineoplastic agents | 2012-07-19 | 2022-11-17 | 34.0 | Hodgkin lymphomaAdcetris is indicated for adul... | NaN | 2018-07-25T13:58:00Z | 2023-03-13T11:52:00Z | https://www.ema.europa.eu/en/medicines/human/E... |
1 | human | Nityr | Tyrosinemias | nitisinone | nitisinone | 4582 | False | authorised | A16AX04 | False | ... | Cycle Pharmaceuticals (Europe) Ltd | Other alimentary tract and metabolism products, | 2018-05-31 | 2023-03-10 | 4.0 | Treatment of adult and paediatric patients wit... | NaN | 2018-07-26T14:20:00Z | 2023-03-10T17:29:00Z | https://www.ema.europa.eu/en/medicines/human/E... |
2 | human | Ebvallo | Lymphoproliferative Disorders | tabelecleucel | tabelecleucel | 4577 | False | authorised | NaN | True | ... | Pierre Fabre Medicament | NaN | 2022-10-13 | 2023-03-09 | 2.0 | Ebvallo is indicated as monotherapy for treatm... | NaN | 2022-10-12T16:13:00Z | 2023-03-10T13:40:00Z | https://www.ema.europa.eu/en/medicines/human/E... |
3 | human | Ronapreve | COVID-19 virus infection | casirivimab, imdevimab | casirivimab, imdevimab | 5814 | False | authorised | J06BD | True | ... | Roche Registration GmbH | Immune sera and immunoglobulins, | 2021-11-11 | 2023-02-24 | 3.0 | Ronapreve is indicated for:Treatment of COVID-... | NaN | 2021-11-12T16:30:00Z | 2023-03-10T12:29:00Z | https://www.ema.europa.eu/en/medicines/human/E... |
4 | human | Cosentyx | Arthritis, Psoriatic; Psoriasis; Spondylitis... | secukinumab | secukinumab | 3729 | False | authorised | L04AC10 | False | ... | Novartis Europharm Limited | Immunosuppressants | 2014-11-20 | 2023-01-26 | 30.0 | Plaque psoriasisCosentyx is indicated for the ... | NaN | 2018-06-07T11:59:00Z | 2023-03-09T18:53:00Z | https://www.ema.europa.eu/en/medicines/human/E... |
5 rows × 28 columns
A brief description of the dataset:
The dataset[1] consists of 1988 records and 28 features, providing a comprehensive overview of various pharmaceutical products and medicines. It encompasses diverse information, including the medicine’s category, name, therapeutic area, common name, active substance, and unique product number. Additionally, details about patient safety, authorization status, ATC code, and whether additional monitoring or conditional approval is required are included. The dataset captures essential regulatory information, such as the marketing authorization date, refusal date, and the company holding the marketing authorization. With variables indicating generic status, biosimilarity, orphan medicine designation, and accelerated assessment, the dataset offers a rich source for exploring the landscape of pharmaceutical products. Alongside pharmacotherapeutic group information, details about the indication for specific conditions, target species, and revision history provide a holistic view. The dataset’s diverse nature, coupled with numerical, categorical, and date variables, makes it a valuable resource for conducting analyses in the fields of pharmaceutical research, healthcare, and regulatory affairs.
Dataset Dimensions:
Number of Rows (Observations): 1988
Number of Columns (Features): 28
Reasons of choosing the dataset:
The dataset covers a wide range of drug applications, providing a holistic view of the European drug development landscape. This includes both successful and unsuccessful applications, offering a complete picture of the regulatory environment. Data from the European Medicines Agency is highly reliable, as it is collected and maintained by an authoritative regulatory body. This ensures the accuracy of our analysis.
Understanding how drug development has evolved over time, including which areas (therapeutic, disease focus) are gaining attention. Identifying the success rate of drug applications, which can offer insights into regulatory challenges or the quality of drug development. We’ll gain experience in managing large datasets, including cleaning, filtering, and organizing data to prepare it for analysis. This is a fundamental skill in data science. The dataset provides an opportunity to apply different techniques to identify trends, correlations, and patterns in drug development and approval processes.
Column | Data Type | Description |
---|---|---|
medicine_name | String | The brand name of the medicine. |
therapeutic_area | String | The therapeutic area(s) for which the medicine is authorized. |
authorisation_status | String | The current authorization status of the medicine (e.g., Approved, Refused). |
revision_number | Integer | The number of times the medicine’s authorization details have been revised. |
conditional_approval | String | Indicator if conditional approval is applied. |
category | String | The category (human or veterinary) of the medicine. |
marketing_authorisation_holder_company_name | String | The company holding the marketing authorization for the medicine. |
revision_date | Date | The date of the latest revision for the medicine. |
Question 1: Which COVID vaccines have undergone the most revisions while maintaining an approved authorization status with no conditions applied?
Question 2: What are the most recently released medicines (name and company) authorized for human usage for ‘Hepatitis B’?
Variables involved to answer question 1:
Plan for answering question 1:
No external data is needed for question 1, and no new variables need to be created.
Variables involved to answer question 2:
Plan for answering question 2:
No external data or new variables are needed for question 2.
Project Timeline
By Feb 22nd: Clarify dataset details, set clear expectations for the project and finalize workflow.
By Feb 26th: Complete analysis for Question 1 and incorporate findings into presentation.qmd
By March 4th: Finish analysis for Question 2 and update presentation.
March 5th-10th: Make final adjustments to the project, presentation and writeup, practice for presentation.
Title: European Drug Development
Author: jonthegeek
Date: 2023-03-14
Link: https://github.com/rfordatascience/tidytuesday/tree/master/data/2023/2023-03-14