import numpy as np
import pandas as pd
Season’s Screenings
Exploring Trends in Holiday Movies
Dataset
= pd.read_csv('data/holiday_movies.csv') data
Metadata
Column | Data type | Description |
---|---|---|
tconst | string | Alphanumeric unique identifier of the title |
title_type | categorical | The type/format of the title (movie, video, or tvMovie) |
primary_title | string | The more popular title / the title used by the filmmakers on promotional materials at the point of release |
original_title | string | Original title, in the original language |
year | numerical | The release year of a title |
runtime_minutes | numerical | Primary runtime of the title, in minutes |
genres | categorical | Includes up to three genres associated with the title (comma-delimited) |
simple_title | string | The title in lowercase, with punctuation removed, for easier filtering and grouping |
average_rating | numerical | Weighted average of all the individual user ratings on IMDb |
num_votes | numerical | Number of votes the title has received on IMDb (titles with fewer than 10 votes were not included in this dataset) |
christmas | boolean | Whether the title includes “christmas”, “xmas”, “x mas”, etc |
hanukkah | boolean | Whether the title includes “hanukkah”, “chanukah”, etc |
kwanzaa | boolean | Whether the title includes “kwanzaa” |
holiday | boolean | Whether the title includes the word “holiday” |
Description
The Holiday Movies dataset is a collection that focuses on movies with themes or titles related to various holidays. It spans various genres, years, and holiday themes, looking at how holiday movies have evolved over time and how audiences have received them. By selecting the Holiday Movies dataset, we leverage a rich compilation of data to explore changes in popularity, genre preferences, and financial success over time.Our goal is to uncover patterns and insights that could inform future film production and marketing strategies.
Provenance
The dataset is courtesy of the Tidy Tuesday project and can be found on their official GitHub repository. The origin of the data is IMDb’s Non-Commercial datasets. The criteria for including a movie in this dataset were very simple. All works that had “Holiday” in the title were included, as well as those that had specific types of holidays in the title (“Christmas”, “Hannukah” and “Kwanzaa”).
Structure & Types of Data
Let us look at the first 5 lines of the data:
data.head()
tconst | title_type | primary_title | original_title | year | runtime_minutes | genres | simple_title | average_rating | num_votes | christmas | hanukkah | kwanzaa | holiday | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | tt0020356 | movie | Sailor's Holiday | Sailor's Holiday | 1929 | 58.0 | Comedy | sailors holiday | 5.4 | 55 | False | False | False | True |
1 | tt0020823 | movie | The Devil's Holiday | The Devil's Holiday | 1930 | 80.0 | Drama,Romance | the devils holiday | 6.0 | 242 | False | False | False | True |
2 | tt0020985 | movie | Holiday | Holiday | 1930 | 91.0 | Comedy,Drama | holiday | 6.3 | 638 | False | False | False | True |
3 | tt0021268 | movie | Holiday of St. Jorgen | Prazdnik svyatogo Yorgena | 1930 | 83.0 | Comedy | holiday of st jorgen | 7.4 | 256 | False | False | False | True |
4 | tt0021377 | movie | Sin Takes a Holiday | Sin Takes a Holiday | 1930 | 81.0 | Comedy,Romance | sin takes a holiday | 6.1 | 740 | False | False | False | True |
Let us look at the shape of the data:
data.shape
(2265, 14)
The dataset contains 2265 rows and 14 columns. The data contains both numerical and categorical features. The genres column is essentially a set that contains multiple values, since a movie can belong to multiple genres.
Rationale for Selecting
The dataset offers rich data with both numerical and categorical features that span over multiple decades. The data is broad enough to allow for creative research yet does not require deep domain knowledge. This makes it a good choice for a team with varied backgrounds.
Questions:
Question 1
How have popular movies been changing over the decades?
We will look at the trends in popular holiday movies over the decades. Considering the movies with average ratings that fall in the range between the 3rd and 4th quartiles, we will show in which directions holiday movies have been evolving in terms of genres, duration, and lengths of titles.
Question 2
What kind of holiday movies earn the most money?
We will look at the correlations between movie features (such as year of production, duration, length of title, rating, genres, and type of holiday) and the amount of earnings the movie generated. For that, we will supplement the existing dataset with box office information using IMDb’s API. These observations may be useful in developing a model for estimating the investment worthiness of a movie (whether in production stage or adding an existing movie to a portfolio).
Analysis plan
Question 1
Data Preparation:
Variables Involved: Title, release year, genre(s), duration, ratings.
Variables to be Created: Title Length(Calculate the number of characters in each movie title).
Rating Quartile: Categorize movies based on their rating quartile for easier analysis.
Genre Count: Count of genres per movie, as movies might belong to multiple genres.
Data Cleaning: cleaning the data for inconsistencies, missing values, and outliers.
Trend Analysis:
Genre Analysis: Determine the most common genres among holiday movies in the selected rating range.
Duration Analysis: Examine if there’s a preferred movie length that correlates with higher ratings within the 3rd to 4th quartile range and how this preference has changed.
Title Length Analysis: Investigate the relationship between the length of movie titles and their ratings.
Yearly Rating Analysis: Assess how the average ratings of holiday movies in the 3rd to 4th quartile have fluctuated over time, looking for any patterns or significant changes.
Analytical method: we plan to employ statistical methods and machine learning algorithms to uncover underlying patterns and correlations. Initially, descriptive statistics and visualization tools (such as line graphs and bar charts) will be used to observe changes over time.
Question 2
Data Preparation:
Variables Involved: Title, release year, genre(s), duration, ratings, earnings.
Variables to be Created: Title Length(Calculate the number of characters in each movie title)
Genre Count: Count of genres per movie, as movies might belong to multiple genres.
External Data to be Merged: Earnings data for each movie from IMDb’s API.
Data Cleaning: cleaning the data for inconsistencies, missing values, and outliers.
Trend Analysis:
Earnings and Duration Analysis: Investigate the relationship between the movie’s duration and its earnings. Determine if there’s a preferred movie length that correlates with higher earnings.
Earnings and Title Length Analysis: Explore whether the length of the movie title has any correlation with its earnings. This could include examining if shorter or longer titles are associated with higher earnings.
Earnings and Rating Analysis: Assess the relationship between a movie’s rating and its earnings. Determine if higher-rated movies tend to earn more.
Earnings and Release Year Analysis: Look at how the earnings of movies have changed over time. This could involve analyzing trends in movie earnings across different years.
Analytical Method: In addressing the financial success of holiday movies and its correlation with various features, we will integrate linear regression analysis to explore the relationships between movie characteristics (such as genre, duration, and ratings) and box office earnings.
Project Timeline
Task | Desciption | Start | End |
---|---|---|---|
1 | Choose a dataset and write up proposal | 2024-01-30 | 2024-02-06 |
2 | Receive feedback on proposal and make necessery changes | 2024-02-06 | 2024-02-13 |
3 | Perform data analysis and work on presentation and write-up | 2024-02-13 | 2024-02-27 |
4 | Finalize analysis, write-up, presentation and project website | 2024-02-27 | 2024-03-12 |