Season’s Screenings

Exploring Trends in Holiday Movies

import numpy as np
import pandas as pd

Dataset

data = pd.read_csv('data/holiday_movies.csv')

Metadata

Column	Data type	Description
tconst	string	Alphanumeric unique identifier of the title
title_type	categorical	The type/format of the title (movie, video, or tvMovie)
primary_title	string	The more popular title / the title used by the filmmakers on promotional materials at the point of release
original_title	string	Original title, in the original language
year	numerical	The release year of a title
runtime_minutes	numerical	Primary runtime of the title, in minutes
genres	categorical	Includes up to three genres associated with the title (comma-delimited)
simple_title	string	The title in lowercase, with punctuation removed, for easier filtering and grouping
average_rating	numerical	Weighted average of all the individual user ratings on IMDb
num_votes	numerical	Number of votes the title has received on IMDb (titles with fewer than 10 votes were not included in this dataset)
christmas	boolean	Whether the title includes “christmas”, “xmas”, “x mas”, etc
hanukkah	boolean	Whether the title includes “hanukkah”, “chanukah”, etc
kwanzaa	boolean	Whether the title includes “kwanzaa”
holiday	boolean	Whether the title includes the word “holiday”

Description

The Holiday Movies dataset is a collection that focuses on movies with themes or titles related to various holidays. It spans various genres, years, and holiday themes, looking at how holiday movies have evolved over time and how audiences have received them. By selecting the Holiday Movies dataset, we leverage a rich compilation of data to explore changes in popularity, genre preferences, and financial success over time.Our goal is to uncover patterns and insights that could inform future film production and marketing strategies.

Provenance

The dataset is courtesy of the Tidy Tuesday project and can be found on their official GitHub repository. The origin of the data is IMDb’s Non-Commercial datasets. The criteria for including a movie in this dataset were very simple. All works that had “Holiday” in the title were included, as well as those that had specific types of holidays in the title (“Christmas”, “Hannukah” and “Kwanzaa”).

Structure & Types of Data

Let us look at the first 5 lines of the data:

data.head()

	tconst	title_type	primary_title	original_title	year	runtime_minutes	genres	simple_title	average_rating	num_votes	christmas	hanukkah	kwanzaa	holiday
0	tt0020356	movie	Sailor's Holiday	Sailor's Holiday	1929	58.0	Comedy	sailors holiday	5.4	55	False	False	False	True
1	tt0020823	movie	The Devil's Holiday	The Devil's Holiday	1930	80.0	Drama,Romance	the devils holiday	6.0	242	False	False	False	True
2	tt0020985	movie	Holiday	Holiday	1930	91.0	Comedy,Drama	holiday	6.3	638	False	False	False	True
3	tt0021268	movie	Holiday of St. Jorgen	Prazdnik svyatogo Yorgena	1930	83.0	Comedy	holiday of st jorgen	7.4	256	False	False	False	True
4	tt0021377	movie	Sin Takes a Holiday	Sin Takes a Holiday	1930	81.0	Comedy,Romance	sin takes a holiday	6.1	740	False	False	False	True

Let us look at the shape of the data:

data.shape

(2265, 14)

The dataset contains 2265 rows and 14 columns. The data contains both numerical and categorical features. The genres column is essentially a set that contains multiple values, since a movie can belong to multiple genres.

Rationale for Selecting

The dataset offers rich data with both numerical and categorical features that span over multiple decades. The data is broad enough to allow for creative research yet does not require deep domain knowledge. This makes it a good choice for a team with varied backgrounds.

Questions:

Question 1

How have popular movies been changing over the decades?

We will look at the trends in popular holiday movies over the decades. Considering the movies with average ratings that fall in the range between the 3rd and 4th quartiles, we will show in which directions holiday movies have been evolving in terms of genres, duration, and lengths of titles.

Question 2

What kind of holiday movies earn the most money?

We will look at the correlations between movie features (such as year of production, duration, length of title, rating, genres, and type of holiday) and the amount of earnings the movie generated. For that, we will supplement the existing dataset with box office information using IMDb’s API. These observations may be useful in developing a model for estimating the investment worthiness of a movie (whether in production stage or adding an existing movie to a portfolio).

Analysis plan

Question 1

Data Preparation:

Variables Involved: Title, release year, genre(s), duration, ratings.
Variables to be Created: Title Length(Calculate the number of characters in each movie title).
Rating Quartile: Categorize movies based on their rating quartile for easier analysis.
Genre Count: Count of genres per movie, as movies might belong to multiple genres.
Data Cleaning: cleaning the data for inconsistencies, missing values, and outliers.

Trend Analysis:

Genre Analysis: Determine the most common genres among holiday movies in the selected rating range.
Duration Analysis: Examine if there’s a preferred movie length that correlates with higher ratings within the 3rd to 4th quartile range and how this preference has changed.
Title Length Analysis: Investigate the relationship between the length of movie titles and their ratings.
Yearly Rating Analysis: Assess how the average ratings of holiday movies in the 3rd to 4th quartile have fluctuated over time, looking for any patterns or significant changes.
Analytical method: we plan to employ statistical methods and machine learning algorithms to uncover underlying patterns and correlations. Initially, descriptive statistics and visualization tools (such as line graphs and bar charts) will be used to observe changes over time.

Question 2

Data Preparation:

Variables Involved: Title, release year, genre(s), duration, ratings, earnings.
Variables to be Created: Title Length(Calculate the number of characters in each movie title)
Genre Count: Count of genres per movie, as movies might belong to multiple genres.
External Data to be Merged: Earnings data for each movie from IMDb’s API.
Data Cleaning: cleaning the data for inconsistencies, missing values, and outliers.

Trend Analysis:

Earnings and Duration Analysis: Investigate the relationship between the movie’s duration and its earnings. Determine if there’s a preferred movie length that correlates with higher earnings.
Earnings and Title Length Analysis: Explore whether the length of the movie title has any correlation with its earnings. This could include examining if shorter or longer titles are associated with higher earnings.
Earnings and Rating Analysis: Assess the relationship between a movie’s rating and its earnings. Determine if higher-rated movies tend to earn more.
Earnings and Release Year Analysis: Look at how the earnings of movies have changed over time. This could involve analyzing trends in movie earnings across different years.
Analytical Method: In addressing the financial success of holiday movies and its correlation with various features, we will integrate linear regression analysis to explore the relationships between movie characteristics (such as genre, duration, and ratings) and box office earnings.

Project Timeline

Task	Desciption	Start	End
1	Choose a dataset and write up proposal	2024-01-30	2024-02-06
2	Receive feedback on proposal and make necessery changes	2024-02-06	2024-02-13
3	Perform data analysis and work on presentation and write-up	2024-02-13	2024-02-27
4	Finalize analysis, write-up, presentation and project website	2024-02-27	2024-03-12