Season’s Screenings

Exploring Trends in Holiday Movies

import numpy as np
import pandas as pd

Dataset

data = pd.read_csv('data/holiday_movies.csv')

Metadata

Column Data type Description
tconst string Alphanumeric unique identifier of the title
title_type categorical The type/format of the title (movie, video, or tvMovie)
primary_title string The more popular title / the title used by the filmmakers on promotional materials at the point of release
original_title string Original title, in the original language
year numerical The release year of a title
runtime_minutes numerical Primary runtime of the title, in minutes
genres categorical Includes up to three genres associated with the title (comma-delimited)
simple_title string The title in lowercase, with punctuation removed, for easier filtering and grouping
average_rating numerical Weighted average of all the individual user ratings on IMDb
num_votes numerical Number of votes the title has received on IMDb (titles with fewer than 10 votes were not included in this dataset)
christmas boolean Whether the title includes “christmas”, “xmas”, “x mas”, etc
hanukkah boolean Whether the title includes “hanukkah”, “chanukah”, etc
kwanzaa boolean Whether the title includes “kwanzaa”
holiday boolean Whether the title includes the word “holiday”

Description

The Holiday Movies dataset is a collection that focuses on movies with themes or titles related to various holidays. It spans various genres, years, and holiday themes, looking at how holiday movies have evolved over time and how audiences have received them. By selecting the Holiday Movies dataset, we leverage a rich compilation of data to explore changes in popularity, genre preferences, and financial success over time.Our goal is to uncover patterns and insights that could inform future film production and marketing strategies.

Provenance

The dataset is courtesy of the Tidy Tuesday project and can be found on their official GitHub repository. The origin of the data is IMDb’s Non-Commercial datasets. The criteria for including a movie in this dataset were very simple. All works that had “Holiday” in the title were included, as well as those that had specific types of holidays in the title (“Christmas”, “Hannukah” and “Kwanzaa”).

Structure & Types of Data

Let us look at the first 5 lines of the data:

data.head()
tconst title_type primary_title original_title year runtime_minutes genres simple_title average_rating num_votes christmas hanukkah kwanzaa holiday
0 tt0020356 movie Sailor's Holiday Sailor's Holiday 1929 58.0 Comedy sailors holiday 5.4 55 False False False True
1 tt0020823 movie The Devil's Holiday The Devil's Holiday 1930 80.0 Drama,Romance the devils holiday 6.0 242 False False False True
2 tt0020985 movie Holiday Holiday 1930 91.0 Comedy,Drama holiday 6.3 638 False False False True
3 tt0021268 movie Holiday of St. Jorgen Prazdnik svyatogo Yorgena 1930 83.0 Comedy holiday of st jorgen 7.4 256 False False False True
4 tt0021377 movie Sin Takes a Holiday Sin Takes a Holiday 1930 81.0 Comedy,Romance sin takes a holiday 6.1 740 False False False True

Let us look at the shape of the data:

data.shape
(2265, 14)

The dataset contains 2265 rows and 14 columns. The data contains both numerical and categorical features. The genres column is essentially a set that contains multiple values, since a movie can belong to multiple genres.

Rationale for Selecting

The dataset offers rich data with both numerical and categorical features that span over multiple decades. The data is broad enough to allow for creative research yet does not require deep domain knowledge. This makes it a good choice for a team with varied backgrounds.

Questions:

Question 1

How have popular movies been changing over the decades?

We will look at the trends in popular holiday movies over the decades. Considering the movies with average ratings that fall in the range between the 3rd and 4th quartiles, we will show in which directions holiday movies have been evolving in terms of genres, duration, and lengths of titles.

Question 2

What kind of holiday movies earn the most money?

We will look at the correlations between movie features (such as year of production, duration, length of title, rating, genres, and type of holiday) and the amount of earnings the movie generated. For that, we will supplement the existing dataset with box office information using IMDb’s API. These observations may be useful in developing a model for estimating the investment worthiness of a movie (whether in production stage or adding an existing movie to a portfolio).

Analysis plan

Question 1

Data Preparation:

  • Variables Involved: Title, release year, genre(s), duration, ratings.

  • Variables to be Created: Title Length(Calculate the number of characters in each movie title).

  • Rating Quartile: Categorize movies based on their rating quartile for easier analysis.

  • Genre Count: Count of genres per movie, as movies might belong to multiple genres.

  • Data Cleaning: cleaning the data for inconsistencies, missing values, and outliers.

Trend Analysis:

  • Genre Analysis: Determine the most common genres among holiday movies in the selected rating range.

  • Duration Analysis: Examine if there’s a preferred movie length that correlates with higher ratings within the 3rd to 4th quartile range and how this preference has changed.

  • Title Length Analysis: Investigate the relationship between the length of movie titles and their ratings.

  • Yearly Rating Analysis: Assess how the average ratings of holiday movies in the 3rd to 4th quartile have fluctuated over time, looking for any patterns or significant changes.

  • Analytical method: we plan to employ statistical methods and machine learning algorithms to uncover underlying patterns and correlations. Initially, descriptive statistics and visualization tools (such as line graphs and bar charts) will be used to observe changes over time.

Question 2

Data Preparation:

  • Variables Involved: Title, release year, genre(s), duration, ratings, earnings.

  • Variables to be Created: Title Length(Calculate the number of characters in each movie title)

  • Genre Count: Count of genres per movie, as movies might belong to multiple genres.

  • External Data to be Merged: Earnings data for each movie from IMDb’s API.

  • Data Cleaning: cleaning the data for inconsistencies, missing values, and outliers.

Trend Analysis:

  • Earnings and Duration Analysis: Investigate the relationship between the movie’s duration and its earnings. Determine if there’s a preferred movie length that correlates with higher earnings.

  • Earnings and Title Length Analysis: Explore whether the length of the movie title has any correlation with its earnings. This could include examining if shorter or longer titles are associated with higher earnings.

  • Earnings and Rating Analysis: Assess the relationship between a movie’s rating and its earnings. Determine if higher-rated movies tend to earn more.

  • Earnings and Release Year Analysis: Look at how the earnings of movies have changed over time. This could involve analyzing trends in movie earnings across different years.

  • Analytical Method: In addressing the financial success of holiday movies and its correlation with various features, we will integrate linear regression analysis to explore the relationships between movie characteristics (such as genre, duration, and ratings) and box office earnings.

Project Timeline

Task Desciption Start End
1 Choose a dataset and write up proposal 2024-01-30 2024-02-06
2 Receive feedback on proposal and make necessery changes 2024-02-06 2024-02-13
3 Perform data analysis and work on presentation and write-up 2024-02-13 2024-02-27
4 Finalize analysis, write-up, presentation and project website 2024-02-27 2024-03-12