Season’s Screenings: Exploring Trends in Holiday Movies

INFO 523 - Project 1

Author

Season’s Screenings

Abstract

The provided dataset is a structured compilation of films with a central theme or element of holidays, spanning almost a century from 1929 to 2022. The dataset includes various attributes such as the original title of the movie, its release year, the runtime in minutes, the genre(s), a simplified title, the average rating, the number of votes the movie received, and boolean indicators for its association with major holidays like Christmas, Hanukkah, Kwanzaa, and whether it is related to a holiday in general. We performed exploratory data analysis on the dataset to discover trends and interesting insights and did some further analysis to answer the following two questions: How have popular movies been changing over the decades? What kind of holiday movies earn the most money?

What kind of holiday movies earn the most money?

Introduction

  • We will look at the correlations between movie features (such as year of production, duration, length of title, rating, genres, and type of holiday) and the amount of earnings the movie generated. For that, we will supplement the existing dataset with box office information using IMDb’s API. These observations may be useful in developing a model for estimating the investment worthiness of a movie (whether in production stage or adding an existing movie to a portfolio).

Approach

  • At first glance, the tconst column in the dataset may seem redundant. After all, we have the indices, why would we need other identifiers? These variables become useful when we need to get additional data about the movies. Using IMDB’s API, we will populate a new column called earnings. It will allow us to answer our question and make additional useful observations.

Analysis

  • In order to not expose the API key and not to waste many minutes of time on real-time execution, the code below is not executable. Instead, we will generate the earnings column once, update a copy of our dataset and have it readily available in the /data folder. The function below does just that: using the tconst variable, it makes a call to IMDb’s API and returns the gross movie earnings.
import json
import boto3


def get_earnings(tconst)
    # Instantiate DataExchange client
    CLIENT = boto3.client('dataexchange', region_name='us-east-1')

    DATA_SET_ID = '<Dataset ID>'
    REVISION_ID = '<Revision ID>'
    ASSET_ID    = '<Asset ID>'
    API_KEY     = '<API Key>'

    query = """
    {title(id: tconst) {
        # Get the international opening weekend gross for WALL·E
        openingWeekendGross(boxOfficeArea: INTERNATIONAL) {
        gross {
            total {
            amount
            currency
            }
        }
        # Get the date of the opening weekend end date
        weekendEndDate
        }
    }
    }
    """

    METHOD = 'POST'
    PATH = '/v1'

    response = CLIENT.send_api_asset(
        DataSetId=DATA_SET_ID,
        RevisionId=REVISION_ID,
        AssetId=ASSET_ID,
        Method=METHOD,
        Path=PATH,
        Body=BODY,
        RequestHeaders={
            'x-api-key': API_KEY
        },
    )

    return response['Data']['gross']

Disclaimer: IMDb’s API is extremely pricey, so we do not make actual calls, but rather simulate the data for demonstrative purposes. However, all the shown functions and actions have been carefully crafted in accordance with IMDb’s API documentation. They will work provided a valid API key

We load the data in and convert the values into millions of USD by dividing every value by 1,000,000. Then we will proceed to explore a few relationships to see if any features in our dataset correlate to earnings.

df_earnings = pd.read_csv('data/holiday_movies_earnings.csv')

# earnings in million dollars
df_earnings['earnings'] = df_earnings['earnings'] // 1000000
df_earnings.head()
Unnamed: 0.1 Unnamed: 0 tconst title_type primary_title original_title year runtime_minutes genres simple_title average_rating num_votes christmas hanukkah kwanzaa holiday earnings
0 0 0 tt0020356 movie Sailor's Holiday Sailor's Holiday 1929 58.0 Comedy sailors holiday 5.4 55 False False False True 3.0
1 1 1 tt0020823 movie The Devil's Holiday The Devil's Holiday 1930 80.0 Drama,Romance the devils holiday 6.0 242 False False False True 4.0
2 2 2 tt0020985 movie Holiday Holiday 1930 91.0 Comedy,Drama holiday 6.3 638 False False False True 18.0
3 3 3 tt0021268 movie Holiday of St. Jorgen Prazdnik svyatogo Yorgena 1930 83.0 Comedy holiday of st jorgen 7.4 256 False False False True 23.0
4 4 4 tt0021377 movie Sin Takes a Holiday Sin Takes a Holiday 1930 81.0 Comedy,Romance sin takes a holiday 6.1 740 False False False True 74.0
data_copy = df_earnings.copy()
averages = data_copy.groupby('year')['earnings'].mean()

scatter = sns.scatterplot(data=averages.to_frame(), x = "year", y = "earnings")
labels = scatter.set(xlabel = "Year", ylabel = "Earnings (in mln USD)", title = "Year-earnings relationship")
plt.show()

From the plot above, we can observe that there are more outliers before 1980. There are both extremely high earners and movies that made very little. There is less data on older movies, so this information has to be taken with a grain of salt. It could be simply a result of too few datapoints.

data_copy = df_earnings.copy()
averages = data_copy.groupby('average_rating')['earnings'].mean()

scatter = sns.scatterplot(data=averages.to_frame(),x = "average_rating", y = "earnings")
labels = scatter.set(xlabel ="Rating", ylabel = "Earnings (in mln USD)", title = "Rating-earnings relationship")
plt.show()

As we may intuitively expect from the above plot, there indeed seems to be a correlation between the average rating of a movie and its earnings. The more people like the movie, the more tickets it sells.

# Merge the genre data with the earnings data on the common key (assuming it's 'tconst' here)
merged_data = pd.merge(new_genres, df_earnings, on='tconst', how='inner')

# Split the genres into lists if they're combined in a single string, assuming they're separated by commas
merged_data['genres_list'] = merged_data['genres_y'].str.split(',')

# Explode the DataFrame so each genre gets its own row
merged_data_exploded = merged_data.explode('genres_list')

# Calculate the average earnings for each genre
average_earnings_by_genre = merged_data_exploded.groupby('genres_list')['earnings'].mean().reset_index()

# Create a bar plot to show average earnings by genre
plt.figure(figsize=(12, 8))
sns.barplot(x='earnings', y='genres_list', data=average_earnings_by_genre.sort_values('earnings', ascending=False))
plt.xlabel('Average Earnings (in million USD)')
plt.ylabel('Genre')
plt.title('Average Earnings by Movie Genre')
plt.xticks(rotation=45)
plt.show()

The chart displays a range of genres from “Reality-TV” at the top, suggesting the highest average earnings, to “Thriller” at the bottom, indicating the lowest earnings within this visual.

# Merge the data on a common key, let's say 'tconst' which should be present in both dataframes
merged_data1 = pd.merge(df[['tconst', 'runtime_minutes']], df_earnings[['tconst', 'earnings']], on='tconst', how='inner')

# Now that we have a merged DataFrame with runtime and earnings, we can group by runtime and calculate average earnings
duration_earnings = merged_data1.groupby('runtime_minutes')['earnings'].mean().reset_index()

# Create a scatter plot to visualize the relationship between movie duration and average earnings
plt.figure(figsize=(12, 8))
scatter = sns.scatterplot(x="runtime_minutes", y="earnings", data=duration_earnings)
labels = scatter.set(xlabel="Runtime Minutes", ylabel="Earnings (in million USD)", title="Runtime-Earnings Relationship")
plt.show()

There does not appear to be a strong linear relationship between the runtime of a movie and its earnings. The scatter plot does not indicate a clear trend that would suggest longer or shorter movies earn more money consistently.

# Initialize an empty list to store average earnings by holiday type
holiday_earnings_list = []

# Calculate the average earnings for each holiday type
for holiday in ['christmas', 'hanukkah', 'kwanzaa', 'holiday']:
    # Filter the DataFrame to include only movies associated with the holiday
    holiday_df = merged_data[merged_data[holiday] == True]
    # Calculate the mean earnings for the holiday
    average_earnings = holiday_df['earnings'].mean()
    # Create a dictionary of the results and append to the list
    holiday_earnings_list.append({'Holiday Type': holiday, 'Average Earnings': average_earnings})

# Convert the list of dictionaries to a DataFrame
holiday_earnings = pd.DataFrame(holiday_earnings_list)

# Convert the 'Average Earnings' to millions USD for better readability
#holiday_earnings['Average Earnings (in million USD)'] = holiday_earnings['Average Earnings'] / 1000000

# Create a bar plot to show average earnings by holiday type
plt.figure(figsize=(12, 8))
sns.barplot(x = 'Average Earnings', y = 'Holiday Type', data = holiday_earnings)
plt.xlabel('Average Earnings (in million USD)')
plt.ylabel('Holiday Type')
plt.title('Average Earnings by Holiday Type')
plt.show()

The bar chart illustrates that hanukkah-themed movies are the highest earners, followed by films about christmas in general. Movies about Holiday and Kwanzaa have notably lower average earnings, suggesting that hanukkah films are the most commercially successful within holiday genres.

Discussion:

Several insights on the Christmas film industry are revealed by the analysis. First off, a noticeable spike in film releases occurs between 2000 and 2020, suggesting a rise in recent decades’ worth of production. Second, audience tastes are important; higher-rated films are associated with higher revenues, even if there is no clear pattern connecting movie runtime or title length to earnings. Additionally, the most popular holiday films are mostly in the comedy and drama genres, indicating a predilection for happy, uplifting tales throughout the holiday season. Unexpectedly, the highest-grossing films are those with a Hanukkah theme, demonstrating the commercial feasibility of a variety of holiday storylines.