Global Energy Trends

INFO 523 - Project Final

The goal of this project is to analyze the complex relationships between economic and population growth, sustainable energy practices, and energy consumption

Author

Affiliation

data detectives - Ayesha, Abhishek, Shreemithra, Toluwanimi, Valerie, Alyssa

School of Information, University of Arizona

Abstract

This project utilizes the comprehensive energy dataset from Our World in Data, spanning from 1900 to 2022, to examine the global energy consumption trends regarding economic growth, population dynamics, and the adoption of sustainable energy practices. The primary goal of the project is to design a predictive dashboard that models a nation’s energy consumption based on essential factors such as population size, GDP, and the proportion of electricity derived from renewable sources. The analysis will utilize a range of statistical and machine learning techniques, including time series decomposition, linear regression for key predictors, and regression analysis. We will evaluate the performance of these regression models using R-squared and Root Mean Squared Error (RMSE) metrics to gauge their accuracy and explanatory power. This evaluation is essential for enhancing predictive accuracy and reliability in energy policy formulation and planning. The project will analyze trends in the use of renewable energy at the regional and national levels, with a certain emphasis on emphasizing countries that lead the way in sustainable energy practices and those making progress toward lower greenhouse gas emissions. This analysis will provide crucial insights for industry and researchers dedicated to promoting energy sustainability and promoting economic growth.

Question 1

Is it possible to predict a nation’s power consumption by considering its population size, gross domestic product (GDP), and the percentage of electricity generated from renewable sources and changes across the years?

Density Plot

A right-skewed distribution is shown by the distribution’s shape, which shows that most values are low and that frequency drops off quickly as values rise. In order to train the model to identify patterns and to validate its correctness by contrasting the projected values with the actual data shown in the plot, the historical data represented in this histogram would be crucial.

Scatter Plot

The relationship between primary energy consumption and the GDP, population size, and the proportion of power derived from renewable sources is represented graphically in this plot. If persistent patterns are seen over time, one might utilize the spread and trends of the scatter points to deduce that there might be correlations between these parameters and a country’s power consumption, which could be used to anticipate energy usage.

Time Series Analysis Plot

Indicating possible relationships between these variables and a country’s power consumption, the graphic shows how the population, GDP, percentage of renewable electricity, and energy consumption have changed over time. For example, if GDP and population growth are accompanied by an upward trend in the primary energy consumption curve, this could indicate that energy demand is driven by economic activity and demographic growth. On the other hand, while a rise in the proportion of power derived from renewable sources may not necessarily translate into reduced energy usage, it may suggest a change in the composition of energy sources. Time periods in which the growth of energy consumption slows down or deviates from trends in GDP and population may be linked to advancements in energy efficiency or structural adjustments in the economy. examining these patterns and connections across time. In order to forecast future power consumption patterns, data demonstrating a high historical association between these variables can be used to develop a predictive model through the analysis of such trends and relationships across time.

Question 2

What countries or regions are engaging in sustainable energy practices and relying more on renewable energy compared to nonrenewable energy? Which countries are moving towards the trajectory of relying more on renewable energy and producing less greenhouse gas emissions?

The animated map shows the share of renewable energy consumption over two decades, from 2000 to 2022. Playing the animation reveals the countries that have the highest share of renewable consumption, such as China, India, Brazil, and the United States. Over the two decades, there hasn’t been much change in countries that practice renewable energy

-The second plot explores the country with the highest mean renewable energy share. China has the highest renewable share of 22.75%, followed by India with 14.23%, Brazil and the United States with 9% and 3.28% respectively. Over the decade, the top countries remain the same, with only the top country’s percentage share increasing, indicating that they are working more towards renewable practices

This plot shows the mean greenhouse gas emissions over two decades. Here, China, despite practicing more renewable energy consumption and share, emits the most greenhouse gases. The same trend is seen for the United States and India

This visualization provides information on the energy consumption of wind, solar and hydro energy over the years. It can be observed from the line plot that hydro electric power has always been the lead in terms of energy consumed over the years. It’s consumption appears to have climbed steadily over the years. Wind and Solar energy consumption appear to have taken off in the early 2000s and have continued to climb steadily over the years with more of wind energy being consumed than solar energy.

Data Wrangling for Density Plot

Density plots for renewable and non-renewable energy for the continents

Ratio of Non-renewable Energy Consumption to Renewable Energy Consumption Plot

Renewables Consumption Plot

Visualizing the density plots for renewable consumption

The density plots for the consumption of renewable energy represent the distribution of renewable energy consumption over the years from 1965 to the year 2022. It shows the distribution of energy consumption for the seven continents and how the distribution of this type of energy consumption changes with time. From the y axis, we notice that not much of renewable is consumed especially when we view this in relation to the density plot on non renewable energy consumption which shows large amounts of energy being consumed. The world has recently been focusing on a shift from the consumption of more of non-renewable energy to renewable energy as a form of sustainable energy practices. From the animated density plot, we can see that over the years more and more of renewable energy is being consumed as the times progress. The major continents that have increased their consumption of renewable energy over time are Europe, Asia, and North America. The 3 continents have increased their consumption of non-renewable energy over time compared to the others. Consumption of renewable energy increased from 100s to 1000s of energy being consumed on a yearly basis. This is a good sign that most of these continents are beginning to exploit the use of renewable energy as recommended under sustainable energy practices.

Non-renewables consumption Plot

The visualization for the non-renewable consumption of energy

There has also been an increase in the consumption of non-renewable energy consumption across the continents over the years. This could be attributed to an increase in population and others. Under sustainable energy practices, it is recommended to have more reliance on renewable than non-renewable energy sources such that the future generation will also have some of the non-renewable energy sources left to use. The continents that saw major increases in the use of non-renewable energy over the years are the very same continents that were the major consumers of renewable energy, that is Europe, Asia, and North America.

The results from this analysis reveals that the increase in renewable energy consumption is also matched by an increase in the consumption of non renewable energy. Overtime, we hope that the consumption of renewable energy will surpass that of renewable energy as recommended under sustainable energy practices.

Cluster

Here, the patterns in the data are identified to help form clusters in the data based on the consumption of renewable and non-renewable energy. The goal here. is to predict which class a country or regional boundary may belong to in relation to their of renewable and non-renewable energy to determine which of these countries and or regional boundaries are taking their energy consumption more in the direction of renewable energy consumption as is required by the sustainable energy practices. Clusters are created here to categorize these regional boundaries into clusters based on how much renewable and non-renewable energy they consume.

The required packages for the cluster analysis are imported.

The data is dealt with appropriately for missing values and duplicates. The numerical features in the data set are also standardized and categorical variables encoded. For this analysis.

Selecting the Features for clustering

The necessary features required for developing the clusters in the data set for renewable and non-renewable energy consumption are selected.

The data set is scaled and encoded for the purpose of this analysis to ensure accurate results are derived from this analysis.

A K means clustering is performed on the features of the data set to divide the data set into clusters. This method of clustering is chosen because of simplicity, speed and efficiency.

The Calinksi-Harabasz method is used here to determine the optimal number of clusters for this analysis.

A visualization is provided below to help determine what the appropriate number of clusters should be based on this method.

From this visualization 8 is the optimal number of clusters for this analysis. Using the optimal number of clusters from the Calinski method of cluster selection, the k means clustering is performed.

The results from the k means cluster algorithm are visualized below.

A DBSCAN is performed to ensure refine the clusters and ensure that these clusters are well separated by outliers.

This is done by first calculating the distance to the nearest neighbour and then visualizing the plot.

The minimum number of samples are chosen and the DBSCAN is performed to determine if there are any outliers in the clusters and the clusters are refined for more accuracy.

The number of outliers are also calculated after the dscan is performed.

Number of outliers detected: 1

The silhouette scores for this analysis is found as a form of cluster validation of this analysis.

Silhouette Score: 0.96

The silhouette score of 0.96 implies that the clustering here is a good one. it means that the data points are well clustered and are closer to the center of their respective clusters compared to other clusters. This indicates that there is a clear separation between clusters with good cohesion within clusters.

This analysis is concluded with the creation of two clusters that determine whether a regional boundary is consuming more of renewable energy or more of non-renewable energy. When a regional boundary is assigned the cluster 1 it means they are consuming more of renewable energy than non-renewable energy and when a regional boundary is assigned the cluster -1, it means they are consuming more of non-renewable energy compred to renewable energy. From the DBSCAN visualization, it is observed that more regional boundaries are currently consuming more of non-renewable than renewable energy. The number 0 in the visualization is used to represent the outlier.

Repo Organization

The following folders comprise the project repository

.github/: This directory is designated for files associated with GitHub, encompassing workflows, actions, and templates tailored for issues.
_extra/: Reserved for miscellaneous files that don’t neatly fit into other project categories, providing a catch-all space for various supplementary documents.
_freeze/: Within this directory lie frozen environment files containing comprehensive information regarding the project’s environment configuration and dependencies.
data/: Specifically allocated for storing i data files crucial for the project’s functionality, encompassing input files, datasets, and other essential data resources.
images/: Serving as a repository for visual assets employed throughout the project, including diagrams, charts, and screenshots, this directory maintains visual elements integral to project documentation and presentation.
.gitignore: This file functions to specify exclusions from version control, ensuring that designated files and directories remain untracked by Git, thus streamlining the versioning process.
README.md: Serving as the primary hub of project information, this README document furnishes essential details encompassing project setup, usage instructions, and an overarching overview of project objectives and scope.
_quarto.yml: Acting as a pivotal configuration file for Quarto, this document encapsulates various settings and options governing the construction and rendering of Quarto documents, facilitating customization and control over document output.
about.qmd: This Quarto Markdown file supplements project documentation by providing additional contextual information, elucidating project purpose, contributor insights, and other pertinent project details.
index.qmd: index.qmd: This serves as the main documentation page for our project. This Quarto Markdown file provides detailed descriptions of our project, including all code and visualization.

--- title: "Global Energy Trends" subtitle: "INFO 523 - Project Final" author: - name: "data detectives - Ayesha, Abhishek, Shreemithra, Toluwanimi, Valerie, Alyssa" affiliations: - name: "School of Information, University of Arizona" description: "The goal of this project is to analyze the complex relationships between economic and population growth, sustainable energy practices, and energy consumption" format: html: code-tools: true code-overflow: wrap embed-resources: true editor: visual execute: warning: false echo: false jupyter: python3 --- # Abstract This project utilizes the comprehensive energy dataset from Our World in Data, spanning from 1900 to 2022, to examine the global energy consumption trends regarding economic growth, population dynamics, and the adoption of sustainable energy practices. The primary goal of the project is to design a predictive dashboard that models a nation’s energy consumption based on essential factors such as population size, GDP, and the proportion of electricity derived from renewable sources. The analysis will utilize a range of statistical and machine learning techniques, including time series decomposition, linear regression for key predictors, and regression analysis. We will evaluate the performance of these regression models using R-squared and Root Mean Squared Error (RMSE) metrics to gauge their accuracy and explanatory power. This evaluation is essential for enhancing predictive accuracy and reliability in energy policy formulation and planning. The project will analyze trends in the use of renewable energy at the regional and national levels, with a certain emphasis on emphasizing countries that lead the way in sustainable energy practices and those making progress toward lower greenhouse gas emissions. This analysis will provide crucial insights for industry and researchers dedicated to promoting energy sustainability and promoting economic growth. ## Question 1 Is it possible to predict a nation's power consumption by considering its population size, gross domestic product (GDP), and the percentage of electricity generated from renewable sources and changes across the years? ```{python} #| label: libraries #| echo: false #| warning: false # cell library import pandas as pd import matplotlib.pyplot as plt from matplotlib.animation import FuncAnimation from matplotlib.ticker import FuncFormatter from matplotlib.patches import Ellipse from IPython.display import HTML import seaborn as sns import statsmodels.api as sm from statsmodels.tsa.seasonal import seasonal_decompose from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error import matplotlib.dates as mdates import geopandas as gpd import plotly.graph_objects as go import plotly.express as px from sklearn.preprocessing import MinMaxScaler import scipy.stats as stats import matplotlib.pyplot as plt import seaborn as sns from sklearn.preprocessing import LabelEncoder from sklearn.metrics import calinski_harabasz_score from sklearn.cluster import KMeans from scipy.cluster import hierarchy from scipy.spatial.distance import pdist from sklearn.neighbors import NearestNeighbors from sklearn.cluster import DBSCAN from sklearn.mixture import GaussianMixture from sklearn.metrics import silhouette_score from sklearn.metrics import adjusted_rand_score ``` ```{python} #| label: question #| echo: false # Load the dataset data = pd.read_csv('data/owid-energy-data.csv') #columns to keep keep = (['year', 'population', 'gdp', 'electricity_generation', 'primary_energy_consumption', 'renewables_electricity']) #data for q1 q1_data = data[keep] # Drop rows with any empty values q1_data_cleaned = q1_data.dropna() # Save the cleaned dataset to a new CSV file q1_data_cleaned.to_csv('data/q1_energy_data_cleaned.csv', index = False) ``` ```{python} #| label: question1 #| echo: false # Load the clean dataset data = pd.read_csv('data/q1_energy_data_cleaned.csv') # Calculate the percentage of electricity generated from renewable sources data['renewables_percentage'] = (data['renewables_electricity'] / data['electricity_generation']) * 100 columns_to_normalize = ['electricity_generation', 'primary_energy_consumption', 'renewables_electricity'] for column in columns_to_normalize: data[column] = (data[column] - data[column].min()) / (data[column].max() - data[column].min()) # Save the updated dataset to a new CSV file data.to_csv('data/q1_energy_data_processed.csv', index = False) ``` ### Density Plot ```{python} #| label: question2 #| echo: false # Load the processed dataset data = pd.read_csv('data/q1_energy_data_processed.csv') plt.figure(figsize=(10, 8)) sns.histplot(data['primary_energy_consumption'], kde = True) plt.title('Distribution of Target Variable "primary_energy_consumption"') plt.xlabel('Primary Energy Consumption') plt.ylabel('Frequency') plt.xlim(0, 0.065) plt.ylim(0, 90) plt.show() ``` - A right-skewed distribution is shown by the distribution's shape, which shows that most values are low and that frequency drops off quickly as values rise. In order to train the model to identify patterns and to validate its correctness by contrasting the projected values with the actual data shown in the plot, the historical data represented in this histogram would be crucial. ### Scatter Plot ```{python} #| label: question3 #| echo: false # Create a dictionary that maps feature names to desired labels feature_labels = { 'population': 'Population (in millions)', 'gdp': 'GDP (in millions)', 'renewables_electricity': 'Renewable Electricity' } # Set up the plot area to have 1 row and 3 columns fig, axes = plt.subplots(1, 3, figsize=(12, 6)) for i, column in enumerate(feature_labels): if column in ['population', 'gdp']: sns.scatterplot(ax = axes[i], x = data[column] / 1e6, y = 'primary_energy_consumption', data = data, color = 'green') axes[i].set_xlabel(feature_labels[column]) else: sns.scatterplot(ax = axes[i], x = column, y = 'primary_energy_consumption', data = data, color = 'green') axes[i].set_xlabel(feature_labels[column]) # Remove the y-axis label for individual plots and y-tick labels for a cleaner look axes[i].set_ylabel('') axes[i].set_yticklabels([]) # Adding a common y-axis label on the left side of the subplots fig.text(0.00, 0.5, 'Primary Energy Consumption', va = 'center', rotation = 'vertical', fontsize = 9) # Adjust layout and add a common title plt.tight_layout(pad = 3.0, w_pad = 2.5, h_pad = 2.0) fig.suptitle('Scatter plot of Primary Energy Consumption vs Other Factors', fontsize = 12) plt.show() ``` - The relationship between primary energy consumption and the GDP, population size, and the proportion of power derived from renewable sources is represented graphically in this plot. If persistent patterns are seen over time, one might utilize the spread and trends of the scatter points to deduce that there might be correlations between these parameters and a country's power consumption, which could be used to anticipate energy usage. ### Time Series Analysis Plot ```{python} #| label: question4 #| echo: false #| warning: false # Load and preprocess the data data = pd.read_csv('data/q1_energy_data_processed.csv') data['year'] = pd.to_datetime(data['year'], format = '%Y') data.set_index('year', inplace = True) # Create the plot with customized settings fig, ax1 = plt.subplots(figsize = (12, 6)) # Plotting with different line styles and markers ax1.plot(data.index, data['population'], label = 'Population', color = 'blue', linestyle = '-', marker = 'o', linewidth = 1) ax1.plot(data.index, data['gdp'], label='GDP', color='red', linestyle='--', marker='x', linewidth=1) ax2 = ax1.twinx() ax2.plot(data.index, data['renewables_electricity'], label = 'Renewables Electricity', color = 'green', linestyle = '-.', marker = '^', linewidth = 1, alpha = 0.7) ax2.plot(data.index, data['primary_energy_consumption'], label = 'Primary Energy Consumption', color = 'purple', linestyle = ':', marker = 's', linewidth = 1, alpha = 0.7) # Configure the x-axis with date formatting ax1.xaxis.set_major_locator(mdates.YearLocator(3)) ax1.xaxis.set_major_formatter(mdates.DateFormatter('%Y')) # Labeling and titles ax1.set_xlabel('Year') ax1.set_ylabel('Population and GDP') ax2.set_ylabel('Renewables and Energy Consumption') ax1.set_title('Time Series of Population, GDP, Renewable Electricity, and Energy Consumption') # Combine legends from both axes lines, labels = ax1.get_legend_handles_labels() lines2, labels2 = ax2.get_legend_handles_labels() ax1.legend(lines + lines2, labels + labels2, loc='upper left') plt.xticks(rotation = 45) # Adjust layout plt.tight_layout(pad = 5.0, w_pad = 1.5, h_pad = 1.0) plt.show() ``` - Indicating possible relationships between these variables and a country's power consumption, the graphic shows how the population, GDP, percentage of renewable electricity, and energy consumption have changed over time. For example, if GDP and population growth are accompanied by an upward trend in the primary energy consumption curve, this could indicate that energy demand is driven by economic activity and demographic growth. On the other hand, while a rise in the proportion of power derived from renewable sources may not necessarily translate into reduced energy usage, it may suggest a change in the composition of energy sources. Time periods in which the growth of energy consumption slows down or deviates from trends in GDP and population may be linked to advancements in energy efficiency or structural adjustments in the economy. examining these patterns and connections across time. In order to forecast future power consumption patterns, data demonstrating a high historical association between these variables can be used to develop a predictive model through the analysis of such trends and relationships across time. # Question 2 What countries or regions are engaging in sustainable energy practices and relying more on renewable energy compared to nonrenewable energy? Which countries are moving towards the trajectory of relying more on renewable energy and producing less greenhouse gas emissions? ```{python} #| label: MAP1 #| echo: false #| warning: false # Loading data data = pd.read_csv("data/owid-energy-data.csv") world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres')) # Merging world shapefile with energy data world = world.merge(data, how='left', left_on='iso_a3', right_on='iso_code') # Droping NaN values from columns used in calculations columns_to_drop_na = ['solar_consumption', 'wind_consumption', 'hydro_consumption', 'other_renewable_consumption', 'energy_per_capita'] # Filtering data from 2000 to 2023 world = world[(world['year'] >= 2000) & (world['year'] <= 2023)] # Grouping by year and country, summing renewable energy consumption and energy per capita world = world.groupby(['iso_a3', 'year', 'name']).agg({ 'solar_consumption': 'sum', 'wind_consumption': 'sum', 'hydro_consumption': 'sum', 'other_renewable_consumption': 'sum', 'energy_per_capita': 'mean' }).reset_index() # Calculating total renewable energy consumption world['total_renewable_consumption'] = world['solar_consumption'] + world['wind_consumption'] + world['hydro_consumption'] + world['other_renewable_consumption'] # Calculate renewable energy share world['renewable_energy_share'] = (world['total_renewable_consumption'] / world['energy_per_capita']) * 100 # Setting year column as date world['year'] = pd.to_datetime(world['year'], format='%Y') # Plotting the animated map fig = px.choropleth(world, locations='iso_a3', color='renewable_energy_share', hover_name='name', hover_data={'iso_a3': False, 'renewable_energy_share': True}, animation_frame=world['year'].dt.year, range_color=(0, 7), projection='natural earth', color_continuous_scale=px.colors.sequential.Plasma, title='Share of Renewable Energy Consumption (%)') # Setting x-axis format to display only the year fig.update_xaxes(dtick='M1', tickformat='%Y') fig.show() ``` - The animated map shows the share of renewable energy consumption over two decades, from 2000 to 2022. Playing the animation reveals the countries that have the highest share of renewable consumption, such as China, India, Brazil, and the United States. Over the two decades, there hasn’t been much change in countries that practice renewable energy ```{python} #| label: BAR1 #| echo: false #| warning: false world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres')) # Merging world shapefile with energy data world = world.merge(data, how='left', left_on='iso_a3', right_on='iso_code') # Droping NaN values from columns used in calculations columns_to_drop_na = ['solar_consumption', 'wind_consumption', 'hydro_consumption', 'other_renewable_consumption', 'energy_per_capita'] # Filtering data from 2000 to 2023 world = world[(world['year'] >= 2000) & (world['year'] <= 2023)] # Grouping by year and country, summing renewable energy consumption and energy per capita world = world.groupby(['iso_a3', 'year', 'name']).agg({ 'solar_consumption': 'sum', 'wind_consumption': 'sum', 'hydro_consumption': 'sum', 'other_renewable_consumption': 'sum', 'energy_per_capita': 'mean' }).reset_index() # Calculating total renewable energy consumption world['total_renewable_consumption'] = world['solar_consumption'] + world['wind_consumption'] + world['hydro_consumption'] + world['other_renewable_consumption'] # Calculating renewable energy share world['renewable_energy_share'] = (world['total_renewable_consumption'] / world['energy_per_capita']) * 100 # Making year column as date world['year'] = pd.to_datetime(world['year'], format='%Y') # Calculating mean renewable energy share by country country_stats = world.groupby('name')['renewable_energy_share'].mean().reset_index() # Identifying top 10 countries with the highest mean renewable energy share top_10_countries = country_stats.nlargest(10, 'renewable_energy_share')['name'].tolist() # Filtering data for top 10 countries top_10_world = world[world['name'].isin(top_10_countries)] # Sorting data by renewable energy share top_10_world = top_10_world.sort_values(by='renewable_energy_share', ascending=False) # Plotting the animated bar plot with Plotly Graph Objects fig = go.Figure() # Defining custom color scale from light green to dark green custom_color_scale = [ [0, 'rgba(0, 255, 0, 0.5)'], [0.2, 'rgba(0, 255, 0, 0.6)'], [0.4, 'rgba(0, 255, 0, 0.7)'], [0.6, 'rgba(0, 255, 0, 0.8)'], [0.8, 'rgba(0, 255, 0, 0.9)'], [1, 'rgba(0, 255, 0, 1.0)'] ] # Creating bar traces for each year for year, df in top_10_world.groupby('year'): fig.add_trace(go.Bar( x=df['name'], y=df['renewable_energy_share'], name=str(year.year), hoverinfo='x+y', hovertemplate='%{x} Renewable Energy Share: %{y:.2f}%<extra></extra>', visible=False if year != top_10_world['year'].min() else True, marker=dict( color=df['renewable_energy_share'], colorscale=custom_color_scale, colorbar=dict(title='Renewable Energy Share (%)') ) )) # Adding play button and slider fig.update_layout( updatemenus=[dict( type="buttons", buttons=[dict(label="Play", method="animate", args=[None, {"frame": {"duration": 500, "redraw": True}, "fromcurrent": True, "transition": {"duration": 300, "easing": "quadratic-in-out"}}] )] )], title='Top 10 Countries with Highest Mean Renewable Energy Share', xaxis=dict(title='Country'), yaxis=dict(title='Mean Renewable Energy Share (%)', range=[0, 20]) ) # Setting initial layout fig.update_layout(showlegend=False) # Creating frames for each year frames = [go.Frame( data=[go.Bar( x=df['name'], y=df['renewable_energy_share'], name=str(year.year), hoverinfo='x+y', hovertemplate='%{x} Renewable Energy Share: %{y:.2f}%<extra></extra>', marker=dict( color=df['renewable_energy_share'], colorscale=custom_color_scale, colorbar=dict(title='Renewable Energy Share (%)') ) )], name=str(year.year) ) for year, df in top_10_world.groupby('year')] # Adding frames to the figure fig.frames = frames fig.show() ``` -The second plot explores the country with the highest mean renewable energy share. China has the highest renewable share of 22.75%, followed by India with 14.23%, Brazil and the United States with 9% and 3.28% respectively. Over the decade, the top countries remain the same, with only the top country’s percentage share increasing, indicating that they are working more towards renewable practices ```{python} #| label: BAR2 #| echo: false #| warning: false # Read world shapefile world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres')) # Merging world shapefile with energy data world = world.merge(data, how='left', left_on='iso_a3', right_on='iso_code') # Dropping NaN values from columns used in calculations columns_to_drop_na = ['solar_consumption', 'wind_consumption', 'hydro_consumption', 'other_renewable_consumption', 'energy_per_capita'] world = world.dropna(subset=columns_to_drop_na) # Filtering data from 2000 to 2023 world = world[(world['year'] >= 2000) & (world['year'] <= 2023)] # Grouping by year and country, summing renewable energy consumption and energy per capita world = world.groupby(['iso_a3', 'year', 'name']).agg({ 'solar_consumption': 'sum', 'wind_consumption': 'sum', 'hydro_consumption': 'sum', 'other_renewable_consumption': 'sum', 'energy_per_capita': 'mean', 'greenhouse_gas_emissions': 'mean' }).reset_index() # Calculating total renewable energy consumption world['total_renewable_consumption'] = world['solar_consumption'] + world['wind_consumption'] + world['hydro_consumption'] + world['other_renewable_consumption'] # Calculating renewable energy share world['renewable_energy_share'] = (world['total_renewable_consumption'] / world['energy_per_capita']) * 100 # Setting year column as date world['year'] = pd.to_datetime(world['year'], format='%Y') # Calculating mean renewable energy share and greenhouse gas emissions by country country_stats = world.groupby('name').agg({ 'renewable_energy_share': 'mean', 'greenhouse_gas_emissions': 'mean' }).reset_index() # Sorting countries by greenhouse gas emissions and selecting top 10 top_emissions_countries = country_stats.nlargest(10, 'greenhouse_gas_emissions') # Creating vertical bar plot with color gradient for greenhouse gas emissions fig_high_emissions = px.bar(top_emissions_countries, x='name', y='greenhouse_gas_emissions', color='greenhouse_gas_emissions', labels={'greenhouse_gas_emissions': 'Mean Greenhouse Gas (megatonnes of CO₂ equivalents)'}, title='Top 10 Countries with Highest Mean Greenhouse Gas Emissions', orientation='v', color_continuous_scale='Emrld') # Updating x-axis label for emissions plot fig_high_emissions.update_xaxes(title='Country') # Updating color bar title fig_high_emissions.update_coloraxes(colorbar_title='Mean Greenhouse Gas (megatonnes of CO₂ equivalents)') # Showing plot for most greenhouse gas emissions fig_high_emissions.show() ``` - This plot shows the mean greenhouse gas emissions over two decades. Here, China, despite practicing more renewable energy consumption and share, emits the most greenhouse gases. The same trend is seen for the United States and India ```{python} # Create renewable energy dataset as a subset of the data wind_data = data[['country', 'year', 'wind_consumption']] solar_data = data[['country', 'year', 'solar_consumption']] hydro_data = data[['country', 'year', 'hydro_consumption']] # remove all missing values wind_data_clean = wind_data.dropna() solar_data_clean = solar_data.dropna() hydro_data_clean = hydro_data.dropna() # Group the data by year and sum the wind consumption for each year across all countries grouped_wind_data = wind_data_clean.groupby('year')['wind_consumption'].sum() grouped_solar_data = solar_data_clean.groupby('year')['solar_consumption'].sum() grouped_hydro_data = hydro_data_clean.groupby('year')['hydro_consumption'].sum() # Plot each energy type with appropriate labels and colors plt.figure(figsize=(12, 10)) plt.plot(grouped_wind_data.index, grouped_wind_data.values, marker='o', color='skyblue', label='Wind') plt.plot(grouped_solar_data.index, grouped_solar_data.values, marker='o', color='goldenrod', label='Solar') plt.plot(grouped_hydro_data.index, grouped_hydro_data.values, marker='o', color='seagreen', label='Hydro') plt.xlabel('Year') plt.ylabel('Energy Consumption (in terawatt-hours)') plt.title('Global Renewable Energy Consumption') plt.grid(True) plt.legend() # Add a legend to differentiate the lines plt.tight_layout() # Adjust the plot to ensure everything fits without overlap plt.show() ``` This visualization provides information on the energy consumption of wind, solar and hydro energy over the years. It can be observed from the line plot that hydro electric power has always been the lead in terms of energy consumed over the years. It's consumption appears to have climbed steadily over the years. Wind and Solar energy consumption appear to have taken off in the early 2000s and have continued to climb steadily over the years with more of wind energy being consumed than solar energy. ### Data Wrangling for Density Plot Density plots for renewable and non-renewable energy for the continents ```{python} #| label: data_wrang_density #| message: false #| warning: false ## loading the data set energy = pd.read_csv('data/owid-energy-data.csv') ## performing data wrangling operations on the data set to make it suitable for the analysis ## selecting the necessary columns for this analysis energy2 = energy[['year','country','iso_code', 'biofuel_consumption', 'coal_consumption', 'fossil_fuel_consumption', 'gas_consumption', 'hydro_consumption', 'nuclear_consumption', 'oil_consumption', 'other_renewable_consumption', 'renewables_consumption', 'solar_consumption', 'wind_consumption']] ### removing the rows with only missing values in it with the exception of the year column energy2_cleaned = energy2.dropna(subset = energy2.columns.difference(['year']), how = 'all') ### we remove the rows in the country column that are not countries but rather regions or continents ### loading the data set for the country code country_codes = pd.read_csv('data/country_codes.csv') # filtering the energy data set by removing regions which are not countries from the country column in the data set ## first we select the valid country codes valid_country_codes = set(country_codes['Alpha-3 code']) ## filtering the country column in the data set energy_filtered = energy2_cleaned[energy2_cleaned['iso_code'].isin(valid_country_codes)] ###removing all rows with NA values except the iso_codes, year and country columns in the data set energy_wrangled = energy_filtered.dropna(subset = energy_filtered.columns.difference(['year', 'iso_code', 'country']), how = 'all') ### removing the values with na values in the data set energy_wrangled = energy_wrangled.fillna(0) # Select columns to check for 0 values columns_to_check = energy_wrangled.columns.difference(['year', 'iso_code', 'country']) # Filter rows with 0 values in all columns except 'year', 'country', and 'iso_code' energy_wrangled = energy_wrangled[(energy_wrangled[columns_to_check] != 0).any(axis = 1)] ### The next step is feature engineering where new columns in the data set are created ## note that there is a column for the renewable energy consumption total ### creating a column for the total consumption of non-renewable energy energy_wrangled['non_renewables_consumption'] = energy_wrangled['coal_consumption'] + energy_wrangled['fossil_fuel_consumption'] + energy_wrangled['gas_consumption'] + energy_wrangled['nuclear_consumption'] + energy_wrangled['oil_consumption'] ### rounding the rows in the newly column to 3 decimal places energy_wrangled['non_renewables_consumption'] = energy_wrangled['non_renewables_consumption'].round(3) ### a new column for the total consumption is created energy_wrangled['total_consumption'] = energy_wrangled['non_renewables_consumption'] + energy_wrangled['renewables_consumption'] #### making sure that the total consumption is in two decimal places energy_wrangled['total_consumption'] = energy_wrangled['total_consumption'].round(3) energy_wrangled['consumption_ratio'] = (energy_wrangled['non_renewables_consumption'] / energy_wrangled['renewables_consumption']).round(3) ### the next step is to feature engineer and group the data into continents ### generating a function that will group the countries into their respective continents def group_countries_by_continent(iso_code): continents = { 'Africa': ['DZA', 'AGO', 'BEN', 'BWA', 'BFA', 'BDI', 'CPV', 'CMR', 'CAF', 'TCD', 'COM', 'COG', 'COD', 'DJI', 'EGY', 'GNQ', 'ERI', 'SWZ', 'ETH', 'GAB', 'GMB', 'GHA', 'GIN', 'GNB', 'CIV', 'KEN', 'LSO', 'LBR', 'LBY', 'MDG', 'MWI', 'MLI', 'MRT', 'MUS', 'MAR', 'MOZ', 'NAM', 'NER', 'NGA', 'RWA', 'STP', 'SEN', 'SYC', 'SLE', 'SOM', 'ZAF', 'SSD', 'SDN', 'TZA', 'TGO', 'TUN', 'UGA', 'ZMB', 'ZWE'], 'Asia': ['AFG', 'ARM', 'AZE', 'BHR', 'BGD', 'BTN', 'BRN', 'KHM', 'CHN', 'CYP', 'GEO', 'IND', 'IDN', 'IRN', 'IRQ', 'ISR', 'JPN', 'JOR', 'KAZ', 'KWT', 'KGZ', 'LAO', 'LBN', 'MYS', 'MDV', 'MNG', 'MMR', 'NPL', 'PRK', 'OMN', 'PAK', 'PSE', 'PHL', 'QAT', 'SAU', 'SGP', 'KOR', 'LKA', 'SYR', 'TWN', 'TJK', 'THA', 'TLS', 'TKM', 'ARE', 'UZB', 'VNM', 'YEM', 'TUR', 'THA'], 'Europe': ['ALB', 'AND', 'AUT', 'BLR', 'BEL', 'BIH', 'BGR', 'HRV', 'CZE', 'DNK', 'EST', 'FIN', 'FRA', 'DEU', 'GRC', 'HUN', 'ISL', 'IRL', 'ITA', 'XKX', 'LVA', 'LIE', 'LTU', 'LUX', 'MLT', 'MDA', 'MCO', 'MNE', 'NLD', 'MKD', 'NOR', 'POL', 'PRT', 'ROU', 'RUS', 'SMR', 'SRB', 'SVK', 'SVN', 'ESP', 'SWE', 'CHE', 'UKR', 'GBR', 'VAT'], 'North America': ['ATG', 'BHS', 'BRB', 'BLZ', 'CAN', 'CRI', 'CUB', 'DMA', 'DOM', 'SLV', 'GRL', 'GRD', 'GTM', 'HTI', 'HND', 'JAM', 'MEX', 'NIC', 'PAN', 'PRI', 'KNA', 'LCA', 'VCT', 'TTO', 'USA'], 'South America': ['ARG', 'BOL', 'BRA', 'CHL', 'COL', 'ECU', 'FLK', 'GUF', 'GUY', 'PRY', 'PER', 'SUR', 'URY', 'VEN'], 'Antarctica': ['ATA'], 'Europe': ['ALB', 'AND', 'AUT', 'BLR', 'BEL', 'BIH', 'BGR', 'HRV', 'CZE', 'DNK', 'EST', 'FIN', 'FRA', 'DEU', 'GRC', 'HUN', 'ISL', 'IRL', 'ITA', 'XKX', 'LVA', 'LIE', 'LTU', 'LUX', 'MLT', 'MDA', 'MCO', 'MNE', 'NLD', 'MKD', 'NOR', 'POL', 'PRT', 'ROU', 'RUS', 'SMR', 'SRB', 'SVK', 'SVN', 'ESP', 'SWE', 'CHE', 'UKR', 'GBR', 'VAT'], 'Australia': ['AUS', 'NZL', 'PNG', 'FJI', 'SLB', 'VUT', 'NZL'] } for continent, countries in continents.items(): if iso_code in countries: return continent return None # Return None if ISO code not found in any continent # Apply the function to create the 'continents' column in energy_wrangled DataFrame energy_wrangled['continents'] = energy_wrangled['iso_code'].apply(group_countries_by_continent) ### removing hong kong since it is not a country from the data set energy_wrangled = energy_wrangled[energy_wrangled['iso_code'] != 'HKG'] ``` Ratio of Non-renewable Energy Consumption to Renewable Energy Consumption Plot ### Renewables Consumption Plot Visualizing the density plots for renewable consumption ```{python} #| label: density1_redone #| output: false # Extract unique years from the 'year' column years = energy_wrangled['year'].unique() # Function to format the x-axis labels with commas def format_with_commas(value, pos): return "{:,}".format(int(value)) # Function to update the plot for the selected year def update_plot(year): # Filter data for the selected year data_year = energy_wrangled[energy_wrangled['year'] == year] # Clear the previous plot plt.clf() # Plot the KDE plot for the selected year sns.kdeplot(data=data_year, x='renewables_consumption', hue='continents', multiple='stack', alpha=0.3, linewidth=0.2) # Set title plt.title(f'Density Plot of Renewables Consumption by Continent for Year {year}') # Set the x label plt.xlabel('Renewable Energy Consumption') # Set the x-axis tick formatter plt.gca().xaxis.set_major_formatter(FuncFormatter(format_with_commas)) # Initialize the plot with the first year update_plot(years[0]) # Create a FuncAnimation object without controls (slider, play/pause button) ani2 = FuncAnimation(plt.gcf(), update_plot, frames=years, interval=1000) # Convert the animation to HTML format html_animation2 = ani2.to_jshtml() ``` ```{python} #| label: hmtl_animation2 HTML(html_animation2) ``` The density plots for the consumption of renewable energy represent the distribution of renewable energy consumption over the years from 1965 to the year 2022. It shows the distribution of energy consumption for the seven continents and how the distribution of this type of energy consumption changes with time. From the y axis, we notice that not much of renewable is consumed especially when we view this in relation to the density plot on non renewable energy consumption which shows large amounts of energy being consumed. The world has recently been focusing on a shift from the consumption of more of non-renewable energy to renewable energy as a form of sustainable energy practices. From the animated density plot, we can see that over the years more and more of renewable energy is being consumed as the times progress. The major continents that have increased their consumption of renewable energy over time are Europe, Asia, and North America. The 3 continents have increased their consumption of non-renewable energy over time compared to the others. Consumption of renewable energy increased from 100s to 1000s of energy being consumed on a yearly basis. This is a good sign that most of these continents are beginning to exploit the use of renewable energy as recommended under sustainable energy practices. ### Non-renewables consumption Plot The visualization for the non-renewable consumption of energy ```{python} #| label: density2_redone #| output: false # Extract unique years from the 'year' column years = energy_wrangled['year'].unique() # Function to format the x-axis labels with commas def format_with_commas(value, pos): return "{:,}".format(int(value)) # Function to update the plot for the selected year def update_plot(year): # Filter data for the selected year data_year = energy_wrangled[energy_wrangled['year'] == year] # Clear the previous plot plt.clf() # Plot the KDE plot for the selected year sns.kdeplot(data=data_year, x='non_renewables_consumption', hue='continents', multiple='stack', alpha=0.3, linewidth=0.2) # Set title plt.title(f'Density Plot of Non-Renewables Consumption by Continent for Year {year}') # Set the x label plt.xlabel('Non-Renewable Energy Consumption') # Set the x-axis tick formatter plt.gca().xaxis.set_major_formatter(FuncFormatter(format_with_commas)) # Initialize the plot with the first year update_plot(years[0]) # Create a FuncAnimation object without controls (slider, play/pause button) ani = FuncAnimation(plt.gcf(), update_plot, frames=years, interval=1000) # Convert the animation to HTML format html_animation = ani.to_jshtml() # Display the HTML animation HTML(html_animation) ``` ```{python} #| label: html_animation HTML(html_animation) ``` There has also been an increase in the consumption of non-renewable energy consumption across the continents over the years. This could be attributed to an increase in population and others. Under sustainable energy practices, it is recommended to have more reliance on renewable than non-renewable energy sources such that the future generation will also have some of the non-renewable energy sources left to use. The continents that saw major increases in the use of non-renewable energy over the years are the very same continents that were the major consumers of renewable energy, that is Europe, Asia, and North America. The results from this analysis reveals that the increase in renewable energy consumption is also matched by an increase in the consumption of non renewable energy. Overtime, we hope that the consumption of renewable energy will surpass that of renewable energy as recommended under sustainable energy practices. ## Cluster Here, the patterns in the data are identified to help form clusters in the data based on the consumption of renewable and non-renewable energy. The goal here. is to predict which class a country or regional boundary may belong to in relation to their of renewable and non-renewable energy to determine which of these countries and or regional boundaries are taking their energy consumption more in the direction of renewable energy consumption as is required by the sustainable energy practices. Clusters are created here to categorize these regional boundaries into clusters based on how much renewable and non-renewable energy they consume. The required packages for the cluster analysis are imported. ```{python} #| label: setup import pandas as pd import numpy as np ``` The data is dealt with appropriately for missing values and duplicates. The numerical features in the data set are also standardized and categorical variables encoded. For this analysis. ```{python} #| label: wrangling_clusters energy = pd.read_csv('data/owid-energy-data.csv') energy_copy = energy.copy() ### to handle missing values we first replace all NA values with the number 0 energy_copy = energy_copy.fillna(0) ### now we remove all rows with only 0 values with the exception of the major rows # Select columns to check for 0 values columns_to_check = energy_copy.columns.difference(['year', 'iso_code', 'country', 'population', 'gdp']) # Filter rows with 0 values in all columns except 'year', 'country', and 'iso_code' energy_copy = energy_copy[(energy_copy[columns_to_check] != 0).any(axis = 1)] # Check for duplicate rows duplicate_rows = energy[energy.duplicated()] #### creating new columns in the energy data set ### creating new columns in the energy dataset for the renewable and non renewable energy consumption ### creating a column for the total consumption of non-renewable energy energy_copy['non_renewables_consumption'] = energy_copy['coal_consumption'] + energy_copy['fossil_fuel_consumption'] + energy_copy['gas_consumption'] + energy_copy['nuclear_consumption'] + energy_copy['oil_consumption'] ### a new column for the total consumption is created energy_copy['total_consumption'] = energy_copy['non_renewables_consumption'] + energy_copy['renewables_consumption'] ##### selecting the numerical columns in the data set numerical_cols = energy.select_dtypes(include = 'number').columns ### scaling the numerical columns in the data set minmax_scaler = MinMaxScaler() energy_copy[numerical_cols] = minmax_scaler.fit_transform(energy_copy[numerical_cols]) # Encoding the categorical variables categorical_columns = energy_copy.select_dtypes(include = ['object', 'category']).columns.tolist() energy_copy[categorical_columns] = energy_copy[categorical_columns].astype(str) # looping through the categorical variables to encode them label_encoders = {columns: LabelEncoder() for columns in categorical_columns} for columns in categorical_columns: energy_copy[columns] = label_encoders[columns].fit_transform(energy_copy[columns]) ### selecting the relevant columns for this analysis. energy_features = energy_copy[['year','country','iso_code', 'renewables_consumption', 'non_renewables_consumption', 'total_consumption', 'carbon_intensity_elec','electricity_generation']] ``` #### Selecting the Features for clustering The necessary features required for developing the clusters in the data set for renewable and non-renewable energy consumption are selected. The data set is scaled and encoded for the purpose of this analysis to ensure accurate results are derived from this analysis. A K means clustering is performed on the features of the data set to divide the data set into clusters. This method of clustering is chosen because of simplicity, speed and efficiency. The Calinksi-Harabasz method is used here to determine the optimal number of clusters for this analysis. ```{python} #| label: calinski #| warning: false #### determining the optimal number of clusters using the Calinki-Harabsz method ### initializing the lists to store scores ch_scores = [] for n_clusters in range(2, 11): kmeans = KMeans(n_clusters=n_clusters, random_state=42) kmeans.fit(energy_features) ch_score = calinski_harabasz_score(energy_features, kmeans.labels_) ch_scores.append(ch_score) ``` A visualization is provided below to help determine what the appropriate number of clusters should be based on this method. ```{python} #| label: calinski_plot ### creating the plot for the ch scores plt.figure(figsize = (10, 6)) sns.lineplot(x = range(2, 11), y = ch_scores, marker = 'o') plt.xlabel('Number of Clusters (K)') plt.ylabel('Calinski-Harabasz Score') plt.title('Calinski-Harabasz Score vs. Number of Clusters') plt.grid(True) plt.show() ``` From this visualization 8 is the optimal number of clusters for this analysis. Using the optimal number of clusters from the Calinski method of cluster selection, the k means clustering is performed. ```{python} #| label: k_means #### using the k means algorithm k = 8 ### performing the k means clustering kmeans = KMeans(n_clusters = k, random_state = 12) clusters = kmeans.fit_predict(energy_features) energy_copy['cluster'] = clusters ``` The results from the k means cluster algorithm are visualized below. ```{python} #| label: k_means_plot plt.figure(figsize=(10, 8)) sns.scatterplot(x='renewables_consumption', y = 'non_renewables_consumption', hue = 'cluster', data = energy_copy, palette = 'Set1', legend = 'full') plt.xlabel('Renewables Consumption') plt.ylabel('Non-Renewables Consumption') plt.title('K-Means Clustering of Energy Consumption') plt.legend(title='Cluster') plt.show() ``` A DBSCAN is performed to ensure refine the clusters and ensure that these clusters are well separated by outliers. This is done by first calculating the distance to the nearest neighbour and then visualizing the plot. ```{python} #| label: nearest_neighbor ## calculating the average distance to the closest neighbor #### the number of neighbors taken into consideration for this analysis is 5 neighbors = NearestNeighbors(n_neighbors = 5) neighbors_fit = neighbors.fit(energy_features) distances, indices = neighbors_fit.kneighbors(energy_features) avg_distances = np.mean(distances[:, 1:], axis = 1) ## sorting the average distances avg_distances_sorted = np.sort(avg_distances) ``` The minimum number of samples are chosen and the DBSCAN is performed to determine if there are any outliers in the clusters and the clusters are refined for more accuracy. ```{python} #| label: minval&eps ### setting the min_samples parameter eps = 17500 ### the number 5 is selected for this analysis min_samples = 5 ### initializing the dbscan with the chosen parameters dbscan = DBSCAN(eps = eps, min_samples = min_samples) ## fitting the DBSCAN to the features of the energy data set clusters = dbscan.fit_predict(energy_features) ``` ```{python} #| label: dbscan_viz ### visualizing the clusters plt.figure(figsize = (10, 6)) sns.scatterplot(x = 'renewables_consumption', y='non_renewables_consumption', hue = clusters, data = energy_copy, palette = 'viridis') plt.title('DBSCAN Clustering') plt.xlabel('Renewables Consumption') plt.ylabel('Non-Renewables Consumption') plt.legend(title = 'Clusters') plt.show() ``` The number of outliers are also calculated after the dscan is performed. ```{python} #| label: outliers outliers = energy_features[clusters == -1] num_outliers = len(outliers) print(f'Number of outliers detected: {num_outliers}') ``` The silhouette scores for this analysis is found as a form of cluster validation of this analysis. ```{python} #| label: sillhouette_score silhouette_avg = silhouette_score(energy_features, clusters) rounded_score = round(silhouette_avg, 2) print("Silhouette Score:", rounded_score) ``` The silhouette score of 0.96 implies that the clustering here is a good one. it means that the data points are well clustered and are closer to the center of their respective clusters compared to other clusters. This indicates that there is a clear separation between clusters with good cohesion within clusters. This analysis is concluded with the creation of two clusters that determine whether a regional boundary is consuming more of renewable energy or more of non-renewable energy. When a regional boundary is assigned the cluster 1 it means they are consuming more of renewable energy than non-renewable energy and when a regional boundary is assigned the cluster -1, it means they are consuming more of non-renewable energy compred to renewable energy. From the DBSCAN visualization, it is observed that more regional boundaries are currently consuming more of non-renewable than renewable energy. The number 0 in the visualization is used to represent the outlier. # Repo Organization The following folders comprise the project repository - **.github/:** This directory is designated for files associated with GitHub, encompassing workflows, actions, and templates tailored for issues. - **\_extra/:** Reserved for miscellaneous files that don't neatly fit into other project categories, providing a catch-all space for various supplementary documents. - **\_freeze/:** Within this directory lie frozen environment files containing comprehensive information regarding the project's environment configuration and dependencies. - **data/:** Specifically allocated for storing i data files crucial for the project's functionality, encompassing input files, datasets, and other essential data resources. - **images/:** Serving as a repository for visual assets employed throughout the project, including diagrams, charts, and screenshots, this directory maintains visual elements integral to project documentation and presentation. - **.gitignore:** This file functions to specify exclusions from version control, ensuring that designated files and directories remain untracked by Git, thus streamlining the versioning process. - **README.md:** Serving as the primary hub of project information, this README document furnishes essential details encompassing project setup, usage instructions, and an overarching overview of project objectives and scope. - **\_quarto.yml:** Acting as a pivotal configuration file for Quarto, this document encapsulates various settings and options governing the construction and rendering of Quarto documents, facilitating customization and control over document output. - **about.qmd:** This Quarto Markdown file supplements project documentation by providing additional contextual information, elucidating project purpose, contributor insights, and other pertinent project details. - **index.qmd:** index.qmd: This serves as the main documentation page for our project. This Quarto Markdown file provides detailed descriptions of our project, including all code and visualization.